Written by Ramón Saquete
Índice
I already mentioned in another post about images and WPO that the factors that most often affect the overall load time are the images in the front-end and the database access in the back-end. Today we are going to talk about database optimization.
Optimizing data access is vital to page load time, as it is often the factor that most affects the time the browser has to wait to receive the HTML. This waiting time is very important, since the rest of the page resources (images, scripts and style sheets) do not start downloading until the browser reads the HTML from which these resources are referenced.
This type of optimization is probably the most complex of allfirst of all because depends on two variable factors in time: on the one hand, of how and what type of consultations are to be made and, on the other hand, of the workload to be carried the server or servers. Secondly, because of the large amount of knowledge You need to know how to rewrite queries, rewrite the code that executes queries, create indexes, materialized views, horizontal and vertical partitions, replicas, support tables, know how to choose the data types to use, know how to optimize the schema without losing the logic of the business model, know how to adjust the DBMS configuration parameters, know how to use external caching systems, and know how to use external caching systems. In addition, each of these points, of which there are many books written, must be approached differently depending on which DBMS is involved (MySql, Sql Server, Oracle,…) and even which version.
If we go to non-relational or NoSQL DBMSs, we have a wide variety of data model types: key/value models, graph oriented, columnar, multidimensional, object oriented, etc. But the data model that has most popularized the term NoSQL, are document-oriented databases, where we have key/value pairs, where the value is usually a hierarchical structure. Many developers jump into the pool when using these new DBMSs (MongoDb, Google BigTable, Apache Cassandra,…), thinking about the speed they are going to gain, but without thinking that the speed achieved by these DBMSs is achieved by not implementing ACID features or JOIN operations. The former allow you to avoid ending up with corrupted and unusable data, and by choosing a NoSQL, it is assumed that JOIN operations will never be needed.
What happens in these cases is that customer requirements changeThe database schema has to be modified either during development or maintenance, so the database schema has to be modified, NoSQL DBMS is no longer an alternative to the NoSQL that can be adjusted to the business model, being in the end it is the programmer who ends up implementing in the code what the relational DBMS doesbut with much slower code than we would have with one of these DBMSs. With this I am not saying that with Relational DBMSs, there is no need to implement code to adjust the logical model of the database to the conceptual model of the business, but it will be much less than with a NoSQL DBMS.
What NoSQLs are very good for, especially those of the key/value type, is to use them as caching systems, either memory-only like memcached or hybrid like Redis, which are a good complement to any Relational DBMS that has to perform many reads.
Now that I’ve sketched a little bit of the big picture, let’s briefly look at a few interesting tricks for common Web application situations.
For the following explanations I will use the standard nomenclature. I comment this so that you keep in mind that when I say relationship I am referring to what many incorrectly call table in relational DBMS.
Quick tree structure queries
Tree data structures are very common on the Web, for example, in breadcrumbs, menus with nested submenus and comment threads. Normally whenever you want to implement in a relational database, you create a reflexive foreign key that goes from a parent column to the primary key of the relation. Example:
IDPadre
ID | Father |
1 | NULL |
2 | 1 |
3 | 2 |
The problem is that to know which are all the previous levels of a node or all the descendants we will have to make several queries or several JOIN operations with the table itself, one for each level of the tree and this is very inefficient.
There are many possible solutions to this problem, but the one I like the most, is to store all the nodes that hang from another one in a new many-to-many relation, also reflexive, in which we will have as a composite primary key, the ancestor node and the descendant node that will be at the same time foreign keys to the primary key of the original relation. We will also have another column to know the level of distance between nodes. Example showing the same information as in the previous example:
Ancestor | Descendant | Level |
1 | 2 | 1 |
1 | 3 | 2 |
2 | 2 | 1 |
In this way, with a single JOIN with the original relation we can obtain all the ancestors or descendants of a node and display them on the Web.
Optimize pagination
Pagination is usually a costly task when we have to display several page numbers, because for that the number of rows in the relation has to be calculated, which may require a complete read of the relation, depending on the DBMS. It will always be more efficient to display previous and next links, retrieving all the rows to display and one more, so that the next page link is displayed, if we get this additional row from the DB. Alternatively, we can keep the number of rows in a separate relationship.
Another thing that can affect the paginations, is that when we are in an inner page a certain number of rows, belonging to the previous pages, must be discarded and read from the disk with all the columns that appear in the query. One way to avoid this is to perform a JOIN with a subquery that retrieves the identifiers we need before reading all the information. Example:
SELECT data1, data2, data3 FROM Foo INNER JOIN( SELECT id FROM Foo ORDER BY data1 LIMIT 50,10) AS FooPaginated USING(id);
Show random rows
It is quite common to show random results of a relationship to give a sense of freshness in the contents. The problem is that this is often implemented in rather inefficient ways, such as sorting by a random value, which often results in reading all the rows of the table. In MySQL, if we have autonumeric identifiers, the best is cross-reference with random identifiers. Here is an example of how to implement this, to retrieve a random row from the Foo relation:
MySQL: SELECT name FROM Foo AS f1 JOIN (SELECT (RAND() * (SELECT MAX(id) FROM Foo)) AS id) AS f2 WHERE f1.id >= f2.id ORDER BY f1.id ASC LIMIT 1
In other DBMSs we have proprietary solutions:
Sql Server: SELECT * FROM Foo TABLESAMPLE (1 ROWS);
Oracle: SELECT * FROM (SELECT * FROM Bugs SAMPLE (1) ORDER BY dbms_random.value) WHERE ROWNUM=1
Caching the most frequent queries
Enabling the DBMS database cache can sometimes worsen performance. This depends on the volume of data accessed most frequently. If the volume of data is high and frequently updated, we will always be writing to the cache instead of reading from it. So it is usually preferable to use an external cache system, in which we can control that the most frequent and freshest data will actually be cached. If the cache system does not provide this functionality, we can implement it ourselves. A good way to do this without the overhead of storing the number of times a query is accessed is to generate a random number and cache the query if that number exceeds a certain value. For example, suppose we generate a random number between 1 and 100 and the query is cached if that number is less than 10, this way the query will be cached with a 10% probability on each request, so if the query has more requests, it will be more likely to be cached.
Query optimization
Finally, here are a few brief tips for optimizing queries:
- Change ORs to IN, when we have more than one value to compare.
- Minimize the cost of JOIN: Natural concatenation or JOIN is the most expensive operation in relational databases, since it requires a Cartesian multiplication and a selection of values. Some techniques we can use to minimize its effect are as follows:
- Reorder them to concatenate relationships with fewer rows first to reduce the number of crossings.
- Create subqueries where the number of rows of large relationships are filtered or limited before performing the following JOINs.
- Sometimes, splitting a query into several queries is better than doing it all with a single query, so that we can obtain in a first query a few identifiers that we can pass with an IN to the next query, instead of performing a JOIN.
- Change the JOIN to EXISTS if no data is to be displayed for the relation with which the crossing is performed.
- Take into account the N + 1 problem: The n+1 usually occurs when we have a list in which to display it as we want, for each item we need to perform an additional query (the plus one of the n+1). In this case, it is usually better to perform one or more additional JOINs in the query that retrieves the list of items. This way we get the list as we need it, and do not have to launch additional queries for each item.
- Always specify the column names in the SELECT, otherwise the DBMS will read all the rows from the disk. The asterisk must be used if and only if COUNT is used, in which case the DBMS will know that it does not have to read all columns.
- Create indexes: indexes allow much faster access to non-sequential data, but they are expensive to create, so it is not convenient to use them if we have many more reads than writes. We must analyze the execution plan of the queries (each DBMS has its own way of looking at it) to know where we must create indexes.
Normally, we will create the indexes on foreign keys and on the columns used with ORDER BY or WHERE. If you create composite indexes, you must put the columns in the same order that they will be used in the queries.
That’s all, I hope you enjoyed it.