How to Optimize Database Queries - Basics

It seems that all questions regarding this topic are very specific, and although I appreciate specific examples, I am interested in the basics of SQL optimization. I work very well in SQL and have experience in hardware / low-level software.

What I want is both the software tools and the mysql database lookup method that I look at regularly, and I know what the difference is between the orders of the connection operators and the operators.

I want to know why an index helps, for example, which is why. I want to know exactly what is happening differently, and I want to know how I can really look at what is happening. I don’t need a tool that will break every step of my SQL, I just want to be able to pop it, and if someone can’t tell me which column to index, I can pull out a sheet of paper and for some time will be able to find the answers .

Databases are complex, but they are not so complex, and there should be some great material to learn the basics so that you know how to find answers to the optimization problems that you encounter, even if you tracked down the exact answer to the forum.

Please recommend a short reading that will be concise, intuitive and not afraid to move on to low level nuts and bolts. I prefer free online resources, but if the recommendation of the book destroys the head of the nail, it falls, I would think about accepting it.

+6
optimization sql database mysql query-optimization
source share
5 answers

You should look for all the conditions for each connection ... provided. They work the same way.

Suppose we write

select name from customer where customerid=37; 

Somehow, the DBMS must find a record or records with customerid = 37. If there is no index, the only way to do this is to read each record in the table by comparing customerid with 37. Even when it finds it, it has no way to find out that there is only one therefore she must continue to search for others.

If you create an index for customerid, the DBMS has ways to find the index very quickly. This is not a sequential search, but, depending on the database, a binary search or some other efficient method. Just as it does not matter, you must admit that it is much faster than sequential. Then the index translates it directly into the corresponding record or entries. In addition, if you indicate that the index is "unique", then the database knows that there can only be one, so it does not waste time searching for the second. (And the DBMS won't let you add a second.)

Now consider this query:

 select name from customer where city='Albany' and state='NY'; 

Now we have two conditions. If you have an index for only one of these fields, the DBMS will use this index to search for a subset of records, and then search them sequentially. For example, if you have a state index, the DBMS will quickly find the first record for NY, then sequentially search for the city = "Albany" and stop looking when it reaches the last record for New York.

If you have an index that includes both fields, that is, "create an index for the client (state, city)," then the DBMS can immediately increase to the required records.

If you have two separate indexes, one in each field, the DBMS will have different rules that it applies to determine which index to use. Again, exactly how this is done depends on the specific DBMS that you use, but basically it tries to keep statistics on the total number of records, the number of different values ​​and the distribution of values. Then he will sequentially search for these records for those that satisfy another condition. In this case, the DBMS is likely to observe that there are far more cities than there are states, therefore, using the city index, it can quickly approach the Albany records. He will then search them sequentially, checking the status of each against NY. If you have entries for Albany, California, they will be skipped.

Each connection requires some kind of search.

Let's say we write

 select customer.name from transaction join customer on transaction.customerid=customer.customerid where transaction.transactiondate='2010-07-04' and customer.type='Q'; 

Now the DBMS should decide which table to read first, select the necessary records, and then find the corresponding records in another table.

If you have an index for transaction.transactiondate and customer.customerid, the best plan is probably to search for all transactions with this date, and then for each of them find the client with the appropriate client, and then make sure the client is of the correct type.

If you don’t have an index on customer.customerid, then the DBMS can quickly find the transaction, but then for each transaction it will have to sequentially search for a client table that is looking for a suitable client. (This is likely to be very slow.)

Suppose that only those indexes that you have are on transaction.customerid and customer.type. Then the DBMS is likely to use a completely different plan. He probably scans the client table for all clients with the correct type, and then for each of them they find all the transactions for that client and look through them sequentially on the desired date.

The most important key to optimizing is figuring out which indexes will really help and create those indexes. Additional, unused indexes are a burden on the database because they require work to work, and if they are never used, it will be wasted.

You can specify which DBMS indexes will be used for any given query using the EXPLAIN command. I use this all the time to determine if my queries are optimized or if I will create additional indexes. (Read the documentation for this command to explain its output.)

Warning. Remember that I said that the DBMS stores statistics on the number of records and the number of different values, etc. in each table. EXPLAIN can give you a completely different plan today than yesterday if the data has changed. For example, if you have a query that joins two tables, and one of these tables is very small and the other large, it will be biased to read the first table and then search for matching records in the large table. Adding records to the table can change, which is larger, and thus lead the DBMS to change its plan. Therefore, you should try to use EXPLAINS for a database with realistic data. Working with a test database with 5 records in each table is much less valuable than working with a live database.

Well, much more can be said, but I do not want to write a book here.

+6
source share

Let's say you are looking for a friend in another city. One way would be to go door to door and ask if this is the house you are looking for. Another way is to look at the map.

An index is a map to a table. He can say that the database engine is exactly where you are looking. This way, you index each column that you think you will have to look for, and do not leave the columns from which you are just reading data, and never look.

Good technical reading on indexes and ORDER BY optimization . And if you want to see what exactly is happening, you want EXPLAIN .

+7
source share

Do not think about database optimization. Think about query optimization.

As a rule, you optimize one case at the expense of others. You just need to decide which cases interest you.

+2
source share

"I'm particularly interested in how indexes affect joins."

As an example, I'll take the case of equijoin (SELECT FROM A, B WHERE Ax = By).

If there are no indexes at all (which is possible in theory, but I think not in SQL), then basically the only way to calculate the join is to take the whole table A and split it by x, take the whole table y and divide it by y, then map the partitions and finally, for each pair of matching sections, calculate the result rows. This is expensive (or even simply impossible due to memory limitations) for all but the smallest tables.

Same story if there are indices on A and / or B, but none of them have x resp. y as its first attribute.

If there is an index on x, but not on y (or vice versa), then another possibility opens up: scanning table B, for each row selection value y, searching for this value in the index and selecting the corresponding rows A to calculate the connection. Note that this still will not benefit you unless other additional restrictions are applied (AND z = ...) - unless there are only a few matches between the x and y values.

If ordered indexes (hash-based indexes are not ordered) exist for both x and y, then the third opportunity opens: perform a scan comparison on the indices themselves (the indices themselves will apparently be smaller than the tables themselves, so scanning the index itself will take less time), and for the corresponding x / y values, calculate the union of the corresponding rows.

This is a basic level. Variations arise for joins in x> y, etc.

+1
source share

I do not know about MySql tools, but in MS SqlServer you have a tool that displays all the operations that the request will perform, and how long it takes to process the entire request.

Using this tool helped me understand how query optimization is optimized by the query optimizer, than I think that any book can help, because what the optimizer does is often not easy to understand. By modifying the query and possibly the underlying database, I saw how each change affected the query plan. There are certain key points when writing queries, but for me it seems that you already have the idea that the optimization in your case is much more about this than about any general rules. After several years of db development, I looked at several books specifically designed to optimize the database on SQL Server and found very little useful information.

A quick googling came up with the following: http://www.mysql.com/products/enterprise/query.html , which looks like a similar tool.

This was, of course, at the query level, database level optimization is again an excellent fish choice, but there you look at options such as partitioning the database on hard drives, etc. At least in SqlServer you can choose to split tables into different hdd and even disk plates, and this can have a big effect, since disks and heads can work in parallel. Another is how you can create your queries so that the database can run them on multiple threads and processors in parallel, but both of these problems again depend on the database engine and even on the version you are using.

+1
source share

All Articles