SQL query: internal join optimization between large tables

I have the following 3 tables in MySQL 4.x DB:

  • hosts: (300,000 entries)
    • id (UNSIGNED INT) PRIMARY KEY
    • name (VARCHAR 100)
  • paths: (6.000.000 entries)
    • id (UNSIGNED INT) PRIMARY KEY
    • name (VARCHAR 100)
  • URL: (7,000,000 entries)
    • host (UNSIGNED INT) PRIMARY KEY <--- links to hosts.id
    • path (UNSIGNED INT) PRIMARY KEY <--- links to paths.id

As you can see, the schema is really simple, but the problem is the amount of data in these tables.

Here is the query that I run:

SELECT CONCAT(H.name, P.name) FROM hosts AS H INNER JOIN urls as U ON H.id = U.host INNER JOIN paths AS P ON U.path = P.id; 

This query works fine, but it takes 50 minutes to run. Does anyone know how I can expedite this request?

Thanks in advance. Nicolas

+6
optimization sql inner-join mysql bigtable
source share
14 answers

For one thing, I would not have done CONCAT in a query. Do it outside.

But actually the query is slow because you are extracting millions of rows.

+1
source share

Perhaps you should include a WHERE clause? Or do you really need ALL the data?

+5
source share

It looks like a case where excessive zeal for using surrogate keys slows you down. If the tables were:

  • hosts:

    • name (VARCHAR 100) PRIMARY KEY
  • ways:

    • name (VARCHAR 100) PRIMARY KEY
  • URL:

    • host (VARCHAR 100) PRIMARY KEY <--- links to hosts.name
    • path (VARCHAR 100) PRIMARY KEY <--- links to paths.name

Then your request will not require a union:

 SELECT CONCAT(U.host, U.path) FROM urls U; 

True, the URLS table will take up more disk space - but does it matter?

EDIT: Secondly, what is the point of this PATHS table? How often do different hosts use the same paths?

Why not:

  • hosts:

    • name (VARCHAR 100) PRIMARY KEY
  • URL:

    • host (VARCHAR 100) PRIMARY KEY <--- links to hosts.name
    • path (VARCHAR 100) PRIMARY KEY <--- no link anywhere

EDIT2: Or if you really need a surrogate key for hosts:

  • hosts:

    • id integer PRIMARY KEY
    • name (VARCHAR 100)
  • URL:

    • host integer PRIMARY KEY <--- links to hosts.name
    • path (VARCHAR 100) PRIMARY KEY <--- no link anywhere

    SELECT CONCAT (H.name, U.path) FROM urls U JOIN hosts H ON H.id = U.host;

+4
source share

Overall, the best advice is to keep track of and profile to see what really takes time. But here are my thoughts on specific things to look at.

(1) I would say that you want to make sure that indexes are NOT used when executing this query. Since you do not have filtering conditions, it should be more efficient to completely scan all tables and then combine them using a sort or hash operation.

(2) String concatenation, of course, takes some time, but I don’t understand why people recommend deleting it. Apparently, then you will need to perform the concatenation in another piece of code, where it will still take about the same amount of time (if for some reason MySQL string is not concatenated).

(3) Transferring data from server to client probably takes a considerable amount of time, which is probably more than the time it takes for the server to retrieve data. If you have tools to track these kinds of things, use them. If you can increase the size of the sample array in your client, experiment with different sizes (for example, in JDBC use Statement.setFetchSize ()). This can be significant, even if the client and server are on the same host.

+2
source share

Have you already declared some join attribute indexes?

PS: See here [broken link] for indexes on MySQL 4.x

+1
source share

Try optimizing your tables before running the query:

 optimize table hosts, paths, urls; 

This can save you time, especially if rows have been removed from tables. (see here for more information on OPTIMIZE)

+1
source share

I would try to create a new table with the data you want to get. This means that you lose real data, but you win quickly. Could this idea be similar to OLAP or something like that?

Of course, you need to do an update (daily or something else) of this table.

+1
source share

I am not a MySQL expert, but it looks like the primary keys of MySQL are clustered - you want to make sure the case with your primary keys; clustered indexes will definitely help speed up the process.

One thing, however - I do not believe that you can have two "primary" keys on any table; for this reason, the table of your URLs looks rather suspicious. First of all, you need to make sure that these two columns in the table of URLs are indexed by the handle - for each of them there must be one index index, because you join them, so the DBMS must know how to quickly find them; this may be what happens in your case. If you look at full tables that have many rows, then yes, you could sit there for a while while the server tries to find everything that you requested.

I also suggest removing this CONCAT function from the select statement and see how this affects your results. I would be amazed if this were not some factor. Just extract both columns and handle the concatenation later, and see how this happens.

Finally, do you understand where the bottleneck is? A simple connection to three tables with several million rows should not take a lot of time (I would expect, maybe, a second or so, just looking through your tables and query) if the tables are correctly indexed. But if you click these lines on a slow or already attached network adapter, on an application server with a puzzle, etc., then slowness may not have anything to do with your request, but instead of what happens after the request. Seven million rows is quite a lot of data that needs to be collected and moved, no matter how long these rows are searched for. Try to select only one line instead of seven million, and see how it looks the other way around. If it’s fast, then the problem is not the query, this is the result.

+1
source share

Since your result set returns all the data, there are very few optimizations that can be performed at all. You look at the whole table, and then join other tables with indexes.

Are PrimaryKeys clusters connected? This ensures that data is stored on the disk in index order, so avoid bouncing around different parts of the disk.

In addition, you can transfer data across multiple disks. If you have URLs on PRIMARY and PATHS / HOSTS on SECONDARY, you will get higher disk throughput.

+1
source share

You need to see the configuration of your server. The default memory options for MySQL will degrade performance on a table of this size. If you use the default settings, you need to increase at least key_buffer_size and join_buffer_size by at least 4 times, possibly much more. Look in the documentation; There are other memory options that you can configure.

MySQL has ridiculous performance, when your tables go to a certain size with queries that will return most of the data, the performance goes to the toilet. Unfortunately, he cannot tell you when this threshold is reached. It looks like me, like you.

+1
source share

Concat is definitely slowing you down. Can we see how mysql results explain this? Documentation Link

The most interesting thing is to try to pull out only the data you need. If you can pull out fewer records that will speed you up as much as possible. But mysql's explanation should help us figure out if any indexes will help.

0
source share

I understand that you need a complete list of URLs - that's 7 million entries. Perhaps, as Mitch said , you should use the WHERE clause to filter your results. Perhaps synchronization is mainly due to the delay in displaying records

check time for this request

 select count(*) FROM hosts AS H INNER JOIN urls as U ON H.id = U.host INNER JOIN paths AS P ON U.path = P.id 

If it is still slow, I would go and check the time for select count (*) from the urls

then

 select count(*) from urls u inner join hosts h on u.host = h.id 

then

 select count(*) from urls u inner join hosts h on u.host = h.id inner join paths p on u.path = p.id 

to find the source of slowdown

Also, sometimes reordering a request can help

 SELECT CONCAT(u.host, u.path) from urls u inner join hosts h on u.host = h.id inner join paths p on u.path = p.id 
0
source share

I cannot say for sure about mySQL, but I know in SQL Server that primary keys automatically create an index, and foreign keys do not. Make sure your foreign key fields have an index.

0
source share

Since I'm not a big fan of MySQL, I would ask if you tried PostgreSQL. In this database, you would like to make sure that your work_mem parameter was quite high, but you can set it to a DB connection with SET work_mem = 64MB, for example.

Another suggestion is to study the use of duplicate path entries. There are many URLs that share the paths.

Another thing that may or may not help is to use fixed-length text fields instead of varchars. It used to make a difference in speed, but I'm not sure about the current database mechanisms.

If you use PostgreSQL, it will allow you to use JOIN USING, but even in MySQL I like it more: name your id field the same in each table. Instead of id in hosts and host in URLs, name it host_id in both places.

Now a few more comments. :) This data layout that you have here is very useful when you select a small set of strings, perhaps every URL from the same domain. It can also help a lot if your queries often require sequential scanning of the URL table for other data stored there, because scanning can skip large text fields (if it does not matter, your database stores text through pointers to the linked table anyway )

However, if you almost always select all the domain data and paths, then it makes sense to store it in one table.

0
source share

All Articles