Category Filter Optimization

Question

Category Filter Optimization

This recent question made me think about optimizing a category filter.

Suppose we want to create a database that references a huge number of audio tracks, with the release date and a list of world locations from which you can download an audio track.

The queries that we want to optimize are:

Give me the last 10 tracks downloaded from location A.
Give me the last 10 tracks downloaded from places A or B.
Give me the 10 most recent tracks downloaded from places A and B.

How to organize the structuring of this database? It's hard for me to find a simple solution that does not require reading all the tracks for at least one location ...

+4

sql query-optimization

Victor nicollet Sep 01 '11 at 15:44

source share

2 answers

In a classic relational scheme, you should have a many-to-many relationship between tracks and locations to avoid redundancy:

 CREATE TABLE tracks ( id INT, ... release_date DATETIME, PRIMARY KEY (id) ) CREATE TABLE locations ( id INT, ... PRIMARY KEY (id) ) CREATE TABLE tracks_locations ( location_id INT, track_id INT, ... PRIMARY KEY (location_id, track_id) ) SELECT tracks.* FROM tracks_locations LEFT JOIN tracks ON tracks.id = tracks_locations.location_id WHERE tracks_locations.location_id = A ORDER BY tracks.release_date DESC LIMIT 10

You can change this layout using table sections by location. The problem is that it depends on implementation problems or usage restrictions. For example, AFAIK in MySQL, you cannot have foreign keys in partitioned tables. To solve this problem, you can also have a collection of tables (call it "manual markup"), for example tracks_by_location_# , where # is the identifier of a known location. These tables can save filtered results and create / update / delete using triggers.

0

elitalon Sep 05 '11 at 11:13

source share

MatBailie · Accepted Answer · 2011-09-07T16:01:01+0000

To optimize these queries, you need to slightly de-normalize the data.

For example, you might have a track table containing the id , name and release date tracks, and a map_location_to_track table that describes which tracks can be downloaded from. To answer the "10 most recent tracks for location A," you need to get ALL tracks for location A from map_location_to_track , and then attach them to the track table to arrange them by release date and select the top 10.

If instead all of the data is in the same table, the sequencing step can be avoided. For instance...

 CREATE TABLE map_location_to_track ( location_id INT, track_id INT, release_date DATETIME, PRIMARY KEY (location_id, release_date, track_id) ) SELECT * FROM map_location_to_track WHERE location_id = A ORDER BY release_date DESC LIMIT 10

Having location_id as the first entry in the primary key ensures that the WHERE clause is just an index lookup. Then there is no need to reorder the data that he already ordered for us with the primary key, but instead just select 10 records at the end.

You can really join the track table to get a name, price, etc., but now you only need to do this for 10 entries, not just everything in this place.

To solve the same query for “locations A OR B”, there are several options that can be performed differently depending on the RDBMS you are using.

The first is simple, although some RDBMSs do not play well with IN ...

 SELECT track_id, release_date FROM map_location_to_track WHERE location_id IN (A, B) GROUP BY track_id, release_date ORDER BY release_date DESC LIMIT 10

The next option is almost identical, but still some RDBMSs do not play well with the OR logic applied to the INDEX.

 SELECT track_id, release_date FROM map_location_to_track WHERE location_id = A or location_id = B GROUP BY track_id, release_date ORDER BY release_date DESC LIMIT 10

In any case, the algorithm used to streamline the list of entries to 10 is hidden from you. It is a question to try and see; an index is still available so this MAY be performed.

An alternative is to explicitly define part of the approach in your SQL statement ...

 SELECT * FROM ( SELECT track_id, release_date FROM map_location_to_track WHERE location_id = A ORDER BY release_date DESC LIMIT 10 UNION SELECT track_id, release_date FROM map_location_to_track WHERE location_id = B ORDER BY release_date DESC LIMIT 10 ) AS data ORDER BY release_date DESC LIMIT 10 -- NOTE: This is a UNION and not a UNION ALL -- The same track can be available in both locations, but should only count once -- It in place of the GROUP BY in the previous 2 examples

The optimizer can still understand that these two combined datasets are ordered and therefore make the outer order very fast. However, even if not, an order for 20 items is pretty quick. More importantly, these are fixed overheads: it doesn’t matter if you have a billion tracks in each place, we simply combine the two lists of 10.

The most difficult thing to optimize is the AND condition, but even then the existence of the “TOP 10” restriction can help work wonders.

Adding a HAVING clause to IN or OR approaches can solve this problem, but, again, depending on your RDBMS, it may work less optimally.

 SELECT track_id, release_date FROM map_location_to_track WHERE location_id = A or location_id = B GROUP BY track_id, release_date HAVING COUNT(*) = 2 ORDER BY release_date DESC LIMIT 10

An alternative is to try the two-query approach ...

 SELECT location_a.* FROM ( SELECT track_id, release_date FROM map_location_to_track WHERE location_id = A ) AS location_a INNER JOIN ( SELECT track_id, release_date FROM map_location_to_track WHERE location_id = B ) AS location_b ON location_a.release_date = location_b.release_date AND location_a.track_id = location_b.track_id ORDER BY location_a.release_date DESC LIMIT 10

This time we cannot limit the two subqueries to only 10 records; for all we know, the last 10 at location a do not appear at location b at all. However, the primary key saves us again. Two data sets are omitted by release date, RDBMScan can start from the top record of each set and combine the two until there are 10 records, and then stop.

NOTE. Since release_date is in the primary key, and before track_id , make sure that it is used in the connection.

Depending on the DBMS, you do not even need subqueries. Perhaps you can just join the table without changing the RDBMS plan ...

 SELECT location_a.* FROM map_location_to_track AS location_a INNER JOIN map_location_to_track AS location_b ON location_a.release_date = location_b.release_date AND location_a.track_id = location_b.track_id WHERE location_a.location_id = A AND location_b.location_id = B ORDER BY location_a.release_date DESC LIMIT 10

All in all, a combination of three things makes this pretty effective:
- Partially De-Normalization of data to ensure it is friendly for our needs
- Knowing that we only need the first 10 results
- Knowing that we ever deal with two locations maximum

There are variations that can be optimized for any number of records and any number of locations, but they are much less effective than the problem posed in this matter.

Category Filter Optimization

More articles: