MySQL SELECT DISTINCT statement takes 10 minutes

I'm fairly new to MySQL, and I'm trying to select a separate set of rows using this statement:

SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude FROM `transportdata`.stoppoints as sp INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id WHERE agency.agency_id IN (1,2,3,4); 

However, the select statement takes about 10 minutes, so something is clearly happening.

One of the significant factors is that the gtfsstop_times table is huge. (~ 250 million records)

The indicators seem to be configured correctly; all listed connections use indexed columns. Table sizes are approximately:

 gtfsagencys - 4 rows gtfsroutes - 56,000 rows gtfstrips - 5,500,000 rows gtfsstop_times - 250,000,000 rows `transportdata`.stoppoints - 400,000 rows 

The server has 22 GB of memory, I installed the InnoDB buffer pool on 8G, and I use MySQL 5.6.

Can anyone see a way to make this run faster? Or indeed, in general!

Does it matter that the stop point table is in a different scheme?

EDIT: EXPLAIN SELECT ... returns this:

enter image description here

+7
source share
4 answers

It looks like you are trying to find a collection of stop points based on certain criteria. And you use SELECT DISTINCT to avoid duplicate breakpoints. Is it correct?

AtcoCode seems to be a unique key for your stop point table. Is it correct?

If yes, try the following:

 SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode FROM `transportdata`.stoppoints` AS sp JOIN ( SELECT DISTINCT st.fk_atco_code AS atcoCode FROM `vehicledata`.gtfsroutes AS route JOIN `vehicledata`.gtfstrips AS trip ON trip.route_id = route.route_id JOIN `vehicledata`.gtfsstop_times AS st ON trip.trip_id = st.trip_id WHERE route.agency_id BETWEEN 1 AND 4 ) ids ON sp.atcoCode = ids.atcoCode 

This does a few things: it eliminates a table (agency) that you do not think is needed. It changes the search for agency_id from IN(a,b,c) to a range search, which may or may not help. And finally, he shifts the DISTINCT processing from the situation where he has to process a whole ton of data in a subquery situation, where he should only process identifier values.

( JOIN and INNER JOIN same. I used JOIN to make the query easier to read.)

This should speed you up. But I must say, a quarter of the gigar table is a large table.

+6
source

With 250M entries, I would have sealed the gtfsstop_times table in one column. Then each closed table can be combined into a separate query, which can be executed in parallel in separate threads, you will need to combine the result sets.

+3
source

The trick is to reduce the number of gtfsstop_times SQL rows to evaluate. In this case, SQL first evaluates each row in the inner join of gtfsstop_times and transportdata .stoppoints , right? How many lines does transportdata .stoppoints have? Then SQL computes the WHERE clause, then computes DISTINCT. How does DISTINCT do it? Scanning each line several times to determine if there are other lines. It will take forever, right?

However, GROUP BY quickly compresses all matching rows without evaluating them. I usually use joins to quickly reduce the number of rows the query should evaluate, and then I look at my group.

In this case, you want to replace DISTINCT with grouping.

Try it;

 SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode FROM `transportdata`.stoppoints as sp INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id WHERE agency.agency_id IN (1,2,3,4) GROUP BY sp.name , sp.longitude , sp.latitude , sp.atcoCode 
+2
source

Other valuable answers to your question and my question is to add to it. I believe sp.atcoCode and st.fk_atco_code are indexed columns in my table.

If you can verify and verify that the agency IDs in the WHERE valid, you can exclude the addition of `vehicledata .gtfsagencys` to JOINS, since you are not extracting any records from the table.

 SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude FROM `transportdata`.stoppoints as sp INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id WHERE route.agency_id IN (1,2,3,4); 
+1
source

All Articles