Poor database performance when using ORDER BY

I work with a nonprofit organization that displays solar potential in the United States. Needless to say, we have a ridiculously large PostgreSQL 9 database. Executing a query like the one shown below is quick until the order by line is commented out, in which case the same query will be executed forever (185 ms without sorting by compared to 25 minutes). What steps should be taken to ensure that this and other requests are executed in a more manageable and reasonable amount of time?

 select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total from global_site A cross join na_utility_line B where (A.power_peak between 1.0 AND 100.0) and A.area_acre >= 500 and A.solar_avg >= 5.0 AND A.pc_num <= 1000 and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025') and B.volt_mn_kv >= 69 and B.fips_code like '%US06%' and B.status = 'active' and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000)) --order by A.area_acre offset 0 limit 11; 
+4
source share
7 answers

Sorting is not a problem - in fact, the cost of the processor and sorting memory is close to zero, since Postgres has the form Top-N, where the result set is checked, keeping a small sorting buffer containing only Top -N lines up to date.

 select count(*) from (1 million row table) -- 0.17 s select * from (1 million row table) order by x limit 10; -- 0.18 s select * from (1 million row table) order by x; -- 1.80 s 

So, you see that Top-10 sorting adds 10 ms to an odd fast count (*) compared to much longer for real sorting. This is a very neat feature, I use it a lot.

OK, now without EXPLAIN ANALYZE it’s impossible to be sure, but I feel the cross-connection is the real problem. You basically filter the rows in both tables using:

 where (A.power_peak between 1.0 AND 100.0) and A.area_acre >= 500 and A.solar_avg >= 5.0 AND A.pc_num <= 1000 and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025') and B.volt_mn_kv >= 69 and B.fips_code like '%US06%' and B.status = 'active' 

OK I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE can tell), but this is probably important. Knowing these numbers will help.

Then we got the worst CROSS JOIN condition:

 and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000)) 

This means that all lines of A are matched with all lines of B (therefore, this expression will be evaluated many times) using a bunch of rather complex, slow and intensive processes.

Of course it's awfully slow!

When you delete ORDER BY, postgres just appears (by chance?) With a bunch of matching lines right at the beginning, displays them and stops after reaching LIMIT.

Here is a small example:

Tables a and b are identical and contain 1000 rows and a column of type BOX.

 select * from a cross join b where (ab && bb) --- 0.28 s 

Here, 1,000,000 overlap tests (operator &) completed in 0.28 s. The test data set is generated so that the result set contains only 1000 rows.

 create index a_b on a using gist(b); create index b_b on a using gist(b); select * from a cross join b where (ab && bb) --- 0.01 s 

Here, the index is used to optimize cross-connects, and the speed is ridiculous.

You need to optimize the geometry.

  • add columns to be cached:
    • ST_Centroid (A.wkb_geometry)
    • ST_Buffer ((B.wkb_geometry), 1000)

There is NO POINT when reprocessing these slow functions a million times during your CROSS JOIN, so save the results in a column. Use a trigger to keep them up to date.

  • add columns of type BOX that will cache:

    • Boundary Box ST_Centroid (A.wkb_geometry)
    • Boundary Box ST_Buffer ((B.wkb_geometry), 1000)
  • add gist indices to BOXes

  • add a box overlap test (using the & & operator) that will use the index

  • save ST_Within, which will act as the last filter in the lines that pass

Perhaps you can just index the columns ST_Centroid and ST_Buffer ... and use the (indexed) operator "contains", see here:

http://www.postgresql.org/docs/8.2/static/functions-geometry.html

+5
source

I would suggest creating an index on area_acre. You can take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html

I would recommend doing such things from peak hours, although it can be somewhat intense with a lot of data. One thing you will have to pay attention to is also to rebuild them on a schedule to ensure performance over time. Again, this schedule should be outside rush hours.

You might want to take a look at this article from the SO'er partner and his experience of replacing the database over time using indexes: Why PostgresQL query performance decreases over time, but is restored when the index is restored

+2
source

If the A.area_acre field is not indexed, this may slow it down. You can run a query using EXPLAIN to see what it does at runtime.

+1
source

Firstly, I would look at creating indexes so that your db is vacuum cleaned, increase the general buffers for setting db, work_mem settings.

0
source

The first thing you need to pay attention to is whether you have an index in the field that you order. If not, adding one will lead to a significant increase in performance. I don't know postgresql, but something similar to:

 CREATE INDEX area_acre ON global_site(area_acre) 

As noted in other answers, the indexing process is intensive when working with a large data set, so this happens during off-peak.

0
source

I am not familiar with PostgreSQL optimization, but it looks like what happens when a query is executed with an ORDER BY clause, since the entire result set is created, then it is sorted, and then the top 11 rows are taken from this sorted result. Without ORDER BY, the query engine can simply generate the first 11 rows in any order that it likes, and then it is done.

Having an index in the area_acre field may not help the sorting (ORDER BY), depending on how the result set is built. Theoretically, this could be used to generate a result set by moving the global_site table using an index on area_acre ; in this case, the results will be generated in the desired order (and it may stop after 11 lines are generated as a result). If it does not generate the results in this order (and it looks like it might not be), then this index will not help sort the results.

One thing you can try is to remove "CROSS JOIN" from the request. I doubt this will make a difference, but it’s worth checking out. Since the WHERE clause is related to joining two tables (via ST_WITHIN), I believe the result is the same as the inner join. Perhaps using the CROSS JOIN syntax causes the optimizer to make an unwanted choice.

Otherwise (besides the confidence that indexes exist for filtered fields), you can play a game with a slightly guessing game with a query. One of the conditions that stand out is area_acre >= 500 . This means that the query engine considers all rows matching this condition. But then only the first 11 lines are taken. You can try changing it to area_acre >= 500 and area_acre <= somevalue . somevalue is a guessing part that needs to be set up to make sure you get at least 11 lines. This, however, seems like a rather crappy affair, so I mention it with some restraint.

0
source

Have you considered creating expression-based indexes in the interest of more complex compounds and conditions?

0
source

All Articles