<\/script>')

Why is the following Postgres SQL query taking so long?

The raw request is as follows

SELECT "TIME", "TRADEPRICE" FROM "YEAR" where "DATE"='2010-03-01' and "SECURITY"='STW.AX' AND "TIME" < '10:16:00' AND "TYPE" = 'TRADE' ORDER BY "TIME" ASC LIMIT 3 

I built three indexes as follows

 Columns "DATE" DESC NULLS LAST Columns "SECURITY" DESC NULLS LAST Columns "TIME" DESC NULLS LAST 

I do not index TYPE because it accepts only one of two possible values

An explanation of the analysis leads to the following

 "Limit (cost=50291.28..50291.28 rows=3 width=16) (actual time=1794484.566..1794484.567 rows=3 loops=1)" " -> Sort (cost=50291.28..50291.29 rows=4 width=16) (actual time=1794484.562..1794484.563 rows=3 loops=1)" " Sort Key: "TIME"" " Sort Method: top-N heapsort Memory: 25kB" " -> Bitmap Heap Scan on "YEAR" (cost=48569.54..50291.24 rows=4 width=16) (actual time=1794411.662..1794484.498 rows=20 loops=1)" " Recheck Cond: (("SECURITY" = 'STW.AX'::bpchar) AND ("DATE" = '2010-03-01'::date))" " Filter: (("TIME" < '10:16:00'::time without time zone) AND ("TYPE" = 'TRADE'::bpchar))" " -> BitmapAnd (cost=48569.54..48569.54 rows=430 width=0) (actual time=1794411.249..1794411.249 rows=0 loops=1)" " -> Bitmap Index Scan on security_desc (cost=0.00..4722.94 rows=166029 width=0) (actual time=1793917.506..1793917.506 rows=1291933 loops=1)" " Index Cond: ("SECURITY" = 'STW.AX'::bpchar)" " -> Bitmap Index Scan on date_desc (cost=0.00..43846.35 rows=2368764 width=0) (actual time=378.698..378.698 rows=2317130 loops=1)" " Index Cond: ("DATE" = '2010-03-01'::date)" "Total runtime: 1794485.224 ms" 

The database contains about 1 billion rows on Core2Quad with 8GB RAM on 64-bit Ubuntu. Of course, this request should not take half an hour.

+5
source share
3 answers

The database contains about 1 billion rows on Core2Quad with 8GB RAM on 64-bit Ubuntu. Of course, this request should not take half an hour.

It takes half an hour because of how you set up your indexes.

Your query does not have indexes with multiple columns that it can use to direct directly to the desired rows. He does the next best, which is scanning a raster index on barely selective indexes, and the top one sorting the result set.

The two indicated indexes, for security and for date, give rows 1.3M and 2.3M, respectively. Combining them will be painfully slow because you accidentally scan over a million lines and filter each.

Adding insult to injury, your data structure is such that two highly correlated fields (date and time) are stored and processed separately. This confuses the query planner because Postgres does not collect correlation data. Thus, your queries almost always resort to filtering through huge data arrays and organize the filtered set according to individual criteria.

I would suggest the following changes:

  • Modify the table and add a datetime column of type timestamp with time zone . Combine your date and time columns.

  • Put the appropriate date and time fields, as well as the indices on them. Also clear the security index.

  • Create an index (security, datetime). (And don't get confused with the zeros of first / nulls last unless your ordering criteria also contain these sentences.)

  • Of your choice, add a separate index to (datetime) or on (datetime, security) if you ever need to execute queries that set statistics for all deals in a range of days or dates.

  • The vacuum analyzes the entire mess once you are done with the above.

Then you can rewrite your request as follows:

 SELECT "TIME", "TRADEPRICE" FROM "YEAR" WHERE '2010-03-01 00:00:00' <= "DATETIME" AND "DATETIME" < '2010-03-01 10:16:00' AND "SECURITY"='STW.AX' AND "TYPE" = 'TRADE' ORDER BY "DATETIME" ASC LIMIT 3 

This will give the most optimistic plan: extracting the top 3 rows from the filtered index scan (security, datetime), which I expect (since you have a billion rows) will take a maximum of 25 ms.

+12
source

Add a combined index of many search terms together, for example ON YEAR (TYPE, SECURITY, DATE, TIME) . The database can then search in one index to match all of them, instead of looking for multiple indexes and matching all the results together (Scanning a bitmap image).

Exactly which columns (for example, include TYPE or not?) And which order you include in the index depend on the data characteristics and what other queries you make (since you get the opportunity to reuse any subset of the composite index for free) , so experiment a bit; but to encourage order optimization, keep the ORDER BY column as the last used index column / direction.

You may also want ANALYZE update the statistics for the query planner, as some of the row count guesses seem a bit off.

+1
source

I do not index TYPE because it accepts only one of two possible values

You must understand how indexes work and why they work. Indexes duplicate indexed data into poor small index blocks that contain only the specified index data. Of your X GB raw data, only X / 20 (guesstimate) size remains. If you specify a query that uses data that is not indexed, it means that for each of the records that meet other query criteria, the DBMS must read the corresponding block of raw data to the index block to determine whether it meets the query criteria.

The best case is the presence of at least one index that contains all the requirements that the query requests, so there is no need to search in data blocks.

Another hint: it is usually recommended to specify columns that take values ​​that are usually requested as a range (in your case, "TIME").

My suggestion: drop all indexes. Create an index with the fields TIME (ASC), DATE, SECURITY, TYPE (in that order). Use request

 SELECT "TIME", "TRADEPRICE" FROM "YEAR" WHERE "TIME" < '10:16:00' AND "DATE"='2010-03-01' AND "SECURITY"='STW.AX' AND "TYPE" = 'TRADE' 

And watch the incredible speed.

0
source

All Articles