Optimize postgresql query

I have 2 tables in PostgreSQL 9.1 - flight_2012_09_12, containing about 500,000 rows and position_2012_09_12, containing about 5.5 million rows. I run a simple connection request, and it takes a lot of time, and although the tables are not small, I am convinced that there are some important benefits to the execution.

Request:

SELECT f.departure, f.arrival, p.callsign, p.flightkey, p.time, p.lat, p.lon, p.altitude_ft, p.speed FROM position_2012_09_12 AS p JOIN flight_2012_09_12 AS f ON p.flightkey = f.flightkey WHERE p.lon < 0 AND p.time BETWEEN '2012-9-12 0:0:0' AND '2012-9-12 23:0:0' 

The result of the analysis of the explanation:

 Hash Join (cost=239891.03..470396.82 rows=4790498 width=51) (actual time=29203.830..45777.193 rows=4403717 loops=1) Hash Cond: (f.flightkey = p.flightkey) -> Seq Scan on flight_2012_09_12 f (cost=0.00..1934.31 rows=70631 width=12) (actual time=0.014..220.494 rows=70631 loops=1) -> Hash (cost=158415.97..158415.97 rows=3916885 width=43) (actual time=29201.012..29201.012 rows=3950815 loops=1) Buckets: 2048 Batches: 512 (originally 256) Memory Usage: 1025kB -> Seq Scan on position_2012_09_12 p (cost=0.00..158415.97 rows=3916885 width=43) (actual time=0.006..14630.058 rows=3950815 loops=1) Filter: ((lon < 0::double precision) AND ("time" >= '2012-09-12 00:00:00'::timestamp without time zone) AND ("time" <= '2012-09-12 23:00:00'::timestamp without time zone)) Total runtime: 58522.767 ms 

I think the problem is sequential scanning in the position table, but I can’t understand why it is there. Table structures with indices below:

  Table "public.flight_2012_09_12" Column | Type | Modifiers --------------------+-----------------------------+----------- callsign | character varying(8) | flightkey | integer | source | character varying(16) | departure | character varying(4) | arrival | character varying(4) | original_etd | timestamp without time zone | original_eta | timestamp without time zone | enroute | boolean | etd | timestamp without time zone | eta | timestamp without time zone | equipment | character varying(6) | diverted | timestamp without time zone | time | timestamp without time zone | lat | double precision | lon | double precision | altitude | character varying(7) | altitude_ft | integer | speed | character varying(4) | asdi_acid | character varying(4) | enroute_eta | timestamp without time zone | enroute_eta_source | character varying(1) | Indexes: "flight_2012_09_12_flightkey_idx" btree (flightkey) "idx_2012_09_12_altitude_ft" btree (altitude_ft) "idx_2012_09_12_arrival" btree (arrival) "idx_2012_09_12_callsign" btree (callsign) "idx_2012_09_12_departure" btree (departure) "idx_2012_09_12_diverted" btree (diverted) "idx_2012_09_12_enroute_eta" btree (enroute_eta) "idx_2012_09_12_equipment" btree (equipment) "idx_2012_09_12_etd" btree (etd) "idx_2012_09_12_lat" btree (lat) "idx_2012_09_12_lon" btree (lon) "idx_2012_09_12_original_eta" btree (original_eta) "idx_2012_09_12_original_etd" btree (original_etd) "idx_2012_09_12_speed" btree (speed) "idx_2012_09_12_time" btree ("time") Table "public.position_2012_09_12" Column | Type | Modifiers -------------+-----------------------------+----------- callsign | character varying(8) | flightkey | integer | time | timestamp without time zone | lat | double precision | lon | double precision | altitude | character varying(7) | altitude_ft | integer | course | integer | speed | character varying(4) | trackerkey | integer | the_geom | geometry | Indexes: "index_2012_09_12_altitude_ft" btree (altitude_ft) "index_2012_09_12_callsign" btree (callsign) "index_2012_09_12_course" btree (course) "index_2012_09_12_flightkey" btree (flightkey) "index_2012_09_12_speed" btree (speed) "index_2012_09_12_time" btree ("time") "position_2012_09_12_flightkey_idx" btree (flightkey) "test_index" btree (lon) "test_index_lat" btree (lat) 

I cannot think of another way to rewrite the request, and therefore I do not understand this. If the current setup is as good as it gets, but it seems to me that it should be much faster than it is currently. Any help would be greatly appreciated.

+7
source share
2 answers

The reason you get sequential scans is because Postgres believes it will read fewer disk pages in such a way as to use indexes. This is probably correct. Think, if you use an index without coverage, you need to read all the relevant index pages. it essentially lists the line identifiers. Then the database engine needs to read each relevant data page.

Your position table uses 71 bytes per row, plus any type of geometry (I will assume 16 bytes for illustration), amounting to 87 bytes. Postgres page - 8192 bytes. So you have approximately 90 lines per page.

Your query matches 3950815 of 5563070 rows, or about 70% of the total. Assuming the data is randomly distributed as far as your filters are concerned, the likelihood that the corresponding row cannot be found on the data page is almost 30%. This is essentially nothing. Therefore, no matter how good your indexes are, you still have to read all the data pages. If you still have to read all the pages, scanning the table is usually a good approach.

Get out here, this is what I said, not a covering index. If you are ready to create indexes that themselves can respond to queries, you may not search for data pages at all, so you are back in the game. I would advise you to pay attention to the following:

 flight_2012_09_12 (flightkey, departure, arrival) position_2012_09_12 (filghtkey, time, lon, ...) position_2012_09_12 (lon, time, flightkey, ...) position_2012_09_12 (time, long, flightkey, ...) 

The dots here represent the rest of the columns that you select. You only need one of the indices in the position, but it’s hard to say what will turn out to be the best. The first approach may allow merging to be merged into preliminary data, while the cost of reading the entire second index is done for filtering. The second and third allow you to pre-filter the data, but require a hash connection. Indicate how much of the cost will be in the hash join, merge merging can be a good option.

Since your query requires 52 out of 87 bytes per row, and indexes have overhead, you can omit the index, which takes up a lot, if any, less space than the table itself.

Another approach is to attack the "randomly distributed" side, looking at clustering.

+2
source

The row count estimates are pretty reasonable, so I doubt this is a statistic problem.

I would try:

  • Creating an index on position_2012_09_12(lon,"time") or, possibly, a partial index on position_2012_09_12("time") WHERE (lon < 0) if you regularly search lon < 0 .

  • The random_page_cost setting random_page_cost below, possibly 1.1. See if (a) it changes the plan and (b) if the new plan is really faster. For testing purposes, to avoid accelerating seqscan, you can SET enable_seqscan = off ; if so, change the cost parameters.

  • Increase work_mem for this request. SET work_mem = 10M or something before starting.

  • Running the latest PostgreSQL, if you haven’t already. Always state your version of PostgreSQL in questions. (Update after editing): you are at 9.1; it's great. The biggest performance improvement in 9.2 was index-only scanning, and it seems you are unlikely to benefit from index-only scanning for this query.

You will also improve performance a bit if you can get rid of columns to narrow the rows. It will not change tons, but it will do some.

+3
source

All Articles