Very slow scanning of a raster map in Postgres

I have the following simple table that contains traffic measurement data:

CREATE TABLE "TrafficData" ( "RoadID" character varying NOT NULL, "DateID" numeric NOT NULL, "ExactDateTime" timestamp NOT NULL, "CarsSpeed" numeric NOT NULL, "CarsCount" numeric NOT NULL ) CREATE INDEX "RoadDate_Idx" ON "TrafficData" USING btree ("RoadID", "DateID"); 

The RoadID column uniquely identifies the road whose data is being recorded, and DateID identifies the day of the year (1..365) of the data - basically a rounded ExactDateTime view.

I have about 100,000,000 lines; the RoadID column has 1,000 different values ​​and 365 different values ​​in the DateID column.

Then I run the following query:

 SELECT * FROM "TrafficData" WHERE "RoadID"='Station_1' AND "DateID">20100610 AND "DateID"<20100618; 

It will take up to three breathtaking seconds to complete, and I can’t let my life determine WHY.

EXPLAIN ANALYZE gives me the following result:

 Bitmap Heap Scan on "TrafficData" (cost=104.84..9743.06 rows=2496 width=47) (actual time=35.112..2162.404 rows=2016 loops=1) Recheck Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric)) -> Bitmap Index Scan on "RoadDate_Idx" (cost=0.00..104.22 rows=2496 width=0) (actual time=1.637..1.637 rows=2016 loops=1) Index Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric)) Total runtime: 2163.985 ms 

My specifications:

  • Windows 7
  • Postgres 9.0
  • RAM 4 GB

I would really appreciate some helpful tips!

+6
performance indexing postgresql
source share
3 answers

The slower part truncates the data from the tables, since access to the index seems very fast. You can optimize the RAM usage parameters ( http://wiki.postgresql.org/wiki/Performance_Optimization and http://www.varlena.com/GeneralBits/Tidbits/perf.html ) or optimize the layout of the data in the table by issuing the CLUSTER command ( see http://www.postgresql.org/docs/8.3/static/sql-cluster.html ).

 CLUSTER "TrafficData" USING "RoadDate_Idx"; 

must do it.

+5
source share

Adding an answer to Daniel, a cluster operation is a one-time process that reorders data on disk. The goal is to get your 2000 results series with fewer disk blocks.

Since this is dummy data that is used to figure out how you can quickly request it, I would recommend reloading it according to the template, closer to how it will load as it is created. I believe that data is generated one day at a time, which effectively leads to a strong correlation between DateID and disk location. If so, then I either cluster using DateID , or split your test data into 365 separate loads and reload it.

Without this and with randomly generated data, you will most likely have to perform more than 2000 attempts on your disk.

I will also check that everything you use in Windows 7 does not add time to reads that you do not need, for example, to make sure that the blocks you read do not contain virus signatures or at the same time automatically defragment the disk (causing that the disk head is almost never somewhere near the place where the database block was last read).

+2
source share
  • 4 GB RAM -> 6+ you have 100M entries that are small, but may matter for desktop memory. If it's not a desktop, I'm not sure why you will have such a small amount of memory.
  • AND "DateID">20100610 AND "DateID"<20100618; β†’ DateID BETWEEN 20100611 AND 20100617;
  • Create an index in DateID
  • Get rid of all double quotes around field names
  • Instead of VarChar make RoadID text box
0
source share

All Articles