I was given this query for optimization on POSTGRESQL 9.2:
SELECT C.name, COUNT(DISTINCT I.id) AS NumItems, COUNT(B.id) FROM Categories C INNER JOIN Items I ON(C.id = I.category) INNER JOIN Bids B ON (I.id = B.item_id) GROUP BY C.name
As part of my school assignment.
I created these indexes in the corresponding table: items(category) โ 2ndary b + tree, bids(item_id) โ 2ndary b + tree and categories(id) โ here is the primary index,
The weird part: PostgreSQL does a sequential scan in my Items, Categories and Bid table, and when I set enable_seqscan=off , the index search is more terrible than the result below.
When I run the explanation in PostgreSQL, this is the result: PLEASE DO NOT REMOVE THE FISHINGS AS THEY ARE NECESSARY!
GroupAggregate (cost=119575.55..125576.11 rows=20 width=23) (actual time=6912.523..9459.431 rows=20 loops=1) Buffers: shared hit=30 read=12306, temp read=6600 written=6598 -> Sort (cost=119575.55..121075.64 rows=600036 width=23) (actual time=6817.015..8031.285 rows=600036 loops=1) Sort Key: c.name Sort Method: external merge Disk: 20160kB Buffers: shared hit=30 read=12306, temp read=6274 written=6272 -> Hash Join (cost=9416.95..37376.03 rows=600036 width=23) (actual time=407.974..3322.253 rows=600036 loops=1) Hash Cond: (b.item_id = i.id) Buffers: shared hit=30 read=12306, temp read=994 written=992 -> Seq Scan on bids b (cost=0.00..11001.36 rows=600036 width=8) (actual time=0.009..870.898 rows=600036 loops=1) Buffers: shared hit=2 read=4999 -> Hash (cost=8522.95..8522.95 rows=50000 width=19) (actual time=407.784..407.784 rows=50000 loops=1) Buckets: 4096 Batches: 2 Memory Usage: 989kB Buffers: shared hit=28 read=7307, temp written=111 -> Hash Join (cost=1.45..8522.95 rows=50000 width=19) (actual time=0.082..313.211 rows=50000 loops=1) Hash Cond: (i.category = c.id) Buffers: shared hit=28 read=7307 -> Seq Scan on items i (cost=0.00..7834.00 rows=50000 width=8) (actual time=0.004..144.554 rows=50000 loops=1) Buffers: shared hit=27 read=7307 -> Hash (cost=1.20..1.20 rows=20 width=19) (actual time=0.062..0.062 rows=20 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB Buffers: shared hit=1 -> Seq Scan on categories c (cost=0.00..1.20 rows=20 width=19) (actual time=0.004..0.028 rows=20 loops=1) Buffers: shared hit=1 Total runtime: 9473.257 ms
See this plan at explain.depesz.com .
I just want to know why this is happening, i.e. why indexes make query terribly slow compared to sequential scans.
Edit: I think I managed to discover a couple of things by looking at the postgresql documentation. Postgresql decided to perform a seq scan on some table, for example, bids and items, because it predicted that it should retrieve each row in the table (compare the number of lines in the bracket before the actual time and the number of lines in the actual time part). Sequential scans best retrieve all rows. Nothing can be done in this part.
I created an additional index for categories(name) , and the result below is what I have. This has somehow improved, but now the hash join is replaced by a nested loop. Any clue in why?
GroupAggregate (cost=0.00..119552.02 rows=20 width=23) (actual time=617.330..7725.314 rows=20 loops=1) Buffers: shared hit=178582 read=37473 written=14, temp read=2435 written=436 -> Nested Loop (cost=0.00..115051.55 rows=600036 width=23) (actual time=0.120..6186.496 rows=600036 loops=1) Buffers: shared hit=178582 read=37473 written=14, temp read=2109 written=110 -> Nested Loop (cost=0.00..26891.55 rows=50000 width=19) (actual time=0.066..2827.955 rows=50000 loops=1) Join Filter: (c.id = i.category) Rows Removed by Join Filter: 950000 Buffers: shared hit=2 read=7334 written=1, temp read=2109 written=110 -> Index Scan using categories_name_idx on categories c (cost=0.00..12.55 rows=20 width=19) (actual time=0.039..0.146 rows=20 loops=1) Buffers: shared hit=1 read=1 -> Materialize (cost=0.00..8280.00 rows=50000 width=8) (actual time=0.014..76.908 rows=50000 loops=20) Buffers: shared hit=1 read=7333 written=1, temp read=2109 written=110 -> Seq Scan on items i (cost=0.00..7834.00 rows=50000 width=8) (actual time=0.007..170.464 rows=50000 loops=1) Buffers: shared hit=1 read=7333 written=1 -> Index Scan using bid_itemid_idx on bids b (cost=0.00..1.60 rows=16 width=8) (actual time=0.016..0.036 rows=12 loops=50000) Index Cond: (item_id = i.id) Buffers: shared hit=178580 read=30139 written=13 Total runtime: 7726.392 ms
Check out the plan here if it's better.
I managed to reduce it to 114062.92 by creating an index for the category (id) and items(category) . Postgresql used both indexes to achieve a value of 140,062.92. However, now postgresql is playing with me without using an index! why is it so buggy?