Postgresql query optimization Internal / external connection not allowed

Question

Postgresql query optimization Internal / external connection not allowed

I was given this query for optimization on POSTGRESQL 9.2:

SELECT C.name, COUNT(DISTINCT I.id) AS NumItems, COUNT(B.id) FROM Categories C INNER JOIN Items I ON(C.id = I.category) INNER JOIN Bids B ON (I.id = B.item_id) GROUP BY C.name

As part of my school assignment.

I created these indexes in the corresponding table: items(category) → 2ndary b + tree, bids(item_id) → 2ndary b + tree and categories(id) → here is the primary index,

The weird part: PostgreSQL does a sequential scan in my Items, Categories and Bid table, and when I set enable_seqscan=off , the index search is more terrible than the result below.

When I run the explanation in PostgreSQL, this is the result: PLEASE DO NOT REMOVE THE FISHINGS AS THEY ARE NECESSARY!

 GroupAggregate (cost=119575.55..125576.11 rows=20 width=23) (actual time=6912.523..9459.431 rows=20 loops=1) Buffers: shared hit=30 read=12306, temp read=6600 written=6598 -> Sort (cost=119575.55..121075.64 rows=600036 width=23) (actual time=6817.015..8031.285 rows=600036 loops=1) Sort Key: c.name Sort Method: external merge Disk: 20160kB Buffers: shared hit=30 read=12306, temp read=6274 written=6272 -> Hash Join (cost=9416.95..37376.03 rows=600036 width=23) (actual time=407.974..3322.253 rows=600036 loops=1) Hash Cond: (b.item_id = i.id) Buffers: shared hit=30 read=12306, temp read=994 written=992 -> Seq Scan on bids b (cost=0.00..11001.36 rows=600036 width=8) (actual time=0.009..870.898 rows=600036 loops=1) Buffers: shared hit=2 read=4999 -> Hash (cost=8522.95..8522.95 rows=50000 width=19) (actual time=407.784..407.784 rows=50000 loops=1) Buckets: 4096 Batches: 2 Memory Usage: 989kB Buffers: shared hit=28 read=7307, temp written=111 -> Hash Join (cost=1.45..8522.95 rows=50000 width=19) (actual time=0.082..313.211 rows=50000 loops=1) Hash Cond: (i.category = c.id) Buffers: shared hit=28 read=7307 -> Seq Scan on items i (cost=0.00..7834.00 rows=50000 width=8) (actual time=0.004..144.554 rows=50000 loops=1) Buffers: shared hit=27 read=7307 -> Hash (cost=1.20..1.20 rows=20 width=19) (actual time=0.062..0.062 rows=20 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB Buffers: shared hit=1 -> Seq Scan on categories c (cost=0.00..1.20 rows=20 width=19) (actual time=0.004..0.028 rows=20 loops=1) Buffers: shared hit=1 Total runtime: 9473.257 ms

See this plan at explain.depesz.com .

I just want to know why this is happening, i.e. why indexes make query terribly slow compared to sequential scans.

Edit: I think I managed to discover a couple of things by looking at the postgresql documentation. Postgresql decided to perform a seq scan on some table, for example, bids and items, because it predicted that it should retrieve each row in the table (compare the number of lines in the bracket before the actual time and the number of lines in the actual time part). Sequential scans best retrieve all rows. Nothing can be done in this part.

I created an additional index for categories(name) , and the result below is what I have. This has somehow improved, but now the hash join is replaced by a nested loop. Any clue in why?

 GroupAggregate (cost=0.00..119552.02 rows=20 width=23) (actual time=617.330..7725.314 rows=20 loops=1) Buffers: shared hit=178582 read=37473 written=14, temp read=2435 written=436 -> Nested Loop (cost=0.00..115051.55 rows=600036 width=23) (actual time=0.120..6186.496 rows=600036 loops=1) Buffers: shared hit=178582 read=37473 written=14, temp read=2109 written=110 -> Nested Loop (cost=0.00..26891.55 rows=50000 width=19) (actual time=0.066..2827.955 rows=50000 loops=1) Join Filter: (c.id = i.category) Rows Removed by Join Filter: 950000 Buffers: shared hit=2 read=7334 written=1, temp read=2109 written=110 -> Index Scan using categories_name_idx on categories c (cost=0.00..12.55 rows=20 width=19) (actual time=0.039..0.146 rows=20 loops=1) Buffers: shared hit=1 read=1 -> Materialize (cost=0.00..8280.00 rows=50000 width=8) (actual time=0.014..76.908 rows=50000 loops=20) Buffers: shared hit=1 read=7333 written=1, temp read=2109 written=110 -> Seq Scan on items i (cost=0.00..7834.00 rows=50000 width=8) (actual time=0.007..170.464 rows=50000 loops=1) Buffers: shared hit=1 read=7333 written=1 -> Index Scan using bid_itemid_idx on bids b (cost=0.00..1.60 rows=16 width=8) (actual time=0.016..0.036 rows=12 loops=50000) Index Cond: (item_id = i.id) Buffers: shared hit=178580 read=30139 written=13 Total runtime: 7726.392 ms

Check out the plan here if it's better.

I managed to reduce it to 114062.92 by creating an index for the category (id) and items(category) . Postgresql used both indexes to achieve a value of 140,062.92. However, now postgresql is playing with me without using an index! why is it so buggy?

+6

database join postgresql postgresql-performance query-optimization

Imnoob Oct 20 '12 at 3:24

source share

3 answers

Craig ringer · Answer 1 · 2012-10-20T09:20:46+0000

Thanks for posting EXPLAIN output without request and for EXPLAIN (BUFFERS, ANALYZE) .

A significant part of the query performance problem is likely to be an external node sorting plan that performs disk merge sorting with a temporary file:

 Sort Method: external merge Disk: 20160kB

You can do this in memory by setting:

 SET work_mem = '50MB';

before running your request. This parameter can also be set for each user, for each database or globally in postgresql.conf .

I'm not sure that adding indexes will be of great benefit, since the query is currently structured. It should read and join all rows from all three tables, and hash joins are probably the fastest way to do this.

I suspect that there are other ways to express this query that will use completely different and more efficient execution strategies, but I have a fading brain about what they might be and don’t want to waste time making dummy tables, to play. More work_mem should significantly improve the request in the form in which it is located.

sufleR · Answer 2 · 2012-10-29T15:00:27+0000

From query plans, we see that:
1. Result and categories have 20 entries
2. Items with a category - 5% of the total number of items
"Rows deleted by join filter: 950,000"
"rows = 50,000" in sequential scan
3. Confirmed bets: rows = 600036 (can you give us the total number of bets?)
4. Does each category have an offer?

So, we want to use indexes for positions (categories) and bids (item_id). We also want to sort by memory.

  select (select name from Categories where id = foo.category) as name, count(foo.id), sum(foo.bids_count) from (select id, category, (select count(item_id) from Bids where item_id = i.id) as bids_count from Items i where category in (select id from Categories) and exists (select 1 from Bids where item_id = i.id) ) as foo group by foo.category order by name

Of course, you must remember that this strictly depends on the data at points 1 and 2.

If the value 4 is true, you can remove the existing part from the query.

Any tips or ideas?

vladr · Answer 3 · 2012-11-09T04:42:02+0000

Please note that if the size of the bids is systematically and significantly larger than the items , then it may be cheaper to go through the items twice (especially if the items fit in RAM) than to select those Item IDs from the result of the union (even if you are sorting in memory). Moreover, depending on how Postgres traffic occurs, from these duplicates, there may be a limited penalty even under adverse loads or memory conditions (this would be a good question that you could ask on pgsql-general .) Take:

 SELECT name, IC.cnt, BC.cnt FROM Categories C, ( SELECT category, count(1) cnt from Items I GROUP BY category ) IC, ( SELECT category, count(1) cnt from Bids B INNER JOIN Items I ON (I.id = B.item_id) GROUP BY category ) BC WHERE IC.category=C.id AND BC.category=id;

How much cheaper? At least 4 times with sufficient caching, i.e. 610 ms versus 2500 ms (sorted in memory) with 20 categories, 50 thousand items and 600 thousand bets, and even faster than 2x after the file system cache is painted over on my car.

Note that this is not a direct replacement for the original request; for one, he assumes that there is a 1: 1 mapping between category identifiers and names (which can be a very reasonable assumption, but if not, just SUM(BC.cnt) and SUM(IC.cnt) as you GROUP BY name ), but more importantly, the number of items in each category includes items that do not have bets, unlike your original INNER JOIN . If only the number of bets for points is required, you can add WHERE EXISTS (SELECT 1 FROM Bids B where item_id=I.id) to the IC subquery; this will go through another bids a second time (in my case, it added a 200 µs penalty to the existing 600 µs plan, up to 2400 ms.)

Postgresql query optimization Internal / external connection not allowed

More articles: