Slow grouping of groups in PostgreSQL

In PostgreSQL 9.2, I have a table of elements that are evaluated by users:

id | userid | itemid | rating | timestamp | !update_time --------+--------+--------+---------------+---------------------+------------------------ 522241 | 3991 | 6887 | 0.1111111111 | 2005-06-20 03:13:56 | 2013-10-11 17:50:24.545 522242 | 3991 | 6934 | 0.1111111111 | 2005-04-05 02:25:21 | 2013-10-11 17:50:24.545 522243 | 3991 | 6936 | -0.1111111111 | 2005-03-31 03:17:25 | 2013-10-11 17:50:24.545 522244 | 3991 | 6942 | -0.3333333333 | 2005-03-24 04:38:02 | 2013-10-11 17:50:24.545 522245 | 3991 | 6951 | -0.5555555556 | 2005-06-20 03:15:35 | 2013-10-11 17:50:24.545 ... | ... | ... | ... | ... | ... 

I want to fulfill a very simple query: for each user, select the total number of ratings in the database.

I use the following simple approach:

 SELECT userid, COUNT(*) AS rcount FROM ratings GROUP BY userid 

The table contains 10M entries. The request takes ... well, about 2 or 3 minutes. Honestly, I am not happy with this, and I believe that 10M is not so much for the request to take so much time. (Or that..??)

From now on, I asked PostgreSQL to show me the execution plan:

 EXPLAIN SELECT userid, COUNT(*) AS rcount FROM ratings GROUP BY userid 

This leads to:

 GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5) -> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5) Sort Key: userid -> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5) 

I read it as follows: firstly, the entire table is read from disk (seq scan). Secondly, it is sorted by user id in n*log(n) (sorting). Finally, a sorted table is read in rows and aggregated in linear time. Well, not quite an optimal algorithm, I think if I implemented it myself, I would use a hash table and build the result in the first pass. Nothing.

It seems like sorting by userid takes so long. So the index added:

 CREATE INDEX ratings_userid_index ON ratings (userid) 

Unfortunately, this did not help, and the performance did not change. I definitely do not consider myself an advanced user, and I believe that I am doing something fundamentally wrong. However, I'm stuck here. I would appreciate any ideas on how to complete the request in a reasonable amount of time. Another note: PostgreSQL workflow uses 100% of one of my processor cores at runtime, which indicates that disk access is not the main bottleneck.

EDIT

As requested by @a_horse_with_no_name. Wow, pretty advanced for me:

 EXPLAIN (analyze on, buffers on, verbose on) SELECT userid,COUNT(userid) AS rcount FROM movielens_10m.ratings GROUP BY userId 

Outputs:

 GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5) (actual time=110666.899..127168.304 rows=69878 loops=1) Output: userid, count(userid) Buffers: shared hit=906 read=82433, temp read=19358 written=19358 -> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5) (actual time=110666.838..125180.683 rows=10000054 loops=1) Output: userid Sort Key: ratings.userid Sort Method: external merge Disk: 154840kB Buffers: shared hit=906 read=82433, temp read=19358 written=19358 -> Seq Scan on movielens_10m.ratings (cost=0.00..183334.54 rows=10000054 width=5) (actual time=0.019..2889.583 rows=10000054 loops=1) Output: userid Buffers: shared hit=901 read=82433 Total runtime: 127193.524 ms 

EDIT 2

Comment by @a_horse_with_no_name resolved the issue. I am pleased to share my findings:

 SET work_mem = '1MB'; EXPLAIN SELECT userid,COUNT(userid) AS rcount FROM movielens_10m.ratings GROUP BY userId 

produces the same as above:

 GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5) -> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5) Sort Key: userid -> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5) 

but

 SET work_mem = '10MB'; EXPLAIN SELECT userid,COUNT(userid) AS rcount FROM movielens_10m.ratings GROUP BY userId 

gives

 HashAggregate (cost=233334.81..233580.16 rows=24535 width=5) -> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5) 

Now the request takes only about 3.5 seconds.

+6
source share
2 answers

See how your query could return the result ... You can create a hash of variable length and create / increase its values; or you can sort all rows by user id and count. Computationally, the latter option is cheaper. This is what Postgres does.

Then consider how to sort the data, taking into account the IO disk. One option is to open disk pages A, B, C, D, etc., and then sort the lines by user ID in memory. In other words, seq scanning is followed by sorting. Another option, called index scanning, is to pull the rows in order using the index: visit page B, then D, then A, then again B, A again, C, advertising nausea.

Index scanning is effective when you pull multiple lines in order; not so much to get a lot of lines in order - not to mention all the lines in order. So the plan you get is optimal:

  • The plow throws all rows (seq scan)
  • Sort rows by groups
  • Sort rows by criteria

The problem is that you sort about 10 million rows to count them by user id. Nothing can make things faster than investing in more RAM and super fast SSDs.

However, you can avoid this request. Or:

  • Calculate the ratings for multiple users that you really need - using the where clause - instead of pulling the whole set; or
  • Add the rating_count field to your users table and use triggers for ratings to maintain the score.
  • Use a materialized view if the exact amount is less relevant than a vague idea of ​​it.
+3
source

Try this because COUNT (*) and COUNT (userid) are very different from each other.

 SELECT userid, COUNT(userid) AS rcount FROM ratings GROUP BY userid 
0
source

All Articles