In PostgreSQL 9.2, I have a table of elements that are evaluated by users:
id | userid | itemid | rating | timestamp | !update_time --------+--------+--------+---------------+---------------------+------------------------ 522241 | 3991 | 6887 | 0.1111111111 | 2005-06-20 03:13:56 | 2013-10-11 17:50:24.545 522242 | 3991 | 6934 | 0.1111111111 | 2005-04-05 02:25:21 | 2013-10-11 17:50:24.545 522243 | 3991 | 6936 | -0.1111111111 | 2005-03-31 03:17:25 | 2013-10-11 17:50:24.545 522244 | 3991 | 6942 | -0.3333333333 | 2005-03-24 04:38:02 | 2013-10-11 17:50:24.545 522245 | 3991 | 6951 | -0.5555555556 | 2005-06-20 03:15:35 | 2013-10-11 17:50:24.545 ... | ... | ... | ... | ... | ...
I want to fulfill a very simple query: for each user, select the total number of ratings in the database.
I use the following simple approach:
SELECT userid, COUNT(*) AS rcount FROM ratings GROUP BY userid
The table contains 10M entries. The request takes ... well, about 2 or 3 minutes. Honestly, I am not happy with this, and I believe that 10M is not so much for the request to take so much time. (Or that..??)
From now on, I asked PostgreSQL to show me the execution plan:
EXPLAIN SELECT userid, COUNT(*) AS rcount FROM ratings GROUP BY userid
This leads to:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5) -> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5) Sort Key: userid -> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
I read it as follows: firstly, the entire table is read from disk (seq scan). Secondly, it is sorted by user id in n*log(n) (sorting). Finally, a sorted table is read in rows and aggregated in linear time. Well, not quite an optimal algorithm, I think if I implemented it myself, I would use a hash table and build the result in the first pass. Nothing.
It seems like sorting by userid takes so long. So the index added:
CREATE INDEX ratings_userid_index ON ratings (userid)
Unfortunately, this did not help, and the performance did not change. I definitely do not consider myself an advanced user, and I believe that I am doing something fundamentally wrong. However, I'm stuck here. I would appreciate any ideas on how to complete the request in a reasonable amount of time. Another note: PostgreSQL workflow uses 100% of one of my processor cores at runtime, which indicates that disk access is not the main bottleneck.
EDIT
As requested by @a_horse_with_no_name. Wow, pretty advanced for me:
EXPLAIN (analyze on, buffers on, verbose on) SELECT userid,COUNT(userid) AS rcount FROM movielens_10m.ratings GROUP BY userId
Outputs:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5) (actual time=110666.899..127168.304 rows=69878 loops=1) Output: userid, count(userid) Buffers: shared hit=906 read=82433, temp read=19358 written=19358 -> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5) (actual time=110666.838..125180.683 rows=10000054 loops=1) Output: userid Sort Key: ratings.userid Sort Method: external merge Disk: 154840kB Buffers: shared hit=906 read=82433, temp read=19358 written=19358 -> Seq Scan on movielens_10m.ratings (cost=0.00..183334.54 rows=10000054 width=5) (actual time=0.019..2889.583 rows=10000054 loops=1) Output: userid Buffers: shared hit=901 read=82433 Total runtime: 127193.524 ms
EDIT 2
Comment by @a_horse_with_no_name resolved the issue. I am pleased to share my findings:
SET work_mem = '1MB'; EXPLAIN SELECT userid,COUNT(userid) AS rcount FROM movielens_10m.ratings GROUP BY userId
produces the same as above:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5) -> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5) Sort Key: userid -> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
but
SET work_mem = '10MB'; EXPLAIN SELECT userid,COUNT(userid) AS rcount FROM movielens_10m.ratings GROUP BY userId
gives
HashAggregate (cost=233334.81..233580.16 rows=24535 width=5) -> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
Now the request takes only about 3.5 seconds.