Huge performance differences between sum (column_name), sum (1) and count (*) on a large dataset

EDIT:
Since you guys suggested creating separate tables for player / tournament names and replacing rows with foreign keys, I did the following:

SELECT DISTINCT tournament INTO tournaments FROM chess_data2 ALTER TABLE tournaments ADD COLUMN id SERIAL PRIMARY KEY 

I repeated that for namew and nameb, and then replaced the strings with foreign keys. That's where it got tricky - I can't do it at the "legal" time.

I tried two approaches: 1) Delete existing indexes
1) Create individual indexes for namew, nameb and tournament separately
1) Run the query by inserting the necessary data into a new table:

 SELECT date, whiterank, blackrank, t_round, result, (SELECT p.id FROM players p WHERE c_d2.namew = p.name) AS whitep, (SELECT p2.id FROM players p2 WHERE c_d2.nameb = p2.name) AS blackp, (SELECT t.id FROM tournaments t WHERE t_d2.tournament = t.t_name) AS tournament INTO final_chess from chess_data2 c_d2 

Unfortunately, it was very slow, so I returned to the user Boris Schegolev. In a comment, he suggested creating a new column in the existing chess_data2 table and updating it. So I did this:

 ALTER TABLE chess_data2 ADD COLUMN name_id INTEGER UPDATE chess_data2 cd2 SET namew_id = (SELECT id FROM players WHERE name = cd2.namew)" 

I started these queries half an hour ago, the first of them was instantaneous, and the second - forever.

What should I do now?

INITIAL QUESTION:

Database Schema:
date DATE
nemew TEXT
nameb TEXT whiterank INTEGER
blackrank INTEGER
tournament TEXT
t_round INTEGER
REAL result
id BIGINT
chess_data2_pkey (identifier)
black_index (name, tournament, date)
chess_data2_pkey (id) UNIQUE
w_b_t_d_index (namew, nameb, tournament, date)
white_index (namew, tournament, date)

Problem:
The following statement is very good (~ 60-70 seconds in a database with 3 million records):

 # Number of points that the white player has so far accrued throughout the tournament (SELECT coalesce(SUM(result),0) from chess_data2 t2 where (t1.namew = t2.namew) and t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90) + SELECT coalesce(SUM(1-result),0) from chess_data2 t2 where (t1.namew = t2.nameb) and t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90 ) AS result_in_t_w from chessdata2 t1 

Meanwhile, the next choice (which is EXACTLY the same as the sentences) takes forever for calculation.

 # Number of games that the white player has so far played in the tournament (SELECT coalesce(count(*),0) from chess_data t2 where (t1.namew = t2.namew) and t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90) + (SELECT coalesce(count(*),0) from chess_data2 t2 where (t1.namew = t2.nameb) and t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90) AS games_t_w from chess_data2 t1 

I tried a different approach (with a sum), and it also did not improve:

 # Number of games that the white player has so far played in the tournament (SELECT coalesce(sum(1),0) from chess_data t2 where (t1.namew = t2.namew) and t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90) + (SELECT coalesce(sum(1),0) from chess_data2 t2 where (t1.namew = t2.nameb) and t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90) AS games_t_w from chess_data2 t1 

Any idea what is going on here and how to fix it? I am using python 3.5 and psycopg2 in PyCharm to run these requests. I will be very happy to provide any additional information, since for me this is a very important project.
EXPLAIN ANALYZE (used for the last request):

 Seq Scan on chess_data2 t1 (cost=0.00..49571932.96 rows=2879185 width=86) (actual time=0.061..81756.896 rows=2879185 loops=1) Planning time: 0.161 ms Execution time: 81883.716 ms SubPlan 2 SubPlan 1 -> Aggregate (cost=8.58..8.59 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=2879185) -> Aggregate (cost=8.58..8.59 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=2879185) -> Index Only Scan using white_index on chess_data2 t2 (cost=0.56..8.58 rows=1 width=0) (actual time=0.013..0.013 rows=1 loops=2879185) -> Index Only Scan using black_index on chess_data2 t2_1 (cost=0.56..8.58 rows=1 width=0) (actual time=0.013..0.013 rows=2 loops=2879185) Rows Removed by Filter: 1 Rows Removed by Filter: 1 Index Cond: ((namew = t1.namew) AND (tournament = t1.tournament) AND (date < t1.date)) Index Cond: ((nameb = t1.namew) AND (tournament = t1.tournament) AND (date < t1.date)) Heap Fetches: 6009767 Heap Fetches: 5303160 Filter: (t1.date < (date + 90)) Filter: (t1.date < (date + 90)) 
+5
source share
2 answers

Requests do not work well due to poor table design. It is obvious from EXPLAIN that the database uses indexes, but indexed fields are all TEXT , and indexes are huge.

To fix this:

  • create table names
  • replace namew and nameb with namew_id and nameb_id , referring to names
  • create table tournaments
  • replace tournament with tournament_id link tournaments
  • reindex black_index as (nameb_id, tournament_id, date)
  • reindex white_index as (namew_id, tournament_id, date)
  • drop w_b_t_d_index if you are not using it in another request
  • remove useless coalesce from count(*) request

Then your request will look like this:

 SELECT ( SELECT count(*) FROM chess_data t2 WHERE t1.namew_id = t2.namew_id AND t1.tournament_id = t2.tournament_id AND t1.date > t2.date AND t1.date < t2.date + 90 ) + ( SELECT count(*) FROM chess_data2 t2 WHERE t1.namew_id = t2.nameb_id AND t1.tournament_id = t2.tournament_id AND t1.date > t2.date AND t1.date < t2.date + 90 ) AS games_t_w FROM chess_data2 t1 
+3
source

The entire compression operation (except for composite indices):

 DONT CREATE TABLE games ( id serial not null PRIMARY KEY , zdate DATE , namew TEXT , nameb TEXT , whiterank INTEGER NOT NULL DEFAULT 0 , blackrank INTEGER NOT NULL DEFAULT 0 , tournament TEXT , t_round INTEGER NOT NULL DEFAULT 0 , result REAL NOT NULL DEFAULT 0.0 ); CREATE TABLE tournaments ( id serial not null PRIMARY KEY , tournament TEXT UNIQUE ); CREATE TABLE players ( id serial not null PRIMARY KEY , name TEXT UNIQUE ); INSERT INTO tournaments(tournament) SELECT DISTINCT tournament FROM games; VACUUM ANALYZE tournaments; INSERT INTO players(name) SELECT DISTINCT namew FROM games; VACUUM ANALYZE players; INSERT INTO players(name) SELECT DISTINCT nameb FROM games g WHERE NOT EXISTS ( SELECT * FROM players nx WHERE nx.name = g.nameb ); VACUUM ANALYZE players; ALTER TABLE games ADD COLUMN id_w INTEGER , ADD COLUMN id_b INTEGER , ADD COLUMN id_t INTEGER ; UPDATE games g SET id_w = p.id FROM players p WHERE p.name = g.namew; UPDATE games g SET id_b = p.id FROM players p WHERE p.name = g.nameb; UPDATE games g SET id_t = t.id FROM tournaments t WHERE t.tournament = g.tournament; ALTER TABLE games ALTER COLUMN id_w SET NOT NULL , ALTER COLUMN id_b SET NOT NULL , ALTER COLUMN id_t SET NOT NULL ; CREATE INDEX ON games(id_w); CREATE INDEX ON games(id_b); CREATE INDEX ON games(id_t); ALTER TABLE games ADD CONSTRAINT fk_games_idw FOREIGN KEY(id_w) REFERENCES players(id) , ADD CONSTRAINT fk_games_idb FOREIGN KEY(id_b) REFERENCES players(id) , ADD CONSTRAINT fk_games_idt FOREIGN KEY(id_t) REFERENCES tournaments(id) , DROP COLUMN namew , DROP COLUMN nameb , DROP COLUMN tournament ; VACUUM ANALYZE games; 
0
source

All Articles