Segmented table query still checking all partitions

I have a table with over a billion records. To improve performance, I divided it into 30 sections. The most common queries have (id = ...) in their where clause, so I decided to split the table into an id column.

Basically, partitions were created this way:

 CREATE TABLE foo_0 (CHECK (id % 30 = 0)) INHERITS (foo); CREATE TABLE foo_1 (CHECK (id % 30 = 1)) INHERITS (foo); CREATE TABLE foo_2 (CHECK (id % 30 = 2)) INHERITS (foo); CREATE TABLE foo_3 (CHECK (id % 30 = 3)) INHERITS (foo); . . . 

I ran ANALYZE for the entire database, and in particular, I collected additional statistics for this id table by doing:

 ALTER TABLE foo ALTER COLUMN id SET STATISTICS 10000; 

However, when I run queries that are filtered in the id column, the scheduler shows that it is still scanning all sections. constraint_exclusion set to partition , so no problem.

 EXPLAIN ANALYZE SELECT * FROM foo WHERE (id = 2); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Result (cost=0.00..8106617.40 rows=3620981 width=54) (actual time=30.544..215.540 rows=171477 loops=1) -> Append (cost=0.00..8106617.40 rows=3620981 width=54) (actual time=30.539..106.446 rows=171477 loops=1) -> Seq Scan on foo (cost=0.00..0.00 rows=1 width=203) (actual time=0.002..0.002 rows=0 loops=1) Filter: (id = 2) -> Bitmap Heap Scan on foo_0 foo (cost=3293.44..281055.75 rows=122479 width=52) (actual time=0.020..0.020 rows=0 loops=1) Recheck Cond: (id = 2) -> Bitmap Index Scan on foo_0_idx_1 (cost=0.00..3262.82 rows=122479 width=0) (actual time=0.018..0.018 rows=0 loops=1) Index Cond: (id = 2) -> Bitmap Heap Scan on foo_1 foo (cost=3312.59..274769.09 rows=122968 width=56) (actual time=0.012..0.012 rows=0 loops=1) Recheck Cond: (id = 2) -> Bitmap Index Scan on foo_1_idx_1 (cost=0.00..3281.85 rows=122968 width=0) (actual time=0.010..0.010 rows=0 loops=1) Index Cond: (id = 2) -> Bitmap Heap Scan on foo_2 foo (cost=3280.30..272541.10 rows=121903 width=56) (actual time=30.504..77.033 rows=171477 loops=1) Recheck Cond: (id = 2) -> Bitmap Index Scan on foo_2_idx_1 (cost=0.00..3249.82 rows=121903 width=0) (actual time=29.825..29.825 rows=171477 loops=1) Index Cond: (id = 2) . . . 

What can I do to make the planer smoother? Do I need to run ALTER TABLE foo ALTER COLUMN id SET STATISTICS 10000; for all sections?

EDIT

After using Erwin's proposed change in the query, the scheduler only scans the correct section, however, the execution time is actually worse than a full scan (at least the index).

 EXPLAIN ANALYZE select * from foo where (id % 30 = 2) and (id = 2); QUERY PLAN QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Result (cost=0.00..8106617.40 rows=3620981 width=54) (actual time=32.611..224.934 rows=171477 loops=1) -> Append (cost=0.00..8106617.40 rows=3620981 width=54) (actual time=32.606..116.565 rows=171477 loops=1) -> Seq Scan on foo (cost=0.00..0.00 rows=1 width=203) (actual time=0.002..0.002 rows=0 loops=1) Filter: (id = 2) -> Bitmap Heap Scan on foo_0 foo (cost=3293.44..281055.75 rows=122479 width=52) (actual time=0.046..0.046 rows=0 loops=1) Recheck Cond: (id = 2) -> Bitmap Index Scan on foo_0_idx_1 (cost=0.00..3262.82 rows=122479 width=0) (actual time=0.044..0.044 rows=0 loops=1) Index Cond: (id = 2) -> Bitmap Heap Scan on foo_1 foo (cost=3312.59..274769.09 rows=122968 width=56) (actual time=0.021..0.021 rows=0 loops=1) Recheck Cond: (id = 2) -> Bitmap Index Scan on foo_1_idx_1 (cost=0.00..3281.85 rows=122968 width=0) (actual time=0.020..0.020 rows=0 loops=1) Index Cond: (id = 2) -> Bitmap Heap Scan on foo_2 foo (cost=3280.30..272541.10 rows=121903 width=56) (actual time=32.536..86.730 rows=171477 loops=1) Recheck Cond: (id = 2) -> Bitmap Index Scan on foo_2_idx_1 (cost=0.00..3249.82 rows=121903 width=0) (actual time=31.842..31.842 rows=171477 loops=1) Index Cond: (id = 2) -> Bitmap Heap Scan on foo_3 foo (cost=3475.87..285574.05 rows=129032 width=52) (actual time=0.035..0.035 rows=0 loops=1) Recheck Cond: (id = 2) -> Bitmap Index Scan on foo_3_idx_1 (cost=0.00..3443.61 rows=129032 width=0) (actual time=0.031..0.031 rows=0 loops=1) . . . -> Bitmap Heap Scan on foo_29 foo (cost=3401.84..276569.90 rows=126245 width=56) (actual time=0.019..0.019 rows=0 loops=1) Recheck Cond: (id = 2) -> Bitmap Index Scan on foo_29_idx_1 (cost=0.00..3370.28 rows=126245 width=0) (actual time=0.018..0.018 rows=0 loops=1) Index Cond: (id = 2) Total runtime: 238.790 ms 

Versus:

 EXPLAIN ANALYZE select * from foo where (id % 30 = 2) and (id = 2); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------ Result (cost=0.00..273120.30 rows=611 width=56) (actual time=31.519..257.051 rows=171477 loops=1) -> Append (cost=0.00..273120.30 rows=611 width=56) (actual time=31.516..153.356 rows=171477 loops=1) -> Seq Scan on foo (cost=0.00..0.00 rows=1 width=203) (actual time=0.002..0.002 rows=0 loops=1) Filter: ((id = 2) AND ((id % 30) = 2)) -> Bitmap Heap Scan on foo_2 foo (cost=3249.97..273120.30 rows=610 width=56) (actual time=31.512..124.177 rows=171477 loops=1) Recheck Cond: (id = 2) Filter: ((id % 30) = 2) -> Bitmap Index Scan on foo_2_idx_1 (cost=0.00..3249.82 rows=121903 width=0) (actual time=30.816..30.816 rows=171477 loops=1) Index Cond: (id = 2) Total runtime: 270.384 ms 
+6
source share
3 answers

For nontrivial expressions, you must repeat the more or less shorthand condition in the queries in order to make the Postgres query planner understand that it can rely on the CHECK constraint. Even if it seems redundant!

In the documentation :

If constraint exclusion is enabled, the scheduler will consider the constraints of each section and try to prove that the section does not need to be scanned, because it cannot contain rows that satisfy the WHERE query. When the planner can prove this , it excludes the section from the query plan.

My bold accent. The scheduler does not understand complex expressions. Of course, this also needs to be done:

Ensure that the constraint_exclusion configuration parameter is not disabled in postgresql.conf . If so, queries will not be optimized as desired.

Instead

 SELECT * FROM foo WHERE (id = 2); 

Try:

 SELECT * FROM foo WHERE id % 30 = 2 AND id = 2; 

and

The default (and recommended) constraint_exclusion parameter is actually neither on nor off , but an intermediate setting called partition , which forces the method to apply only to queries that are likely to work with partitioned tables. The inclusion setting forces the scheduler to examine the CHECK constraints in all queries, even simple ones, which are unlikely to be useful.

You can experiment with constraint_exclusion = on to see if the scheduler leaves without a redundant verbatim state. But you have to weigh the cost and benefits of this setting.

An alternative would be simpler terms for your sections as already indicated by @harmic .

No, increasing the number for STATISTICS will not help in this case. Only the CHECK conditions and your WHERE conditions in the query question.

+8
source

Unfortunately, partioning in postgresql is pretty primitive. It only works for range and list restrictions. The restrictions of your section are too complex for the query planner to use the exception of some sections.

The manual says:

Keep the partitioning restrictions in place, otherwise the scheduler may not be there to prove that the sections do not need to be visited. Use simple equality conditions to split a list or simple range tests for as shown in previous examples. A good rule of thumb is that partition restrictions should only contain column comparisons (sec) of the partition with constants using Operators indexed by B-trees.

You can get away with modifying the WHERE clause so that the module expression is explicitly mentioned, as Erwin suggested. I was not lucky with this in the past, although I have not tried it recently, and, according to him, there were improvements in the scheduler. This is probably the first thing to try.

Otherwise, you will have to rebuild your partitions to use the ranges of id values, and not the module method that you are using now. I do not know how a great solution.

Another solution is to keep the id module in a separate column, which can then be used to check the size limit of a simple equality check. However, there is a small amount of disk space, and you will also need to add the term to the where clauses to load.

+5
source

In addition to Erwin's words about the details of the partition planner, there is a big problem.

Separation is not a magic bullet. There are some very specific things for which splitting is very useful. If none of these very specific things apply to you, you cannot expect performance improvements from partitioning and are likely to get a decrease.

To properly partition, you need to understand in detail your usage patterns or your data upload and download patterns.

0
source

All Articles