Optimizing the count query for PostgreSQL

Question

Optimizing the count query for PostgreSQL

I have a table in postgresql that contains an array that is constantly being updated.

In my application, I need to get the number of rows for which there is no specific parameter in this column of the array. My query looks like this:

select count(id) from table where not (ARRAY['parameter value'] <@ table.array_column)

But with an increase in the number of lines and the number of executions of this query (several times per second, possibly hundreds or thousands), performance deteriorates significantly, it seems to me that counting in postgresql can have a linear order of execution (Im not quite sure about this).

Basically my question is:

Is there an existing model that I don't know about that applies to this situation? What would be the best approach for this?

Any suggestion you could give me would be truly appreciated.

+6

count postgresql postgresql-performance database-performance

jeruki Oct 25 '12 at 18:48

source share

3 answers

PostgreSQL actually supports GIN indexes for array columns. Unfortunately, it is not suitable for NOT ARRAY[...] <@ indexed_col , and GIN indexes are not suitable for frequently updated tables.

Demo:

 CREATE TABLE arrtable (id integer primary key, array_column integer[]); INSERT INTO arrtable(1, ARRAY[1,2,3,4]); CREATE INDEX arrtable_arraycolumn_gin_arr_idx ON arrtable USING GIN(array_column); -- Use the following *only* for testing whether Pg can use an index -- Do not use it in production. SET enable_seqscan = off; explain (buffers, analyze) select count(id) from arrtable where not (ARRAY[1] <@ arrtable.array_column);

Unfortunately, this shows that, as written, we cannot use the index. If you do not deny the condition that it can be used, you can search and count the strings containing the search element (by removing NOT ).

You can use the index to count records containing the target value, and then subtract this result from all records. Since count all the rows in the table are rather slow in PostgreSQL (9.1 and older) and require sequential scans, this will be actually slower than your current query. It is possible that in 9.2 you can use only index scanning for row counting, if you have a b-tree index on id , in which case it could really be OK:

 SELECT ( SELECT count(id) FROM arrtable ) - ( SELECT count(id) FROM arrtable WHERE (ARRAY[1] <@ arrtable.array_column) );

Guaranteed execution is worse than the original version for Pg 9.1 and below, because in addition to seqscan, your original also requires a GIN index scan. I tested this on 9.2, and it looks like it is using the index for the account, so it's worth exploring it for 9.2. With some less trivial dummy data:

 drop index arrtable_arraycolumn_gin_arr_idx ; truncate table arrtable; insert into arrtable (id, array_column) select s, ARRAY[1,2,s,s*2,s*3,s/2,s/4] FROM generate_series(1,1000000) s; CREATE INDEX arrtable_arraycolumn_gin_arr_idx ON arrtable USING GIN(array_column);

Note that a GIN index like this slows down updates using LOT, and at the beginning it is rather slow to create. It is not suitable for tables that are updated quite like your table.

Worse, a query using this index takes twice as much as your original query and, at best, half as much in the same dataset. This is the worst case where the index is not very selective, like ARRAY[1] - 4s versus 2s for the original query. If the index is very selective (that is, there are not so many matches, for example ARRAY[199] ), it works after about 1.2 seconds compared to the original 3. This index is simply not worth having for this query.

Is the lesson here? Sometimes the correct answer is simply to perform a sequential scan.

Since this will not be done for your hit rates, either support the materialized view with a trigger, as @debenhur suggests, or try to invert the array as a list of parameters that are not in the record so that you can use the GiST Index as @maniek suggests.

+5

Craig ringer Oct 26 '12 at 0:30

source share

I think you are out of luck with your current data model. Try to come up with an algorithm that the database should execute for your query. It cannot work without sequential data scanning.

Can you arrange the column to retain the return data (so the query would be select count(id) from table where ARRAY['parameter value'] <@ table.array_column )? This query will use the gin / gist index.

+2

maniek Oct 25 '12 at 19:41

source share

Craig ringer · Accepted Answer · 2012-10-26T02:17:10+0000

Is there an existing Im model that is not aware of this, in relation to this situation? What would be the best approach for this?

Your best bet in this situation may be to normalize your circuit. Divide the array into a table. Add the index of table b to the property table or order the primary key so that it can efficiently find by property_id .

 CREATE TABLE demo( id integer primary key ); INSERT INTO demo (id) SELECT id FROM arrtable; CREATE TABLE properties ( demo_id integer not null references demo(id), property integer not null, primary key (demo_id, property) ); CREATE INDEX properties_property_idx ON properties(property);

Then you can request the properties:

 SELECT count(id) FROM demo WHERE NOT EXISTS ( SELECT 1 FROM properties WHERE demo.id = properties.demo_id AND property = 1 )

I expected this to be much faster than the original query, but in fact it is almost the same with the same data samples; it works in the same range from 2s to 3s as the original request. This is the same problem when the search for what is not there is much slower than the search for what is; if we look for strings containing a property, we can avoid seqscan demo and just scan properties to match identifiers directly.

Again, scanning seq in the table containing the array also does the job.

Optimizing the count query for PostgreSQL

More articles: