Count the number of matching rows for a large number of points (100,000+)

Purpose:

Get the number of times when something happened between the two moments when the order of magnitude of the account is 100,000 - 10,000,000.

Current implementation:

  • Using PostgreSQL
  • Each "incident" is recorded as a separate row in the table

Columns:

  • Incident type
  • Date when this happened

Request for a counter (pseudo-code):

COUNT rows WHERE time_occurred > <begin_time> AND time_occurred < <end_time> 

Problem:

This works, but the request is very inefficient and takes about 40 seconds to respond. As far as I understand, PostgreSQL is not a good database for this type of query.

I sat down and came up with several ways so that this type of query could be indexed and executed in O (log n), so I know that t is possible.

What tools should I use for this? Should we use a different database to store the count lines? Is there a package we could install on top of PostgreSQL to make this easy? What are our options?

Note:

Not sure if I were right. The result of COUNT should be on the order of 100,000 - 10,000,000. This means that the number of rows matching the query will be on the order of 100,000 - 10,000,000. The actual number of rows in the table is an order of magnitude greater.

Many thanks!

+7
source share
4 answers

The Pre-PostgreSQL 9.2 MVCC implementation required that any query visit every row of the table to check if this version of the row was visible to the current transaction. This will happen even if the query includes only indexed columns. This appears as a slow count on large tables, even for simple cases.

PostgreSQL 9.2 only implements index scans , which can help alleviate this problem for some workloads.

If you are stuck under v9.2, there are some well-known ways to work around the path if you need an approximate line count for a simple query. See http://wiki.postgresql.org/wiki/Count_estimate .

+5
source

Keep a table of incidents aggregated by day.

 create table incidents_agreggated_by_day ( "day" date primary key, total integer ); 

Daily Run:

 insert into events_agreggated_by_day ("day", total) values select date_trunc('day', time_occurred), count(*) total from incidents where time_occurred < current_date and date_trunc('day', time_occurred) not in ( select "day" from incidents_agreggated_by_day ) group by 1 

Suppose you want the sum between '2013-01-01 10:37' and '2013-03-02 11:20':

 select ( select sum(total) from incidents_aggregated_by_day where "day" >= '2013-01-02'::date and "day" < '2013-03-02'::date ) + ( select count(*) from incidents where time_ocurred >= '2013-01-01 10:37':timestamp and time_ocurred < '2013-01-02' or time_ocurred <= '2013-03-02 11:20':timestamp and time_ocurred >= '2013-01-02' ) total 

Instead of reading 100 million lines, you will read hundreds or thousands. If indexed correctly, it will be fast.

+1
source

Another approach may be to split the table. This guy seems to have solved a very similar split problem:

http://www.if-not-true-then-false.com/2009/performance-testing-between-partitioned-and-non-partitioned-postgresql-tables-part-3/

My concern for using his approach would be maintainability. In his example (you need to go to part 1 of the tutorial to see how he created the partitions), he manually creates each child table and has hard-coded routing of the child tables in the trigger. If your table is constantly growing, you will do a lot of DBA work.

However, it seems to get a big performance boost. So, if you can understand how to make it more convenient for maintenance, this may be a good way to continue.

+1
source

This is exactly the problem that the solution for modeling models and data warehouses is designed to solve.

Previous project, I worked on Ruby’s built-in data warehouse for several weeks to deal with queries like this, and exposed its main application with a simple REST API. Basically, you extract your data and convert it to "Star Schema", which is very optimized for queries like the ones you describe.

Postgresql is well suited for a data warehouse database.

This is a very detailed topic, and a great starter resource is as follows: http://www.amazon.com/Data-Warehouse-Toolkit-Complete -dimensional / dp / 0471200247

+1
source

All Articles