PostgreSQL - fetching a row that has a Max value for a column

I am dealing with a Postgres table (called "life") that contains column entries for time_stamp, usr_id, transaction_id and lives_remaining. I need a request that will give me the latest life information left for each usr_id

  • There are several users (different usr_id)
  • time_stamp is not a unique identifier: sometimes user events (one row in the table) will encounter the same time_stamp.
  • trans_id is unique only for very small time ranges: over time, it repeats
  • Other leafs (for a given user) may increase and decrease over time

Example:

  time_stamp | lives_remaining | usr_id | trans_id
 -----------------------------------------
   07:00 |  1 |  1 |  one    
   09:00 |  4 |  2 |  2    
   10:00 |  2 |  3 |  3    
   10:00 |  1 |  2 |  four    
   11:00 |  4 |  1 |  5    
   11:00 |  3 |  1 |  6    
   13:00 |  3 |  3 |  one    

Since I will need to access the other columns of the row with the latest data for each given usr_id, I need a query that gives the result as follows:

  time_stamp | lives_remaining | usr_id | trans_id
 -----------------------------------------
   11:00 |  3 |  1 |  6    
   10:00 |  1 |  2 |  four    
   13:00 |  3 |  3 |  one    

As already mentioned, each usr_id can gain or lose lives, and sometimes these events with a time delay occur so close to each other that they have the same time stamp! Therefore, this request will not work:

SELECT b.time_stamp,b.lives_remaining,b.usr_id,b.trans_id FROM (SELECT usr_id, max(time_stamp) AS max_timestamp FROM lives GROUP BY usr_id ORDER BY usr_id) a JOIN lives b ON a.max_timestamp = b.time_stamp 

Instead, I need to use both time_stamp (first) and trans_id (second) to determine the correct line. Then I also need to pass this information from the subquery to the main query, which will provide data for the other columns of the corresponding rows. This is a hacked request that I started working:

 SELECT b.time_stamp,b.lives_remaining,b.usr_id,b.trans_id FROM (SELECT usr_id, max(time_stamp || '*' || trans_id) AS max_timestamp_transid FROM lives GROUP BY usr_id ORDER BY usr_id) a JOIN lives b ON a.max_timestamp_transid = b.time_stamp || '*' || b.trans_id ORDER BY b.usr_id 

Okay, that’s how it works, but I don’t like it. It requires a query in the query, self-join, and it seems to me that it can be much simpler by capturing the line, which, as set by MAX, has the largest timestamp and trans_id. The lives table has tens of millions of rows for parsing, so I would like this query to be as fast and efficient as possible. I am new to RDBM and Postgres, so I know that I need to use the appropriate indexes efficiently. I lost a little how to optimize.

I found a similar discussion here . Can I execute some type of Postgres equivalent for an Oracle analytic function?

Any advice on accessing the relevant column information used by the aggregate function (e.g. MAX), creating indexes, and creating better queries will be greatly appreciated!

PS You can use the following to create an example for an example:

 create TABLE lives (time_stamp timestamp, lives_remaining integer, usr_id integer, trans_id integer); insert into lives values ('2000-01-01 07:00', 1, 1, 1); insert into lives values ('2000-01-01 09:00', 4, 2, 2); insert into lives values ('2000-01-01 10:00', 2, 3, 3); insert into lives values ('2000-01-01 10:00', 1, 2, 4); insert into lives values ('2000-01-01 11:00', 4, 1, 5); insert into lives values ('2000-01-01 11:00', 3, 1, 6); insert into lives values ('2000-01-01 13:00', 3, 3, 1); 
+54
sql postgresql query-optimization
Feb 25 '09 at 16:37
source share
7 answers

In a table with pseudo-random rows of 158k (usr_id are evenly distributed between 0 and 10k, trans_id evenly distributed between 0 and 30),

By request cost, below, I mean the Postgres cost optimizer cost estimate (with default values xxx_cost Postgres), which is an estimate of the weighted functions of the required I / O resources and CPU; you can get this by activating PgAdminIII and running "Query / Explain (F7)" in the query with the parameters "Query / Explain" set to "Analysis"

  • The Quassnoy request has a cost estimate of 745k (!) And ends in 1.3 seconds (taking into account the composite index at ( usr_id , trans_id , time_stamp ))
  • The request request has an estimate of 93 thousand and ends in 2.9 seconds (taking into account the composite index at ( usr_id , trans_id ))
  • Request No. 1 below has a cost estimate of 16k and ends in 800 ms (taking into account the composite index at ( usr_id , trans_id , time_stamp ))
  • Request No. 2 below has a cost estimate of 14k and ends in 800 ms (taking into account the index of the composite function at ( usr_id , EXTRACT(EPOCH FROM time_stamp) , trans_id ))
    • it's Postgres specific
  • Request No. 3 below (Postgres 8.4+) has an estimated time and completion time comparable to request # 2 (or better than) (taking into account the composite index ( usr_id , time_stamp , trans_id )); it has the advantage of scanning the lives table only once, and if you temporarily increase (if necessary) work_mem to place the sort in memory, this will by far be the fastest of all queries.

All times above include finding a complete set of results in 10,000 rows.

Your goal is a minimum estimated cost and a minimum lead time with emphasis on estimated value. The execution of the request may significantly depend on the execution conditions (for example, whether the corresponding lines are already fully cached in memory or not), while the estimated cost is not. On the other hand, keep in mind that valuation is exactly that valuation.

The best query execution time is achieved when working in a separate database without load (for example, when playing with pgAdminIII on a PC-developer). The request time will vary depending on the actual distribution of the download / distribution of data. When one request appears a little faster (<20%) than another, but has a much higher cost, it will usually be wiser to choose the one with a higher lead time but lower cost.

If you expect that on your working computer there will be no competition for memory at the time of the request (for example, the RDBMS cache and the file system cache will not be divided into parallel requests and / or file system activity), then the request is the time you received offline mode (e.g. pgAdminIII on the development computer) will be representative. If there is competition in the production system, the request time will deteriorate in proportion to the estimated cost ratio, since a request with a lower cost does not rely so much on the cache, while a request with a higher cost will review the same data again and again (additional launch input-output in the absence of a stable cache), for example:

  cost | time (dedicated machine) | time (under load) | -------------------+--------------------------+-----------------------+ some query A: 5k | (all data cached) 900ms | (less i/o) 1000ms | some query B: 50k | (all data cached) 900ms | (lots of i/o) 10000ms | 

Remember to run ANALYZE lives once after creating the necessary indexes.




Request # 1

 -- incrementally narrow down the result set via inner joins -- the CBO may elect to perform one full index scan combined -- with cascading index lookups, or as hash aggregates terminated -- by one nested index lookup into lives - on my machine -- the latter query plan was selected given my memory settings and -- histogram SELECT l1.* FROM lives AS l1 INNER JOIN ( SELECT usr_id, MAX(time_stamp) AS time_stamp_max FROM lives GROUP BY usr_id ) AS l2 ON l1.usr_id = l2.usr_id AND l1.time_stamp = l2.time_stamp_max INNER JOIN ( SELECT usr_id, time_stamp, MAX(trans_id) AS trans_max FROM lives GROUP BY usr_id, time_stamp ) AS l3 ON l1.usr_id = l3.usr_id AND l1.time_stamp = l3.time_stamp AND l1.trans_id = l3.trans_max 

Request No. 2

 -- cheat to obtain a max of the (time_stamp, trans_id) tuple in one pass -- this results in a single table scan and one nested index lookup into lives, -- by far the least I/O intensive operation even in case of great scarcity -- of memory (least reliant on cache for the best performance) SELECT l1.* FROM lives AS l1 INNER JOIN ( SELECT usr_id, MAX(ARRAY[EXTRACT(EPOCH FROM time_stamp),trans_id]) AS compound_time_stamp FROM lives GROUP BY usr_id ) AS l2 ON l1.usr_id = l2.usr_id AND EXTRACT(EPOCH FROM l1.time_stamp) = l2.compound_time_stamp[1] AND l1.trans_id = l2.compound_time_stamp[2] 

Update for 2013/01/29

Finally, starting with version 8.4, Postgres supports the Window Function , which means that you can write something simple and effective, for example:

Request No. 3

 -- use Window Functions -- performs a SINGLE scan of the table SELECT DISTINCT ON (usr_id) last_value(time_stamp) OVER wnd, last_value(lives_remaining) OVER wnd, usr_id, last_value(trans_id) OVER wnd FROM lives WINDOW wnd AS ( PARTITION BY usr_id ORDER BY time_stamp, trans_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ); 
+62
Feb 26 '09 at 1:19
source share

I would suggest a clean version based on DISTINCT ON (see docs ):

 SELECT DISTINCT ON (usr_id) time_stamp, lives_remaining, usr_id, trans_id FROM lives ORDER BY usr_id, time_stamp DESC, trans_id DESC; 
+24
Jun 11 '15 at 8:40
source share

Here's another method that doesn't use correlated subqueries or GROUP BY. I am not an expert in PostgreSQL performance tuning, so I suggest you try both this and the solutions provided by other people to find out which ones are best for you.

 SELECT l1.* FROM lives l1 LEFT OUTER JOIN lives l2 ON (l1.usr_id = l2.usr_id AND (l1.time_stamp < l2.time_stamp OR (l1.time_stamp = l2.time_stamp AND l1.trans_id < l2.trans_id))) WHERE l2.usr_id IS NULL ORDER BY l1.usr_id; 

I assume that trans_id is unique for at least any given time_stamp value.

+7
Feb 25 '09 at 18:26
source share

I like the style of Mike Woodhouse 's Answer on the other page you talked about. This is especially eloquent when the maximized thing is just one column, in which case the subquery can just use MAX(some_col) and GROUP BY other columns, but in your case you have the maximum possible number of 2 parts, you can still do this using ORDER BY plus LIMIT 1 instead (as done by Quassnoi):

 SELECT * FROM lives outer WHERE (usr_id, time_stamp, trans_id) IN ( SELECT usr_id, time_stamp, trans_id FROM lives sq WHERE sq.usr_id = outer.usr_id ORDER BY trans_id, time_stamp LIMIT 1 ) 

I find using the syntax of the WHERE (a, b, c) IN (subquery) nice string constructor because it reduces the number of words needed.

+3
Jun 15 2018-10-06T00
source share

The hacker solution for this problem is relevant there. Suppose you want to select the largest tree of each forest in the region.

 SELECT (array_agg(tree.id ORDER BY tree_size.size)))[1] FROM tree JOIN forest ON (tree.forest = forest.id) GROUP BY forest.id 

When you group trees by forests, there will be an unsorted list of trees, and you need to find the largest one. The first thing you need to do is sort the rows by their sizes and select the first from the list. This may seem inefficient, but if you have millions of lines, it will be much faster than solutions that have JOIN and WHERE clauses.

By the way, note that ORDER_BY for array_agg introduced in Postgresql 9.0

+3
Jan 18 '13 at 0:04
source share
 SELECT l.* FROM ( SELECT DISTINCT usr_id FROM lives ) lo, lives l WHERE l.ctid = ( SELECT ctid FROM lives li WHERE li.usr_id = lo.usr_id ORDER BY time_stamp DESC, trans_id DESC LIMIT 1 ) 

Creating an index on (usr_id, time_stamp, trans_id) will greatly improve this query.

You should always have a PRIMARY KEY in your tables.

+2
Feb 25 '09 at 17:01
source share

I think you have one serious problem here: there is no monotonously increasing β€œcounter” to ensure that this line occurred later in time than the other. Take this example:

 timestamp lives_remaining user_id trans_id 10:00 4 3 5 10:00 5 3 6 10:00 3 3 1 10:00 2 3 2 

You cannot determine from this data which is the most recent record. Is this the second or the last? There is no sort or max () function that you can apply to any of this data to give the correct answer.

Increasing the resolution of the timestamp would be a huge help. Because the database engine serializes queries, with sufficient resolution, you can ensure that the two timestamps are not the same.

Alternatively, use trans_id, which will not flip for very, very long. Having a trans_id that rolls around, you cannot tell (for the same timestamp) whether trans_id 6 is later than trans_id 1 unless you are doing complex math.

+1
Feb 26 '09 at 1:48
source share



All Articles