Return only the newest rows from a BigQuery table with duplicate items

Question

Return only the newest rows from a BigQuery table with duplicate items

I have a table with many repeating elements. Many rows with the same id , perhaps the only difference being that this is the requested_at column.

I would like to make select * from the table, but return only one row with the same id - the most recently requested.

I looked at group by id , but I need to make an aggregate for each column. This is easy to do with requested_at - max(requested_at) as requested_at , but others are tough.

How can I get a value for title , etc., that matches the most recently updated row?

+5

google-bigquery

Kevin moore Dec 08 '15 at 20:09

source share

2 answers

Try something like this:

  SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY <id_column> ORDER BY <timestamp column> DESC) row_number, FROM <table> ) WHERE row_number = 1

Note that it will add a row_number column, which you might not need. To fix this, you can select individual columns by name in an external select statement.

In your case, this is similar to the requested_at column that you want to use in ORDER BY .

And you also want to use allow_large_results, set the destination table and not specify smoothing of the results (if you have a schema with repeating fields).

+2

Jordan tigani Dec 08 '15 at 20:16

source share

Matthew wesley · Accepted Answer · 2015-12-08T20:23:44+0000

I suggest a similar form that avoids sorting in a window function:

 SELECT * FROM ( SELECT *, MAX(<timestamp_column>) OVER (PARTITION BY <id_column>) AS max_timestamp, FROM <table> ) WHERE <timestamp_column> = max_timestamp

Return only the newest rows from a BigQuery table with duplicate items

More articles: