Eliminate duplicate records in a BigQuery table

Question

Eliminate duplicate records in a BigQuery table

I plan on adding incremental data to the BigQuery table daily. Each time I add incremental data to an existing table, I want to exclude duplicate records (based on the primary key column) from existing data in the table. One approach is

Collect a key set from incremental data (let's call it INCR_KEYS )
Run the query in the rows - SELECT all_cols from table where pkey_col NOT IN (INCR_KEYS) - and save the results in a new table.
Add incremental data to a new table.

My concern with this approach is that it duplicates a copy of a large table and adds my bills.

Is there a better way to achieve the same without creating a duplicate table?

+7

google-bigquery

user1659408 10 Sep '12 at 7:15

source share

3 answers

Jordan tigani · Answer 1 · 2012-09-10T15:13:08+0000

I don't know how to do this without creating a duplicate table - this really seems like a pretty smart solution.

The gradual cost for you is likely to be very small - BigQuery only orders you data for the period of time in which it exists. If you delete the old table, you will need to pay for both tables in a few seconds or minutes.

Siddartha naidu · Answer 2 · 2014-04-04T05:41:06+0000

You can run the query with the destination table installed in the existing table and set the entry style:

 bq query --allow_large_results --replace --destination_table=mydataset.mytable \ 'SELECT * FROM mydataset.mytable WHERE key NOT IN (SELECT key FROM mydataset.update)' bq cp --append_table mydataset.update mydataset.mytable

I think this will work, but I think it's worth making a backup, especially since you can delete it shortly afterwards.

 bq cp mydataset.mytable mydataset.backup # You can also build the new table in one pass: bq query --allow_large_results --replace --destination_table=mydataset.mytable \ 'SELECT * FROM ( SELECT * FROM mydataset.mytable WHERE key NOT IN (SELECT key FROM mydataset.update) ), ( SELECT * FROM mydataset.update )' bq rm mydataset.backup

Rich reinheimer · Answer 3 · 2015-12-02T18:09:00+0000

You can set up a new destination table and simply query the score and group for all columns:

 SELECT FIELD1, FIELD2, FIELD3, FIELD4 FROM ( SELECT COUNT (*), FIELD1, FIELD2, FIELD3, FIELD4 FROM [<TABLE>] GROUP BY FIELD1, FIELD2, FIELD3, FIELD4)

Eliminate duplicate records in a BigQuery table

More articles: