Eliminate duplicate records in a BigQuery table

I plan on adding incremental data to the BigQuery table daily. Each time I add incremental data to an existing table, I want to exclude duplicate records (based on the primary key column) from existing data in the table. One approach is

  • Collect a key set from incremental data (let's call it INCR_KEYS )
  • Run the query in the rows - SELECT all_cols from table where pkey_col NOT IN (INCR_KEYS) - and save the results in a new table.
  • Add incremental data to a new table.

My concern with this approach is that it duplicates a copy of a large table and adds my bills.

Is there a better way to achieve the same without creating a duplicate table?

+7
source share
3 answers

I don't know how to do this without creating a duplicate table - this really seems like a pretty smart solution.

The gradual cost for you is likely to be very small - BigQuery only orders you data for the period of time in which it exists. If you delete the old table, you will need to pay for both tables in a few seconds or minutes.

+4
source

You can run the query with the destination table installed in the existing table and set the entry style:

 bq query --allow_large_results --replace --destination_table=mydataset.mytable \ 'SELECT * FROM mydataset.mytable WHERE key NOT IN (SELECT key FROM mydataset.update)' bq cp --append_table mydataset.update mydataset.mytable 

I think this will work, but I think it's worth making a backup, especially since you can delete it shortly afterwards.

 bq cp mydataset.mytable mydataset.backup # You can also build the new table in one pass: bq query --allow_large_results --replace --destination_table=mydataset.mytable \ 'SELECT * FROM ( SELECT * FROM mydataset.mytable WHERE key NOT IN (SELECT key FROM mydataset.update) ), ( SELECT * FROM mydataset.update )' bq rm mydataset.backup 
+1
source

You can set up a new destination table and simply query the score and group for all columns:

 SELECT FIELD1, FIELD2, FIELD3, FIELD4 FROM ( SELECT COUNT (*), FIELD1, FIELD2, FIELD3, FIELD4 FROM [<TABLE>] GROUP BY FIELD1, FIELD2, FIELD3, FIELD4) 
0
source

All Articles