Migrating from unallocated tables to partitions

In June, the BQ team announced support for date-split tables . But the manual lacks a way to transfer old unpartitioned tables to a new style.

I am looking for a way to update some or not all tables to a new style.

In addition, are there any other options separated outside the DAY type? Does the BQ user interface display this, since I was unable to create such a new partitioned table from the BQ web interface.

+14
google-bigquery
source share
5 answers

from Pavans answer: Note that this approach will charge you the cost of scanning the source table for the query as many times as you request it.


from Pentium10 comments: So, suppose I have several years of data, I need to prepare different requests for each day and run it all, and suppose that I have 1000 days in history, I need to pay 1000 times the total price of the request table source?

As we can see, the main problem here is that you perform a full scan every day. The rest of the problem is not a problem and can be easily recorded in any client of your choice

So below - How to split a table avoiding a full table scan every day?

Below, step by step, the approach is shown.

Enough common to spread / apply to any real use case - in the meantime I use bigquery-public-data.noaa_gsod.gsod2017 and I limit the β€œexercise” to only 10 days to keep it readable

Step 1 - Create a Pivot Table
In this step, we a) compress the contents of each row into a record / array
and
b) put them in the appropriate "daily" column

 #standardSQL SELECT ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170101' THEN r END) AS day20170101, ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170102' THEN r END) AS day20170102, ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170103' THEN r END) AS day20170103, ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170104' THEN r END) AS day20170104, ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170105' THEN r END) AS day20170105, ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170106' THEN r END) AS day20170106, ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170107' THEN r END) AS day20170107, ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170108' THEN r END) AS day20170108, ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170109' THEN r END) AS day20170109, ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170110' THEN r END) AS day20170110 FROM ( SELECT d, r, ROW_NUMBER() OVER(PARTITION BY d) AS line FROM ( SELECT stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r FROM `bigquery-public-data.noaa_gsod.gsod2017` AS t GROUP BY stn, d ) ) GROUP BY line 

Run the above request in the web interface using pivot_table (or any other name) as the destination

As we can see - here we get a table with 10 columns - one column for one day, and the layout of each column is a copy of the layout of the original table:

enter image description here

Step 2 - processing partitions one by one ONLY scanning the corresponding column (without a full table scan) - pasting into the corresponding section

 #standardSQL SELECT r.* FROM pivot_table, UNNEST(day20170101) AS r 

Run a query from the web user interface with a destination table named mytable $ 20160101

You can run it the next day.

 #standardSQL SELECT r.* FROM pivot_table, UNNEST(day20170102) AS r 

You should now have a destination table like mytable $ 20160102, etc.

enter image description here

You should be able to automate / script this step with any client of your choice.

There are many options for how you can use the above approach - it depends on your creativity.

Note. BigQuery allows up to 10,000 columns in a table, so 365 columns for the corresponding days of the year are definitely not a problem: o) If there is no restriction on how far back you can go with new sections - I heard (but I haven't opportunity to check), now no more than 90 days ago

Refresh

Please note: In the above version there is a small additional logic for packing all aggregated cells into the maximum possible number of final numbers.

ROW_NUMBER() OVER(PARTITION BY d) AS line
and then GROUP BY line
along with ARRAY_CONCAT_AGG(…)
doing this

This works well when the row size in the source table is not so large, so the final mixed row size will still be within the row size that BigQuery has (which, in my opinion, is 10 MB at the moment)

If your source table already has a row size close to this limit, use the adjusted version below

In this version, the grouping is deleted so that each row has only a value for one column

 #standardSQL SELECT CASE WHEN d = 'day20170101' THEN r END AS day20170101, CASE WHEN d = 'day20170102' THEN r END AS day20170102, CASE WHEN d = 'day20170103' THEN r END AS day20170103, CASE WHEN d = 'day20170104' THEN r END AS day20170104, CASE WHEN d = 'day20170105' THEN r END AS day20170105, CASE WHEN d = 'day20170106' THEN r END AS day20170106, CASE WHEN d = 'day20170107' THEN r END AS day20170107, CASE WHEN d = 'day20170108' THEN r END AS day20170108, CASE WHEN d = 'day20170109' THEN r END AS day20170109, CASE WHEN d = 'day20170110' THEN r END AS day20170110 FROM ( SELECT stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r FROM `bigquery-public-data.noaa_gsod.gsod2017` AS t GROUP BY stn, d ) WHERE d BETWEEN 'day20170101' AND 'day20170110' 

As you can see now - the pivot table (sparce_pivot_table) is quite sparse (the same 21.5 MB, but now 114 089 rows versus 11 584 rows in pivot_table), so it has an average row size of 190V versus 1.9 KB in the initial version. This is obviously about 10 times less than the number of columns in the example.
Therefore, before using this approach, it is necessary to do some math to evaluate / evaluate what and how can be done!

enter image description here

However: each cell in the pivot table is a JSON representation of the entire row in the source table. It is such that it contains not only the values ​​as for the rows in the source table, but also has a schema in it

enter image description here

As such, it is rather verbose - thus, the cell size can be several times larger than the original size [which limits the use of this approach ... unless you get even more creative: o) ... which still has a lot of areas here to apply: o)]

+14
source share

Until a new feature is deployed in BigQuery, there is another (much cheaper) way to split tables using a cloud-based data stream . We used this approach instead of hundreds of SELECT * that would cost us thousands of dollars.

  1. Create a partitioned table in BigQuery using the regular partition command
  2. Create a data flow pipeline and use the BigQuery.IO.Read receiver to read the table.
  3. Use Partition to split each line
  4. Using no more than 200 shards / receivers at a time (no more, and you will reach the limits of the API), create a BigQuery.IO.Write receiver for each day / shard that will write to the corresponding section using the section decorator syntax - "$YYYYMMDD"
  5. Repeat N times until all data has been processed.

Here is an example on Github to get you started.

You still have to pay for the data stream pipeline, but this is part of the cost of using multiple SELECT * in BigQuery.

+7
source share

Today you can now create a partitioned table from a partitioned table by querying it and specifying a partition column. You will pay for one full scan of the table on the original (not broken up) table. Note : this is currently in beta.

https://cloud.google.com/bigquery/docs/creating-column-partitions#creating_a_partitioned_table_from_a_query_result

To create a partitioned table from a query result, write the results to a new destination table. You can create a partitioned table by querying either a partitioned table or a partitioned table. You cannot change an existing standard table to a partitioned table using the query results.

+6
source share

If you have tables with private data today, you can use this approach:

https://cloud.google.com/bigquery/docs/creating-partitioned-tables#converting_dated_tables_into_a_partitioned_table

If you have one non-segmented table that needs to be converted to a partitioned table, you can try the SELECT * query execution approach to allow large results and use the table partition as the destination (similar to how you repeat the data for the partition):

https://cloud.google.com/bigquery/docs/creating-partitioned-tables#restating_data_in_a_partition

Note that this approach will charge you the cost of scanning the source table for the query as many times as you request.

We are working to make this scenario significantly better in the next few months.

+4
source share

For me, the following set of queries works, applied directly to a large query (a large query creates a new query).

CREATE TABLE (new?)dataset.new_table PARTITION BY DATE(date_column) AS SELECT * FROM dataset.table_to_copy;

Then, as a next step, I omit the table:

DROP TABLE dataset.table_to_copy;

I got this solution from https://fivetran.com/docs/warehouses/bigquery/partition-table using only step 2

0
source share

All Articles