How to scale rotation in BigQuery?

Let's say I have a table of music videos mydataset.stats for a given day (3B lines, 1M users, 6K artists). Simplified Diagram: UserGUID String, ArtistGUID String

I need rotate / transpose elements from rows to columns, so the circuit will be:
UserGUID String, Artist1 Int, Artist2 Int, ... Artist8000 Int
When viewing, the Contractor counts the corresponding user

An approach was suggested in How to wrap rows in columns with lots of data in BigQuery / SQL? and How to create dummy variable columns for thousands of categories in Google BigQuery? but it looks like it does not scale for the numbers that I have in my example

Can this approach be scaled for my example?

+3
sql google-bigquery
source share
1 answer

I tried approaching the approach with up to 6,000 functions, and it worked as expected. I believe that it will work up to 10K functions, which is a hard limit for the number of columns in a table

STEP 1 - Aggregate User / Artist Actions

SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays FROM [mydataset.stats] GROUP BY 1, 2 

STEP 2 - normalize uid and help - so they are sequential numbers 1, 2, 3, ....
We need this for at least two reasons: a) make the more dynamically created sql as compact as possible, and b) have more useful / friendly column names

In combination with the first step, it will be:

 SELECT u.uid AS uid, a.aid AS aid, plays FROM ( SELECT userGUID, artistGUID, COUNT(1) AS plays FROM [mydataset.stats] GROUP BY 1, 2 ) AS s JOIN ( SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1 ) AS u ON u. userGUID = s.userGUID JOIN ( SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1 ) AS a ON a.artistGUID = s.artistGUID 

Allows you to write output to a table - mydataset.aggs

STEP 3 . Using the already proposed (on the above issues) approach for N functions (artists) at a time. In my specific example, while experimenting, I found that the basic approach works well for a number of functions between 2000 and 3000 years. To be safe, I decided to use 2000 functions at the same time

The script below is used to dynamically generate a query, which is then run to create partitioned tables.

 SELECT 'SELECT uid,' + GROUP_CONCAT_UNQUOTED( 'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid) ) + ' FROM [mydataset.aggs] GROUP EACH BY uid' FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001) 

The above query invokes another query, as shown below:

 SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3, SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . . FROM [mydataset.aggs] GROUP EACH BY uid 

This needs to be run and written to mydataset.pivot_1_2000

Performing STEP 3 two more times (setting HAVING aid > NNNN and aid < NNNN ), we get three more tables mydataset.pivot_2001_4000 , mydataset.pivot_4001_6000
As you can see - mydataset.pivot_1_2000 was expecting a circuit, but for functions from 1 to 2001; mydataset.pivot_2001_4000 has only functions from 2001 to 4000; etc.

STEP 4 . Combining the entire split pivot table into a destination pivot table with all the functions presented as columns in a single table.

Same as in previous steps. First we need to generate a request, and then run it. So, initially we will "embroider" mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then the result with mydataset.pivot_4001_6000

 SELECT 'SELECT x.uid uid,' + GROUP_CONCAT_UNQUOTED( 'a' + STRING(aid) ) + ' FROM [mydataset.pivot_1_2000] AS x JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid ' FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid) 

The output line above should be started, and the result recorded in mydataset.pivot_1_4000

Then we repeat STEP 4 as shown below

 SELECT 'SELECT x.uid uid,' + GROUP_CONCAT_UNQUOTED( 'a' + STRING(aid) ) + ' FROM [mydataset.pivot_1_4000] AS x JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid ' FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid) 

Result for writing to mydataset.pivot_1_6000

The table below shows the following diagram:

 uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int 

Note :
a . I tried this approach with only up to 6000 functions and it worked as expected
b . The execution time of the second / main queries in steps 3 and 4 ranged from 20 to 60 minutes
with IMPORTANT: the billing level in steps 3 and 4 ranged from 1 to 90. The good news is that the size of the corresponding tables is relatively small (30-40 MB), so billing bytes. For projects "until 2016" everything is declared as level 1, but after October 2016 this may be a problem.
For more information, see Timing in High Performance Queries.
d . The above example shows the power of large-scale data conversion with BigQuery! However, I think (but I could be wrong) that maintaining a materialized matrix of characteristics is not a good idea

+5
source share

All Articles