I tried approaching the approach with up to 6,000 functions, and it worked as expected. I believe that it will work up to 10K functions, which is a hard limit for the number of columns in a table
STEP 1 - Aggregate User / Artist Actions
SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays FROM [mydataset.stats] GROUP BY 1, 2
STEP 2 - normalize uid and help - so they are sequential numbers 1, 2, 3, ....
We need this for at least two reasons: a) make the more dynamically created sql as compact as possible, and b) have more useful / friendly column names
In combination with the first step, it will be:
SELECT u.uid AS uid, a.aid AS aid, plays FROM ( SELECT userGUID, artistGUID, COUNT(1) AS plays FROM [mydataset.stats] GROUP BY 1, 2 ) AS s JOIN ( SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1 ) AS u ON u. userGUID = s.userGUID JOIN ( SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1 ) AS a ON a.artistGUID = s.artistGUID
Allows you to write output to a table - mydataset.aggs
STEP 3 . Using the already proposed (on the above issues) approach for N functions (artists) at a time. In my specific example, while experimenting, I found that the basic approach works well for a number of functions between 2000 and 3000 years. To be safe, I decided to use 2000 functions at the same time
The script below is used to dynamically generate a query, which is then run to create partitioned tables.
SELECT 'SELECT uid,' + GROUP_CONCAT_UNQUOTED( 'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid) ) + ' FROM [mydataset.aggs] GROUP EACH BY uid' FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)
The above query invokes another query, as shown below:
SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3, SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . . FROM [mydataset.aggs] GROUP EACH BY uid
This needs to be run and written to mydataset.pivot_1_2000
Performing STEP 3 two more times (setting HAVING aid > NNNN and aid < NNNN ), we get three more tables mydataset.pivot_2001_4000 , mydataset.pivot_4001_6000
As you can see - mydataset.pivot_1_2000 was expecting a circuit, but for functions from 1 to 2001; mydataset.pivot_2001_4000 has only functions from 2001 to 4000; etc.
STEP 4 . Combining the entire split pivot table into a destination pivot table with all the functions presented as columns in a single table.
Same as in previous steps. First we need to generate a request, and then run it. So, initially we will "embroider" mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then the result with mydataset.pivot_4001_6000
SELECT 'SELECT x.uid uid,' + GROUP_CONCAT_UNQUOTED( 'a' + STRING(aid) ) + ' FROM [mydataset.pivot_1_2000] AS x JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid ' FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)
The output line above should be started, and the result recorded in mydataset.pivot_1_4000
Then we repeat STEP 4 as shown below
SELECT 'SELECT x.uid uid,' + GROUP_CONCAT_UNQUOTED( 'a' + STRING(aid) ) + ' FROM [mydataset.pivot_1_4000] AS x JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid ' FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)
Result for writing to mydataset.pivot_1_6000
The table below shows the following diagram:
uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int
Note :
a . I tried this approach with only up to 6000 functions and it worked as expected
b . The execution time of the second / main queries in steps 3 and 4 ranged from 20 to 60 minutes
with IMPORTANT: the billing level in steps 3 and 4 ranged from 1 to 90. The good news is that the size of the corresponding tables is relatively small (30-40 MB), so billing bytes. For projects "until 2016" everything is declared as level 1, but after October 2016 this may be a problem.
For more information, see Timing in High Performance Queries.
d . The above example shows the power of large-scale data conversion with BigQuery! However, I think (but I could be wrong) that maintaining a materialized matrix of characteristics is not a good idea