How to create variable columns for thousands of categories in Google BigQuery?

I have a simple table with two columns: UserID and Category, and each UserID can be repeated with several categories, for example:

UserID Category ------ -------- 1 A 1 B 2 C 3 A 3 C 3 B 

I want to "dummify" this table: create an output table that has a unique column for each category consisting of dummy variables (0/1 depending on whether the UserID belongs to this particular category):

 UserID ABC ------ -- -- -- 1 1 1 0 2 0 0 1 3 1 1 1 

My problem is that I have THOUSAND categories (not only 3, as in this example), and therefore this cannot be effectively done using the CASE WHEN statement.

So my questions are:

1) Is there a way to “dummize” the “Category” column in Google BigQuery without using thousands of CASE WHEN statements.

2) Is this a situation where UDF functionality works well? It looks like it will be so, but I'm not familiar enough with UDF in BigQuery to solve this problem. Can anyone help?

Thanks.

+3
sql mysql google-bigquery dummy-variable
source share
1 answer

You can use the below "technician"

First, run query # 1. It creates a query (query number 2), which must be run in order to get the result you need. Please continue to consider Moshi’s comments before heading “wild” with thousands of categories: o)

Request No. 1:

 SELECT 'select UserID, ' + GROUP_CONCAT_UNQUOTED( 'sum(if(category = "' + STRING(category) + '", 1, 0)) as ' + STRING(category) ) + ' from YourTable group by UserID' FROM ( SELECT category FROM YourTable GROUP BY category ) 

The result will be as shown below - Request No. 2

 SELECT UserID, SUM(IF(category = "A", 1, 0)) AS A, SUM(IF(category = "B", 1, 0)) AS B, SUM(IF(category = "C", 1, 0)) AS C FROM YourTable GROUP BY UserID 

of course for three categories - you can do it manually, but for thousands it will definitely make a day for you!

The result of query # 2 will look as you expect:

 UserID ABC 1 1 1 0 2 0 0 1 3 1 1 1 
+4
source share

All Articles