RANK or ROW_NUMBER in BigQuery for a large dataset

I need to add line numbers to a large (approximately one billion rows) dataset in BigQuery. When I try:

SELECT * ROW_NUMBER() OVER (ORDER BY d_arf DESC) plarf FROM [trigram.trigrams8] 

I get "Resources exceeded at query time." Since the analytic / window function should correspond to one node.

How to add line numbers to a large dataset in BigQuery?

+4
source share
1 answer

You did not give me a work request, so I had to create my own, so you need to transfer it to your problem space. Also, I'm not sure why you want to assign a row number to each row in such a huge dataset, but the call is accepted:

 SELECT a.enc, plarf, plarf+COALESCE(INTEGER(sumc), (0)) row_num FROM ( SELECT STRING(year)+STRING(month)+STRING(mother_age)+state enc, ROW_NUMBER() OVER (PARTITION BY year ORDER BY enc) plarf, year FROM [publicdata:samples.natality] ) a LEFT JOIN ( SELECT COUNT(*) c, year+1 year, SUM(c) OVER(ORDER BY year) sumc FROM [publicdata:samples.natality] GROUP BY year ) b ON a.year=b.year 
  • I want to do ROW_NUMBER () OVER (), but I cannot, because there are too many elements.
  • Having OVER (PARTITION) fixes this problem, but now every section starts with 1.
  • But it normal. In another subquery, I will calculate how many elements are in each section.
  • And in the surrounding request it will take the row_number value of each partition and add it to the local-to-the-partition account.
  • Ta da.
+5
source

All Articles