RANK or ROW_NUMBER in BigQuery for a large dataset

Question

RANK or ROW_NUMBER in BigQuery for a large dataset

I need to add line numbers to a large (approximately one billion rows) dataset in BigQuery. When I try:

SELECT * ROW_NUMBER() OVER (ORDER BY d_arf DESC) plarf FROM [trigram.trigrams8]

I get "Resources exceeded at query time." Since the analytic / window function should correspond to one node.

How to add line numbers to a large dataset in BigQuery?

+4

row-number large-data google-bigquery

uner Oct 20 '15 at 10:11

source share

1 answer

Felipe hoffa · Accepted Answer · 2015-10-21T02:29:02+0000

You did not give me a work request, so I had to create my own, so you need to transfer it to your problem space. Also, I'm not sure why you want to assign a row number to each row in such a huge dataset, but the call is accepted:

 SELECT a.enc, plarf, plarf+COALESCE(INTEGER(sumc), (0)) row_num FROM ( SELECT STRING(year)+STRING(month)+STRING(mother_age)+state enc, ROW_NUMBER() OVER (PARTITION BY year ORDER BY enc) plarf, year FROM [publicdata:samples.natality] ) a LEFT JOIN ( SELECT COUNT(*) c, year+1 year, SUM(c) OVER(ORDER BY year) sumc FROM [publicdata:samples.natality] GROUP BY year ) b ON a.year=b.year

I want to do ROW_NUMBER () OVER (), but I cannot, because there are too many elements.
Having OVER (PARTITION) fixes this problem, but now every section starts with 1.
But it normal. In another subquery, I will calculate how many elements are in each section.
And in the surrounding request it will take the row_number value of each partition and add it to the local-to-the-partition account.
Ta da.

RANK or ROW_NUMBER in BigQuery for a large dataset

More articles: