BigQuery COUNT (value DISTINCT) vs COUNT (value)

Question

BigQuery COUNT (value DISTINCT) vs COUNT (value)

I found a bug / bug in bigquery. We got a table based on banking statistics for starschema.net:clouddb:bank.Banks_token

If I run the following query:

SELECT count(*) as totalrow, count(DISTINCT BankId ) as bankidcnt FROM bank.Banks_token;

And I get the following result:

 Row totalrow bankidcnt 1 9513 9903

My problem is that if I have 9513row, how can I get 9903row, which is 390 times larger than the row in the table.

+11

google-bigquery

Balazs gunics May 17 '13 at 12:36

source share

2 answers

I used EXACT_COUNT_DISTINCT () as a way to get an accurate unique score. It is cleaner and more general than COUNT (DISTINCT value, n> numRows)

Found here: https://cloud.google.com/bigquery/query-reference#aggfunctions

+20

smntx May 22, '15 at 22:34

source share

Jeremy condit · Accepted Answer · 2013-05-19T03:40:06+0000

In BigQuery, COUNT DISTINCT is a statistical approximation for all results in excess of 1000.

You can provide an optional second argument to give a threshold at which approximations are used. Therefore, if you use COUNT (DISTINCT BankId, 10000) in your example, you should see the exact result (since the actual number of rows is less than 10000). Note, however, that using a larger threshold can be costly in terms of performance.

See the full documentation here: https://developers.google.com/bigquery/docs/query-reference#aggfunctions

UPDATE 2017:

With BigQuery #standardSQL COUNT(DISTINCT) always accurate. For approximate results, use APPROX_COUNT_DISTINCT() . Why would anyone use approximate results? See the article .

BigQuery COUNT (value DISTINCT) vs COUNT (value)

More articles: