BigQuery COUNT (value DISTINCT) vs COUNT (value)

I found a bug / bug in bigquery. We got a table based on banking statistics for starschema.net:clouddb:bank.Banks_token

If I run the following query:

SELECT count(*) as totalrow, count(DISTINCT BankId ) as bankidcnt FROM bank.Banks_token; 

And I get the following result:

 Row totalrow bankidcnt 1 9513 9903 

My problem is that if I have 9513row, how can I get 9903row, which is 390 times larger than the row in the table.

+11
google-bigquery
source share
2 answers

In BigQuery, COUNT DISTINCT is a statistical approximation for all results in excess of 1000.

You can provide an optional second argument to give a threshold at which approximations are used. Therefore, if you use COUNT (DISTINCT BankId, 10000) in your example, you should see the exact result (since the actual number of rows is less than 10000). Note, however, that using a larger threshold can be costly in terms of performance.

See the full documentation here: https://developers.google.com/bigquery/docs/query-reference#aggfunctions


UPDATE 2017:

With BigQuery #standardSQL COUNT(DISTINCT) always accurate. For approximate results, use APPROX_COUNT_DISTINCT() . Why would anyone use approximate results? See the article .

+21
source share

I used EXACT_COUNT_DISTINCT () as a way to get an accurate unique score. It is cleaner and more general than COUNT (DISTINCT value, n> numRows)

Found here: https://cloud.google.com/bigquery/query-reference#aggfunctions

+20
source share

All Articles