BigQuery aligns when using a field with the same name as the repeated field

Question

BigQuery aligns when using a field with the same name as the repeated field

Edited to use a public dataset

I have a table with the following diagram, which you can get here: https://bigquery.cloud.google.com/table/realself-main:rs_public.test_count

If I run the following query, I get a different result for cnt1 vs. cnt2.

SELECT COUNT(*) AS cnt1, COUNT(dr_id) as cnt2, FROM (SELECT * FROM rs_public.test_count) AS tc WHERE tc.is_published

If I remove the tc alias from the where clause, I get the same result for both counters:

 SELECT COUNT(*) AS cnt1, COUNT(dr_id) as cnt2, FROM (SELECT * FROM rs_public.test_count) AS tc WHERE is_published

If, however, I repeat the first request, but instead use the is_claimed field in the where clause, I get the same counter again.

 SELECT COUNT(*) AS cnt1, COUNT(dr_id) as cnt2, FROM (SELECT * FROM rs_public.test_count) AS tc WHERE tc.is_claimed

I think this is a mistake, and BigQuery gets confused because is_published is an external field, as well as a cover sheet field field cover_photos. This is an incorrect use of the cover_photos.is_published field when evaluating whether results should be smoothed, but using the is_published external field when filtering results in the where clause.

Here's a counter example that doesn't use select *, which I mention in my comment on Felipe below:

 SELECT COUNT(*) FROM ( SELECT dr_id, cover_photos.is_published FROM [realself-main:rs_public.test_count] )

returns 3.

 SELECT COUNT(*), COUNT(0) FROM ( SELECT dr_id, cover_photos.is_published FROM [realself-main:rs_public.test_count] )

returns 7 and 3! In my comment, it seems like the only safe option is to never use count (*)

+1

google-bigquery

alan Nov 09 '15 at 7:11

source share

3 answers

Felipe hoffa · Answer 1 · 2015-11-09T08:05:34+0000

If the query can be interpreted in different ways, BigQuery will do its best to guess what your intentions were - sometimes leading to uncompetitive results. This is true for every database since SQL has room for these ambiguities.

Decision. Eliminate ambiguity from your queries - perhaps both results are correct, depending on what you are trying to calculate.

(eliminate the ambiguity without using *, and make the prefix explicit, while you can also make an explicit request about how you want the table to be flattened)

I would really like to comment on your specific data and results, but given that you did not provide a public sample, I cannot.

Felipe hoffa · Answer 2 · 2015-11-10T01:08:58+0000

Thanks for sharing the @alan dataset! Let's see how it looks:

This is an interesting table: it has 3 columns and 3 rows (a small but regular SQL table). The interesting part is that the third column may contain nested records. The first line has nothing (null), the second line has only 1 value, and the third line has 5 different values.

Things get interesting when you start counting in a column:

 SELECT COUNT(*) FROM [realself-main:rs_public.test_count] 3

It makes sense, the data set has 3 rows.

 SELECT COUNT(dr_id) FROM [realself-main:rs_public.test_count] 3

It also makes sense, there are 3 dr_id.

 SELECT COUNT(cover_photos.is_published) FROM [realself-main:rs_public.test_count] 6

Now everything has become more interesting. This is 6 because there are 6 values for cover_photos.is_published (zero does not count).

 SELECT COUNT(cover_photos.is_published), COUNT(dr_id) FROM [realself-main:rs_public.test_count] 6 3

That still makes sense: 6 cover_photos.is_published, 3 dr_id.

 SELECT COUNT(*) FROM ( SELECT cover_photos.is_published, dr_id FROM [realself-main:rs_public.test_count] ) 3

This is also interesting: if we execute a subquery, COUNT (*) looks at the number of rows returned. 3 rows were returned. That still makes sense.

But then:

 SELECT COUNT(*), COUNT(cover_photos.is_published) FROM ( SELECT cover_photos.is_published, dr_id FROM [realself-main:rs_public.test_count] ) 7 6

7 and 6. Seven? Why 7?

Well, BigQuery had to choose a smoothing strategy for the subquery. Look at the table that I inserted there, you see how it has 7 rows? These are seven counted lines.

Let's look at them explicitly:

 SELECT dr_id, cover_photos.is_published FROM ( SELECT cover_photos.is_published, dr_id FROM [realself-main:rs_public.test_count] )

Cm? These are seven lines. When selecting rows with nested records (a good feature for BigQuery), BigQuery sometimes has to smooth data to process certain queries. The first 2 lines were flattened into exactly 2 lines (one with the lid_photos.is_published as null), and the third line was smoothed to 5 lines, one for each of its cover_photos.is_published.

The moral of this story is to be careful when working with embedded data: some queries will smooth it out in ways that are unexpected for the user, but which make a lot of sense to the computer when it tries to solve it.

Release deeper upon request:

Look at the difference between these two queries:

 SELECT COUNT(*) FROM ( SELECT * FROM ( SELECT * FROM [realself-main:rs_public.test_count] WHERE is_published ) ) SELECT COUNT(*) FROM ( SELECT * FROM ( SELECT * FROM [realself-main:rs_public.test_count] ) ) WHERE is_published

Before looking at the results, you can guess what results each query will give you? No, you can not. Both queries are ambiguous, so to get BigQuery's answer you will need to make some guesses and optimizations.

The result for the first query is 7, and for the second - 3. Go and try.

Who cares? Well, looking at the results of these queries, I can say that in the second BigQuery I saw that the only column that interests you is is_published, so it optimizes the tree, so only this column is read. But BigQuery complicates the optimization of the first query - so it guesses: “Maybe they really want to,” and “means I need to flatten the table before passing it to the next level.” It aligns the table, so a later query contains 7 rows.

Is any of these results guaranteed? No - queries are ambiguous. How to reduce ambiguity? Instead of using "SELECT *", tell BigQuery which columns you want to look for, so you don't need to guess.

Felipe hoffa · Answer 3 · 2015-11-10T08:16:04+0000

I am adding a new answer as you continue to add elements to the question - they all deserve a different answer.

You say this query surprises you:

 SELECT COUNT(*), COUNT(0) FROM ( SELECT dr_id, cover_photos.is_published FROM [realself-main:rs_public.test_count] )

You are surprised because the results are 7 and 3.

Maybe this will make sense if I try this:

 SELECT COUNT(*), COUNT(0), GROUP_CONCAT(STRING(cover_photos.is_published)), GROUP_CONCAT(STRING(dr_id)), GROUP_CONCAT(IFNULL(STRING(cover_photos.is_published),'null')), GROUP_CONCAT("0") FROM ( SELECT dr_id, cover_photos.is_published FROM [realself-main:rs_public.test_count] )

Cm? This is the same query, plus 4 different aggregations from the same subcategories, one of which consists of nested repeating data, and also has a zero value in one row.

Query Results:

 7 3 1,1,1,0,0,0 1234,4321,9999 null,1,1,1,0,0,0 0,0,0

7 is obtained from the complete expansion of the embedded data in 7 rows, like the prompts of the 5th column. 3 comes from the score “0” three times, as can be seen from the 6th column.

These subtleties are associated with working with nested repeating data. I advise you not to work with nested repeating data until you are ready to accept that these subtleties can occur when working with nested repeating data.

BigQuery aligns when using a field with the same name as the repeated field

More articles: