Understanding "Resources Exceeded During Query Execution" Using GROUP EACH BY in BigQuery

I am writing a background job to automatically process A / B test data in BigQuery, and I found that I clicked "Resources exceeded during query execution" when executing large GROUP EACH BY statements. I saw from resources exceeded during query execution that reducing the number of groups can make queries successful, so I split my data into smaller parts, but I still encounter errors (albeit less often). It would be nice to get a better intuition about what actually causes this error. In particular:

  • β€œAre resources exceeded” always means that the shard has run out of memory, or could it also mean that the task is over?
  • What is the correct way to approximate memory usage and shared memory that I have available? Do I correctly assume that each track tracks about 1 / n groups and saves the group key and all aggregates for each group, or is there another way I should think about it?
  • How is the number of fragments determined? In particular, do I get fewer shards / resources if I request a smaller dataset?

A problematic query looks like this (in practice, it is used as a subquery, and an external query aggregates the results):

SELECT alternative, snapshot_time, SUM(column_1), ... SUM(column_139) FROM my_table CROSS JOIN [table containing 24 unix timestamps] timestamps WHERE last_updated_time < timestamps.snapshot_time GROUP EACH BY alternative, user_id, snapshot_time 

(Here the example is not executed: 124072386181: job_XF6MksqoItHNX94Z6FaKpuktGh4)

I understand that this query can be problematic, but in this case, the table is only 22 MB, and the query result is less than a million groups, and it still does not work with "exceeded resources". Reducing the number of timestamps for processing immediately fixes the error, but I'm worried that in the end I will remove the data scale so that this approach as a whole stops working.

+7
google-bigquery
source share
1 answer

As you may have guessed, BigQuery selects several concurrent employees (shards) for GROUP EACH and JOIN EACH queries, depending on the size of the tables used. This is a crude heuristic, but in practice it works very well.

What's interesting about your query is that GROUP EACH runs on top of a larger table than the original table due to the extension in CROSS JOIN. Because of this, we select several skulls that are too small for your request.

To answer your specific questions:

  • Resources that exceed almost always mean that the worker has run out of memory. It can be a splinter or mixer in terms of Dremel (mixers are nodes in the computation tree that aggregate the results. GROUP EACH BY pushes aggregation to fragments that are leaves of the computation tree).

  • Unable to get close to the amount of resources available. This changes over time so that more of your queries should work.

  • The number of fragments is determined by the total bytes processed in the request. As you noticed, this heuristic does not work with joins that extend basic datasets. Nevertheless, active work is underway to be smarter about how we choose the number of fragments. To give you an idea of ​​the scale, your query was planned for only 20 shards, which is a small part of what the larger table will receive.

As a workaround, you can save the CROSS JOIN intermediate result as a table and run GROUP EACH BY over this temporary table. This should allow BigQuery to use extended size when picking the number of fragments. (if this does not work, let me know, maybe we need to configure our destination thresholds).

+8
source share

All Articles