I am writing a background job to automatically process A / B test data in BigQuery, and I found that I clicked "Resources exceeded during query execution" when executing large GROUP EACH BY statements. I saw from resources exceeded during query execution that reducing the number of groups can make queries successful, so I split my data into smaller parts, but I still encounter errors (albeit less often). It would be nice to get a better intuition about what actually causes this error. In particular:
- βAre resources exceededβ always means that the shard has run out of memory, or could it also mean that the task is over?
- What is the correct way to approximate memory usage and shared memory that I have available? Do I correctly assume that each track tracks about 1 / n groups and saves the group key and all aggregates for each group, or is there another way I should think about it?
- How is the number of fragments determined? In particular, do I get fewer shards / resources if I request a smaller dataset?
A problematic query looks like this (in practice, it is used as a subquery, and an external query aggregates the results):
SELECT alternative, snapshot_time, SUM(column_1), ... SUM(column_139) FROM my_table CROSS JOIN [table containing 24 unix timestamps] timestamps WHERE last_updated_time < timestamps.snapshot_time GROUP EACH BY alternative, user_id, snapshot_time
(Here the example is not executed: 124072386181: job_XF6MksqoItHNX94Z6FaKpuktGh4)
I understand that this query can be problematic, but in this case, the table is only 22 MB, and the query result is less than a million groups, and it still does not work with "exceeded resources". Reducing the number of timestamps for processing immediately fixes the error, but I'm worried that in the end I will remove the data scale so that this approach as a whole stops working.
google-bigquery
Alan pierce
source share