After a year update:
Please pay attention to some important events that have occurred after this situation:
- The request price is 85%.
- GithubArchive publishes daily and annual tables now, so when designing your queries, always check them on smaller datasets.
The price of BigQuery is based on the amount of data requested. One of its highlights is how easily it scales, starting from scanning a few gigabytes to terabytes in a few seconds.
Linear scaling of pricing is a feature: most (or all?) Of the other databases that I know require exponentially more expensive resources or simply cannot process these volumes of data - at least not in a reasonable amount of time.
However, linear scaling means that a terabyte request is 1000 times more expensive than a gigabyte request. BigQuery users should be aware of this and plan accordingly. For these purposes, BigQuery offers the βdry runβ flag, which allows you to see exactly how much data will be requested before running the query, and configure accordingly.
In this case, WeiGong requested a table of 105 GB. Ten SELECT * LIMIT 10 queries will quickly make terabytes of data, etc.
There are ways that these same requests consume much less data:
- Instead of querying
SELECT * LIMIT 10 only call the columns you are looking for. The BigQuery board is based on the columns you request, so having unnecessary columns will add unnecessary overhead.
For example, SELECT * ... requests 105 GB, and SELECT repository_url, repository_name, payload_ref_type, payload_pull_request_deletions FROM [githubarchive:github.timeline] is only 8.72 GB, which makes this request more than 10 times cheaper.
For example, retrieving all January data using a query leaves a new table with only 91.7 MB. Querying this table is a thousand times cheaper than a big one!
SELECT * FROM [githubarchive:github.timeline] WHERE created_at BETWEEN '2014-01-01' and '2014-01-02' -> save this into a new table 'timeline_201401'
Combining these methods, you can go from an account for $ 4,000 to $ 4 for the same amount of quick and insightful results.
(I work with the owner of the Github archive to make them store monthly data instead of a single monolithic table to make this even easier)