Google BigQuery Price

I am a PhD student from Singapore University of Management. I am currently working at Carnegie Mellon University on a research project that needs historical events from the Github archive ( http://www.githubarchive.org/ ). I noticed that Google Bigquery contains Github Archive data. Therefore, I run a program to crawl data using the Google Bigquery service.

I just found that the cost of displaying Google ads on the console is not updated in real time ... While I ran the program for several hours, the fee was only $ 4 plus, so I thought the price was reasonable, and I continued to fulfill the program. After 1 ~ 2 days, I checked the price again on September 13, 2013, the price became $ 1388 ... Therefore, I immediately stopped using the Google bigquery service. And now I checked the price again, it turned out that I needed to pay $ 4179 ...

It is my fault that I did not realize that I needed to pay this large amount of money for performing queries and retrieving data from Google bigquery.

This project is intended for research purposes only and not for commercial purposes. I would like to know if the duty can be waived. I really need help [Google Bigquery team].

Thank you very much and best regards, Lisa

+8
google-bigquery
source share
1 answer

After a year update:

Please pay attention to some important events that have occurred after this situation:

  • The request price is 85%.
  • GithubArchive publishes daily and annual tables now, so when designing your queries, always check them on smaller datasets.

The price of BigQuery is based on the amount of data requested. One of its highlights is how easily it scales, starting from scanning a few gigabytes to terabytes in a few seconds.

Linear scaling of pricing is a feature: most (or all?) Of the other databases that I know require exponentially more expensive resources or simply cannot process these volumes of data - at least not in a reasonable amount of time.

However, linear scaling means that a terabyte request is 1000 times more expensive than a gigabyte request. BigQuery users should be aware of this and plan accordingly. For these purposes, BigQuery offers the β€œdry run” flag, which allows you to see exactly how much data will be requested before running the query, and configure accordingly.

In this case, WeiGong requested a table of 105 GB. Ten SELECT * LIMIT 10 queries will quickly make terabytes of data, etc.

There are ways that these same requests consume much less data:

  • Instead of querying SELECT * LIMIT 10 only call the columns you are looking for. The BigQuery board is based on the columns you request, so having unnecessary columns will add unnecessary overhead.

For example, SELECT * ... requests 105 GB, and SELECT repository_url, repository_name, payload_ref_type, payload_pull_request_deletions FROM [githubarchive:github.timeline] is only 8.72 GB, which makes this request more than 10 times cheaper.

  • Instead of "SELECT *" use tabledata.list when you want to load the entire table. It's free.

  • The Github archive table contains all-time data. Separate it if you want to see only the data for one month.

For example, retrieving all January data using a query leaves a new table with only 91.7 MB. Querying this table is a thousand times cheaper than a big one!

 SELECT * FROM [githubarchive:github.timeline] WHERE created_at BETWEEN '2014-01-01' and '2014-01-02' -> save this into a new table 'timeline_201401' 

Combining these methods, you can go from an account for $ 4,000 to $ 4 for the same amount of quick and insightful results.

(I work with the owner of the Github archive to make them store monthly data instead of a single monolithic table to make this even easier)

+17
source share

All Articles