Insert a large amount of data into BigQuery through the bigquery-python library

Question

Insert a large amount of data into BigQuery through the bigquery-python library

I have large csv files and excel files where I read them and dynamically create a create script table depending on the fields and types that it has. Then paste the data into the created table.

I read this one and realized that I had to send them with jobs.insert() instead of tabledata.insertAll() for a lot of data.

This is what I call it (works for small files not large).

 result = client.push_rows(datasetname,table_name,insertObject) # insertObject is a list of dictionaries

When I use the push_rows library, it gives this error in windows.

 [Errno 10054] An existing connection was forcibly closed by the remote host

and this is in ubuntu.

 [Errno 32] Broken pipe

So, when I went through BigQuery-Python , it uses table_data.insertAll() .

How can I do this using this library? I know that we can download through the Google repository, but I need a direct download method with this.

+6

python python-2.7 large-data google-bigquery

Marlon abeykoon Aug 16 '16 at 9:42

source share

1 answer

Felipe hoffa · Answer 1 · 2016-08-23T22:03:31+0000

When processing large files, streaming is not used, but batch loading: streaming will easily process up to 100,000 lines per second. This is very good for streaming, but not for downloading large files.

The sample code associated with this does the right thing (a package instead of streaming), so we see another problem: this sample code tries to load all this data directly into BigQuery, but the download via the POST part failed. gsutil has a more robust boot algorithm than a simple POST.

Decision. Instead of uploading large chunks of data through POST, first run them on Google Cloud Storage and then tell BigQuery to read files from GCS.

See also BigQuery script crashing for large file

Insert a large amount of data into BigQuery through the bigquery-python library

More articles: