Google Cloud Dataflow ETL (Datastore & # 8594; Transform & # 8594; BigQuery)

We have an application running on Google App Engine that uses Datastore as internal storage. Currently, the application has mainly "OLTP" functions and some rudimentary reports. When implementing reports, we were faced with the fact that processing a large amount of data (millions of objects) is very difficult using Datastore and GQL. To improve our application with the proper reports and functions of Business Intelligence, we believe that it is better to configure the ETL process to transfer data from the data warehouse to BigQuery.

We initially thought about introducing the ETL process as a cron task for an Engine Engine application, but it looks like Dataflow can also be used for this. We have the following requirements for setting up the process.

  • Ability to embed all existing data in BigQuery using the Non streaming BigQuery API.
  • After doing this above, click on any new data whenever it is updated or created in Datastore to BigQuery using the streaming API.

My questions

  • Is the cloud data stream selected to implement this pipeline?
  • Can we use existing data? Some of the species are millions of objects.
  • What should be the right approach for its implementation? We are considering two approaches. The first approach is to go through pub / sub ie for existing data create a cron job and push all the data in pub / sub. For any new updates, push data in pub / sub is simultaneously updated in the DataStore. The data feed channel will select it from pub / sub and drag it into BigQuery. The second approach is to create a batch pipeline in the data stream that will query the DataStore and push any new data into BigQuery.

The question is, are these two approaches being implemented? which one is better to cost? Is there any other way that is better than the above two?

Thanks,

rizTaak

+7
google-app-engine google-cloud-datastore google-bigquery google-cloud-dataflow
source share
2 answers

For this purpose, you can absolutely use Dataflow. In fact, the scalability of Dataflow should make the process fast and relatively simple.

Both of your approaches should work - I would prefer the second option to use a batch pipeline to move existing data, and then a stream pipeline to process new data through Cloud Pub / Sub. In addition to moving data, Dataflow allows you to freely analyze / manipulate the data itself.

However, BigQuery and Datastore can be connected directly. See, for example, Uploading data from a cloud-based data warehouse to BigQuery documentation.

+7
source share

Another way to use third-party solutions is to upload data to Google BigQuery. There are many of them here . Most of them are paid, but there is a free one with a limited download frequency. In this case, you do not need to encode anything.

+1
source share

All Articles