We have an application running on Google App Engine that uses Datastore as internal storage. Currently, the application has mainly "OLTP" functions and some rudimentary reports. When implementing reports, we were faced with the fact that processing a large amount of data (millions of objects) is very difficult using Datastore and GQL. To improve our application with the proper reports and functions of Business Intelligence, we believe that it is better to configure the ETL process to transfer data from the data warehouse to BigQuery.
We initially thought about introducing the ETL process as a cron task for an Engine Engine application, but it looks like Dataflow can also be used for this. We have the following requirements for setting up the process.
- Ability to embed all existing data in BigQuery using the Non streaming BigQuery API.
- After doing this above, click on any new data whenever it is updated or created in Datastore to BigQuery using the streaming API.
My questions
- Is the cloud data stream selected to implement this pipeline?
- Can we use existing data? Some of the species are millions of objects.
- What should be the right approach for its implementation? We are considering two approaches. The first approach is to go through pub / sub ie for existing data create a cron job and push all the data in pub / sub. For any new updates, push data in pub / sub is simultaneously updated in the DataStore. The data feed channel will select it from pub / sub and drag it into BigQuery. The second approach is to create a batch pipeline in the data stream that will query the DataStore and push any new data into BigQuery.
The question is, are these two approaches being implemented? which one is better to cost? Is there any other way that is better than the above two?
Thanks,
rizTaak
google-app-engine google-cloud-datastore google-bigquery google-cloud-dataflow
rizTaak
source share