Writing data stream output to a segmented destination

We have one source of streaming events with thousands of events per second, all these events are marked with an identifier that refers to the number of tens of thousands of customers to which the event belongs. We want to use this event source to fill the data store (in streaming mode), but the event source is not permanent, so we would also like to archive the raw data in GCS so that we can play it through our data if we make the necessary changes. Due to storage requirements, any raw data that we store must be shared by the client so that we can easily delete it.

What would be the easiest way to solve this in a data stream? We are currently creating a data stream job with a user receiver that writes data to files for each client on GCS / BigQuery, is this reasonable?

+7
google-cloud-storage google-cloud-dataflow
source share
1 answer

To specify the file name and path, see the TextIO Documentation . You must specify a file name / path, etc. For a recording device.

If you use multiple output files, you can use the Partition function to create multiple PCollections from a single PCollection source.

+1
source share

All Articles