Reading files from PCollection of GCS files in Pipeline?

I have a stream pipeline connected to pub / sub that publishes GCS file names. From there, I want to read every file and analyze the events in each line (events are what I ultimately want to process).

Can i use TextIO? Can you use it in the stream pipeline when the file name is determined at runtime (as opposed to using TextIO as the source, and the name of the file (s) is known at build time). If not, I'm going to do something like the following:

Get the topic from pub / sub ParDo read every file and get lines Process lines of the file ...

Can I use FileBasedReader or something similar in this case to read files? The files are not too large, so I will not need to parallelize the reading of one file, but I will need to read many files.

+6
source share
1 answer

You can use the TextIO.readAll() transform that was recently added to Beam at # 3443 . For instance:

 PCollection<String> filenames = p.apply(PubsubIO.readStrings()...); PCollection<String> lines = filenames.apply(TextIO.readAll()); 

This will read all the lines in every file coming in through pubsub.

+3
source

All Articles