We have a pipeline that reads data from BigQuery and processes historical data for various calendar years. It does not work with OutOfMemoryError errors if the input is small (~ 500 MB)
When it starts, it reads about 10,000 elements / sec from BigQuery, after a short time it slows down to hundreds of elements / sec, and then completely freezes.
Observation of "Elements added" at the next processing stage (BQImportAndCompute), the value increases and then decreases again. This is similar to the fact that some already downloaded data is discarded and then reloaded.
The Stackdriver Logging console contains errors with various stack traces that contain java.lang.OutOfMemoryError, for example:
Message about updating workflow report for data flow service:
"java.lang.OutOfMemoryError: Java heap space
at com.google.cloud.dataflow.sdk.runners.worker.BigQueryAvroReader$BigQueryAvroFileIterator.getProgress(BigQueryAvroReader.java:145)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.setProgressFromIteratorConcurrent(ReadOperation.java:397)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.setProgressFromIterator(ReadOperation.java:389)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$1.run(ReadOperation.java:206)
I would suspect that there is a problem with the pipe topology, but it works with the same pipeline
- works fine with DirectPipelineRunner locally
- in the cloud with DataflowPipelineRunner on a large data set (5 GB, the next year) works fine
I guess the problem is how the Dataflow parallelizes and distributes the work in the pipeline. Are there any possibilities to check or influence it?
source
share