Dataflow OutOfMemoryError when reading small tables from BigQuery

Question

Dataflow OutOfMemoryError when reading small tables from BigQuery

We have a pipeline that reads data from BigQuery and processes historical data for various calendar years. It does not work with OutOfMemoryError errors if the input is small (~ 500 MB)

When it starts, it reads about 10,000 elements / sec from BigQuery, after a short time it slows down to hundreds of elements / sec, and then completely freezes.

Observation of "Elements added" at the next processing stage (BQImportAndCompute), the value increases and then decreases again. This is similar to the fact that some already downloaded data is discarded and then reloaded.

The Stackdriver Logging console contains errors with various stack traces that contain java.lang.OutOfMemoryError, for example:

Message about updating workflow report for data flow service:

"java.lang.OutOfMemoryError: Java heap space
    at com.google.cloud.dataflow.sdk.runners.worker.BigQueryAvroReader$BigQueryAvroFileIterator.getProgress(BigQueryAvroReader.java:145)
    at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.setProgressFromIteratorConcurrent(ReadOperation.java:397)
    at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.setProgressFromIterator(ReadOperation.java:389)
    at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$1.run(ReadOperation.java:206)

I would suspect that there is a problem with the pipe topology, but it works with the same pipeline

works fine with DirectPipelineRunner locally
in the cloud with DataflowPipelineRunner on a large data set (5 GB, the next year) works fine