Dataflow GZIP TextIO ZipException: Too many characters in length or distance

Question

Dataflow GZIP TextIO ZipException: Too many characters in length or distance

Using TextIO.Read is converted with a large collection of compressed text files (1000+ files ranging in size from 100 MB to 1.5 GB), we sometimes get the following error:

java.util.zip.ZipException: too many length or distance symbols at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.PushbackInputStream.read(PushbackInputStream.java:186) at com.google.cloud.dataflow.sdk.runners.worker.TextReader$ScanState.readBytes(TextReader.java:261) at com.google.cloud.dataflow.sdk.runners.worker.TextReader$TextFileIterator.readElement(TextReader.java:189) at com.google.cloud.dataflow.sdk.runners.worker.FileBasedReader$FileBasedIterator.computeNextElement(FileBasedReader.java:265) at com.google.cloud.dataflow.sdk.runners.worker.FileBasedReader$FileBasedIterator.hasNext(FileBasedReader.java:165) at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:169) at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:118) at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:66) at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:204) at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:151) at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:118) at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:139) at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:124) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

Searching the Internet for the same ZipException will only result in this answer :

Zip file errors often occur when a hot deployer tries to deploy an application before it is fully copied to the deployment directory. This is quite common if it takes a few seconds to copy a file. The solution is to copy the file to a temporary directory on the same disk partition as the application server, and then move the file to the deployment directory.

Has anyone else encountered a similar exception? Or can we fix this problem anyway?

+5

java google-cloud-dataflow gzipinputstream

Fematich Jul 27 '15 at 8:50

source share

2 answers

This question may be a little old, but it was the first result in my Google search yesterday for this error:

HIVE_CURSOR_ERROR: Too many characters in length or distance

After the hints here, I came to understand that I ruined my gzip structure of the files that I was trying to process. I had two processes writing gzip'd data to the same output file, and the output files were damaged due to this. Fixed the problem with processing processes for writing to unique files. I thought this answer could save some more time.

0

Aaronm Aug 18 '17 at 14:58

source share

Ivan Tarasov · Accepted Answer · 2015-07-28T00:44:33+0000

Looking at the code that gives the error message , it seems to be a problem with the zlib library (which is used by the JDK), and not supporting the gzip file format that you have.

The following error appears in zlib : Codes of reserved characters are discarded, even if they are not used .

Unfortunately, we probably can do little to help others except offer to create this compressed file using another utility.

If you can create a small example of a gzip file that we could use to reproduce the problem, we could see if we could get along somehow, but I would not rely on it to succeed.

Dataflow GZIP TextIO ZipException: Too many characters in length or distance

More articles: