Effective zip file reading in Java

I am working on a project that works on a very large amount of data. I have many (thousands) zip files, each of which contains ONE simple text file with thousands of lines (about 80 thousand lines). I am currently doing the following:

for(File zipFile: dir.listFiles()){ ZipFile zf = new ZipFile(zipFile); ZipEntry ze = (ZipEntry) zf.entries().nextElement(); BufferedReader in = new BufferedReader(new InputStreamReader(zf.getInputStream(ze))); ... 

This way I can read the file line by line, but it is definitely too slow. Given the large number of files and lines that need to be read, I need to read them in a more efficient way.

I was looking for a different approach, but I could not find anything. I think I should use the java nio APIs designed for intensive I / O, but I don’t know how to use them with zip files.

Any help would be really appreciated.

Thanks,

Marco

+4
source share
6 answers

I have many (thousand) mail files. The encrypted files are about 30 MB each, and the txt inside the zip file is about 60/70 MB. Reading and processing files using this code takes many hours, about 15, but it depends.

Let's do some feedback calculations.

Say you have 5,000 files. If it takes 15 hours to process them, this equals ~ 10 seconds per file. Files are about 30 MB each, so the throughput is ~ 3 MB / s.

This is one and two orders of magnitude slower than the speed at which ZipFile can decompress material.

Is there a problem with the disks (are they local or a network share?), Or is it the actual processing that takes most of the time.

The best way to know for sure is to use a profiler.

+3
source

The correct way to iterate a zip file

 final ZipFile file = new ZipFile( FILE_NAME ); try { final Enumeration<? extends ZipEntry> entries = file.entries(); while ( entries.hasMoreElements() ) { final ZipEntry entry = entries.nextElement(); System.out.println( entry.getName() ); //use entry input stream: readInputStream( file.getInputStream( entry ) ) } } finally { file.close(); } private static int readInputStream( final InputStream is ) throws IOException { final byte[] buf = new byte[ 8192 ]; int read = 0; int cntRead; while ( ( cntRead = is.read( buf, 0, buf.length ) ) >=0 ) { read += cntRead; } return read; } 

A zip file consists of several records, each of which has a field containing the number of bytes in the current record. Thus, iterating over all zip file entries is easy without actually decompressing the data. java.util.zip.ZipFile accepts a file / file name and uses random access to jump between file positions. java.util.zip.ZipInputStream, on the other hand, works with streams, so it cannot jump freely. This is why it must read and unzip all zip data in order to achieve EOF for each record and read the header of the next record.

What does it mean? If you already have a zip file in your file system - use ZipFile to process it regardless of your task. As a bonus, you can access zip records both sequentially and randomly (with a rather slight decrease in performance). On the other hand, if you are processing a stream, you need to process all the records sequentially using ZipInputStream.

Here is an example. A zip archive (total file size = 1.6 GB) containing three 0.6 GB records was repeated in 0.05 seconds using a ZipFile and in 18 seconds using a ZipInputStream.

+1
source

You can use the new file API as follows:

 Path jarPath = Paths.get(...); try (FileSystem jarFS = FileSystems.newFileSystem(jarPath, null)) { Path someFileInJarPath = jarFS.getPath("/..."); try (ReadableByteChannel rbc = Files.newByteChannel(someFileInJarPath, EnumSet.of(StandardOpenOption.READ))) { // read file } } 

The code is for jar files, but I think it should work for zip files too.

0
source

You can try this code

 try { final ZipFile zf = new ZipFile("C:/Documents and Settings/satheesh/Desktop/POTL.Zip"); final Enumeration<? extends ZipEntry> entries = zf.entries(); ZipInputStream zipInput = null; while (entries.hasMoreElements()) { final ZipEntry zipEntry=entries.nextElement(); final String fileName = zipEntry.getName(); // zipInput = new ZipInputStream(new FileInputStream(fileName)); InputStream inputs=zf.getInputStream(zipEntry); // final RandomAccessFile br = new RandomAccessFile(fileName, "r"); BufferedReader br = new BufferedReader(new InputStreamReader(inputs, "UTF-8")); FileWriter fr=new FileWriter(f2); BufferedWriter wr=new BufferedWriter(new FileWriter(f2) ); while((line = br.readLine()) != null) { wr.write(line); System.out.println(line); wr.newLine(); wr.flush(); } br.close(); zipInput.closeEntry(); } } catch(Exception e) { System.out.print(e); } finally { System.out.println("\n\n\nThe had been extracted successfully"); } 

this code works well.

0
source

Intel has released an improved version of zlib , which Java uses internally peroform zip / unzip. This requires that you fix zlib sources using Interl IPP files . I did a test showing bandwidths from 1.4x to 3x.

0
source

Asynchronous Unpacking and Synchronous Processing

Using advice from Java Performance , which is largely similar to the answer of Vasim Vani , from Satish Kumar : sorting through the ZIP records to get each of them InputStream , and they do something with them, I created my own solution.

In my case, processing is a bottleneck, so I massively run parallel extraction at the beginning and put each of the results in the ConcurrentLinkedQueue that I get from the processing stream. My ZIP contains a collection of XML files that represent Java serialized classes, so my "extraction" involves deserializing the classes and they are queued.

For me, this has several advantages compared to my previous approach - sequentially obtaining each file from ZIP and processing it:

  1. more convincing: 10% reduction in overall time
  2. file release occurs earlier
  3. the whole amount of RAM is distributed faster, therefore, if RAM is not enough, it will fail faster (within tens of minutes instead of one hour); Please note that the amount of memory that I leave after processing is very similar to the amount occupied by unzipped files, otherwise it would be better to unzip and discard sequentially to reduce the amount of memory
  4. It seems that unzipping and deserializing has high CPU utilization, so the faster it finishes, the faster you get the CPU for processing, which is really important

There is one drawback: flow control is a bit more complicated when you enable parallelism.

0
source

Source: https://habr.com/ru/post/1414171/


All Articles