Reading and processing a large text file of 25 GB

I need to read a large text file, say 25 GB, and process this file for 15-20 minutes. This file will have several header and footer sections.

I tried using CSplit to split this file based on the header, but it takes 24 to 25 minutes to split it into several files based on the header, which is not acceptable at all.

I tried sequential read and write using BufferReader and BufferWiter along with FileReader and FileWriter . It takes more than 27 minutes. Again, this is unacceptable.

I tried a different approach, for example, getting the starting index of each header, and then running multiple threads to read the file from a specific location using RandomAccessFile . But we are out of luck.

How can I fulfill my requirement?

Possible duplicate:

Read large files in Java

+7
source share
4 answers

Try using a large buffer read size (e.g. 20 MB instead of 2 MB) to process your data faster. Also, do not use BufferedReader due to slow speeds and character conversion.

This question has been asked before: Read large files in Java

+7
source

You need to make sure that IO is fast enough, without processing, because I suspect that processing, not IO, is slowing you down. You can get 80 MB / s from the hard drive and up to 400 MB / s from the SSD. This means that you could read everything in one second.

Try the following, which is not the fastest, but the easiest.

 long start = System.nanoTime(); byte[] bytes = new byte[32*1024]; FileInputStream fis = new FileInputStream(fileName); int len; while((len = fis.read(bytes)) > 0); long time = System.nanoTime() - start; System.out.printf("Took %.3f seconds%n", time/1e9); 

If you do not find that you are getting at least 50 MB / s, you have a hardware problem.

+5
source

Try using java.nio to make better use of the functionality of your operating systems. Avoid copying data (e.g. to a string), but try to work with offsets. I believe that java.nio classes will even have methods for transferring data from one buffer to another without pulling data into the java layer at all (at least in Linux), but this will significantly translate the calls of the operating system.

For many modern web servers, this method plays a key role in the performance with which they can serve static data: in essence, they delegate as much of the operating system as possible to avoid duplicating it in main memory.

Let me emphasize this: simply searching through a 25 GB buffer is much faster than converting it to Java strings (which may require encoding / decoding encoding and copying). Anything that saves your copies and memory management will help.

0
source

If the platform is right, you might want to lay out and call a combination of cat and sed. If this is not the case, you can still lay out the shell and use perl through the command line. For the case where it is absolutely imperative that Java does the actual processing, others have provided sufficient answers.

Be on your guard, although shelling is not without problems. But perl or sed may be the only tools available to scan and change 25GB of text into your timeframe.

0
source

All Articles