Java How to improve reading a 50 gigabyte file

I am reading a 50G file containing millions of lines separated by a newline. I am currently using the following syntax to read a file

String line = null; BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("FileName"))); while ((line = br.readLine()) != null) { // Processing each line here // All processing is done in memory. No IO required here. } 

Since the file is too large, it takes 2 hours to process the entire file. Can I improve the reading of the file from the hard disk so that the I / O (read) operation takes a minimum of time. The limitation with my code is that I have to process every consecutive row order.

+8
java file bufferedreader
source share
6 answers

It takes 2 hours to process the entire file.

50 GB / 2 hours is approximately 7 MB / s. This is not a bad course at all. A good (modern) hard drive should be able to support higher speeds continuously, so maybe your bottleneck is not I / O? You are already using BufferedReader, which, like the name, talks about buffering (in memory) what it reads. You could experiment with creating a reader with a slightly larger buffer than the default size (8192 bytes), for example:

 BufferedReader br = new BufferedReader( new InputStreamReader(new FileInputStream("FileName")), 100000); 

Please note that with a default buffer of 8192 bytes and a bandwidth of 7 MB / s, BufferedReader is about to refill its buffer almost 1000 times per second, so lowering this number can really help to reduce some overhead. But if the processing you do, instead of I / O, is a bottleneck, then no I / O trick will help you. Perhaps you should consider making it multithreaded, but whether it can be done and how it depends on what β€œprocessing” means here.

+10
source share

Your only hope is to parallelize the reading and processing of what's inside. Your strategy should be to never require the contents of the entire file to be in memory immediately.

Start by profiling the code you should see where time is wasted. Rewrite the part that takes longer and reprofiles to see if it has improved. Keep repeating until you get an acceptable result.

I would think of Hadoop and a distributed solution. Data sets that are larger than yours are currently being processed. You may need to be a little more creative in your thinking.

+8
source share

Without NIO, you cannot overcome the bandwidth barrier. For example, try using new Scanner(File) instead of directly creating readers. I recently looked at this source code, it uses NIO files.

But the first thing I would suggest is to start an empty loop using BufferedReader , which does nothing but read. Pay attention to bandwidth - and also monitor the processor. If the loop overlaps the CPU, then there is definitely a problem with the I / O code.

+5
source share
  • Disable antivirus and any other program that adds disk conflict while reading a file.

  • Disk Defragmenter.

  • Create a raw disk partition and read the file from there.

  • Read the file with the SSD.

  • Create a 50 GB RAMdisk and read the file from there.

+2
source share

I think you can get the best results by rethinking the problem you are trying to solve. There is clearly the reason why you are downloading this 50Gig file. Consider whether there is a better way to split the stored data and use only the data that you really need.

+1
source share

The way you read the file is great. There may be ways to get this faster, but it usually takes an understanding of where your bottleneck is. Since I / O bandwidth is actually at the lower end, I assume that computing has a side effect of performance. If it is not too long, you can show you the whole program.

Alternatively, you can run your program without the contents of the loop and see how long it takes to read the file :)

0
source share

All Articles