Reading a large file in Java - a bunch of Java space

I am reading a large tsv file (~ 40G) and trying to trim it by reading line by line and printing only certain lines to a new file. However, I keep getting the following exception:

java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2894) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:532) at java.lang.StringBuffer.append(StringBuffer.java:323) at java.io.BufferedReader.readLine(BufferedReader.java:362) at java.io.BufferedReader.readLine(BufferedReader.java:379) 

The following is the main part of the code. I set the buffer size to 8192 just in case. Doesn't Java flush the buffer after reaching the buffer size limit? I do not see what could lead to a lot of memory usage here. I tried to increase the heap size, but it did not matter (machine with 4 GB of RAM). I also tried flushing the output file every X lines, but that didn't help either. I think maybe I need to call the GC, but that doesn't sound right.

Any thoughts? Many thanks. BTW. I know that I should call trim () only once, save it, and then use it.

 Set<String> set = new HashSet<String>(); set.add("AB"); ... ... static public void main(String[] args) throws Exception { BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile),"UTF-8"), 8192); PrintStream output = new PrintStream(outputFile, "UTF-8"); String line = reader.readLine(); while(line!=null){ String[] fields = line.split("\t"); if( set.contains(fields[0].trim()+"-"+fields[1].trim()) ) output.println((fields[0].trim()+"-"+fields[1].trim())); line = reader.readLine(); } output.close(); } 
+7
source share
5 answers

Most likely, what happens is that the file does not have line terminators, so the reader simply continues to increase its StringBuffer without restrictions until the memory runs out.

The solution is to read a fixed number of bytes at a time using the read method for the reader, and then look for new lines (or other parsing tokens) in a smaller buffer (s).

+17
source

Are you sure the β€œlines” in the file are separated by newlines?

+3
source

I have 3 theories:

  • The input file is not UTF-8, but some undefined binary format, which results in extremely long lines when read as UTF-8.

  • The file contains some extremely long "lines" ... or no line breaks at all.

  • Something else is happening in the code that you are not showing us; for example, you add new elements to set .


To help diagnose this:

  • Use some tool like od (on UNIX / LINUX) to confirm that the input file does contain valid line terminators; i.e. CR, NL or CR NL.
  • Use some tool to verify the correctness of the UTF-8 file.
  • Add a static line counter to your code, and when the application explodes with OOME, print the line counter value.
  • Keep track of the longest line so far visible and print it when you get OOME.

For the record, your slight suboptimal use of trim will not be relevant to this problem.

+2
source

One possibility is that you end up with a heap during garbage collection. Hotspot JVM uses a parallel collector by default, which means that your application can allocate objects faster than a collector can return them. I was able to raise an OutOfMemoryError with supposedly only 10K live (small) objects, quickly selecting and discarding.

Instead, you can use the old (pre-1.5) serial collector with the -XX:+UseSerialGC . There are several other β€œadvanced” options that you can use to customize your collection.

+1
source

You might want to remove the String[] fields declaration from the loop. When you create a new array in each loop. Can you just reuse the old one?

-one
source

All Articles