I am reading a large tsv file (~ 40G) and trying to trim it by reading line by line and printing only certain lines to a new file. However, I keep getting the following exception:
java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2894) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:532) at java.lang.StringBuffer.append(StringBuffer.java:323) at java.io.BufferedReader.readLine(BufferedReader.java:362) at java.io.BufferedReader.readLine(BufferedReader.java:379)
The following is the main part of the code. I set the buffer size to 8192 just in case. Doesn't Java flush the buffer after reaching the buffer size limit? I do not see what could lead to a lot of memory usage here. I tried to increase the heap size, but it did not matter (machine with 4 GB of RAM). I also tried flushing the output file every X lines, but that didn't help either. I think maybe I need to call the GC, but that doesn't sound right.
Any thoughts? Many thanks. BTW. I know that I should call trim () only once, save it, and then use it.
Set<String> set = new HashSet<String>(); set.add("AB"); ... ... static public void main(String[] args) throws Exception { BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile),"UTF-8"), 8192); PrintStream output = new PrintStream(outputFile, "UTF-8"); String line = reader.readLine(); while(line!=null){ String[] fields = line.split("\t"); if( set.contains(fields[0].trim()+"-"+fields[1].trim()) ) output.println((fields[0].trim()+"-"+fields[1].trim())); line = reader.readLine(); } output.close(); }
user431336
source share