An easier way to extract templates from a large file (over 700 MB)

Question

An easier way to extract templates from a large file (over 700 MB)

I have a problem that requires me to parse a text file from the local machine. There are several complications:

Files can be quite large (700mb +)
The pattern occurs in several lines.
I need store line info after template

I created simple code using BufferReader , String.indexOf and String.substring (to get element 3).

Inside the file there is a key (template) with the name code= , which occurs many times in different blocks. The program reads each line from this file using BufferReader.readLine . It uses indexOf to check if the pattern appears, and then extracts the text after the pattern and stores it on a common line.

When I ran my program with a 600 MB file, I noticed that performance was worst when processing a file. I read an article in CodeRanch that the Scanner class does not execute for large files.

Are there any methods or library that could improve my performance?

Thanks in advance.

Here is my source code:

 String codeC = "code=["; String source = ""; try { FileInputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt"); DataInputStream in = new DataInputStream(f1); BufferedReader br = new BufferedReader(new InputStreamReader(in)); String strLine; boolean bPrnt = false; int ln = 0; // Read File Line By Line while ((strLine = br.readLine()) != null) { // Print the content on the console if (strLine.indexOf(codeC) != -1) { ln++; System.out.println(strLine + " ---- register : " + ln); strLine = strLine.substring(codeC.length(), strLine.length()); source = source + "\n" + strLine; } } System.out.println(""); System.out.println("Lines :" + ln); f1.close(); } catch ( ... ) { ... }

+4

java performance parsing large-files

Zerdt Nov 29 '12 at 19:09

source share

4 answers

Marko topolnik · Answer 1 · 2012-11-29T19:27:22+0000

This code is very suspicious and may contain at least part of your performance problems:

 FileInputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt"); DataInputStream in = new DataInputStream(f1); BufferedReader br = new BufferedReader(new InputStreamReader(in));

You turn on DataInputStream for a good reason, and actually using it as an input to Reader can be seen as a case with broken code. Instead, write:

 InputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt"); BufferedReader br = new BufferedReader(new InputStreamReader(fr));

A huge performance System.out is the System.out that you use, especially if you measure performance while working in Eclipse, but even when starting from the command line. I guess this is the main reason for your bottleneck. Be sure to ensure that you are not printing anything in the main loop when you are aiming for maximum performance.

Alexwien · Answer 2 · 2012-11-29T19:34:33+0000

In addition to what Marco answered, I suggest closing br, not f1:

 br.close()

This will not affect performance, but will be cleaner. (closing the outermost thread)

Frank · Answer 3 · 2012-11-29T19:12:43+0000

Take a look at java.util.regex

Great tutorial from oracle.

Copy paste from JAVADoc:

Classes for matching sequences of characters with patterns specified by regular expressions.
An instance of the Pattern class is a regular expression that is specified in string form in syntax similar to that used by Perl.
Instances of the Matcher class are used to match sequences of characters according to a given pattern. Input data is provided for mates through the CharSequence interface to support character matching from a wide variety of input sources.
Unless otherwise specified, passing a null argument to a method in any class or interface in this package will throw a NullPointerException.

Zerdt · Answer 4 · 2012-11-29T20:29:15+0000

It works great!

I followed the advice of OldCurmudgeon , Marco Topolnik and AlexWien , and my productivity improved by 1000%. Before the program spent 2 hours to complete the described operation and write the answer in a file. Now he spends 5 minutes! And SYSO remains in the source code !!

I think the reason for the significant improvement is the change in the "source" for the HashSet "source", as OldCurmudgeon suggests. Bur I deleted DataInputStream and used "br.close" too.

Thanks guys!

An easier way to extract templates from a large file (over 700 MB)

More articles: