How to quickly find a large file for String in Java?

I am trying to find a large text file (400 MB) for a specific line using the following:

File file = new File("fileName.txt"); try { int count = 0; Scanner scanner = new Scanner(file); while(scanner.hasNextLine()) { if(scanner.nextLine().contains("particularString")) { count++; System.out.println("Number of instances of String: " + count); } } } catch (FileNotFoundException e){ System.out.println(e); } 

This is great for small files, however it takes too much time (> 10 minutes) for this particular file and other large files.

What will be the fastest and most effective way to do this?

Now I have changed to the next and completes in a few seconds -

 try { int count = 0; FileReader fileIn = new FileReader(file); BufferedReader reader = new BufferedReader(fileIn); String line; while((line = reader.readLine()) != null) { if((line.contains("particularString"))) { count++; System.out.println("Number of instances of String " + count); } } }catch (IOException e){ System.out.println(e); } 
+6
source share
3 answers

Determine how much time you need to actually read the contents of the entire file and how long it takes to scan them for your template.

if the reading time prevails in your results (and if you read it correctly, so there are channels or at least buffered readers), you have nothing to do.

if its scan time, which dominates you, can read all the lines, and then send small batches of lines to search the work queue, where you can have several threads collecting batches of lines and search in them.

figures in the chalet

  • Assuming a hard drive read speed of 50 MB / s (and this is slow by modern standards), you should be able to read the entire file into memory in 10 seconds.
  • looking at MD5 hash speed tests (example here ) shows us that hash speed can be at least fast (often faster) than disk read speed. Also, finding strings is faster, easier, and better than hashing.

Given these 2 ratings, I think that the correct implementation can easily land you for about 10 seconds (if you start to perform search tasks as you read linear parts), and your disk read time will dominate.

+5
source

The scanner is simply not useful in this case. Under the hood, it does all kinds of parsing, validation, caching, and much more. If your case simply "iterates over all lines of the file", use something based on a simple BufferedReader.

In your specific case, I recommend using Files.lines.

Example:

  long count = Files.lines(Paths.get("testfile.txt")) .filter(s -> s.contains("particularString")) .count(); System.out.println(count); 

(Note that this particular case of streaming api probably does not cover what you are actually trying to achieve - unfortunately your question does not indicate what the result of this method should be.)

On my system, I get about 15% of the Scanner runtime with Files.lines () or a buffered reader.

0
source

Use the method from the Scanner object - FindWithinHorizon. The scanner internally makes a FileChannel to read the file. And for pattern matching, he will use the Boyer-Moore algorithm to efficiently search strings.

-one
source

All Articles