Quick search of lines in a very large file

What is the fastest way to search for strings in a file containing a string. I have a file containing strings to search for. This small file (smallF) contains about 50,000 lines and looks like this:

stringToSearch1
stringToSearch2
stringToSearch3

I need to look for all these lines in a larger file (about 100 million lines). If any line in this larger file contains a search string, the line is printed.

The best method I've come up with so far is

grep -F -f smallF largeF 

But it is not very fast. Just 100 search lines in smallF takes about 4 minutes. Over 50,000 search strings will take a long time.

Is there a more efficient method?

+8
linux bash grep
source share
3 answers

I noticed that using the -E or multiple -E options is faster than using -f . Note that this may not be applicable to your problem, as you are looking for 50,000 lines in a larger file. However, I wanted to show you what can be done and what might be worth checking out:

Here is what I noticed in detail:

Download a 1.2 GB file with random strings.

 >ls -has | grep string 1,2G strings.txt >head strings.txt Mfzd0sf7RA664UVrBHK44cSQpLRKT6J0 Uk218A8GKRdAVOZLIykVc0b2RH1ayfAy BmuCCPJaQGhFTIutGpVG86tlanW8c9Pa etrulbGONKT3pact1SHg2ipcCr7TZ9jc ..... 

Now I want to look for the strings "ab", "cd" and "ef" using different grep approaches:

  • Using grep without flags, search one by one:

    grep "ab" strings.txt> m1.out
    2.76s user 0.42s system 96% cpu 3.313 total

    grep "cd" strings.txt โ†’ m1.out
    2.82s user 0.36s system 95% cpu 3.322 total

    grep "ef" strings.txt โ†’ m1.out
    2.78s user 0.36s system 94% cpu 3.360 total

Thus, the search takes almost 10 seconds .

  1. Using grep with the -f flag with search strings in search.txt

     >cat search.txt ab cd ef >grep -F -f search.txt strings.txt > m2.out 31,55s user 0,60s system 99% cpu 32,343 total 

For some reason, it takes almost 32 seconds .

  1. Now using multiple search patterns with -E

     grep -E "ab|cd|ef" strings.txt > m3.out 3,80s user 0,36s system 98% cpu 4,220 total 

    or

     grep --color=auto -e "ab" -e "cd" -e "ef" strings.txt > /dev/null 3,86s user 0,38s system 98% cpu 4,323 total 

The third method using -E took 4.22 seconds to find the file.

Now let's check if the results match:

 cat m1.out | sort | uniq > m1.sort cat m3.out | sort | uniq > m3.sort diff m1.sort m3.sort # 

Diff makes no conclusion, which means the results are the same.

Perhaps we will try to try, otherwise I would advise you to look at the topic "The fastest grep possible", see the comment from Cyrus.

+7
source share

Note. I understand that the following solution is not based on bash , but given your large search space, a parallel solution is required.


If your machine has more than one core / processor, you can call the following function in Pythran to parallelize the search:

 #!/usr/bin/env python #pythran export search_in_file(string, string) def search_in_file(long_file_path, short_file_path): _long = open(long_file_path, "r") #omp parallel for schedule(guided) for _string in open(short_file_path, "r"): if _string in _long: print(_string) if __name__ == "__main__": search_in_file("long_file_path", "short_file_path") 

Note. Behind the scenes, Pythran takes Python code and tries to aggressively compile it into very fast C ++.

0
source share

You can try sift or ag . Sift through in particular the lists of some pretty impressive grep tests.

0
source share

All Articles