The fastest grep possible

I would like to know if there is any advice to make grep as fast as possible. I have a fairly large database of text files to search as much as possible. I did everything in lower case so that I could get rid of the -i option. This greatly speeds up the search.

In addition, I found that the -F and -P modes are faster than the standard ones. I use the former when the search string is not a regular expression (just text), the latter if a regular expression is involved.

Does anyone have any experience with grep acceleration? Maybe compile it from scratch with a certain flag (I'm on Linux CentOS), organize files in a certain way, or maybe do a parallel search in some way?

+78
unix bash grep
Jan 30 '12 at 3:50
source share
12 answers

Try GNU parallel , which includes an example of how to use with grep :

grep -r greps recursively through directories. On multi-core processors, GNU parallel can often speed this up.

 find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {} 

This will start 1.5 jobs per kernel and give 1000 grep arguments.

For large files, it can split it into several fragments with the arguments --pipe and --block :

  parallel --pipe --block 2M grep foo < bigfile 

You can also run it on multiple machines via SSH (ssh-agent to avoid passwords):

 parallel --pipe --sshlogin server.example.com,server2.example.net grep foo < bigfile 
+103
Jan 30 '12 at 16:18
source share
β€” -

If you are looking for very large files, installing your language may help.

GNU grep runs much faster in C than with UTF-8.

 export LC_ALL=C 
+69
Aug 2 2018-12-12T00:
source share

Ripgrep claims to be the fastest now.

https://github.com/BurntSushi/ripgrep

Also includes default parallelism

  -j, --threads ARG The number of threads to use. Defaults to the number of logical CPUs (capped at 6). [default: 0] 

From README

It is built on top of the Rust regex engine. The Rage regex engine uses state machines, SIMDs, and aggressive literals to make searches very fast.

+12
Oct 13 '16 at 23:22
source share
+5
Jul 02 '13 at 1:24
source share

Not strictly improving the code, but what I found useful after running grep on 2+ million files.

I transferred the operation to a cheap SSD (120 GB). At around $ 100, this is an affordable option if you regularly crunch a lot of files.

+4
May 31 '12 at 2:23
source share

If you don’t care which files contain this line, you can allocate reading and grepping to two tasks, as it can be costly for grep to appear many times - once for each small file.

  • If you have one very large file:

    parallel -j100% --pipepart --block 100M -a <very large SEEKABLE file> grep <...>

  • Many small compressed files (sorted by index)

    ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j80% --group "gzcat {}" | parallel -j50% --pipe --round-robin -u -N1000 grep <..>

I usually compress files with lz4 for maximum throughput.

  1. If you want only a filename with a match:

    ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j100% --group "gzcat {} | grep -lq <..> && echo {}

+3
Oct 25 '15 at 13:00
source share

Based on Sandro's answer, I looked at the link he provided here and played with BSD grep vs. GNU grep. My quick test results showed: GNU grep is the way, the way is faster.

So, my recommendation for the original question is β€œThe fastest grep possible”: make sure you are using GNU grep and not BSD grep (which is the default for MacOS).

+2
Mar 26 '14 at 12:51
source share

I personally use ag (silver finder) instead of grep, and it is faster, also you can combine it with parallel and pipe blocks.

https://github.com/ggreer/the_silver_searcher

Update: Now I am using https://github.com/BurntSushi/ripgrep , which is faster than ag, depending on your use case.

+2
May 25 '16 at 7:32 a.m.
source share

One thing that I found faster for using grep for searching (especially for changing patterns) in one large file is to use split + grep + xargs with its parallel flag. For example:

Having the identifier file that you want to find in a large file called my_ids.txt The name of the file is bigfile.txt.

Use split to split the file into parts:

 # Use split to split the file into x number of files, consider your big file # size and try to stay under 26 split files to keep the filenames # easy from split (xa[az]), in my example I have 10 million rows in bigfile split -l 1000000 bigfile.txt # Produces output files named xa[at] # Now use split files + xargs to iterate and launch parallel greps with output for id in $(cat my_ids.txt) ; do ls xa* | xargs -n 1 -P 20 grep $id >> matches.txt ; done # Here you can tune your parallel greps with -P, in my case I am being greedy # Also be aware that there no point in allocating more greps than x files 

In my case, this would reduce what would be 17-hour work to 1 hour of 20-minute work. I am sure that there is some kind of bell-shaped efficiency curve, and it is obvious that access to the available kernels will not do you any good, but it was a much better solution than any of the above comments for my requirements, as mentioned above. This has an additional advantage over the script parallel when using mostly (linux) native tools.

+1
Jun 23 '16 at 12:59
source share

cgrep, if available, can be an order of magnitude faster than grep.

0
Aug 26 '13 at 12:29
source share

MCE 1.508 includes a shell <file, list>} with two pieces of script that supports many C binary files; agrep, grep, egrep, fgrep and tre-agrep.

https://metacpan.org/source/MARIOROY/MCE-1.509/bin/mce_grep

https://metacpan.org/release/MCE

No need to convert to lowercase when you need -i to start quickly. Just pass -lang = C to mce_grep.

The output order is preserved. The output of -n and -b is also correct. Unfortunately, this does not apply to the GNU parallel mentioned on this page. I really hoped that GNU Parallel would work here. In addition, mce_grep does not call a sub-shell (sh -c / path / to / grep) when invoking a binary file.

Another alternative is the MCE :: Grep module included in the MCE.

0
Jan 21 '14 at 18:20
source share

A slight departure from the original topic: command-line utilities with indexed search from a googlecodesearch project are faster than grep: https://github.com/google/codesearch

After compiling it ( golang required), you can index the folder using

 # index current folder cindex . 

Index will be created in ~/.csearchindex

Now you can search:

 # search folders previously indexed with cindex csearch eggs 

I am still collecting results via grep to get colored matches.

0
Oct 31 '17 at 16:16
source share



All Articles