Why is grep so slow and intense with the -w flag (-word-regexp)?

Question

Why is grep so slow and intense with the -w flag (-word-regexp)?

I have a list of identifiers in a file and data file (~ 3.2 GB in size), and I want to extract the lines into a data file containing the identifier, as well as the next line. I have done the following:

grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data

This worked, but also extracted unwanted substrings, for example, if id EA4 , he also pulled strings using EA40 .

So, I tried to use the same command, but adding the -w flag ( --word-regexp ) to the first grep to match whole words. However, I found that my team now works for> 1 hour (and not ~ 26 seconds), and also started using 10 gigabytes of memory, so I had to kill the job.

Why does adding -w make the command so slow and memory hungry? How can I effectively execute this command to get the desired result? thank you

file.ids looks like this:

 >EA4 >EA9

file.data as follows:

 >EA4 text data >E40 blah more_data >EA9 text_again data_here

output.data will look like this:

 >EA4 text data >EA9 text_again data_here

+6

unix bash shell grep awk

Chris_rands Oct 6 '16 at 10:38

source share

1 answer

Ed morton · Accepted Answer · 2016-10-06T14:28:08+0000

grep -F string file just looks for string occurrences in the file, but grep -w -F string file must check each character before and after string too to see if they are word characters or not. What a lot of extra work and one possible implementation of this will be to first split the lines into all possible lines that are not associated with a word character, with overlapping, so that it could take a lot of memory, but idk, if that is what causes the use of your memory or not.

In any case, grep is just the wrong tool for this job, since you only want to map a specific field in the input file, you should use awk instead:

 $ awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data >EA4 text data >EA9 text_again data_here

The above assumes that your "data" lines cannot begin with > . If they can then tell us how to identify data rows by id lines.

Note that the above will work regardless of how many lines of data you have between the id lines, even if there are 0 or 100:

 $ cat file.data >EA4 text >E40 blah more_data >EA9 text_again data 1 data 2 data 3 $ awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data >EA4 text >EA9 text_again data 1 data 2 data 3

In addition, you do not need to output the output to grep -v :

 grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data

just do it all in one script:

 awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f && !/^-/' file.ids file.data

Why is grep so slow and intense with the -w flag (-word-regexp)?

More articles: