Space in the search bar when matching with grep.

Question

Space in the search bar when matching with grep.

I have a file that looks like this.

10gs+VWW+A+210 10gs-ASN-A-206 0.616667 0.094872 10gs+VWW+A+210 10gs-GLU-A-31- 0.363077 0.151282 10gs+VWW+A+210 10gs-GLY-A-207 0.602564 0.060256 10gs+VWW+A+210 10gs-LEU-A-132 0.378151 0.288462 10gs+VWW+A+210 10gs-LEU-A-60- 0.376812 0.133333 10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385 10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846 10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385 10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846 17gs+VWW+A+210 11ba-SER-A-77- 0.415789 0.101282 15gs+VWW+A+210 11ba-VAL-A-47- 0.413793 0.215385

I want to align lines matching a pattern [including spaces in it]. Let's say the pattern is: '10gs + VWW + A + 210 11ba -'

When I give a pattern like the grep argument, I get the correct lines correctly. However, the problem arises when I want to map several patterns like these from a file, for example pattern.txt , which has a list of all of these patterns on each line.

pattern.txt as follows:

10gs + VWW + A + 210 11ba -

10gs + VWW + A + 210 10gs -

When I use the shell script as follows:

 for i in `cat pattern.txt`; do grep -e "^$i" bigfile.txt ; done

The team takes 10gs+VWW+A+210 separately and 11ba separately for compliance. I want to match the whole thing (separated by a space), i.e. 10gs + VWW + A + 210 11ba for matching, not two lines separately.

How do I modify an existing shell script to break the space character in the search bar?

Also, since the file with which I am matching this rowset is large, ~ 50 GB. Thus, an effective memory solution is welcome. Thanks.

+4

grep

ana Jun 03 '12 at 15:09

source share

2 answers

Dmitri Chubarov · Answer 1 · 2012-06-03T15:22:32+0000

Replace spaces with other characters

Assuming # never found in patterns

  for i in $( cat pattern.txt | tr ' ' '#' ) ; do j=$(echo "$i" | tr '#' ' ' ) grep -e "^$j" bigfile.txt done

Dates in the test file

 real 0m20.739s user 0m11.773s sys 0m8.345s

Use -f flag in grep

  grep -f pattern.txt bigfile.txt

Dates in one test file

 real 0m2.190s user 0m2.163s sys 0m0.026s

In other words, the performance of grep -f in a large template file is about 10 times higher.

Sicco · Answer 2 · 2012-06-03T15:23:36+0000

Is the following command appropriate for you and the corresponding result? Patterns must be broken into a pipe so that any of them matches.

Command:

 egrep '10gs\+VWW\+A\+210 11ba-|10gs\+VWW\+A\+210 10gs-' bigfile.txt

Result:

 10gs+VWW+A+210 10gs-ASN-A-206 0.616667 0.094872 10gs+VWW+A+210 10gs-GLU-A-31- 0.363077 0.151282 10gs+VWW+A+210 10gs-GLY-A-207 0.602564 0.060256 10gs+VWW+A+210 10gs-LEU-A-132 0.378151 0.288462 10gs+VWW+A+210 10gs-LEU-A-60- 0.376812 0.133333 10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385 10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846 10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385 10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846

Space in the search bar when matching with grep.

More articles: