Space in the search bar when matching with grep.

I have a file that looks like this.

10gs+VWW+A+210 10gs-ASN-A-206 0.616667 0.094872 10gs+VWW+A+210 10gs-GLU-A-31- 0.363077 0.151282 10gs+VWW+A+210 10gs-GLY-A-207 0.602564 0.060256 10gs+VWW+A+210 10gs-LEU-A-132 0.378151 0.288462 10gs+VWW+A+210 10gs-LEU-A-60- 0.376812 0.133333 10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385 10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846 10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385 10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846 17gs+VWW+A+210 11ba-SER-A-77- 0.415789 0.101282 15gs+VWW+A+210 11ba-VAL-A-47- 0.413793 0.215385 

I want to align lines matching a pattern [including spaces in it]. Let's say the pattern is: '10gs + VWW + A + 210 11ba -'

When I give a pattern like the grep argument, I get the correct lines correctly. However, the problem arises when I want to map several patterns like these from a file, for example pattern.txt , which has a list of all of these patterns on each line.

pattern.txt as follows:

10gs + VWW + A + 210 11ba -

10gs + VWW + A + 210 10gs -

When I use the shell script as follows:

 for i in `cat pattern.txt`; do grep -e "^$i" bigfile.txt ; done 

The team takes 10gs+VWW+A+210 separately and 11ba separately for compliance. I want to match the whole thing (separated by a space), i.e. 10gs + VWW + A + 210 11ba for matching, not two lines separately.

How do I modify an existing shell script to break the space character in the search bar?

Also, since the file with which I am matching this rowset is large, ~ 50 GB. Thus, an effective memory solution is welcome. Thanks.

+4
source share
2 answers

Replace spaces with other characters

Assuming # never found in patterns

  for i in $( cat pattern.txt | tr ' ' '#' ) ; do j=$(echo "$i" | tr '#' ' ' ) grep -e "^$j" bigfile.txt done 

Dates in the test file

 real 0m20.739s user 0m11.773s sys 0m8.345s 

Use -f flag in grep

  grep -f pattern.txt bigfile.txt 

Dates in one test file

 real 0m2.190s user 0m2.163s sys 0m0.026s 

In other words, the performance of grep -f in a large template file is about 10 times higher.

+1
source

Is the following command appropriate for you and the corresponding result? Patterns must be broken into a pipe so that any of them matches.

Command:

 egrep '10gs\+VWW\+A\+210 11ba-|10gs\+VWW\+A\+210 10gs-' bigfile.txt 

Result:

 10gs+VWW+A+210 10gs-ASN-A-206 0.616667 0.094872 10gs+VWW+A+210 10gs-GLU-A-31- 0.363077 0.151282 10gs+VWW+A+210 10gs-GLY-A-207 0.602564 0.060256 10gs+VWW+A+210 10gs-LEU-A-132 0.378151 0.288462 10gs+VWW+A+210 10gs-LEU-A-60- 0.376812 0.133333 10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385 10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846 10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385 10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846 
0
source

Source: https://habr.com/ru/post/1415824/


All Articles