AWK uses field value in regular expression

I am trying to find a string template consisting of the word CONCLUSION, followed by the value of field $ 2 and field $ 3 from the same entry in field $ 5.

For example, my_file.txt is separated by the character "|":

 1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...| 2|substance3|substance4|red|Conclusions: Substance4 is not harmful...| 3|substance5|substance6|red|Substance5 interacts with substance6...| 

So, in this example, I want the first entry to be printed because it has the word “CONCLUSIONS” followed by substance1 and then substance2 .

This is what I am trying but not working:

 awk 'BEGIN{FS="|";IGNORECASE=1}{if ($5 ~ /CONCLUSIONS.*$2.*$3/) {print $0}}' my_file.txt 

Any help is much appreciated

+5
source share
1 answer
 $ awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*" $2 ".*" $3' my_file.txt 1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...| 

How it works

  • BEGIN{FS="|";IGNORECASE=1}

    This part does not change from the code in the question.

  • $5 ~ "conclusions.*" $2 ".*" $3

    This condition: true if $5 matches the regular expression consisting of four sequences combined together: "conclusions.*" And $2 , and ".*" And $3 .

    We did not indicate any action for this condition. Therefore, if the condition is true, awk performs the default action, which should print the line.

Simple examples

Consider:

 $ echo "aa aa" | awk '$2 ~ /$1/' 

This line does not print anything because awk does not replace the variables inside the regular expression.

Please note that no matches were found here:

 $ echo '$1' | awk '$0 ~ /$1/' 

There is no coincidence, because inside the regular expression $ matches only at the end of the line. Thus, /$1/ will only match the end of the line followed by 1 . If we want to get a match here, we need to avoid the dollar sign:

 $ echo '$1' | awk '$0 ~ /\$1/' $1 

To get a regex using awk variables, we can, like the basis for this answer, do the following:

 $ echo "aa aa" | awk '$2 ~ $1' aa aa 

It really gives a match.

Further improvement

As Ed Morton suggests in the comments, it can be important to insist that substances only correspond to whole words. In this case, we can use \\<...\\> to limit the correspondence of a substance to whole words. Thus:

 awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*\\<" $2 "\\>.*\\<" $3 "\\>"' my_file.txt 

Thus, substance1 will not match substance10 .

+5
source

Source: https://habr.com/ru/post/1213773/


All Articles