AWK uses field value in regular expression

Question

AWK uses field value in regular expression

I am trying to find a string template consisting of the word CONCLUSION, followed by the value of field $ 2 and field $ 3 from the same entry in field $ 5.

For example, my_file.txt is separated by the character "|":

 1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...| 2|substance3|substance4|red|Conclusions: Substance4 is not harmful...| 3|substance5|substance6|red|Substance5 interacts with substance6...|

So, in this example, I want the first entry to be printed because it has the word “CONCLUSIONS” followed by substance1 and then substance2 .

This is what I am trying but not working:

 awk 'BEGIN{FS="|";IGNORECASE=1}{if ($5 ~ /CONCLUSIONS.*$2.*$3/) {print $0}}' my_file.txt

Any help is much appreciated

+5

regex awk

Hallucigeniak Feb 20 '15 at 2:29

source share

1 answer

John1024 · Accepted Answer · 2015-02-20T02:53:30+0000

 $ awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*" $2 ".*" $3' my_file.txt 1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...|

How it works

BEGIN{FS="|";IGNORECASE=1}
This part does not change from the code in the question.
$5 ~ "conclusions.*" $2 ".*" $3
This condition: true if $5 matches the regular expression consisting of four sequences combined together: "conclusions.*" And $2 , and ".*" And $3 .
We did not indicate any action for this condition. Therefore, if the condition is true, awk performs the default action, which should print the line.

Simple examples

Consider:

 $ echo "aa aa" | awk '$2 ~ /$1/'

This line does not print anything because awk does not replace the variables inside the regular expression.

Please note that no matches were found here:

 $ echo '$1' | awk '$0 ~ /$1/'

There is no coincidence, because inside the regular expression $ matches only at the end of the line. Thus, /$1/ will only match the end of the line followed by 1 . If we want to get a match here, we need to avoid the dollar sign:

 $ echo '$1' | awk '$0 ~ /\$1/' $1

To get a regex using awk variables, we can, like the basis for this answer, do the following:

 $ echo "aa aa" | awk '$2 ~ $1' aa aa

It really gives a match.

Further improvement

As Ed Morton suggests in the comments, it can be important to insist that substances only correspond to whole words. In this case, we can use \\<...\\> to limit the correspondence of a substance to whole words. Thus:

 awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*\\<" $2 "\\>.*\\<" $3 "\\>"' my_file.txt

Thus, substance1 will not match substance10 .

AWK uses field value in regular expression

How it works

Simple examples

Further improvement

More articles: