Unwanted mapping using? with grep

I am writing a bash script that parses an html file and I want to get the contents of each <tr>...</tr> . Therefore, my command looks like this:

 $ tr -d \\012 < price.html | grep -oE '<tr>.*?</tr>' 

But grep seems to give me the result:

 $ tr -d \\012 < price.html | grep -oE '<tr>.*</tr>' 

How can I do .* Not greedy?

+7
bash regex grep
source share
4 answers

If you have GNU Grep , you can use -P to make the match inanimate:

 $ tr -d \\012 < price.html | grep -Po '<tr>.*?</tr>' 

The -P allows you to execute the Perl Compliment (PCRE) regular expression, which is necessary for non-greedy matching with ? because Basic Regular Expression (BRE) and Extended Regular Expression (ERE) do not support it.

If you use -P , you can also use look around to avoid printing tags in the match like this:

 $ tr -d \\012 < price.html | grep -Po '(?<=<tr>).*?(?=</tr>)' 

If you don't have GNU Grep and the HTML is well-formed, you can simply do:

 $ tr -d \\012 < price.html | grep -o '<tr>[^<]*</tr>' 

Note. The above example will not work with nested tags inside <tr> .

+14
source share

Unwanted matching is not part of the Extended Regular Expression syntax supported by grep -E . Use grep -P instead if you have it, or switch to Perl / Python / Ruby / what you have. (Oh and pcregrep .)

+4
source share

.*? is a regular expression of Perl. Change grep to

 grep -oP '<tr>.*?</tr>' 
+3
source share

Try perl-style-regexp

 $ grep -Po '<tr>.*?</tr>' input <tr>stuff</tr> <tr>more stuff</tr> 
+3
source share

All Articles