How to map content between HTML tags with an attribute using grep?

What regular expression should be used with the grep command if I want to combine the text contained in the <div class="Message"> tag with its closing </div> in the HTML file?

+8
regex grep
source share
3 answers

Here is one way: GNU grep :

 grep -oP '(?<=<div class="Message"> ).*?(?= </div>)' file 

If your tags span multiple lines, try:

 < file tr -d '\n' | grep -oP '(?<=<div class="Message"> ).*?(?= </div>)' 
+8
source share

You can do this by specifying a regex:

 grep -E "^<div class=\"Message\">.*</div>$" input_files 

Not that it only printed shells found on the same line. If your tag spans multiple lines, you can try:

 tr '\n' ' ' < input_file | grep -E "^<div class=\"Message\">.*</div>$" 
+1
source share

You cannot do this reliably with grep only. You need to parse HTML using an HTML parser.

What if the HTML code has something like:

 <!-- <div class="Message">blah blah</div> --> 

You will get a false hit on this code with comments.

Consider using xmlgrep from the XML::Grep Perl module, as described here: Retrieve html file header using grep

+1
source share

All Articles