content

How to extract body contents using regexp

I have this code in var.

<html> <head> . . anything . . </head> <body anything=""> content </body> </html> 

or

 <html> <head> . . anything . . </head> <body> content </body> </html> 

the result should be

 content 
+6
source share
3 answers

Please note that the above line-based answers should work in most cases. One of the main benefits offered by the regex solution is that you can more easily provide case-insensitive matching in body open / close tags. If this does not concern you, then there is no good reason to use regular expressions here.

And for people who see HTML and regex together and challenge ... Since you're not really trying to parse HTML with this, this is something you can do with regular expressions. If for some reason the content contained </body> , it would fail, but besides that, you have a rather specific scenario where regular expressions are able to do what you want:

 const strVal = yourStringValue; //obviously, this line can be omitted - just assign your string to the name strVal or put your string var in the pattern.exec call below const pattern = /<body[^>]*>((.|[\n\r])*)<\/body>/im; const array_matches = pattern.exec(strVal); 

After doing the above, array_matches[1] will hold everything that is between the <body and </body> tags.

+20
source
 var matched = XMLHttpRequest.responseText.match(/<body[^>]*>([\w|\W]*)<\/body>/im); alert(matched[1]); 
+1
source

I suggest that you can load your HTML document into a .net HTMLDocument object and then just call HTMLDocument.body.innerHTML?

I am sure the new XDocumnet is even simpler and easier.

And just to repeat some of the comments above, a regular expression is not the best tool to use, since html is not an ordinary language, and there are some extreme cases that are difficult to solve.

https://en.wikipedia.org/wiki/Regular_language

Enjoy it!

-3
source

All Articles