Regex Extract html Body

How can I use Regex to extract the body from an html document, given that the html and body tags may be in upper, lower case or missing?

+5
source share
3 answers

Do not use regex for this - use something like the Html Agility Pack .

This is a flexible HTML parser that creates a DOM for reading / writing and supports simple XPATH or XSLT (you actually DO NOT understand XPATH and XSLT to use it, don't worry ...). It is a .NET code library that allows you to parse HTML files outside the Internet. The parser is very tolerant of the "real world" distorted HTML. The object model is very similar to what System.Xml offers, but for HTML documents (or streams).

Then you can extract bodyusing XPATH.

+9
source

How about something like that?

It writes everything between tags <body></body>(case insensitive because of RegexOptions.IgnoreCase) in a group with a name theBody.

RegexOptions.Singleline Allows multi-line HTML to be processed as a single line.

HTML <body></body>, Success .

        string html;

        // Populate the html string here

        RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Singleline;
        Regex regx = new Regex( "<body>(?<theBody>.*)</body>", options );

        Match match = regx.Match( html );

        if ( match.Success ) {
            string theBody = match.Groups["theBody"].Value;
        }
+11

This should be pretty close:

(?is)<body(?:\s[^>]*)>(.*?)(?:</\s*body\s*>|</\s*html\s*>|$)
0
source

All Articles