Regex Extract html Body

Question

Regex Extract html Body

How can I use Regex to extract the body from an html document, given that the html and body tags may be in upper, lower case or missing?

+5

c # regex vb.net

Bruce adams Jun 11 '09 at 17:32

source share

3 answers

How about something like that?

It writes everything between tags <body></body>(case insensitive because of RegexOptions.IgnoreCase) in a group with a name theBody.

RegexOptions.Singleline Allows multi-line HTML to be processed as a single line.

HTML <body></body>, Success .

        string html;

        // Populate the html string here

        RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Singleline;
        Regex regx = new Regex( "<body>(?<theBody>.*)</body>", options );

        Match match = regx.Match( html );

        if ( match.Success ) {
            string theBody = match.Groups["theBody"].Value;
        }

+11

Darryl 17 . '09 15:04

This should be pretty close:

(?is)<body(?:\s[^>]*)>(.*?)(?:</\s*body\s*>|</\s*html\s*>|$)

0

Jeremy stein Jun 11 '09 at 19:55

source share

Andrew Hare · Accepted Answer · 2009-06-11T17:33:56+0000

Do not use regex for this - use something like the Html Agility Pack .

This is a flexible HTML parser that creates a DOM for reading / writing and supports simple XPATH or XSLT (you actually DO NOT understand XPATH and XSLT to use it, don't worry ...). It is a .NET code library that allows you to parse HTML files outside the Internet. The parser is very tolerant of the "real world" distorted HTML. The object model is very similar to what System.Xml offers, but for HTML documents (or streams).

Then you can extract bodyusing XPATH.

Regex Extract html Body

More articles: