Erase ALL HTML from a string?

I saw a regex that can remove tags, which is great, but I also have things like

 

and etc.

This is not an HTML file. This is actually out of line. I am collecting data from SharePoint web services that give me HTML users who can use / get generated as

<div>Hello! Please remember to clean the break room!!! &quot;bob&quote; <BR> </div>

So, I process 100-900 rows of 8-20 columns each.

+5
source share
1 answer

Take a look at HTML Agility Pack , an HTML parser that you can use to extract InnerTextfrom HTML nodes into a document.

, SO, HTML . , ( ); , HTML . HTML .

, HAP, . A () , :

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("path to your HTML document");

StringBuilder content = new StringBuilder();
foreach (var node in doc.DocumentNode.DescendantNodesAndSelf())
{
    if (!node.HasChildNodes)
    {
        sb.AppendLine(node.InnerText);
    }
}

XPATH , node :

var nodes = doc.DocumentNode.SelectNodes("your XPATH query here");

, .

+9

All Articles