How to clean badly formed HTML using HTML Agility Pack

I am trying to replace this god with a terrible collection of regular expressions, which is currently used to clear blocks of poorly formed HTML and stumbled upon HTML Agility Pack for C #. It looks very powerful, but, nevertheless, I could not find an example of how I want to use the package, which, in my opinion, will be the desired functionality included in it. I am sure that I am an idiot and cannot find a suitable method in the documentation.

Let me explain ... let's say I had the following html:

<p class="someclass"> <font size="3"> <font face="Times New Roman"> this is some text <a href="somepage.html">Some link</a> </font> </font> </p> 

... what I want to look like:

 <p> this is some text <a href="somepage.html">Some link</a> </p> 

When I use the HtmlNode.Remove () method, it removes the node plus all its children. Is there a way to remove node saving children?

Thanks:)

+7
source share
3 answers

In the HtmlNode, the RemoveChild method has this overload:

 public HtmlNode RemoveChild(HtmlNode oldChild, bool keepGrandChildren); 

So here is how you do it:

 HtmlDocument doc = new HtmlDocument(); doc.Load("yourfile.htm"); foreach (HtmlNode font in doc.DocumentNode.SelectNodes("//font")) { font.ParentNode.RemoveChild(font, true); } 

EDIT: It appears that the Replace w / keepGrandChildren parameter does not work as expected, so this is an alternative implementation:

 public static HtmlNode RemoveChild(HtmlNode parent, HtmlNode oldChild, bool keepGrandChildren) { if (oldChild == null) throw new ArgumentNullException("oldChild"); if (oldChild.HasChildNodes && keepGrandChildren) { HtmlNode prev = oldChild.PreviousSibling; List<HtmlNode> nodes = new List<HtmlNode>(oldChild.ChildNodes.Cast<HtmlNode>()); nodes.Sort(new StreamPositionComparer()); foreach (HtmlNode grandchild in nodes) { parent.InsertAfter(grandchild, prev); } } parent.RemoveChild(oldChild); return oldChild; } // this helper class allows to sort nodes using their position in the file. private class StreamPositionComparer : IComparer<HtmlNode> { int IComparer<HtmlNode>.Compare(HtmlNode x, HtmlNode y) { return y.StreamPosition.CompareTo(x.StreamPosition); } } 
+6
source

u can try using AngleSharp instead https://github.com/AngleSharp/AngleSharp

 var parser = new HtmlParser(); var document = parser.Parse(html); using (var writer = new StringWriter()) { document.ToHtml(writer, new PrettyMarkupFormatter()); return writer.ToString(); } 
+1
source

Once you find the item

use the InnerText method to get the text. Then do the deletion and then paste the text.

-one
source

All Articles