How to parse HTML to change all words

This seems to be a recurring question, but here it goes.

I have HTML that is well formatted (it comes from a controlled source, so this can be considered given). I need to iterate over the contents of the HTML body, look for all the words in the document, make some changes to these words, and save the results.

For example, I have a sample.html file, and I want to run it through my application and product output.html, which exactly matches the original, as well as my changes.

I found the following using HTMLAgilityPack, but all the examples I found look at the attributes of the specified tags - is there a light modification that will look at the content and make my changes?

HtmlDocument HD = new HtmlDocument(); HD.Load (@"e:\test.htm"); var NoAltElements = HD.DocumentNode.SelectNodes("//img[not(@alt)]"); if (NoAltElements != null) { foreach (HtmlNode HN in NoAltElements) { HN.Attributes.Append("alt", "no alt image"); } } HD.Save(@"e:\test.htm"); 

The above example uses image tags without ALT tags. I want to search for all tags in a <body> file and do something with the content (which may include creating new tags in the process).

A very simple example of what I can do is make the following input:

 <html> <head><title>Some Title</title></head> <body> <h1>This is my page</h1> <p>This is a paragraph of text.</p> </body> </html> 

and draw a conclusion that takes each word and alternates between uppercase letters and makes it italic:

 <html> <head><title>Some Title</title></head> <body> <h1>THIS <em>is</em> MY <em>page</em></h1> <p>THIS <em>is</em> A <em>paragraph</em> OF <em>text</em>.</p> </body> </html> 

Ideas, suggestions?

+7
source share
2 answers

Personally, given this setting, I would work with the InnerText property for HtmlNode to find words (possibly with Regex so that I can exclude for punctuation, and not just rely on spaces), and then use the InnerHtml property to make changes using iterative calls in Regex.Replace (since there is a method in Regex.Replace that allows you to specify both the starting position and the number of times to replace).

Processing Code:

 IEnumerable<HtmlNode> nodes = doc.DocumentNode.DescendantNodes().Where(n => n.InnerText == "something"); foreach (HtmlNode node in nodes) { string[] words = getWords(node.InnerText); node.InnerHtml = processHtml(node.InnerHtml, words); } 

identify the words (maybe some easier way to do this, but here's the initial hit):

 private string[] getWords(string text) { Regex reg = new Regex("/w+"); MatchCollection matches = reg.Matches(text); List<string> words = new List<string>(); foreach (Match match in matches) { words.Add(match.Value); } return words.ToArray(); } 

process html:

 private string processHtml(string html, string[] words) { int startPosition = 0; foreach (string word in words) { startPosition = html.IndexOf(word, startPosition); Regex reg = new Regex(word); html = reg.Replace(html, alterWord(word), 1, startPosition); } return html; } 

I will leave the details of alterWord () to you. :)

+5
source

Try .SelectNodes("//body//*") . This will give you all the elements in any body element at any depth.

+3
source

All Articles