KeyNotFoundException using the HtmlEntity.DeEntitize () method

I am currently working on a scraper written in C # 4.0. I use various tools, including the built-in WebClient and RegEx.NET functions. For part of my scraper, I am parsing an HTML document using HtmlAgilityPack. I got everything to work as I wanted, and went through code cleaning.

I am using the HtmlEntity.DeEntitize() method to clear the HTML. I did some tests, and this method seemed to work fine. But when I implemented the method in my code, I kept getting a KeyNotFoundException . There are no more details, so I'm pretty lost. My code is as follows:

 WebClient client = new WebClient(); string html = HtmlEntity.DeEntitize(client.DownloadString(path)); HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); 

The loaded HTML is encoded in UTF-8 encoding. How can I get around the KeyNotFound exception?

+6
source share
3 answers

I understand that the problem is due to the appearance of non-standard characters. Say, for example, Chinese, Japanese, etc.

Once you find out which characters are causing the problem, perhaps you can find a suitable patch for htmlagilitypack here

This may help you if you want to modify the htmlagilitypack source yourself .

+3
source

Four years later, and I have the same problem with some encoded characters (version 1.4.9.5). In my case, there is a limited set of characters that can cause a problem, so I just created a function to take notes:

 // to be called before HtmlEntity.DeEntitize public static string ReplaceProblematicHtmlEntities(string str) { var sb = new StringBuilder(str); //TODO: add other replacements, as needed return sb.Replace(".", ".") .Replace("ă", "ฤƒ") .Replace("â", "รข") .ToString(); } 

In my case, the string contains both html encoded characters and UTF-8 characters, but the problem is only with some encoded characters.

This is not an elegant solution, but a quick fix for all text with a limited (and known) number of problematic encoded characters.

+3
source

My HTML had this block of text:

... found in sections: 233.9 & 517.3; ...

Despite the gap and decimal point, he interpreted & 517.3; as a unicode character.

Just HTML source code encoding fixed the problem for me.

 string raw = "sections: 233.9 & 517.3;"; // turn '&' into '&', etc, before DeEntitizing string encoded = System.Web.HttpUtility.HtmlEncode(raw); string deEntitized = HtmlEntity.DeEntitize(encoded); 
+2
source

All Articles