KeyNotFoundException using the HtmlEntity.DeEntitize () method

Question

KeyNotFoundException using the HtmlEntity.DeEntitize () method

I am currently working on a scraper written in C # 4.0. I use various tools, including the built-in WebClient and RegEx.NET functions. For part of my scraper, I am parsing an HTML document using HtmlAgilityPack. I got everything to work as I wanted, and went through code cleaning.

I am using the HtmlEntity.DeEntitize() method to clear the HTML. I did some tests, and this method seemed to work fine. But when I implemented the method in my code, I kept getting a KeyNotFoundException . There are no more details, so I'm pretty lost. My code is as follows:

 WebClient client = new WebClient(); string html = HtmlEntity.DeEntitize(client.DownloadString(path)); HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html);

The loaded HTML is encoded in UTF-8 encoding. How can I get around the KeyNotFound exception?

+6

c # html-agility-pack keynotfoundexception

Sebastian brandes kraaijenzank Nov 07 '12 at 18:12

source share

3 answers

Shoaib mohamed · Answer 1 · 2012-11-18T15:33:49+0000

I understand that the problem is due to the appearance of non-standard characters. Say, for example, Chinese, Japanese, etc.

Once you find out which characters are causing the problem, perhaps you can find a suitable patch for htmlagilitypack here

This may help you if you want to modify the htmlagilitypack source yourself .

Alexei · Answer 2 · 2017-03-20T20:56:57+0000

Four years later, and I have the same problem with some encoded characters (version 1.4.9.5). In my case, there is a limited set of characters that can cause a problem, so I just created a function to take notes:

 // to be called before HtmlEntity.DeEntitize public static string ReplaceProblematicHtmlEntities(string str) { var sb = new StringBuilder(str); //TODO: add other replacements, as needed return sb.Replace("&period;", ".") .Replace("&abreve;", "ă") .Replace("&acirc;", "â") .ToString(); }

In my case, the string contains both html encoded characters and UTF-8 characters, but the problem is only with some encoded characters.

This is not an elegant solution, but a quick fix for all text with a limited (and known) number of problematic encoded characters.

djs · Answer 3 · 2017-05-10T19:29:47+0000

My HTML had this block of text:

... found in sections: 233.9 & 517.3; ...

Despite the gap and decimal point, he interpreted & 517.3; as a unicode character.

Just HTML source code encoding fixed the problem for me.

 string raw = "sections: 233.9 & 517.3;"; // turn '&' into '&amp;', etc, before DeEntitizing string encoded = System.Web.HttpUtility.HtmlEncode(raw); string deEntitized = HtmlEntity.DeEntitize(encoded);

KeyNotFoundException using the HtmlEntity.DeEntitize () method

More articles: