HttpUtility.HtmlEncode escapes too much?

Question

HttpUtility.HtmlEncode escapes too much?

In our MVC3 ASP.net project, the HttpUtility.HtmlEncode method seems to avoid too many characters. Our web pages serve as UTF-8 pages, but the method avoids characters such as ü or the yen symbol ¥, even if the tez characters are part of the UTF-8 set .

So, when my asp.net MVC view contains the following code snippet:

@("<strong>ümlaut</strong>")

Then I would expect Encoder to avoid html tags, but not ümlaut

  &lt;strong&gt;ümlaut&lt;/strong&gt;

But instead, he gives me the following HTML snippet:

  &lt;strong&gt;&#252;mlaut&lt;/strong&gt;

For completeness, I also mention that responseEncoding in web.config is explicitly set to utf-8, so I would expect the HtmlEncode method to respect this parameter.

  <globalization requestEncoding="utf-8" responseEncoding="utf-8" />

+7

.net asp.net unicode utf-8 razor

Thomas Feb 03 '12 at 13:07

source share

3 answers

Yes, I have the same problem with my web pages. If we see the htmlEncode code, there is a dot that translates this character set. Here is the code that also translated this type of character.

 if ((ch >= '\x00a0') && (ch < 'A')) { output.Write("&#"); output.Write(ch.ToString(NumberFormatInfo.InvariantInfo)); output.Write(';'); } else { output.Write(ch); }

Here is the HtmlEncode code

 public static unsafe void HtmlEncode(string value, TextWriter output) { if (value != null) { if (output == null) { throw new ArgumentNullException("output"); } int num = IndexOfHtmlEncodingChars(value, 0); if (num == -1) { output.Write(value); } else { int num2 = value.Length - num; fixed (char* str = ((char*) value)) { char* chPtr = str; char* chPtr2 = chPtr; while (num-- > 0) { output.Write(chPtr2[0]); chPtr2++; } while (num2-- > 0) { char ch = chPtr2[0]; if (ch <= '>') { switch (ch) { case '&': { output.Write("&amp;"); chPtr2++; continue; } case '\'': { output.Write("&#39;"); chPtr2++; continue; } case '"': { output.Write("&quot;"); chPtr2++; continue; } case '<': { output.Write("&lt;"); chPtr2++; continue; } case '>': { output.Write("&gt;"); chPtr2++; continue; } } output.Write(ch); chPtr2++; continue; } // !here is the point! if ((ch >= '\x00a0') && (ch < 'Ā')) { output.Write("&#"); output.Write(ch.ToString(NumberFormatInfo.InvariantInfo)); output.Write(';'); } else { output.Write(ch); } chPtr2++; } } } } }

a Possible solutions are to create your own HtmlEncode or use Anti-Cross Site scripts from MS.

http://msdn.microsoft.com/en-us/security/aa973814

+2

Aristos Feb 03 '12 at 13:21

source share

based on Thomas's answer, will slightly improve space, tab and new line processing, as they can break the html structure:

 public static string HtmlEncode(string value,bool removeNewLineAndTabs) { if (value == null) return string.Empty; string toEncode = value.ToString(); // Init capacity to length of string to encode var builder = new StringBuilder(toEncode.Length); foreach (char c in toEncode) { string result; bool success = toEscape.TryGetValue(c, out result); string character = success ? result : c.ToString(); builder.Append(character); } string retVal = builder.ToString(); if (removeNewLineAndTabs) { retVal = retVal.Replace("\r\n", " "); retVal = retVal.Replace("\r", " "); retVal = retVal.Replace("\n", " "); retVal = retVal.Replace("\t", " "); } return retVal; }

0

Ers Mar 27 '13 at 16:18

source share

Thomas · Accepted Answer · 2012-02-06T09:12:33+0000

As Aristos said, we could use Microsoft's AntiXSS library. It contains a UnicodeCharacterEncoder , which behaves as you expected.

But since we

didn't really want to depend on a third-party library for HTML encoding only
were sure that our content did not exceed the UTF-8 range.

We decided to implement our own very simple HTML encoder. You can find the code below. Please feel free to adapt / comment / improve if you see any problems.

 public static class HtmlEncoder { private static IDictionary<char, string> toEscape = new Dictionary<char, string>() { { '<', "lt" }, { '>', "gt" }, { '"', "quot" }, { '&', "amp" }, { '\'', "#39" }, }; /// <summary> /// HTML-Encodes the provided value /// </summary> /// <param name="value">object to encode</param> /// <returns>An HTML-encoded string representing the provided value.</returns> public static string Encode(object value) { if (value == null) return string.Empty; // If value is bare HTML, we expect it to be encoded already if (value is IHtmlString) return value.ToString(); string toEncode = value.ToString(); // Init capacity to length of string to encode var builder = new StringBuilder(toEncode.Length); foreach (char c in toEncode) { string result; bool success = toEscape.TryGetValue(c, out result); string character = success ? "&" + result + ";" : c.ToString(); builder.Append(character); } return builder.ToString(); } }

HttpUtility.HtmlEncode escapes too much?

More articles: