HTML and attribute encoding

I came across a post on Meta SO , and I wonder what the subtle differences between HTML and attribute encoding are.

+7
source share
2 answers

HTML encoding replaces certain characters that are semantically significant in the HTML markup, with equivalent characters that can be displayed to the user without affecting the parsing of the markup.

The most significant and obvious characters are <,>, &, and "which are replaced by &lt; &gt; &amp; and &quot; respectively. In addition, the encoder can replace high-level characters with the equivalent encoding of the HTML entity, so the content can be saved correctly display even if the page is sent to the browser as ASCII.

The encoding of HTML attributes, on the other hand, replaces only a subset of those characters that are important to prevent the character string from breaking the attribute of the HTML element. In particular, you usually simply replace ", &, and <with &quot; &amp; and &lt; This is because the nature of the attributes, the data contained in them, and how they are analyzed and interpreted by the browser or HTML parser on how to read an HTML document and its elements.


In terms of how this relates to XSS, you want to correctly deactivate lines from an external source (such as a user) so that they don't break your page or, more importantly, overlay markup and script that may change or destroy your application or affect the computers of your users (using vulnerabilities in the browser or platform).

If you want to display user-generated content on your page, you will encode HTML code and then display it in your markup, and everything they enter will be displayed literally without worrying about XSS or broken markup.

If you need to attach user-created content to an element in an attribute (for example, the tooltip in a link), you need the encode attribute to make sure the content does not break the element's markup.

Could you use the same function for HTML encoding to handle attribute encoding? Technically, yes. In the case of the meta question you linked, it looks like they take the HTML that has been encoded and decoded, and then uses this result as the attribute value, which causes the encoded markup to appear literally if you follow.

+9
source

I would recommend reviewing the OWASP XSS Protection Rules 1 and 2 .
Short summary ...

Rule 1 for HTML

Execute the following characters with HTML entity encoding ...
& β†’ &amp;
< β†’ &lt;
> β†’ &gt;
" β†’ &quot;
' β†’ &#x27;
/ β†’ &#x2F;

Rule 2 for common HTML attributes

With the exception of alphanumeric characters, print all characters with ASCII values ​​less than 256 using & #xHH; format (or named object, if one is available) to prevent the attribute from being disabled. The reason for this rule is so great that developers often leave attributes without quotes. Correctly quoted attributes can only be escaped with the appropriate quote. Optional attributes can be broken into many characters, including [space]% * +, - /; <=> ^ and |.

+3
source

All Articles