When do I need to escape an HTML string?

In my previous project, I can see the use of escapeHtml before sending the string to the browser.

StringEscapeUtils.escapeHtml(stringBody); 

I know from api doc what escapeHtml does. Here is an example: -

 For example: "bread" & "butter" becomes: "bread" & "butter". 

My understanding is when we send the line after you have canceled the html your browser responsibility, which converts back to the original characters. Is it correct?

But I don’t understand why and when it is required, and what happens if we send the body of the string without html escaping? what is the cost if we do not escapeHtml before sending to the browser

+7
source share
4 answers

I can think of several possibilities to explain why sometimes a string is not escaped:

  • perhaps the original programmer was sure that in certain places the line does not have special characters (however, in my opinion, this would be a bad programming practice; traversing the line as a protection against future changes costs very little)
  • the string was already escaped at that moment in the code. You definitely don't want to avoid the string twice; the user will see an escape sequence instead of the intended text.
  • The string was HTML itself. You do not want to avoid HTML; You want the browser to process it!

EDIT - The reason for escaping is that special characters such as & and < may cause the browser to display something other than what you wanted. Naked & technical error in html. Most browsers try to reasonably deal with such errors and in most cases will display them correctly. (This will almost certainly happen in your sample text if the line was, for example, the text in the <div> .) However, since this is bad markup, some browsers will not work well; assistive technologies (such as text to speech) may fail; and there may be other problems.

There are several cases that will fail, despite all the efforts of the browser to recover from incorrect markup. If your sample string was an attribute value, escaping quotes would be absolutely necessary. There is no way that the browser is going to correctly process something like:

 <img alt=""bread" & "butter"" ... > 

The general rule is that any character that is not markup, but can be confused, since markup must be escaped.

Note that there are several contexts in which text can appear in an HTML document, and they have separate screening requirements. The following should be escaped:

  • all characters that are not represented in the document character set (unlikely if you use UTF-8, but this is not always the case)
  • Within quotation mark attribute values ​​( ' or " , depending on which one matches the delimiters used for the attribute value itself) and ampersand ( & ), but not <
  • In text nodes, only & and <
  • Within href values, characters that need to be escaped in the URL (and sometimes they need to be escaped twice so that they still escaped after the browser deleted them once)
  • Inside a CDATA block, as a rule, nothing (at the HTML level).

Finally, in addition to the danger of double escaping, the cost of escaping the entire text is minimal: a little extra processing and a few extra bytes on the network.

+12
source

HTML (we’d better say XML at the moment) defines many so-called “special” characters, which means that these characters have a special meaning for the browser, in contrast to the “normal” characters, which simply mean themselves. For example, the string "Hello, World!" contains only "regular" characters and therefore literally means "Hello, World!" for the browser. The string "<b>Hello, World!</b>" contains the special characters '<' , '>' and '/' , and for the browser this means: typeset string "Hello, World!" in bold typeset string "Hello, World!" in bold instead of typeset "<b>Hello, World!</b>" .

The escapeHtml (String) method, probably (I can’t say for sure, because I don’t know how it is implemented) will convert an arbitrary string into HTML code, which will instruct the browser to literally type that string. For example, escapeHtml ("<b>Hello, World!</b>") , returning HTML code that will be interpreted by the browser as typeset "<b>Hello, World!</b>" normally instead of typeset string "Hello, World!" in bold typeset string "Hello, World!" in bold . If the escapeHtml (String) method is implemented correctly, you don't care how the HTML generated using this method looks. Just use it where you want the browser to type a string literally.

+3
source

you need to avoid html or xml when it is likely that it can be interpreted along with the generated html (read jsp) page.

this good question also explains this.

+2
source

In my experience, all lines should be removed from Html before being displayed on the page. Our current project is dedicated to the management of all organizational units from Active Directory, and these units can contain any special character (including an HTML character). When displayed on the page, you can get the following code to show an entry called User <Marketing>

 <a href="viewDetail.do"> <%=request.getAttribute("Name");%> </a> 

after displaying the page, it will become

 <a href="viewDetail.do"> User <Marketing> </a> 

What actually appears as the User hyperlink on the page.

However, if you avoid the Html value before submitting to the page

 request.setAttribute("Name", StringEscapeUtils.escapeHtml("User <Marketing>")); 

after displaying the page, it will become

  <a href="viewDetail.do"> User &lt;Marketing&gt; </a> 

which appear correctly on the JSP page

Soon you are using Html character escaping to prevent special input. If the input contains an html character, your page will not display correctly during rendering

+1
source

All Articles