If you avoid strings with HTML objects, can I safely skip encoding characters above Unicode 127 if I use UTF-8?

When outputting a string in HTML, special characters in the form of HTML objects ("& <>", etc.) should be avoided for obvious reasons.

I reviewed two Java implementations: org.apache.commons.lang.StringEscapeUtils.escapeHtml (String) net.htmlparser.jericho.CharacterReference.encode (CharSequence)

Both escape codes of all characters are above Unicode code 127 (0x7F), which in fact means all non-English characters.

This is good, but the lines it produces are not human readable if the characters are not English (for example, in Hebrew or Arabic). I saw that when characters above Unicode 127 are not escaped in this way, they still display correctly in browsers. I believe that this is due to the fact that the html page is encoded in UTF-8 encoding, and therefore these characters are understandable to the browser.

My question is: can I safely turn off Unicode character escaping above code point 127 when escaping HTML objects if my webpage is encoded in UTF-8?

+6
java html encoding escaping html-entities
source share
2 answers

You only need to use HTML objects in two cases:

  • To avoid a character that has special meaning in HTML (e.g., < )
  • To display a character that is not a document encoding (for example, the € character in ISO-8859-1)

Given that UTF-8 can represent all Unicode characters, only the first case applies.

When entering HTML manually, you can find a practical way to insert an HTML object now, and then if your editor and / or keyboard does not allow you to type a specific character (it’s easier just to type &copy; rather than trying to figure out how to type the actual and copy;), but when automatically deleting text, you simply increase the page size; -)

I don't know much about Java, but other languages ​​have different functions for encoding special characters and all possible objects.

+5
source share

If the encoding is sent in the header of the mime type:

 Content-Type: text/html; charset=utf-8 

then the browser interprets your source as UTF-8, and you can send all of these characters as regular UTF-8 encoded bytes.

Alternatively, you can specify the encoding in the header of your HTML page as follows:

 <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 

This has the advantage that information is stored with an HTML page if the user saves it and reopens it from the hard disk later.

Personally, I would do both (send the correct header and add meta -tag to your HTML page). This should be good if the two places agree on the coding.

Update: HTML 5 added a new syntax for specifying the encoding :

 <meta charset="utf-8"> 
+4
source share

All Articles