When outputting a string in HTML, special characters in the form of HTML objects ("& <>", etc.) should be avoided for obvious reasons.
I reviewed two Java implementations: org.apache.commons.lang.StringEscapeUtils.escapeHtml (String) net.htmlparser.jericho.CharacterReference.encode (CharSequence)
Both escape codes of all characters are above Unicode code 127 (0x7F), which in fact means all non-English characters.
This is good, but the lines it produces are not human readable if the characters are not English (for example, in Hebrew or Arabic). I saw that when characters above Unicode 127 are not escaped in this way, they still display correctly in browsers. I believe that this is due to the fact that the html page is encoded in UTF-8 encoding, and therefore these characters are understandable to the browser.
My question is: can I safely turn off Unicode character escaping above code point 127 when escaping HTML objects if my webpage is encoded in UTF-8?
java html encoding escaping html-entities
Amos
source share