See W3 Unicode in XML and other markup languages . It defines a character class as βdiscouraged for use in markup, which I definitely filter for most websites. It includes characters such as:
U + 2028-9, which is funky news that will confuse JavaScript if you try to use them in a string literal;
U + 202A-E, which are bidi control codes that tricky users can embed so that the text appears in reverse order in some browsers, even outside a specific HTML element;
language management rules that may also have an area outside the element;
BOM.
In addition, you want to filter / replace characters that are not allowed in Unicode in general (U + FFFF, etc.), and if you use a language that works in UTF-16 initially (for example, Java, Python on Windows), any surrogate ones characters (U + D800-U + DFFF) that do not form valid surrogate pairs.
Range 0x00-0x19 (mainly control characters), excluding 0x09 (tab), 0x0A (LF) and 0x0D (CR)
And maybe (esp for web application) also lose CR and put tabs in spaces.
Range 0x7F-0x9F (more control characters)
Yes, away with those, except when people really can understand them. (SO is used to resolve them, which allows people to place strings that were incorrectly decoded, which is sometimes useful for diagnosing Unicode problems.) For most sites, I think you won't want them.
source share