Pure minimal sanitization

Question

Pure minimal sanitization

In an application that accepts, stores, processes, and displays Unicode text (for discussion, say, this is a web application) , which characters should always be removed from the incoming text?

I might think of some, mainly listed in the C0 and C1 control codes of the Wikipedia article :

Range 0x00 - 0x19 (mainly control characters), excluding 0x09 (tab), 0x0A (LF) and 0x0D (CR)
Range 0x7F - 0x9F (more control characters)

Character ranges that can be safely accepted will be even better known.

There are other levels of text filtering: everyone can canonize characters that have multiple representations, replace non-capturing characters, and delete characters of zero width, but I'm mostly interested in the basics.

+4

string language-agnostic text sanitization

s4y Jul 07 '10 at 18:17

source share

2 answers

I suppose it depends on your goal. In UTF-8, you can restrict the user to keyboard characters if this is your whim, which is 9,10,13, [32-126]. If you use UTF-8, a range of 0x7f + means that you have a multibyte Unicode character. In ASCII, 0x7f + contains special display / format characters and is localized to allow extensions depending on the language in the location.

Please note that in UTF-8, keyboard characters may vary by location, as users can enter characters in their own language that are outside the range 0x00-0x7f if their language does not use a Latin script without an accent (Arabic, Chinese , Japanese, Greek, critical, etc.).

If you look here , you can see which characters from UTF-8 will be displayed.

0

Adam shiemke Jul 07 '10 at 18:38

source share

bobince · Accepted Answer · 2010-07-07T19:07:41+0000

See W3 Unicode in XML and other markup languages . It defines a character class as “discouraged for use in markup, which I definitely filter for most websites. It includes characters such as:

U + 2028-9, which is funky news that will confuse JavaScript if you try to use them in a string literal;
U + 202A-E, which are bidi control codes that tricky users can embed so that the text appears in reverse order in some browsers, even outside a specific HTML element;
language management rules that may also have an area outside the element;
BOM.

In addition, you want to filter / replace characters that are not allowed in Unicode in general (U + FFFF, etc.), and if you use a language that works in UTF-16 initially (for example, Java, Python on Windows), any surrogate ones characters (U + D800-U + DFFF) that do not form valid surrogate pairs.

Range 0x00-0x19 (mainly control characters), excluding 0x09 (tab), 0x0A (LF) and 0x0D (CR)

And maybe (esp for web application) also lose CR and put tabs in spaces.

Range 0x7F-0x9F (more control characters)

Yes, away with those, except when people really can understand them. (SO is used to resolve them, which allows people to place strings that were incorrectly decoded, which is sometimes useful for diagnosing Unicode problems.) For most sites, I think you won't want them.

Pure minimal sanitization

More articles: