Checking the health of messages with characters that are not part of any ascii supernets (for example: JIS X 0208)?

Question

Checking the health of messages with characters that are not part of any ascii supernets (for example: JIS X 0208)?

I DO NOT want to check if a string is in Python in ASCII. :)

There is an interesting requirement in the HTTP specification , and I was wondering how this can be implemented and tested.

Recipients MUST parse the HTTP message as an encoding octet sequence, which is a superset of US-ASCII [USASCII].
Parsing an HTTP message as a Unicode character stream without regard to a particular encoding creates security vulnerabilities due to the different ways that string processing libraries handle invalid multibyte character sequences that contain the LF octet (% x0A).

In another, https://stackoverflow.com/a/166269/2123/2128, there is an example character set that is not a superset of US-ASCII. But I was more interested in testing this requirement. OR kind of testing. The requirement simply means that the analyzer must pick up a superset of ASCII to swallow the data, but I was wondering in which case you want to check before there are any strange characters inside the message.

Say the message is MSG .

 def is_ascii_superset(self, MSG): "take any string, and return True or False" # Test here if test(MSG): return True else: return False

Any ideas if there is a list of all character sets that are superset of ASCII?

UPDATE :

People seem to misunderstand this question. I am not saying that the string is part of ASCII. This is trivial.

ISO-8859-1, UTF-8, etc. are supersets of ASCII.
JIS X 0208 is NOT a superset of ASCII.

+4

python unit-testing unicode character-encoding

karlcow Mar 11 '13 at 21:50

source share

1 answer

Pavel anossov · Accepted Answer · 2013-03-11T21:55:15+0000

You do not need to check this, you just treat everything like an ASCII supernet, for example. always refer to %x0A as LF , suppose characters below %x7F are ASCII, and don't try to parse multibyte sequences. A superset of ASCII uses each byte value, there are no "strange" characters.

Checking the health of messages with characters that are not part of any ascii supernets (for example: JIS X 0208)?

More articles: