unicode_escape doesn't work at all
It turns out that the string_escape or unicode_escape does not work at all - in particular, it does not work if there is real Unicode.
If you can be sure that every character other than ASCII will be escaped (and remember that everything outside the first 128 characters is not ASCII), unicode_escape will do everything for you. But if your string already has non-ASCII literal characters, everything will be wrong.
unicode_escape is mainly intended to convert bytes to Unicode text. But in many places - for example, Python source code - the source data already has Unicode text.
The only way this can work correctly is to first encode the text into bytes. UTF-8 is a reasonable encoding for all text, so this should work, right?
The following examples are given in Python 3, so string literals are clean, but the same problem exists with slightly different manifestations in both Python 2 and 3.
>>> s = 'naïve \\t test' >>> print(s.encode('utf-8').decode('unicode_escape')) naïve test
Well, that’s wrong.
The new recommended way to use codecs that decode text to text is to call codecs.decode directly. Does it help?
>>> import codecs >>> print(codecs.decode(s, 'unicode_escape')) naïve test
Not at all. (Also the above is a UnicodeError in Python 2.)
The unicode_escape codec, despite its name, turns out that all non-ASCII bytes are encoded in Latin-1 (ISO-8859-1). So you should do it like this:
>>> print(s.encode('latin-1').decode('unicode_escape')) naïve test
But this is terrible. This limits 256 characters to Latin-1, as if Unicode had never been invented at all!
>>> print('Ernő \\t Rubik'.encode('latin-1').decode('unicode_escape')) UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151' in position 3: ordinal not in range(256)
Adding a regex to solve the problem
(Surprisingly, now we have no two problems.)
We only need to apply the unicode_escape decoder to things that will undoubtedly be ASCII text. In particular, we can make sure that we only apply to valid Python escape sequences that are guaranteed to be ASCII text.
In terms of plan, we will find escape sequences using a regular expression, and use the function as an argument to re.sub to replace them with our unexpressed value.
import re import codecs ESCAPE_SEQUENCE_RE = re.compile(r''' ( \\U........ # 8-digit hex escapes | \\u.... # 4-digit hex escapes | \\x.. # 2-digit hex escapes | \\[0-7]{1,3} # Octal escapes | \\N\{[^}]+\} # Unicode characters by name | \\[\\'"abfnrtv] # Single-character escapes )''', re.UNICODE | re.VERBOSE) def decode_escapes(s): def decode_match(match): return codecs.decode(match.group(0), 'unicode-escape') return ESCAPE_SEQUENCE_RE.sub(decode_match, s)
And with that:
>>> print(decode_escapes('Ernő \\t Rubik')) Ernő Rubik