Process Controls in a String in Python

Question

Process Controls in a String in Python

Sometimes, when I get input from a file or user, I get a string with escape sequences in it. I would like to process escape sequences in the same way that Python processes escape sequences in string literals .

For example, let myString be defined as:

 >>> myString = "spam\\neggs" >>> print(myString) spam\neggs

I need a function (I will call it process ) that does this:

 >>> print(process(myString)) spam eggs

It is important that the function can handle all escape sequences in Python (listed in the table in the link above).

Does Python have a function for this?

+78

python string escaping

dln385 Oct 26 2018-10-10T00:

source share

7 answers

`unicode_escape` doesn't work at all

It turns out that the string_escape or unicode_escape does not work at all - in particular, it does not work if there is real Unicode.

If you can be sure that every character other than ASCII will be escaped (and remember that everything outside the first 128 characters is not ASCII), unicode_escape will do everything for you. But if your string already has non-ASCII literal characters, everything will be wrong.

unicode_escape is mainly intended to convert bytes to Unicode text. But in many places - for example, Python source code - the source data already has Unicode text.

The only way this can work correctly is to first encode the text into bytes. UTF-8 is a reasonable encoding for all text, so this should work, right?

The following examples are given in Python 3, so string literals are clean, but the same problem exists with slightly different manifestations in both Python 2 and 3.

 >>> s = 'naïve \\t test' >>> print(s.encode('utf-8').decode('unicode_escape')) naÃ¯ve test

Well, that’s wrong.

The new recommended way to use codecs that decode text to text is to call codecs.decode directly. Does it help?

 >>> import codecs >>> print(codecs.decode(s, 'unicode_escape')) naÃ¯ve test

Not at all. (Also the above is a UnicodeError in Python 2.)

The unicode_escape codec, despite its name, turns out that all non-ASCII bytes are encoded in Latin-1 (ISO-8859-1). So you should do it like this:

 >>> print(s.encode('latin-1').decode('unicode_escape')) naïve test

But this is terrible. This limits 256 characters to Latin-1, as if Unicode had never been invented at all!

 >>> print('Ernő \\t Rubik'.encode('latin-1').decode('unicode_escape')) UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151' in position 3: ordinal not in range(256)

Adding a regex to solve the problem

(Surprisingly, now we have no two problems.)

We only need to apply the unicode_escape decoder to things that will undoubtedly be ASCII text. In particular, we can make sure that we only apply to valid Python escape sequences that are guaranteed to be ASCII text.

In terms of plan, we will find escape sequences using a regular expression, and use the function as an argument to re.sub to replace them with our unexpressed value.

 import re import codecs ESCAPE_SEQUENCE_RE = re.compile(r''' ( \\U........ # 8-digit hex escapes | \\u.... # 4-digit hex escapes | \\x.. # 2-digit hex escapes | \\[0-7]{1,3} # Octal escapes | \\N\{[^}]+\} # Unicode characters by name | \\[\\'"abfnrtv] # Single-character escapes )''', re.UNICODE | re.VERBOSE) def decode_escapes(s): def decode_match(match): return codecs.decode(match.group(0), 'unicode-escape') return ESCAPE_SEQUENCE_RE.sub(decode_match, s)

And with that:

 >>> print(decode_escapes('Ernő \\t Rubik')) Ernő Rubik

+82

rspeer Jul 01 '14 at 21:12

source share

Actually the correct and convenient answer for python 3:

 >>> import codecs >>> myString = "spam\\neggs" >>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8")) spam eggs >>> myString = "naïve \\t test" >>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8")) naïve test

Details for codecs.escape_decode :

codecs.escape_decode is a decoder with bytes in bytes
codecs.escape_decode decodes ascii escape sequences, such as: b"\\n" → b"\n" , b"\\xce" → b"\xce" .
codecs.escape_decode not interested or does not need to encode a byte object, but the encoding of the escaped bytes must match the encoding of the rest of the object.

Background:

@rspeer is correct: unicode_escape is the wrong solution for python3. This is because unicode_escape decodes the escaped bytes, then decodes the bytes into a unicode string, but does not get any information about which codec to use for the second operation.
@Jerub is correct: avoid AST or eval.
I first discovered codecs.escape_decode from this answer to the question “how do I .decode ('string-escape') in Python3?. As stated in this answer, this function is not currently documented for python 3.

+14

user19087 May 05 '16 at 20:27

source share

The ast.literal_eval function ast.literal_eval approaching, but it is expected that the string will be correctly quoted.

Of course, the interpretation of the Python backslash depends on how the string is quoted ( "" vs r"" vs u"" , triple quotes, etc.), so you may need to wrap user input in appropriate quotes and go to literal_eval . literal_eval it in quotation marks will also prevent literal_eval returning a number, a tuple, a dictionary, etc.

Everything can be complicated if the user enters quotes without quotes of the type that you are going to wrap around the string.

+6

Greg Hewgill Oct 26 '10 at 3:50

source share

Rspeer's answer correctly indicates that unicode-escape implies implicit decoding using latin-1 , but this does not happen. If unicode-escape correctly decodes unicode-escape files, but it doesn’t properly handle raw bytes without ASCII, decrypting them as latin-1 , then the direct fix should not accept a regular expression, and then transcode them as latin-1 after (to cancel, the erroneous part of the process ), and then decode in the correct encoding. For example, an example of misuse:

 >>> s = 'naïve \\t test' >>> print(s.encode('utf-8').decode('unicode_escape')) naÃ¯ve test

can be made trivially correct by adding .encode('latin-1').decode('utf-8') by doing this:

 >>> s = 'naïve \\t test' >>> print(s.encode('utf-8').decode('unicode_escape').encode('latin-1').decode('utf-8')) naïve test # Or using codecs.decode to replace the first encode/decode pair with a single text->text transform: >>> print(codecs.decode(s, 'unicode_escape').encode('latin-1').decode('utf-8')) naïve test

Of course, this is a lot back and forth, and I would not want to embed it in my code, but it can be divided into a stand-alone function that works for both str and bytes (with an optional decoding step for bytes if the result is in a known encoding):

 def decode_escapes(s, encoding=None): if isinstance(s, str): if encoding is not None: return TypeError("Do not pass encoding for string arguments") # UTF-8 will allow correct interpretation of escapes when bytes form # interpreted as latin-1 s = s.encode('utf-8') encoding = 'utf-8' decoded = s.decode('unicode_escape').encode('latin-1') if encoding is not None: # If encoding is provided, or we started with an arbitrary string, decode decoded = decode.decode(encoding) return decoded

+2

ShadowRanger Aug 18 '18 at 2:46

source share

Below is the code that should work for \ n, which should appear in a string.

 import string our_str = 'The String is \\n, \\n and \\n!' new_str = string.replace(our_str, '/\\n', '/\n', 1) print(new_str)

0

Vignesh Ramsubbose Mar 26 '18 at 9:42

source share

If you trust the data source, just remove the quotation marks around it and eval () it?

 >>> myString = 'spam\\neggs' >>> print eval('"' + myString.replace('"','') + '"') spam eggs

PS. evil fire-code of exec-code is added - now it will break everything " to eval-ing

-four

Nas Banov Oct 26 2018-10-10T00:

source share

Jerub · Accepted Answer · 2010-10-26 05:01

The right thing is to use the escape-escape code to decode the string.

 >>> myString = "spam\\neggs" >>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3 >>> decoded_string = myString.decode('string_escape') # python2 >>> print(decoded_string) spam eggs

Do not use AST or eval. Using string codecs is much safer.

Process Controls in a String in Python

unicode_escape doesn't work at all

Adding a regex to solve the problem

More articles:

`unicode_escape` doesn't work at all