"surrogateescape" cannot escape certain characters

Question

"surrogateescape" cannot escape certain characters

Regarding reading and writing text files in Python, one of Python's main contributors mentions this in relation to the surrogateescapeUnicode error handler :

[surrogateescape] handles decoding errors by offloading data in an unused portion of the Unicode code space. When encoding, it translates those hidden values back into the exact original byte sequence that could not be decoded correctly.

However, when opening a file and trying to write the output to another file:

input_file = open('someFile.txt', 'r', encoding="ascii", errors="surrogateescape")
output_file = open('anotherFile.txt', 'w')

for line in input_file:
    output_file.write(line)

Results in:

  File "./break-50000.py", line 37, in main
    output_file.write(line)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 3: surrogates not allowed

, ASCII. , , ASCII, , . ASCII, .

, UTF-8:

'\'

:

$ cat z.txt | hd
00000000  27 5a 6f c3 ab 5c 27 73  20 43 6f 66 66 65 65 20  |'Zo..\ Coffee |
00000010  48 6f 75 73 65 27 0a                              |House'.|
00000017

surrogateescape Unicode , ASCII?. Python 3.2.3 Kubuntu Linux 12.10.

+4

python encoding unicode utf-8

dotancohen 14 . '14 14:30

3

UTF-8 - .

, , "". , , , , 1, ASCII UTF-8. "" , " " , , , .

xml ( ) - , .

= replace "Zo? Coffee House", . (, - - Unicode ASCII, "?" ).

surrogateescape , : ", ? , . , ... ". Python ( ) , .

Python . , , . ( , , .)

, , , , , () ... () .

+3

user2784358 12 . '14 14:58

Why should DCS3 with low surrogate level be encoded in utf-8? This is unacceptable and useless, because the surrogate is NOT a character. Find the high surrogate that belongs to the low surrogate, decrypt its code point and then create the correct utf-8 sequence for the code point.

0

brighty Jan 15 '14 at 12:43

source share

Ignacio Vazquez-Abrams · Accepted Answer · 2014-01-14T14:39:34+0000

Unicode Error Handler , ASCII?

. , , , .

3>> b"'Zo\xc3\xab\\'s'".decode('ascii', errors='surrogateescape')
"'Zo\udcc3\udcab\\'s'"
3>> "'Zo\udcc3\udcab\\'s'".encode('ascii', errors='surrogateescape')
b"'Zo\xc3\xab\\'s'"

"surrogateescape" cannot escape certain characters

More articles: