"surrogateescape" cannot escape certain characters

Regarding reading and writing text files in Python, one of Python's main contributors mentions this in relation to the surrogateescapeUnicode error handler :

[surrogateescape] handles decoding errors by offloading data in an unused portion of the Unicode code space. When encoding, it translates those hidden values ​​back into the exact original byte sequence that could not be decoded correctly.

However, when opening a file and trying to write the output to another file:

input_file = open('someFile.txt', 'r', encoding="ascii", errors="surrogateescape")
output_file = open('anotherFile.txt', 'w')

for line in input_file:
    output_file.write(line)

Results in:

  File "./break-50000.py", line 37, in main
    output_file.write(line)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 3: surrogates not allowed

, ASCII. , , ASCII, , . ASCII, .

, UTF-8:

'\'

:

$ cat z.txt | hd
00000000  27 5a 6f c3 ab 5c 27 73  20 43 6f 66 66 65 65 20  |'Zo..\ Coffee |
00000010  48 6f 75 73 65 27 0a                              |House'.|
00000017

surrogateescape Unicode , ASCII?. Python 3.2.3 Kubuntu Linux 12.10.

+4
3

Unicode Error Handler , ASCII?

. , , , .

3>> b"'Zo\xc3\xab\\'s'".decode('ascii', errors='surrogateescape')
"'Zo\udcc3\udcab\\'s'"
3>> "'Zo\udcc3\udcab\\'s'".encode('ascii', errors='surrogateescape')
b"'Zo\xc3\xab\\'s'"
+6

UTF-8 - .

, , "". , , , , 1, ASCII UTF-8. "" , " " , , , .

xml ( ) - , .

= replace "Zo? Coffee House", . (, - - Unicode ASCII, "?" ).

surrogateescape , : ", ? , . , ... ". Python ( ) , .

Python . , , . ( , , .)

, , , , , () ... () .

+3

Why should DCS3 with low surrogate level be encoded in utf-8? This is unacceptable and useless, because the surrogate is NOT a character. Find the high surrogate that belongs to the low surrogate, decrypt its code point and then create the correct utf-8 sequence for the code point.

0
source

All Articles