Regarding reading and writing text files in Python, one of Python's main contributors mentions this in relation to the surrogateescapeUnicode error handler :
[surrogateescape] handles decoding errors by offloading data in an unused portion of the Unicode code space. When encoding, it translates those hidden values back into the exact original byte sequence that could not be decoded correctly.
However, when opening a file and trying to write the output to another file:
input_file = open('someFile.txt', 'r', encoding="ascii", errors="surrogateescape")
output_file = open('anotherFile.txt', 'w')
for line in input_file:
output_file.write(line)
Results in:
File "./break-50000.py", line 37, in main
output_file.write(line)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 3: surrogates not allowed
, ASCII. , , ASCII, , . ASCII, .
, UTF-8:
'\'
:
$ cat z.txt | hd
00000000 27 5a 6f c3 ab 5c 27 73 20 43 6f 66 66 65 65 20 |'Zo..\ Coffee |
00000010 48 6f 75 73 65 27 0a |House'.|
00000017
surrogateescape Unicode , ASCII?. Python 3.2.3 Kubuntu Linux 12.10.