I have a source text file containing only the following line and a new line:
Q853 \u0410\u043D\u0434\u0440\u0435\u0439 \u0410\u0440\u0441\u0435\u043D\u044C\u0435\u0432\u0438\u0447 \u0422\u0430\u0440\u043A\u043E\u0432\u0441\u043A\u0438\u0439
Characters are escaped as shown above, which means that \u05E9 indeed a backslash followed by 5 alphanumeric characters (not a Unicode character). I am trying to decode a file using the following code:
import codecs with codecs.open("wikidata-terms20.nt", 'r', encoding='unicode_escape') as input: with open("wikidata-terms3.nt", "w") as output: for line in input: output.write(line)
Using print is not possible here, see comments.
Running it gives me the following error:
Traceback (most recent call last): File "terms2.py", line 5, in <module> for line in input: File "C:\Program Files\Python35\lib\codecs.py", line 711, in __next__ return next(self.reader) File "C:\Program Files\Python35\lib\codecs.py", line 642, in __next__ line = self.readline() File "C:\Program Files\Python35\lib\codecs.py", line 555, in readline data = self.read(readsize, firstline=True) File "C:\Program Files\Python35\lib\codecs.py", line 501, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 67-71: truncated \uXXXX escape
What's happening?
I am running Python 3.5.1 on Windows 8.1, and this code seems to work for most other Unicode characters (this line is the first reason for the crash).
See the change history for the original question.