Unicode file processing error

I have a source text file containing only the following line and a new line:

Q853 \u0410\u043D\u0434\u0440\u0435\u0439 \u0410\u0440\u0441\u0435\u043D\u044C\u0435\u0432\u0438\u0447 \u0422\u0430\u0440\u043A\u043E\u0432\u0441\u043A\u0438\u0439 

Characters are escaped as shown above, which means that \u05E9 indeed a backslash followed by 5 alphanumeric characters (not a Unicode character). I am trying to decode a file using the following code:

 import codecs with codecs.open("wikidata-terms20.nt", 'r', encoding='unicode_escape') as input: with open("wikidata-terms3.nt", "w") as output: for line in input: output.write(line) 

Using print is not possible here, see comments.

Running it gives me the following error:

 Traceback (most recent call last): File "terms2.py", line 5, in <module> for line in input: File "C:\Program Files\Python35\lib\codecs.py", line 711, in __next__ return next(self.reader) File "C:\Program Files\Python35\lib\codecs.py", line 642, in __next__ line = self.readline() File "C:\Program Files\Python35\lib\codecs.py", line 555, in readline data = self.read(readsize, firstline=True) File "C:\Program Files\Python35\lib\codecs.py", line 501, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 67-71: truncated \uXXXX escape 

What's happening?

I am running Python 3.5.1 on Windows 8.1, and this code seems to work for most other Unicode characters (this line is the first reason for the crash).

See the change history for the original question.

+5
source share
1 answer

It looks like the data read by the decoder is truncated in (after) character # 72 (character 0 # 71). This is clearly related to this error .

The following code generates the same error as in your example:

 open("wikidata-terms20.nt", 'r').readline() open("wikidata-terms20.nt", 'r').readline(72) 

Increasing the size of the reading line above the actual size of the input or setting it to -1 eliminates the error:

 open("wikidata-terms20.nt", 'r').readline(1000) open("wikidata-terms20.nt", 'r').readline(-1) 

Obviously, for line in input: gets the string to be decoded using readline() , effectively truncating the data to be decoded to 72 characters.

So here are some workarounds:

Workaround 1:

 import codecs with open("wikidata-terms20.nt", 'r') as input: with open("wikidata-terms3.nt", "w") as output: for line in input: output.write(codecs.decode(line, 'unicode_escape')) 

Workaround 2:

 import codecs with codecs.open("wikidata-terms20.nt", 'r', encoding='unicode_escape') as input: with open("wikidata-terms3.nt", "w") as output: for line in input.readlines(): output.write(line) 
+2
source

All Articles