Python 3.5 UnicodeDecodeError for a file in utf-8 (language - "eng", old English)

This is my first time that StackOverflow asked a question, but you have collectively saved so many of my projects over the years that I feel already at home.

I use Python3.5 and nltk for parsing Complete Corpus of Old English, which was published for me as 77 text files and an XML document that denotes a sequence of files in the form of adjacent corps segments in the TEI format. Here's the corresponding part of the header from the XML document, showing that we are essentially working with TEI:

<?xml version="1.0" encoding="UTF-8"?> <TEI xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader type="ISBD-ER"> <fileDesc> 

That's right, as a test, I'm just trying to use the NLTK MTECorpusReader to open the case and use the words () method to prove that I can open it. I do all this from the interactive Python shell, just for the convenience of testing. Here is all that I really do:

 # import the reader method import nltk.corpus.reader as reader # open the sequence of files and the XML doc with the MTECorpusReader oecorpus = reader.mte.MTECorpusReader('/Users/me/Documents/0163','.*') # print the first few words in the corpus to the interactive shell oecorpus.words() 

When I try this, I get the following trace:

 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/util.py", line 765, in __repr__ for elt in self: File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 397, in iterate_from for tok in piece.iterate_from(max(0, start_tok-offset)): File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 291, in iterate_from tokens = self.read_block(self._stream) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/mte.py", line 25, in read_block return list(filter(lambda x: x is not None, XMLCorpusView.read_block(self, stream, tagspec, elt_handler))) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 307, in read_block xml_fragment = self._read_xml_fragment(stream) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 252, in _read_xml_fragment xml_block = stream.read(self._BLOCK_SIZE) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1097, in read chars = self._read(size) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1367, in _read chars, bytes_decoded = self._incr_decode(bytes) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1398, in _incr_decode return self.decode(bytes, 'strict') File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 59: invalid start byte 

So, since I am the valiant StackOverflowsketeer, I decided that one or more files are damaged or there is some character in the file (s) that contains a character that Python utf-8 decoder does not know how to handle. I can be pretty sure about the integrity of this file (take my word for it), so I am pursuing

I tried the following reformatting 77 text files with no visible effect:

 for file in loglist: bufferfile = open(file, encoding='utf-8', errors='replace') bufferfile.close() loglist = [name for name in os.listdir('.') if os.path.isfile(name)] 

So my questions are:

1) Does my approach still make sense, or have I delved into something troubleshooting so far?

2) Is it fair to complete at this point the question that the problem should be related to the XML document, based on the fact that the UTF-8 error appears very early (in hexadecimal position 59) and the fact that my utf-8 is a replacement The script error has nothing to do with the problem? If I am mistaken to suggest this, then how can I better isolate the problem?

3) If we can conclude that the problem is with the XML document, what is the best way to clear it? Is it possible for me to try to find this hex byte and ASCII that matches and change the character?

Thank you in advance for your help!

+7
python utf-8 nltk
source share
2 answers

Your conversion technology did not work because you never read and never wrote a file again.

0x80 not a valid byte in UTF-8 or any iso-8859- * character set. It is valid in Windows code files, but only Unicode can support old English characters, so you have some very broken data.

To convert UTF-8 with bad bytes, do:

 with open('input.txt', 'r', encoding='utf-8', errors='ignore') as input, open('output.txt', 'w', encoding='utf-8') as output: output.write(input.read()) 

If you don't care about data loss, you can avoid using the encoding argument in MTECorpusReader:

 oecorpus = reader.mte.MTECorpusReader('/Users/me/Documents/0163','.*', encoding='cp1252') 

which will make 0x80 the euro symbol (€).

+4
source share

Unicode is not supported in NLTK. Generally. I suspect that if it's old English, he will need to use some strange letters.

There is, however, a solution that can minimize headaches. There, a library that is significantly more modern and powerful than NLTK is called spacy . Here is the link.

Spacy requires that everything be Unicode, and NLTK requires that everything not be. This is not worth the headache. In addition, the NLTK processor will only work on fully Unicode strings, which can lead to some errors.

0
source share

All Articles