This is my first time that StackOverflow asked a question, but you have collectively saved so many of my projects over the years that I feel already at home.
I use Python3.5 and nltk for parsing Complete Corpus of Old English, which was published for me as 77 text files and an XML document that denotes a sequence of files in the form of adjacent corps segments in the TEI format. Here's the corresponding part of the header from the XML document, showing that we are essentially working with TEI:
<?xml version="1.0" encoding="UTF-8"?> <TEI xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader type="ISBD-ER"> <fileDesc>
That's right, as a test, I'm just trying to use the NLTK MTECorpusReader to open the case and use the words () method to prove that I can open it. I do all this from the interactive Python shell, just for the convenience of testing. Here is all that I really do:
When I try this, I get the following trace:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/util.py", line 765, in __repr__ for elt in self: File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 397, in iterate_from for tok in piece.iterate_from(max(0, start_tok-offset)): File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 291, in iterate_from tokens = self.read_block(self._stream) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/mte.py", line 25, in read_block return list(filter(lambda x: x is not None, XMLCorpusView.read_block(self, stream, tagspec, elt_handler))) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 307, in read_block xml_fragment = self._read_xml_fragment(stream) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 252, in _read_xml_fragment xml_block = stream.read(self._BLOCK_SIZE) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1097, in read chars = self._read(size) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1367, in _read chars, bytes_decoded = self._incr_decode(bytes) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1398, in _incr_decode return self.decode(bytes, 'strict') File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 59: invalid start byte
So, since I am the valiant StackOverflowsketeer, I decided that one or more files are damaged or there is some character in the file (s) that contains a character that Python utf-8 decoder does not know how to handle. I can be pretty sure about the integrity of this file (take my word for it), so I am pursuing
I tried the following reformatting 77 text files with no visible effect:
for file in loglist: bufferfile = open(file, encoding='utf-8', errors='replace') bufferfile.close() loglist = [name for name in os.listdir('.') if os.path.isfile(name)]
So my questions are:
1) Does my approach still make sense, or have I delved into something troubleshooting so far?
2) Is it fair to complete at this point the question that the problem should be related to the XML document, based on the fact that the UTF-8 error appears very early (in hexadecimal position 59) and the fact that my utf-8 is a replacement The script error has nothing to do with the problem? If I am mistaken to suggest this, then how can I better isolate the problem?
3) If we can conclude that the problem is with the XML document, what is the best way to clear it? Is it possible for me to try to find this hex byte and ASCII that matches and change the character?
Thank you in advance for your help!