Python (nltk) - UnicodeDecodeError: codec 'ascii' cannot decode bytes

Question

Python (nltk) - UnicodeDecodeError: codec 'ascii' cannot decode bytes

I am new to NLTK. I get this error and I was looking for it for encoding / decoding and in particular UnicodeDecodeError, but this error seems to be specific to the NLTK source code.

Here's the error:

Traceback (most recent call last): File "A:\Python\Projects\Test\main.py", line 2, in <module> print(pos_tag(word_tokenize("John big idea isn't all that bad."))) File "A:\Python\Python\lib\site-packages\nltk\tag\__init__.py", line 100, in pos_tag tagger = load(_POS_TAGGER) File "A:\Python\Python\lib\site-packages\nltk\data.py", line 779, in load resource_val = pickle.load(opened_resource) UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)

How do I solve this error?

This is what causes the error:

 from nltk import pos_tag, word_tokenize print(pos_tag(word_tokenize("John big idea isn't all that bad.")))

+8

python compiler-errors error-handling nltk

user3422952 Aug 25 '14 at 20:18

source share

4 answers

Simone dagli orti · Answer 1 · 2015-01-14T17:06:43+0000

try this ... NLTK 3.0.1 with Python 2.7.x

 import io f = io.open(txtFile, 'rU', encoding='utf-8')

Luckymatina · Answer 2 · 2014-09-28T17:39:39+0000

I had the same problem with you. I am using Python 3.4 on Windows 7.

I installed "nltk-3.0.0.win32.exe" (from here ). But when I installed "nltk-3.0a4.win32.exe" (from here ), my problem with nltk.pos_tag was resolved. Check it out.

EDIT: If the second link does not work, you can look here .

Dave · Answer 3 · 2014-09-03T02:22:38+0000

Duplicate: NLTK 3 POS_TAG throws UnicodeDecodeError

In short: NLTK is not compatible with Python 3. You need to use NLTK 3, which currently sounds a bit experimental.

Shivamshaz · Answer 4 · 2014-09-05T09:36:40+0000

Try using the "textclean" module

 >>> pip install textclean

Python code

 from textclean.textclean import textclean text = textclean.clean("John big idea isn't all that bad.") print pos_tag(word_tokenize(text))

Python (nltk) - UnicodeDecodeError: codec 'ascii' cannot decode bytes

More articles: