Cannot use punken tokenizer with pyspark

Question

Cannot use punken tokenizer with pyspark

I am trying to use the punkt tokenizer from the NLTK package with pyspark in a stand-alone Spark cluster. NLTK was installed on separate nodes, but the nltk_data folder is not where NLTK expects (/ usr / share / nltk_data).

I am trying to use the punkt tokenizer which is in (regardless of / my _user / nltk_data).

I have installed:

envv1 = "/whatever/my_user/nltk_data" os.environ['NLTK_DATA'] = envv1

Printing nltk.data.path indicates that the first entry is where the nltk_data folder is located.

from nltk import word_tokenize goes fine, but when it comes to calling the word_tokenize() function, I get the following error:

 ImportError: No module named nltk.tokenize

For some reason, I have no problems accessing resources from nltk.corpus. When I try nltk.download (), it is clear that I already loaded the punkt tokenizer. I can even use a punkt tokenizer outside of pyspark.

+4

python nlp nltk apache-spark pyspark

jamesmf Aug 12 '15 at 15:14

source share

No one has answered this question yet.

See similar questions:

66

How to configure nltk data directory from code?

or similar:

1349

Why can't Python parse this JSON data?

1207