Cannot use punken tokenizer with pyspark

I am trying to use the punkt tokenizer from the NLTK package with pyspark in a stand-alone Spark cluster. NLTK was installed on separate nodes, but the nltk_data folder is not where NLTK expects (/ usr / share / nltk_data).

I am trying to use the punkt tokenizer which is in (regardless of / my _user / nltk_data).

I have installed:

envv1 = "/whatever/my_user/nltk_data" os.environ['NLTK_DATA'] = envv1 

Printing nltk.data.path indicates that the first entry is where the nltk_data folder is located.

from nltk import word_tokenize goes fine, but when it comes to calling the word_tokenize() function, I get the following error:

 ImportError: No module named nltk.tokenize 

For some reason, I have no problems accessing resources from nltk.corpus. When I try nltk.download (), it is clear that I already loaded the punkt tokenizer. I can even use a punkt tokenizer outside of pyspark.

+4
source share

All Articles