I am trying to use the punkt tokenizer from the NLTK package with pyspark in a stand-alone Spark cluster. NLTK was installed on separate nodes, but the nltk_data folder is not where NLTK expects (/ usr / share / nltk_data).
I am trying to use the punkt tokenizer which is in (regardless of / my _user / nltk_data).
I have installed:
envv1 = "/whatever/my_user/nltk_data" os.environ['NLTK_DATA'] = envv1
Printing nltk.data.path indicates that the first entry is where the nltk_data folder is located.
from nltk import word_tokenize goes fine, but when it comes to calling the word_tokenize() function, I get the following error:
ImportError: No module named nltk.tokenize
For some reason, I have no problems accessing resources from nltk.corpus. When I try nltk.download (), it is clear that I already loaded the punkt tokenizer. I can even use a punkt tokenizer outside of pyspark.
source share