In short :
nltk.download('punkt')
would be enough.
In the long :
You do not need to download all the models and cases available in NLTk if you intend to use NLTK for tokenization.
In fact, if you just use word_tokenize() , you will not need any resources from nltk.download() . If we look at the code, by default word_tokenize() , mainly TreebankWordTokenizer , we should not use additional resources:
alvas@ubi:~$ ls nltk_data/ chunkers corpora grammars help models stemmers taggers tokenizers alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data/ alvas@ubi:~$ python Python 2.7.11+ (default, Apr 17 2016, 14:00:29) [GCC 5.3.1 20160413] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from nltk import word_tokenize >>> from nltk.tokenize import TreebankWordTokenizer >>> tokenizer = TreebankWordTokenizer() >>> tokenizer.tokenize('This is a sentence.') ['This', 'is', 'a', 'sentence', '.']
But:
alvas@ubi:~$ ls nltk_data/ chunkers corpora grammars help models stemmers taggers tokenizers alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data alvas@ubi:~$ python Python 2.7.11+ (default, Apr 17 2016, 14:00:29) [GCC 5.3.1 20160413] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from nltk import sent_tokenize >>> sent_tokenize('This is a sentence. This is another.') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language)) File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load opened_resource = _open(resource_url) File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open return find(path_, path + ['']).open() File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find raise LookupError(resource_not_found) LookupError: ********************************************************************** Resource u'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - '/home/alvas/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - u'' ********************************************************************** >>> from nltk import word_tokenize >>> word_tokenize('This is a sentence.') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize return [token for sent in sent_tokenize(text, language) File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language)) File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load opened_resource = _open(resource_url) File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open return find(path_, path + ['']).open() File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find raise LookupError(resource_not_found) LookupError: ********************************************************************** Resource u'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - '/home/alvas/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - u'' **********************************************************************
But it doesn't seem like that if we look at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/ init .py # L93 . It seems that word_tokenize implicitly called sent_tokenize() , which requires a punkt model.
I'm not sure if this is a bug or function, but it looks like the old idiom might be deprecated given the current code:
>>> from nltk import sent_tokenize, word_tokenize >>> sentences = 'This is a foo bar sentence. This is another sentence.' >>> tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(sentences)] >>> tokenized_sents [['This', 'is', 'a', 'foo', 'bar', 'sentence', '.'], ['This', 'is', 'another', 'sentence', '.']]
It could be simple:
>>> word_tokenize(sentences) ['This', 'is', 'a', 'foo', 'bar', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']
But we see that word_tokenize() aligns a list of a list of strings into a single list of strings.
Alternatively, you can try to use a new tokenizer, which will be added to toktok.py NLTK based on https://github.com/jonsafari/tok-tok , which does not require pre-prepared models.