What to download to make nltk.tokenize.word_tokenize work?

I am going to use nltk.tokenize.word_tokenize in a cluster where my account is very limited by the space quota. At home, I downloaded all the nltk resources nltk nltk.download() , but as I found out, it takes ~ 2.5 GB.

It seems a little redundant to me. Could you suggest what are the minimal (or almost minimal) dependencies for nltk.tokenize.word_tokenize ? So far I have seen nltk.download('punkt') , but I'm not sure if this is enough and what size. What exactly do I need to run in order to make it work?

+8
python nltk
source share
2 answers

You're right. You need Tokenizer Punkt models. It has 13 MB, and nltk.download('punkt') should do the trick.

+11
source share

In short :

 nltk.download('punkt') 

would be enough.


In the long :

You do not need to download all the models and cases available in NLTk if you intend to use NLTK for tokenization.

In fact, if you just use word_tokenize() , you will not need any resources from nltk.download() . If we look at the code, by default word_tokenize() , mainly TreebankWordTokenizer , we should not use additional resources:

 alvas@ubi:~$ ls nltk_data/ chunkers corpora grammars help models stemmers taggers tokenizers alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data/ alvas@ubi:~$ python Python 2.7.11+ (default, Apr 17 2016, 14:00:29) [GCC 5.3.1 20160413] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from nltk import word_tokenize >>> from nltk.tokenize import TreebankWordTokenizer >>> tokenizer = TreebankWordTokenizer() >>> tokenizer.tokenize('This is a sentence.') ['This', 'is', 'a', 'sentence', '.'] 

But:

 alvas@ubi:~$ ls nltk_data/ chunkers corpora grammars help models stemmers taggers tokenizers alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data alvas@ubi:~$ python Python 2.7.11+ (default, Apr 17 2016, 14:00:29) [GCC 5.3.1 20160413] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from nltk import sent_tokenize >>> sent_tokenize('This is a sentence. This is another.') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language)) File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load opened_resource = _open(resource_url) File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open return find(path_, path + ['']).open() File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find raise LookupError(resource_not_found) LookupError: ********************************************************************** Resource u'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - '/home/alvas/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - u'' ********************************************************************** >>> from nltk import word_tokenize >>> word_tokenize('This is a sentence.') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize return [token for sent in sent_tokenize(text, language) File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language)) File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load opened_resource = _open(resource_url) File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open return find(path_, path + ['']).open() File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find raise LookupError(resource_not_found) LookupError: ********************************************************************** Resource u'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - '/home/alvas/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - u'' ********************************************************************** 

But it doesn't seem like that if we look at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/ init .py # L93 . It seems that word_tokenize implicitly called sent_tokenize() , which requires a punkt model.

I'm not sure if this is a bug or function, but it looks like the old idiom might be deprecated given the current code:

 >>> from nltk import sent_tokenize, word_tokenize >>> sentences = 'This is a foo bar sentence. This is another sentence.' >>> tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(sentences)] >>> tokenized_sents [['This', 'is', 'a', 'foo', 'bar', 'sentence', '.'], ['This', 'is', 'another', 'sentence', '.']] 

It could be simple:

 >>> word_tokenize(sentences) ['This', 'is', 'a', 'foo', 'bar', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.'] 

But we see that word_tokenize() aligns a list of a list of strings into a single list of strings.


Alternatively, you can try to use a new tokenizer, which will be added to toktok.py NLTK based on https://github.com/jonsafari/tok-tok , which does not require pre-prepared models.

+4
source share

All Articles