POS-Tagger is incredibly slow

Question

POS-Tagger is incredibly slow

I use nltk to generate n-grams from sentences, first removing the given stop words. However, nltk.pos_tag() very slowly takes up to 0.6 seconds on my processor (Intel i7).

Exit:

 ['The first time I went, and was completely taken by the live jazz band and atmosphere, I ordered the Lobster Cobb Salad.'] 0.620481014252 ["It simply the best meal in NYC."] 0.640982151031 ['You cannot go wrong at the Red Eye Grill.'] 0.644664049149

The code:

 for sentence in source: nltk_ngrams = None if stop_words is not None: start = time.time() sentence_pos = nltk.pos_tag(word_tokenize(sentence)) print time.time() - start filtered_words = [word for (word, pos) in sentence_pos if pos not in stop_words] else: filtered_words = ngrams(sentence.split(), n)

Is it really so slow or am I doing something wrong?

+5

python nlp nltk pos-tagger

Stefan falk Nov 12 '15 at 16:32

source share

3 answers

 nltk pos_tag is defined as: from nltk.tag.perceptron import PerceptronTagger def pos_tag(tokens, tagset=None): tagger = PerceptronTagger() return _pos_tag(tokens, tagset, tagger)

therefore, each call to pos_tag creates an instance of the perceptrontagger module, which takes up most of the computation time. You can save this time by directly calling tagger.tag as:

 from nltk.tag.perceptron import PerceptronTagger tagger=PerceptronTagger() sentence_pos = tagger.tag(word_tokenize(sentence))

+5

Abhiram pappula Oct 4 '16 at 7:58

source share

If you are looking for yet another quick-acting POS tagger in Python, you can try RDRPOSTagger . For example, when marking English POS, tagging speed is 8K words / second for single-threaded Python implementation using a 2.4 GHz Core 2Duo computer. You can get a higher tagging speed just by using multi-threaded mode. RDRPOSTagger gets very competitive accuracy compared to modern taggers and now supports pre-prepared models for 40 languages. See Experimental Results in this article .

0

NQD Nov 20 '15 at 7:51

source share

alvas · Accepted Answer · 2015-11-12T16:58:33+0000

Use pos_tag_sents to tag multiple sentences:

 >>> import time >>> from nltk.corpus import brown >>> from nltk import pos_tag >>> from nltk import pos_tag_sents >>> sents = brown.sents()[:10] >>> start = time.time(); pos_tag(sents[0]); print time.time() - start 0.934092998505 >>> start = time.time(); [pos_tag(s) for s in sents]; print time.time() - start 9.5061340332 >>> start = time.time(); pos_tag_sents(sents); print time.time() - start 0.939551115036

POS-Tagger is incredibly slow

More articles: