NLTK Combinations for Specific Words

Question

NLTK Combinations for Specific Words

I know how to get bigrams and trigrams using NLTK, and I apply them to my own corporations. The code is below.

I am not sure, however, about (1), how to get matches for a specific word? (2) Does the NLTK have a collocation index based on the log likelihood coefficient?

import nltk from nltk.collocations import * from nltk.tokenize import word_tokenize text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence" trigram_measures = nltk.collocations.TrigramAssocMeasures() finder = TrigramCollocationFinder.from_words(word_tokenize(text)) for i in finder.score_ngrams(trigram_measures.pmi): print i

+8

python nltk collocation

Sabba Jan 16 '14 at 15:18

source share

3 answers

Question 1 - Try:

 target_word = "electronic" # your choice of word finder.apply_ngram_filter(lambda w1, w2, w3: target_word not in (w1, w2, w3)) for i in finder.score_ngrams(trigram_measures.likelihood_ratio): print i

The idea is to filter out everything you don't want. This method is usually used to filter words in specific parts of ngram, and you can configure it to contain your heart.

+2

dmvianna Jan 17 '14 at 4:22

source share

Regarding question number 2, yes! NLTK has a likelihood ratio in its measure of association. The first question remains unanswered!

http://nltk.org/api/nltk.metrics.html?highlight=likelihood_ratio#nltk.metrics.association.NgramAssocMeasures.likelihood_ratio

0

Sabba Jan 17 '14 at 3:57

source share

bogs · Accepted Answer · 2014-01-17T11:54:31+0000

Try this code:

 import nltk from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() trigram_measures = nltk.collocations.TrigramAssocMeasures() # Ngrams with 'creature' as a member creature_filter = lambda *w: 'creature' not in w ## Bigrams finder = BigramCollocationFinder.from_words( nltk.corpus.genesis.words('english-web.txt')) # only bigrams that appear 3+ times finder.apply_freq_filter(3) # only bigrams that contain 'creature' finder.apply_ngram_filter(creature_filter) # return the 10 n-grams with the highest PMI print finder.nbest(bigram_measures.likelihood_ratio, 10) ## Trigrams finder = TrigramCollocationFinder.from_words( nltk.corpus.genesis.words('english-web.txt')) # only trigrams that appear 3+ times finder.apply_freq_filter(3) # only trigrams that contain 'creature' finder.apply_ngram_filter(creature_filter) # return the 10 n-grams with the highest PMI print finder.nbest(trigram_measures.likelihood_ratio, 10)

It uses a likelihood measure and also filters out Ngrams that do not contain the word "creature."

NLTK Combinations for Specific Words

More articles: