NLTK Combinations for Specific Words

I know how to get bigrams and trigrams using NLTK, and I apply them to my own corporations. The code is below.

I am not sure, however, about (1), how to get matches for a specific word? (2) Does the NLTK have a collocation index based on the log likelihood coefficient?

import nltk from nltk.collocations import * from nltk.tokenize import word_tokenize text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence" trigram_measures = nltk.collocations.TrigramAssocMeasures() finder = TrigramCollocationFinder.from_words(word_tokenize(text)) for i in finder.score_ngrams(trigram_measures.pmi): print i 
+8
python nltk collocation
source share
3 answers

Try this code:

 import nltk from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() trigram_measures = nltk.collocations.TrigramAssocMeasures() # Ngrams with 'creature' as a member creature_filter = lambda *w: 'creature' not in w ## Bigrams finder = BigramCollocationFinder.from_words( nltk.corpus.genesis.words('english-web.txt')) # only bigrams that appear 3+ times finder.apply_freq_filter(3) # only bigrams that contain 'creature' finder.apply_ngram_filter(creature_filter) # return the 10 n-grams with the highest PMI print finder.nbest(bigram_measures.likelihood_ratio, 10) ## Trigrams finder = TrigramCollocationFinder.from_words( nltk.corpus.genesis.words('english-web.txt')) # only trigrams that appear 3+ times finder.apply_freq_filter(3) # only trigrams that contain 'creature' finder.apply_ngram_filter(creature_filter) # return the 10 n-grams with the highest PMI print finder.nbest(trigram_measures.likelihood_ratio, 10) 

It uses a likelihood measure and also filters out Ngrams that do not contain the word "creature."

+9
source share

Question 1 - Try:

 target_word = "electronic" # your choice of word finder.apply_ngram_filter(lambda w1, w2, w3: target_word not in (w1, w2, w3)) for i in finder.score_ngrams(trigram_measures.likelihood_ratio): print i 

The idea is to filter out everything you don't want. This method is usually used to filter words in specific parts of ngram, and you can configure it to contain your heart.

+2
source share

Regarding question number 2, yes! NLTK has a likelihood ratio in its measure of association. The first question remains unanswered!

http://nltk.org/api/nltk.metrics.html?highlight=likelihood_ratio#nltk.metrics.association.NgramAssocMeasures.likelihood_ratio

0
source share

All Articles