Find similar phrases with nltk

I have a bunch of unrelated paragraphs, and I need to go through them to find similar occurrences, for example, when searching where I search for object falls , I find the logical True for text containing:

  • The box fell off the shelf
  • Bulb destroyed on the ground
  • A piece of plaster fell from the ceiling.

And False for:

  • Guilt fell on Sarah
  • The temperature dropped sharply.

I can use nltk in tokenise , tag and get Wordnet synsets , but it's hard for me to figure out how to correctly nltk moving parts to achieve the desired result. Should I chunk before looking for synsets? Should I write context-free grammar ? Is there any best practice when translating from treebank tags to wordnet grammar tags? None of this is explained in the nltk book , and I could not find it on the nltk cookbook .

Bonus points for responses that include pandas in response.


[EDIT]:

Some code to get started

 In [1]: from nltk.tag import pos_tag from nltk.tokenize import word_tokenize from pandas import Series def tag(x): return pos_tag(word_tokenize(x)) phrases = ['Box fell from shelf', 'Bulb shattered on the ground', 'A piece of plaster fell from the ceiling', 'The blame fell on Sarah', 'Berlin fell on May', 'The temperature fell abruptly'] ser = Series(phrases) ser.map(tag) Out[1]: 0 [(Box, NNP), (fell, VBD), (from, IN), (shelf, ... 1 [(Bulb, NNP), (shattered, VBD), (on, IN), (the... 2 [(A, DT), (piece, NN), (of, IN), (plaster, NN)... 3 [(The, DT), (blame, NN), (fell, VBD), (on, IN)... 4 [(Berlin, NNP), (fell, VBD), (on, IN), (May, N... 5 [(The, DT), (temperature, NN), (fell, VBD), (a... dtype: object 
+7
python search nlp nltk
source share
2 answers

How I did this would be as follows:

Use nltk to find nouns followed by one or two verbs. To meet your exact specifications, I would use Wordnet : The only nouns (NN, NNP, PRP, NNS) that need to be found are those that are semantically related to the “physical” or “material” and only verbs (VB, VBZ, VBD, etc.), Which must be found those that are in semantic connection with the "fall".

I mentioned “one or two verbs,” because the auxiliary verb may precede it. What you can also do is create a dependency tree to define subject-verb relationships, but in this case it does not seem necessary.

You may also want to exclude place names and keep people's names (because you would agree that “John fell” but not “Berlin fell”). This can also be done using Wordnet , in places there is a tag 'noun.location'.

I’m not sure in what context you will have to convert the tags, so I can’t give the correct answer to this question, it seems to me that in this case you may not need it: you use POS tags to identify nouns and verbs, and then you check, Does each noun and verb belong simultaneously.

Hope this helps.

+7
source share

Not perfect, but most of the work is there. Now about hard coding a pronoun (like “this”) and closed class words and adding a few goals to handle things like “broken ones”. Not a solitary, but not impossible task!

 from nltk.tag import pos_tag from nltk.tokenize import word_tokenize from pandas import Series, DataFrame import collections from nltk import wordnet wn = wordnet.wordnet def tag(x): return pos_tag(word_tokenize(x)) def flatten(l): for el in l: if isinstance(el, collections.Iterable) and not isinstance(el, basestring): for sub in flatten(el): yield sub else: yield el def noun_verb_match(phrase, nouns, verbs): res = [] for i in range(len(phrase) -1): if (phrase[i][1] in nouns) &\ (phrase[i + 1][1] in verbs): res.append((phrase[i], phrase[i + 1])) return res def hypernym_paths(word, pos): res = [x.hypernym_paths() for x in wn.synsets(word, pos)] return set(flatten(res)) def bool_syn(double, noun_syn, verb_syn): """ Returns boolean if noun/verb double contains the target Wordnet Synsets. Arguments: double: ((noun, tag), (verb, tag)) noun_syn, verb_syn: Wordnet Synset string (ie, 'travel.v.01') """ noun = double[0][0] verb = double[1][0] noun_bool = wn.synset(noun_syn) in hypernym_paths(noun, 'n') verb_bool = wn.synset(verb_syn) in hypernym_paths(verb, 'v') return noun_bool & verb_bool def bool_loop(l, f): """ Tests all list elements for truthiness and returns True if any is True. Arguments: l: List. e: List element. f: Function returning boolean. """ if len(l) == 0: return False else: return f(l[0]) | bool_loop(l[1:], f) def bool_noun_verb(series, nouns, verbs, noun_synset_target, verb_synset_target): tagged = series.map(tag) nvm = lambda x: noun_verb_match(x, nouns, verbs) matches = tagged.apply(nvm) bs = lambda x: bool_syn(x, noun_synset_target, verb_synset_target) return matches.apply(lambda x: bool_loop(x, bs)) phrases = ['Box fell from shelf', 'Bulb shattered on the ground', 'A piece of plaster fell from the ceiling', 'The blame fell on Sarah', 'Berlin fell on May', 'The temperature fell abruptly', 'It fell on the floor'] nouns = "NN NNP PRP NNS".split() verbs = "VB VBD VBZ".split() noun_synset_target = 'artifact.n.01' verb_synset_target = 'travel.v.01' df = DataFrame() df['text'] = Series(phrases) df['fall'] = bool_noun_verb(df.text, nouns, verbs, noun_synset_target, verb_synset_target) df 
0
source share

All Articles