I have a bunch of unrelated paragraphs, and I need to go through them to find similar occurrences, for example, when searching where I search for object falls , I find the logical True for text containing:
- The box fell off the shelf
- Bulb destroyed on the ground
- A piece of plaster fell from the ceiling.
And False for:
- Guilt fell on Sarah
- The temperature dropped sharply.
I can use nltk in tokenise , tag and get Wordnet synsets , but it's hard for me to figure out how to correctly nltk moving parts to achieve the desired result. Should I chunk before looking for synsets? Should I write context-free grammar ? Is there any best practice when translating from treebank tags to wordnet grammar tags? None of this is explained in the nltk book , and I could not find it on the nltk cookbook .
Bonus points for responses that include pandas in response.
[EDIT]:
Some code to get started
In [1]: from nltk.tag import pos_tag from nltk.tokenize import word_tokenize from pandas import Series def tag(x): return pos_tag(word_tokenize(x)) phrases = ['Box fell from shelf', 'Bulb shattered on the ground', 'A piece of plaster fell from the ceiling', 'The blame fell on Sarah', 'Berlin fell on May', 'The temperature fell abruptly'] ser = Series(phrases) ser.map(tag) Out[1]: 0 [(Box, NNP), (fell, VBD), (from, IN), (shelf, ... 1 [(Bulb, NNP), (shattered, VBD), (on, IN), (the... 2 [(A, DT), (piece, NN), (of, IN), (plaster, NN)... 3 [(The, DT), (blame, NN), (fell, VBD), (on, IN)... 4 [(Berlin, NNP), (fell, VBD), (on, IN), (May, N... 5 [(The, DT), (temperature, NN), (fell, VBD), (a... dtype: object