notesigs , of which joyceschan is mentioned, deals with detecting duplicate content and contains a lot of food for thought.
If you are looking for a quick comparison of key terms, the standard nltk functions nltk .
With nltk you can nltk synonyms for your terms by browsing the synsets contained in WordNet
>>> from nltk.corpus import wordnet >>> wordnet.synsets('donation') [Synset('contribution.n.02'), Synset('contribution.n.03')] >>> wordnet.synsets('donations') [Synset('contribution.n.02'), Synset('contribution.n.03')]
He understands plurals, and he also tells you how much of the speech corresponds to a synonym.
Synsets are stored in a tree with more specific terms on the leaves and more general in the root. The root members are called hypernima.
You can measure the similarity of how close the terms are to general hypernosis.
Beware of different parts of speech, according to the NLTK cookbook, they have no overlapping paths, so you should not try to measure the similarities between them.
Say you have two donations and a gift, you can receive them from synsets , but in this example I initialized them directly:
>>> d = wordnet.synset('donation.n.01') >>> g = wordnet.synset('gift.n.01')
The cookbook recommends the Wu-Palmer affinity method
>>> d.wup_similarity(g) 0.93333333333333335
This approach gives you a quick way to determine if the terms you use correspond to the relevant concepts. Check out Natural Language Processing with Python to find out what else you can do to help you analyze text.
Dragan chupacabric
source share