Using python nltk to find the similarities between two web pages?

Question

Using python nltk to find the similarities between two web pages?

I want to find if two web pages are similar or not. Can anyone suggest if python nltk with wordnet affinity functions is useful and how? What is the best similarity function to be used in this case?

+7

python nlp nltk wordnet

station Jun 06 '11 at 12:47

source share

2 answers

consider using Spotsigs

+1

Joyce Jun 06 '11 at 15:35

source share

Dragan chupacabric · Accepted Answer · 2011-06-06T23:25:35+0000

notesigs , of which joyceschan is mentioned, deals with detecting duplicate content and contains a lot of food for thought.

If you are looking for a quick comparison of key terms, the standard nltk functions nltk .

With nltk you can nltk synonyms for your terms by browsing the synsets contained in WordNet

 >>> from nltk.corpus import wordnet >>> wordnet.synsets('donation') [Synset('contribution.n.02'), Synset('contribution.n.03')] >>> wordnet.synsets('donations') [Synset('contribution.n.02'), Synset('contribution.n.03')]

He understands plurals, and he also tells you how much of the speech corresponds to a synonym.

Synsets are stored in a tree with more specific terms on the leaves and more general in the root. The root members are called hypernima.

You can measure the similarity of how close the terms are to general hypernosis.

Beware of different parts of speech, according to the NLTK cookbook, they have no overlapping paths, so you should not try to measure the similarities between them.

Say you have two donations and a gift, you can receive them from synsets , but in this example I initialized them directly:

 >>> d = wordnet.synset('donation.n.01') >>> g = wordnet.synset('gift.n.01')

The cookbook recommends the Wu-Palmer affinity method

 >>> d.wup_similarity(g) 0.93333333333333335

This approach gives you a quick way to determine if the terms you use correspond to the relevant concepts. Check out Natural Language Processing with Python to find out what else you can do to help you analyze text.

Using python nltk to find the similarities between two web pages?

More articles: