Check if two words are related to each other

I have two lists: one, user interests; and secondly, keywords about the book. I want to recommend the book to a user based on his list of interests. I use the difflib Python library's SequenceMatcher class to match similar words like "game", "games", "games", "gamer", etc. The ratio function gives me a number between [0, 1], indicating how similar the 2 lines are. But I was stuck with one example, where I calculated the similarities between β€œloops” and β€œshooting”. It looks like 0.6667 .

 for interest in self.interests: for keyword in keywords: s = SequenceMatcher(None,interest,keyword) match_freq = s.ratio() if match_freq >= self.limit: #print interest, keyword, match_freq final_score += 1 break 

Is there any other way to accomplish such a match in Python?

+4
source share
3 answers

Firstly, a word can have many feelings , and when you try to find similar words, you may need some semantic semantic difference http://en.wikipedia.org/wiki/Word-sense_disambiguation .

Given a couple of words, if we take the most similar pairs of feelings as an indicator of whether the two words are similar, we can try the following:

 from nltk.corpus import wordnet as wn from itertools import product wordx, wordy = "cat","dog" sem1, sem2 = wn.synsets(wordx), wn.synsets(wordy) maxscore = 0 for i,j in list(product(*[sem1,sem2])): score = i.wup_similarity(j) # Wu-Palmer Similarity maxscore = score if maxscore < score else maxscore 

There are other similarity features that you can use. http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html The only problem is when you meet words not in wordnet. Then I suggest you return to difflib .

+10
source

At first I thought of regular expressions to do extra tests to distinguish between low-ratio matches. It may be a solution to solve a specific problem, for example, for words ending with the words ing . But that only a limited case and thre can be numerous other cases that would have to add a specific treatment for each of them.

Then I thought that we could try to find an additional criterion to exclude semantically matching words that have enough characters so that they can be defined as a match together, although the ratio is low,
WHILE, at the same time capturing real semantically consistent terms with a low ratio, because they are short.

Here is the opportunity

 from difflib import SequenceMatcher interests = ('shooting','gaming','looping') keywords = ('loop','looping','game') s = SequenceMatcher(None) limit = 0.50 for interest in interests: s.set_seq2(interest) for keyword in keywords: s.set_seq1(keyword) b = s.ratio()>=limit and len(s.get_matching_blocks())==2 print '%10s %-10s %f %s' % (interest, keyword, s.ratio(), '** MATCH **' if b else '') print 

gives

  shooting loop 0.333333 shooting looping 0.666667 shooting game 0.166667 gaming loop 0.000000 gaming looping 0.461538 gaming game 0.600000 ** MATCH ** looping loop 0.727273 ** MATCH ** looping looping 1.000000 ** MATCH ** looping game 0.181818 

Pay attention to this from the document:

SequenceMatcher computes and caches detailed information about the second sequence, so if you want to compare one sequence with many sequences, use set_seq2 () to set a frequently used sequence once and call set_seq1 () several times, once for each of the other sequences.

+4
source

Thats because the SequenceMatcher is based on changing distance or something like that. semantic similarity is more suitable for your case or a hybrid of two.

take a look at the NLTK package ( sample code ) as you are using python and maybe this is paper

for people using C ++, you can check this open source project for reference

+3
source

All Articles