At first I thought of regular expressions to do extra tests to distinguish between low-ratio matches. It may be a solution to solve a specific problem, for example, for words ending with the words ing . But that only a limited case and thre can be numerous other cases that would have to add a specific treatment for each of them.
Then I thought that we could try to find an additional criterion to exclude semantically matching words that have enough characters so that they can be defined as a match together, although the ratio is low,
WHILE, at the same time capturing real semantically consistent terms with a low ratio, because they are short.
Here is the opportunity
from difflib import SequenceMatcher interests = ('shooting','gaming','looping') keywords = ('loop','looping','game') s = SequenceMatcher(None) limit = 0.50 for interest in interests: s.set_seq2(interest) for keyword in keywords: s.set_seq1(keyword) b = s.ratio()>=limit and len(s.get_matching_blocks())==2 print '%10s %-10s %f %s' % (interest, keyword, s.ratio(), '** MATCH **' if b else '') print
gives
shooting loop 0.333333 shooting looping 0.666667 shooting game 0.166667 gaming loop 0.000000 gaming looping 0.461538 gaming game 0.600000 ** MATCH ** looping loop 0.727273 ** MATCH ** looping looping 1.000000 ** MATCH ** looping game 0.181818
Pay attention to this from the document:
SequenceMatcher computes and caches detailed information about the second sequence, so if you want to compare one sequence with many sequences, use set_seq2 () to set a frequently used sequence once and call set_seq1 () several times, once for each of the other sequences.
source share