Alternative to Levenshtein and Trigram

Question

Alternative to Levenshtein and Trigram

Let's say I have two rows in my database:

(1) 'Levi Watkins Learning Center - Alabama State University' (2) 'ETH Library'

My software receives free text inputs from a data source and must match these free texts in predefined rows in the database (above).

For example, if the software receives the string 'Alabama University' , it should recognize that this is more like (1) than (2) .

At first I thought about using a well-known string metric such as Levenshtein-Damerau or Trigrams, but this leads to undesirable results, which you can see here:

http://fuzzy-string.com/Compare/Transform.aspx?r=Levi+Watkins+Learning+Center+-+Alabama+State+University&q=Alabama+University

http://fuzzy-string.com/Compare/Transform.aspx?r=ETH+Library&q=Alabama+University

 Difference to (1): 37 Difference to (2): 14

(2) wins because it is much shorter than (1) , although (1) contains both the words ( Alabama and University ) of the search string.

I also tried it with Trigrams (using the javascript fuzzySet library), but I got similar results there.

Is there a string that recognizes the similarity of the search string to (1) ?

+7

levenshtein-distance string-metric

cheeesus Nov 23 '13 at 13:28

source share

4 answers

neaGaze · Answer 1 · 2016-03-24T05:21:26+0000

Instead, you could choose the distance between the words https://github.com/mkusner/wmd . One of the clear advantages of this algorithm is that it includes implied meanings when calculating differences between words in documents. Paper can be found here.

Matthias studer · Answer 2 · 2013-11-25T07:15:41+0000

You can try using the normalized levenshtein distance:

Li Yujian, Liu Bo, “Normalized Levenshtein Distance Metric,” “IEEE Transactions on Pattern Analysis and Machine Intelligence,” vol. 29, no. 6, pp. 1091-1095, June 2007, doi: 10.1109 / TPAMI.2007.1078 http://www.computer.org/csdl/trans/tp/2007/06/i1091-abs.html

They suggest normalizing the Levenshtein distance. By doing this, the difference of one character in sequences of longer two weights is greater than the same difference when comparing sequences of longer 10.

mathsRuinedme · Answer 3 · 2014-01-10T14:38:00+0000

You should change your approach:

levenshtein Distance is useful in calculating the similarity in units, or they are “characters” or “words”.

Conceptually, you consider Alabama and the university (2 words) as 2 units, and you want to calculate the distance between words for which the levenshtein distance should mean how many words are between Alabama and the University, which should be 1.

But you are trying to apply the levenshtein algorithm, which is implemented for characters within a word. This implementation will only work for matching single words of NOT sentences.

Better you have to implement your own levenshtein algorithm (using BK-Tree) for the "words" above and inside each match, you again match each word using levenshtein for the "characters".

your result for (1) should match distance 1 with this algorithm and No match for (2).

Todd A. Jacobs · Answer 4 · 2013-11-23T14:08:28+0000

Keyword Counting

You really have not determined why you think that option one is “closer”, at least not in any algorithmic sense. It seems that you base your expectations on the fact that the option has more suitable keywords than the second option, so why not just a match depending on the number of keywords in each line?

For example, using Ruby 2.0:

 string1 = 'Levi Watkins Learning Center - Alabama State University' string2 = 'ETH Library' strings = [str1, str2] keywords = 'Alabama University'.split keycount = {} # Count matching keywords in each string. strings.each do |str| keyword_hits = Hash.new(0) keywords.each { |word| keyword_hits[word] += str.scan(/#{word}/).count } keyword_count = keyword_hits.values.reduce :+ keycount[str] = keyword_count end # Sort by keyword count, and print results. keycount.sort.reverse.map { |e| pp "#{e.last}: #{e.first}" }

This will print:

"2: Levy Watkins Learning Center - Alabama State University"
"0: ETH Library"

which meets your expectations on the case. You might want to make additional omissions on the results using other algorithms to refine the results or break the links, but this should at least make you point in the right direction.

Alternative to Levenshtein and Trigram

Keyword Counting

More articles: