Calculation of the relative Levenshtein distance - makes sense?

I use Daitch-Mokotoff and Damerau-Levenshtein soundtracks to find out if the user record and value in the application is โ€œthe sameโ€.

Is Levenshtein distance supposed to be used as an absolute value? If I have a 20 letter word, a distance of 4 is not so bad. If a word has 4 letters ...

What I'm doing right now is distance / length to get a distance that better reflects what percentage of the word has been changed.

Is this a valid / proven approach? Or is it just stupid?

+7
compare words levenshtein distance linguistics fuzzy
source share
2 answers

Is it assumed that Levenshtein distance is used as an absolute value?

It looks like it will depend on your requirements. (To clarify: the level of Levelshtein absolute value, but, as the OP pointed out, the raw value may not be as useful as for a given application as a measure that takes into account the length of the word. This is because we are more interested in similarity than distance as such.)

I use both Daitch-Mokotoff sound and Damerau-Levenshtein to find out if the user record and value in the application are โ€œthe sameโ€.

It looks like you are trying to determine if the user of his record matches the same as the given data value?

Do you spell check? or match an invalid input to a known set of values? What are your priorities?

  • Minimize false positives (try to make sure that all the suggested words are very โ€œsimilarโ€ and the list of sentences is short).
  • Minimize false negatives (try to make sure that the line specified by the user is in the list of offers, even if it makes the list longer)
  • Maximum Accuracy Match

You can end up using Levenshtein distance in one way to determine whether to suggest a word in a list of sentences; and another way to determine how to order a list of offers.

It seems to me that if I correctly formulated your goal, then the main value that you want to measure is the similarity, and not the difference between the two lines. So you can use a Jaro or Jaro-Winkler distance , which takes into account the length of the lines and the number of common characters:

The distance jaro dj of the two given strings s1 and s2 are

(m / |s1| + m / |s2| + (m - t) / m) / 3 

Where:

  • m - the number of matching characters
  • t is the number of transpositions

The Yaro-Winkler distance uses the p scale prefix, which gives more favorable row ratings that correspond to starting with a given prefix length l.

+6
source share

The distance levenshtein is the relative meaning between the two words. Comparing LD with length does not matter, for example

cat โ†’ scat = 1 (similar to 75%)

difference โ†’ differences = 1 (90% similar?)

Both of these words have left distances of 1, i.e. differ by one character, but compared to their length, the second set seems more "similar."

I use soundexing to rank words that have the same left distance, e.g.

cat and fat both have LD 1 relative to kat , but when using soundex, the word is more likely to be kat than fat, assuming the word is spelled incorrectly and not incorrectly typed!)

So the short answer is simply using the left distance to determine the similarity.

0
source share

All Articles