Damerau - Levenshtein Distance, adding a threshold

I have the following implementation, but I want to add a threshold, so if the result is larger, just stop calculating and returning.

How can i do this?

EDIT: here is my current code, threshold is not used yet ... the goal is that it is used

  public static int DamerauLevenshteinDistance(string string1, string string2, int threshold) { // Return trivial case - where they are equal if (string1.Equals(string2)) return 0; // Return trivial case - where one is empty if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2)) return (string1 ?? "").Length + (string2 ?? "").Length; // Ensure string2 (inner cycle) is longer if (string1.Length > string2.Length) { var tmp = string1; string1 = string2; string2 = tmp; } // Return trivial case - where string1 is contained within string2 if (string2.Contains(string1)) return string2.Length - string1.Length; var length1 = string1.Length; var length2 = string2.Length; var d = new int[length1 + 1, length2 + 1]; for (var i = 0; i <= d.GetUpperBound(0); i++) d[i, 0] = i; for (var i = 0; i <= d.GetUpperBound(1); i++) d[0, i] = i; for (var i = 1; i <= d.GetUpperBound(0); i++) { for (var j = 1; j <= d.GetUpperBound(1); j++) { var cost = string1[i - 1] == string2[j - 1] ? 0 : 1; var del = d[i - 1, j] + 1; var ins = d[i, j - 1] + 1; var sub = d[i - 1, j - 1] + cost; d[i, j] = Math.Min(del, Math.Min(ins, sub)); if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1]) d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost); } } return d[d.GetUpperBound(0), d.GetUpperBound(1)]; } } 
+4
source share
4 answers

Finally got it ... although it is not as useful as I hoped

  public static int DamerauLevenshteinDistance(string string1, string string2, int threshold) { // Return trivial case - where they are equal if (string1.Equals(string2)) return 0; // Return trivial case - where one is empty if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2)) return (string1 ?? "").Length + (string2 ?? "").Length; // Ensure string2 (inner cycle) is longer if (string1.Length > string2.Length) { var tmp = string1; string1 = string2; string2 = tmp; } // Return trivial case - where string1 is contained within string2 if (string2.Contains(string1)) return string2.Length - string1.Length; var length1 = string1.Length; var length2 = string2.Length; var d = new int[length1 + 1, length2 + 1]; for (var i = 0; i <= d.GetUpperBound(0); i++) d[i, 0] = i; for (var i = 0; i <= d.GetUpperBound(1); i++) d[0, i] = i; for (var i = 1; i <= d.GetUpperBound(0); i++) { var im1 = i - 1; var im2 = i - 2; var minDistance = threshold; for (var j = 1; j <= d.GetUpperBound(1); j++) { var jm1 = j - 1; var jm2 = j - 2; var cost = string1[im1] == string2[jm1] ? 0 : 1; var del = d[im1, j] + 1; var ins = d[i, jm1] + 1; var sub = d[im1, jm1] + cost; //Math.Min is slower than native code //d[i, j] = Math.Min(del, Math.Min(ins, sub)); d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub; if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1]) d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost); if (d[i, j] < minDistance) minDistance = d[i, j]; } if (minDistance > threshold) return int.MaxValue; } return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold ? int.MaxValue : d[d.GetUpperBound(0), d.GetUpperBound(1)]; } 
0
source

This refers to ur answer to this: Damerau - Levenshtein Distance, adding a threshold (sorry, I can not comment, because I have no reputation 50 yet)

I think you made a mistake here. You have initialized:

 var minDistance = threshold; 

And ur update rule:

 if (d[i, j] < minDistance) minDistance = d[i, j]; 

In addition, ur early exit criteria:

 if (minDistance > threshold) return int.MaxValue; 

Now note that the if condition above will never be met! You should rather initialize minDistance to int.MaxValue

+5
source

Here is the most elegant way I can think of. After setting each index d, check to see if it exceeds your threshold. The estimate is constant-time, so this is a drop in the bucket compared to the theoretical complexity of the N ^ 2 algorithm:

 public static int DamerauLevenshteinDistance(string string1, string string2, int threshold) { ... for (var i = 1; i <= d.GetUpperBound(0); i++) { for (var j = 1; j <= d.GetUpperBound(1); j++) { ... var temp = d[i,j] = Math.Min(del, Math.Min(ins, sub)); if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1]) temp = d[i,j] = Math.Min(temp, d[i - 2, j - 2] + cost); //Does this value exceed your threshold? if so, get out now if(temp > threshold) return temp; } } return d[d.GetUpperBound(0), d.GetUpperBound(1)]; } 
+1
source

You also asked this question as a UDF SQL CLR question, so I’ll answer in this specific context: you best optmiziation will not come from optimizing Levenshtein distance, but from reducing the number of pairs that you are comparing. Yes, a faster Levenshtein algorithm will improve the situation, but not as much as reducing the number of comparisons with the N square (from N in millions of lines) to N *. My suggestion is to compare only those elements that have a length difference in the allowable delta. On your large table, add a constant computed column to LEN(Data) , and then create an index on it with Data enabled:

 ALTER TABLE Table ADD LenData AS LEN(Data) PERSISTED; CREATE INDEX ndxTableLenData on Table(LenData) INCLUDE (Data); 

Now you can limit the simple problem space by connecting with the maximum difference in length (e.g. 5) if your LEN(Data) data varies significantly:

 SELECT a.Data, b.Data, dbo.Levenshtein(a.Data, b.Data) FROM Table A JOIN Table B ON B.DataLen BETWEEN A.DataLen - 5 AND A.DataLen+5 
+1
source

All Articles