Changing the Levenshtein distance algorithm to not calculate all distances

Question

Changing the Levenshtein distance algorithm to not calculate all distances

I am working on a fuzzy search implementation, and as part of the implementation, we are using Apache StringUtils.getLevenshteinDistance. At the moment, we are going to determine the maximum maximum response time for our fuzzy search. After various improvements and with some profiling, the place where the most time is spent calculates the Levenshtein distance. It takes about 80-90% of the total time on search lines three letters or more.

Now I know that there are some restrictions on what can be done here, but I read the previous SO questions and the Wikipedia link for LD that if someone wants to limit the threshold to a set maximum distance, this can help reduce the time spent on algorithm, but I'm not sure how to do this.

If we are only interested in the distance, if it is less than the threshold k, then it suffices to calculate the diagonal strip of width 2k + 1 in the matrix. Thus, the algorithm can be executed in O (kl) time, where l is the length of the shortest String [3].

Below you will see the LH source code from StringUtils. After that, my modification. I try mainly to calculate the distances of a given length from the diagonal i, j (so, in my example, two diagonals above and below the diagonal i, j). However, this may not be correct, as I did. For example, on the highest diagonal, the cell value will always be selected immediately above, which will be 0. If someone can show me how to make this functionality, as I described, or some general tips on how to do it like this, it would be very grateful.

public static int getLevenshteinDistance(String s, String t) { if (s == null || t == null) { throw new IllegalArgumentException("Strings must not be null"); } int n = s.length(); // length of s int m = t.length(); // length of t if (n == 0) { return m; } else if (m == 0) { return n; } if (n > m) { // swap the input strings to consume less memory String tmp = s; s = t; t = tmp; n = m; m = t.length(); } int p[] = new int[n+1]; //'previous' cost array, horizontally int d[] = new int[n+1]; // cost array, horizontally int _d[]; //placeholder to assist in swapping p and d // indexes into strings s and t int i; // iterates through s int j; // iterates through t char t_j; // jth character of t int cost; // cost for (i = 0; i<=n; i++) { p[i] = i; } for (j = 1; j<=m; j++) { t_j = t.charAt(j-1); d[0] = j; for (i=1; i<=n; i++) { cost = s.charAt(i-1)==t_j ? 0 : 1; // minimum of cell to the left+1, to the top+1, diagonally left and up +cost d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost); } // copy current distance counts to 'previous row' distance counts _d = p; p = d; d = _d; } // our last action in the above loop was to switch d and p, so p now // actually has the most recent cost counts return p[n]; }

My changes (for for loops only):

  for (j = 1; j<=m; j++) { t_j = t.charAt(j-1); d[0] = j; int k = Math.max(j-2, 1); for (i = k; i <= Math.min(j+2, n); i++) { cost = s.charAt(i-1)==t_j ? 0 : 1; // minimum of cell to the left+1, to the top+1, diagonally left and up +cost d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost); } // copy current distance counts to 'previous row' distance counts _d = p; p = d; d = _d; }

+7

java performance algorithm levenshtein distance

AHungerArtist Oct 05 '10 at 17:43

source share

6 answers

I wrote about the author of Levenshtein, which is one way to do such a check O (n) times earlier, here . Samples of the source code are in Python, but the explanations should be useful, and the links provided in the document contain more detailed information.

+5

Nick johnson Oct 05 '10 at 18:48

source share

Here someone answers a very similar question:

Cite:
<i> I have done this several times. The way I do this is a recursive tree-like tree game of a tree of possible changes. There is a change budget k that I use to trim the tree. With this routine in my hand, first I start it with k = 0, then k = 1, then k = 2, until I get a hit, or I do not want to go higher.

 char* a = /* string 1 */; char* b = /* string 2 */; int na = strlen(a); int nb = strlen(b); bool walk(int ia, int ib, int k){ /* if the budget is exhausted, prune the search */ if (k < 0) return false; /* if at end of both strings we have a match */ if (ia == na && ib == nb) return true; /* if the first characters match, continue walking with no reduction in budget */ if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true; /* if the first characters don't match, assume there is a 1-character replacement */ if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true; /* try assuming there is an extra character in a */ if (ia < na && walk(ia+1, ib, k-1)) return true; /* try assuming there is an extra character in b */ if (ib < nb && walk(ia, ib+1, k-1)) return true; /* if none of those worked, I give up */ return false; }

only the main part, more code in the original

+1

Dr. belisarius Oct 05 '10 at 19:30

source share

According to "Gusfield, Dan" (1997). Algorithms for strings, trees, and sequences: computer science and computational biology "(p. 264), you must ignore zeros.

+1

Collapse Dec 6 '10 at 14:48

source share

I used the source code and placed it immediately before the end of the j loop:

  if (p[n] > s.length() + 5) break;

+5 is arbitrary, but for our purposes, if the distance is the length of the request plus five (or no matter how much we count), it doesn’t really matter what is returned, because we consider the coincidence to be just too different, It’s a little confusing. However, I’m pretty sure that this is not the idea that was mentioned in the Wiki statement if someone understands this better.

0

AHungerArtist Oct 05 '10 at 21:29

source share

Apache Commons Lang 3.4 has this implementation:

 /** * <p>Find the Levenshtein distance between two Strings if it less than or equal to a given * threshold.</p> * * <p>This is the number of changes needed to change one String into * another, where each change is a single character modification (deletion, * insertion or substitution).</p> * * <p>This implementation follows from Algorithms on Strings, Trees and Sequences by Dan Gusfield * and Chas Emerick implementation of the Levenshtein distance algorithm from * <a href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld.htm</a></p> * * <pre> * StringUtils.getLevenshteinDistance(null, *, *) = IllegalArgumentException * StringUtils.getLevenshteinDistance(*, null, *) = IllegalArgumentException * StringUtils.getLevenshteinDistance(*, *, -1) = IllegalArgumentException * StringUtils.getLevenshteinDistance("","", 0) = 0 * StringUtils.getLevenshteinDistance("aaapppp", "", 8) = 7 * StringUtils.getLevenshteinDistance("aaapppp", "", 7) = 7 * StringUtils.getLevenshteinDistance("aaapppp", "", 6)) = -1 * StringUtils.getLevenshteinDistance("elephant", "hippo", 7) = 7 * StringUtils.getLevenshteinDistance("elephant", "hippo", 6) = -1 * StringUtils.getLevenshteinDistance("hippo", "elephant", 7) = 7 * StringUtils.getLevenshteinDistance("hippo", "elephant", 6) = -1 * </pre> * * @param s the first String, must not be null * @param t the second String, must not be null * @param threshold the target threshold, must not be negative * @return result distance, or {@code -1} if the distance would be greater than the threshold * @throws IllegalArgumentException if either String input {@code null} or negative threshold */ public static int getLevenshteinDistance(CharSequence s, CharSequence t, final int threshold) { if (s == null || t == null) { throw new IllegalArgumentException("Strings must not be null"); } if (threshold < 0) { throw new IllegalArgumentException("Threshold must not be negative"); } /* This implementation only computes the distance if it less than or equal to the threshold value, returning -1 if it greater. The advantage is performance: unbounded distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only computing a diagonal stripe of width 2k + 1 of the cost table. It is also possible to use this to compute the unbounded Levenshtein distance by starting the threshold at 1 and doubling each time until the distance is found; this is O(dm), where d is the distance. One subtlety comes from needing to ignore entries on the border of our stripe eg. p[] = |#|#|#|* d[] = *|#|#|#| We must ignore the entry to the left of the leftmost member We must ignore the entry above the rightmost member Another subtlety comes from our stripe running off the matrix if the strings aren't of the same size. Since string s is always swapped to be the shorter of the two, the stripe will always run off to the upper right instead of the lower left of the matrix. As a concrete example, suppose s is of length 5, t is of length 7, and our threshold is 1. In this case we're going to walk a stripe of length 3. The matrix would look like so: 1 2 3 4 5 1 |#|#| | | | 2 |#|#|#| | | 3 | |#|#|#| | 4 | | |#|#|#| 5 | | | |#|#| 6 | | | | |#| 7 | | | | | | Note how the stripe leads off the table as there is no possible way to turn a string of length 5 into one of length 7 in edit distance of 1. Additionally, this implementation decreases memory usage by using two single-dimensional arrays and swapping them back and forth instead of allocating an entire n by m matrix. This requires a few minor changes, such as immediately returning when it detected that the stripe has run off the matrix and initially filling the arrays with large values so that entries we don't compute are ignored. See Algorithms on Strings, Trees and Sequences by Dan Gusfield for some discussion. */ int n = s.length(); // length of s int m = t.length(); // length of t // if one string is empty, the edit distance is necessarily the length of the other if (n == 0) { return m <= threshold ? m : -1; } else if (m == 0) { return n <= threshold ? n : -1; } if (n > m) { // swap the two strings to consume less memory final CharSequence tmp = s; s = t; t = tmp; n = m; m = t.length(); } int p[] = new int[n + 1]; // 'previous' cost array, horizontally int d[] = new int[n + 1]; // cost array, horizontally int _d[]; // placeholder to assist in swapping p and d // fill in starting table values final int boundary = Math.min(n, threshold) + 1; for (int i = 0; i < boundary; i++) { p[i] = i; } // these fills ensure that the value above the rightmost entry of our // stripe will be ignored in following loop iterations Arrays.fill(p, boundary, p.length, Integer.MAX_VALUE); Arrays.fill(d, Integer.MAX_VALUE); // iterates through t for (int j = 1; j <= m; j++) { final char t_j = t.charAt(j - 1); // jth character of t d[0] = j; // compute stripe indices, constrain to array size final int min = Math.max(1, j - threshold); final int max = (j > Integer.MAX_VALUE - threshold) ? n : Math.min(n, j + threshold); // the stripe may lead off of the table if s and t are of different sizes if (min > max) { return -1; } // ignore entry left of leftmost if (min > 1) { d[min - 1] = Integer.MAX_VALUE; } // iterates through [min, max] in s for (int i = min; i <= max; i++) { if (s.charAt(i - 1) == t_j) { // diagonally left and up d[i] = p[i - 1]; } else { // 1 + minimum of cell to the left, to the top, diagonally left and up d[i] = 1 + Math.min(Math.min(d[i - 1], p[i]), p[i - 1]); } } // copy current distance counts to 'previous row' distance counts _d = p; p = d; d = _d; } // if p[n] is greater than the threshold, there no guarantee on it being the correct // distance if (p[n] <= threshold) { return p[n]; } return -1; }

0

Richard EB Jan 28 '16 at 19:10

source share

elindsey · Accepted Answer · 2011-02-28T04:14:46+0000

The problem with window embedding is related to the value to the left of the first record and above the last record in each row.

One way is to start the values that you initially fill in with 1 instead of 0, and then just ignore any 0s that you come across. You will need to subtract 1 from your final answer.

Another way is to fill records to the left of the first and above the last with high values, so a minimal check will never select them. What I chose when I had to implement it the other day:

 public static int levenshtein(String s, String t, int threshold) { int slen = s.length(); int tlen = t.length(); // swap so the smaller string is t; this reduces the memory usage // of our buffers if(tlen > slen) { String stmp = s; s = t; t = stmp; int itmp = slen; slen = tlen; tlen = itmp; } // p is the previous and d is the current distance array; dtmp is used in swaps int[] p = new int[tlen + 1]; int[] d = new int[tlen + 1]; int[] dtmp; // the values necessary for our threshold are written; the ones after // must be filled with large integers since the tailing member of the threshold // window in the bottom array will run min across them int n = 0; for(; n < Math.min(p.length, threshold + 1); ++n) p[n] = n; Arrays.fill(p, n, p.length, Integer.MAX_VALUE); Arrays.fill(d, Integer.MAX_VALUE); // this is the core of the Levenshtein edit distance algorithm // instead of actually building the matrix, two arrays are swapped back and forth // the threshold limits the amount of entries that need to be computed if we're // looking for a match within a set distance for(int row = 1; row < s.length()+1; ++row) { char schar = s.charAt(row-1); d[0] = row; // set up our threshold window int min = Math.max(1, row - threshold); int max = Math.min(d.length, row + threshold + 1); // since we're reusing arrays, we need to be sure to wipe the value left of the // starting index; we don't have to worry about the value above the ending index // as the arrays were initially filled with large integers and we progress to the right if(min > 1) d[min-1] = Integer.MAX_VALUE; for(int col = min; col < max; ++col) { if(schar == t.charAt(col-1)) d[col] = p[col-1]; else // min of: diagonal, left, up d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1; } // swap our arrays dtmp = p; p = d; d = dtmp; } if(p[tlen] == Integer.MAX_VALUE) return -1; return p[tlen]; }

Changing the Levenshtein distance algorithm to not calculate all distances

More articles: