I am working on a fuzzy search implementation, and as part of the implementation, we are using Apache StringUtils.getLevenshteinDistance. At the moment, we are going to determine the maximum maximum response time for our fuzzy search. After various improvements and with some profiling, the place where the most time is spent calculates the Levenshtein distance. It takes about 80-90% of the total time on search lines three letters or more.
Now I know that there are some restrictions on what can be done here, but I read the previous SO questions and the Wikipedia link for LD that if someone wants to limit the threshold to a set maximum distance, this can help reduce the time spent on algorithm, but I'm not sure how to do this.
If we are only interested in the distance, if it is less than the threshold k, then it suffices to calculate the diagonal strip of width 2k + 1 in the matrix. Thus, the algorithm can be executed in O (kl) time, where l is the length of the shortest String [3].
Below you will see the LH source code from StringUtils. After that, my modification. I try mainly to calculate the distances of a given length from the diagonal i, j (so, in my example, two diagonals above and below the diagonal i, j). However, this may not be correct, as I did. For example, on the highest diagonal, the cell value will always be selected immediately above, which will be 0. If someone can show me how to make this functionality, as I described, or some general tips on how to do it like this, it would be very grateful.
public static int getLevenshteinDistance(String s, String t) { if (s == null || t == null) { throw new IllegalArgumentException("Strings must not be null"); } int n = s.length(); // length of s int m = t.length(); // length of t if (n == 0) { return m; } else if (m == 0) { return n; } if (n > m) { // swap the input strings to consume less memory String tmp = s; s = t; t = tmp; n = m; m = t.length(); } int p[] = new int[n+1]; //'previous' cost array, horizontally int d[] = new int[n+1]; // cost array, horizontally int _d[]; //placeholder to assist in swapping p and d // indexes into strings s and t int i; // iterates through s int j; // iterates through t char t_j; // jth character of t int cost; // cost for (i = 0; i<=n; i++) { p[i] = i; } for (j = 1; j<=m; j++) { t_j = t.charAt(j-1); d[0] = j; for (i=1; i<=n; i++) { cost = s.charAt(i-1)==t_j ? 0 : 1; // minimum of cell to the left+1, to the top+1, diagonally left and up +cost d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost); } // copy current distance counts to 'previous row' distance counts _d = p; p = d; d = _d; } // our last action in the above loop was to switch d and p, so p now // actually has the most recent cost counts return p[n]; }
My changes (for for loops only):
for (j = 1; j<=m; j++) { t_j = t.charAt(j-1); d[0] = j; int k = Math.max(j-2, 1); for (i = k; i <= Math.min(j+2, n); i++) { cost = s.charAt(i-1)==t_j ? 0 : 1;