Change distance

I saw a lot of code to solve this question, but I can’t understand why they use a matrix to represent the distance between two words. Can someone explain to me?

Here is an example of the code I found:

public static int minDistance(String word1, String word2) { int l1 = word1.length(), l2 = word2.length(); int[][] d = new int[l1 + 1][l2 + 1]; // the edit distance between an empty string and the prefixes of // word2 for (int i = 0; i < l2 + 1; i++) { d[0][i] = i; } // the edit distance between an empty string and the prefixes of // word1 for (int j = 0; j < l1 + 1; j++) { d[j][0] = j; } for (int i = 1; i < l1 + 1; i++) { for (int j = 1; j < l2 + 1; j++) { if (word1.charAt(i - 1) == word2.charAt(j - 1)) { d[i][j] = d[i - 1][j - 1]; } else { d[i][j] = min(1 + d[i][j - 1], 1 + d[i - 1][j], 1 + d[i - 1][j - 1]); // min of insertion, // deletion, replacement } } } return d[l1][l2]; } 
+4
source share
3 answers

The matrix contains [at the end] the editing distance between all word1 prefixes and all word2 prefixes.

 d[i][j] = edit distance between word1[0..(i-1)] and word2[0..(j-1)] 

You are interested in d[l1][l2] . In general, to calculate d[i][j] , you need to look at the three smaller neighbors d[i-1][j] , d[i-1][j-1] and d[i][j-1] . Thus, transitively, d[i][j] requires all records where at least one coordinate is less than i resp j (and the other is not greater). The exception is that the two characters word1[i-1] and word2[j-1] are equal, in which case you only need a smaller diagonal neighbor.

If you first filled in the -1 matrix to indicate that the corresponding editing distance between the prefixes was not estimated, and recursively compute d[l1][l2] using the cached value of the required d[i][j] , if it has already been calculated, computing it recursively and preserving its value, if not, some areas of the matrix may remain intact. Probably large areas, if there are many pairs of identical characters [only the diagonal will be evaluated if two words are equal], only small areas, if any, if several pairs of equal characters.

In the general case, most of the matrix is ​​necessary for calculating d[l1][l2] , therefore it is faster to calculate the matrix completely using a simple algorithm than to repeat and calculate only the necessary values.

If you do not store the values ​​of shorter prefixes, since they are necessary for calculating d[i][j] , they must be recalculated for each path d[ia][jb] from d[i][j] . Since d[ia][jb] can be achieved in many respects from d[i][j] , which will lead to a recalculation of the lot , leading to a grossly inefficient algorithm as a whole.

Since the calculation for each line uses only the previous line, you can only do with two arrays of length min{l1, l2} + 1 to save some memory, but if the words are not very long, it does not matter much, and the code is simpler with full array.

+4
source

Your code calculates the Levenshtein distance using dynamic programming .

Array d will ultimately contain a solution to various subtasks, where d[i][j] is the distance between the first i letters of the first word and the first j letters of the second. There is a relationship between d[i][j] and the entries d[i-1][j] , d[i][j-1] and d[i-1][j-1] . The algorithm computes the table entries so that the required subtasks have already been calculated (this is part of dynamic programming).

+5
source

The code may be hard to understand, but here are some tips:

  • Change the distance between any pairs of characters in two lines - this is at least the editing distance of all pairs of characters that were matched in front of them. for example, when comparing two lines abc and ade , when you reach c and e in the comparison step, you are sure that the editing distance of the two lines will at least be the minimum editing distance found so far (in this case 1)

  • If the two characters to be compared are not equal, the editing distance is increased by one, which means replacing. Therefore, if you know the editing distance of the string before the comparison, you can add 1 to it, depending on whether the characters are equal or not.

Armed with these two facts,

position [i, j] - editing distance to consider

position [i-1, j] will be the total distance of editing the line until the element of the i-th line does not exist (mean insertion cost)

position [i, j-1] will be the total distance of row editing until the element of the j-th column has been deleted (indicate the cost of removal)

position [ii, j-1] will be the minimum editing distance calculated so far for all previous elements

position [i, j] is calculated by taking at least three possible and adding the current solution.

+3
source

All Articles