Problems with the Levenshtein algorithm in Java

Question

Problems with the Levenshtein algorithm in Java

I want to use the Levenshtein algorithm for the following task: if a user on my site is looking for some value (he enters characters in the input), I want to immediately check the sentences using AJAX, for example, Google Instant.

I got the impression that the Levenshtein algorithm is too slow for such a task. To test its behavior, I first implemented it in Java by printing out two String in all recursive method calls.

 public class Levenshtein { public static void main(String[] arg){ String a = "Hallo Zusammen"; String b = "jfdss Zusammen"; int res = levenshtein(a, b); System.out.println(res); } public static int levenshtein(String s, String t){ int len_s = s.length(); int len_t = t.length(); int cost = 0; System.out.println("s: " + s + ", t: " + t); if(len_s>0 && len_t>0){ if(s.charAt(0) != t.charAt(0)) cost = 1; } if(len_s == 0){ return len_t; }else{ if(len_t == 0){ return len_s; }else{ String news = s.substring(0, s.length()-1); String newt = t.substring(0, t.length()-1); return min(levenshtein(news, t) + 1, levenshtein(s, newt) + 1, levenshtein(news, newt) + cost); } } } public static int min(int a, int b, int c) { return Math.min(Math.min(a, b), c); } }

However, here are a few points:

The if(len_s>0 && len_t>0) was added by me because I was getting a StringIndexOutOfBoundsException with the above test values.
With the above test values, the algorithm seems to calculate infinitely

Are there optimizations that can be made for the algorithm to work for me, or should I use a completely different one to accomplish the desired task?

+6

java string algorithm

Valentino Ru Nov 26 '12 at 12:02

source share

5 answers

You can use Apache Commons Lang3 StringUtils.getLevenshteinDistance() :

Find the Levenshtein distance between two lines.
This is the number of changes needed to change one line to another, where each change is a modification of one character (delete, insert or replace).
The previous implementation of the Levenshtein distance algorithm was from http://www.merriampark.com/ld.htm
Chas Emerick wrote a Java implementation that avoids the OutOfMemoryError that might occur when using my Java implementation with very large strings.
This implementation of the Levenshtein distance algorithm from http://www.merriampark.com/ldjava.htm
  StringUtils.getLevenshteinDistance(null, *) = IllegalArgumentException StringUtils.getLevenshteinDistance(*, null) = IllegalArgumentException StringUtils.getLevenshteinDistance("","") = 0 StringUtils.getLevenshteinDistance("","a") = 1 StringUtils.getLevenshteinDistance("aaapppp", "") = 7 StringUtils.getLevenshteinDistance("frog", "fog") = 1 StringUtils.getLevenshteinDistance("fly", "ant") = 3 StringUtils.getLevenshteinDistance("elephant", "hippo") = 7 StringUtils.getLevenshteinDistance("hippo", "elephant") = 7 StringUtils.getLevenshteinDistance("hippo", "zzzzzzzz") = 8 StringUtils.getLevenshteinDistance("hello", "hallo") = 1 

+2

Hendy irawan Feb 15 '16 at 6:30

source share

There is an open source library, java-util ( https://github.com/jdereg/java-util ), which has the StringUtilities.levenshteinDistance API (string1, string2) which is implemented in O (N ^ 2) complexity and uses memory proportional to O (N) [as discussed above].

DamerauLevenshteinDisance () is also included in this library. Damerau-Levenshtein considers the character swap as one edit, where, since the corresponding levenshtein considers it as two edits. The disadvantage of Damerau-Levenshtein is that it does not have triangular equality, like the original levenshtein.

Great image of triangular equality:

http://richardminerich.com/2012/09/levenshtein-distance-and-the-triangle-inequality/

+1

John DeRegnaucourt Feb 24 '14 at 6:16

source share

 import java.util.Scanner; public class Algorithmm { public static void main(String args[]) { Scanner sc= new Scanner(System.in); System.out.println("Enter the correct string "); String correct=sc.nextLine(); System.out.println("Enter the incorrect string "); String incorrect=sc.nextLine(); int i=correct.length(),j=incorrect.length(); ++i ; ++j; int a[][] = new int[i][j]; int b[] = new int[3]; for(int m=0;m<i;m++) for(int n=0;n<j;n++) { if(m==0 || n==0) { a[0][n]=n; a[m][0]=m; } else { b[0]=a[m-1][n-1]; b[1]=a[m-1][n]; b[2]=a[m][n-1]; if ( correct.charAt(m-1) == incorrect.charAt(n-1) ) { a[m][n]=a[m-1][n-1]; } else { for(int t=0;t<2;t++) for(int u=0;u<2-t;u++) if(b[u]>b[u+1]) b[u]=b[u+1]; a[m][n]=b[0]+1; } } } for(int m=0;m<i;m++) { for(int n=0;n<j;n++) System.out.print( a[m][n] +" "); System.out.print("\n"); } System.out.println(" Levenshtein distance : "+a[i-1][j-1]); } }

0

Pramod Mar 05 '13 at 5:47

source share

 public class Algorithmm { public static void main(String args[]) { Scanner sc= new Scanner(System.in); System.out.println("Enter the correct string "); String correct=sc.nextLine(); System.out.println("Enter the incorrect string "); String incorrect=sc.nextLine(); int i=correct.length(),j=incorrect.length(); ++i ; ++j; int a[][] = new int[i][j]; int b[] = new int[3]; for(int m=0;m<i;m++) for(int n=0;n<j;n++) { if(m==0 || n==0) { a[0][n]=n; a[m][0]=m; } else { b[0]=a[m-1][n-1]; b[1]=a[m-1][n]; b[2]=a[m][n-1]; if ( correct.charAt(m-1) == incorrect.charAt(n-1) ) a[m][n]=a[m-1][n-1]; else { //instead of using the above code for finding the smallest number in the array 'b' we can simplyfy that code to the following, so that we can reduce the execution time.// if( (b[0]<=b[1]) && (b[0])<=b[2] ) a[m][n]=b[0]+1; else if( (b[1]<=b[0]) && (b[1])<=b[2] ) a[m][n]=b[1]+1; else a[m][n]=b[2]+1; } } } for(int m=0;m<i;m++) { for(int n=0;n<j;n++) System.out.print( a[m][n] +" "); System.out.print("\n"); } System.out.println(" Levenshtein distance : "+a[i-1][j-1]); } }

0

Pramod Mar 05 '13 at 6:11

source share

stemm · Accepted Answer · 2012-11-26T12:04:16+0000

1) A few words about improving the Levenshtein distance algorithm

The recursive realization of the Levenshtein distance has exponential complexity .

I would suggest you use memoization technology and implement Levenshtein distance without recursion and reduce complexity to O(N^2) (needs O(N^2) memory)

 public static int levenshteinDistance( String s1, String s2 ) { return dist( s1.toCharArray(), s2.toCharArray() ); } public static int dist( char[] s1, char[] s2 ) { // distance matrix - to memoize distances between substrings // needed to avoid recursion int[][] d = new int[ s1.length + 1 ][ s2.length + 1 ]; // d[i][j] - would contain distance between such substrings: // s1.subString(0, i) and s2.subString(0, j) for( int i = 0; i < s1.length + 1; i++ ) { d[ i ][ 0 ] = i; } for(int j = 0; j < s2.length + 1; j++) { d[ 0 ][ j ] = j; } for( int i = 1; i < s1.length + 1; i++ ) { for( int j = 1; j < s2.length + 1; j++ ) { int d1 = d[ i - 1 ][ j ] + 1; int d2 = d[ i ][ j - 1 ] + 1; int d3 = d[ i - 1 ][ j - 1 ]; if ( s1[ i - 1 ] != s2[ j - 1 ] ) { d3 += 1; } d[ i ][ j ] = Math.min( Math.min( d1, d2 ), d3 ); } } return d[ s1.length ][ s2.length ]; }

Or, even better - you can notice that for each cell in the distance matrix you only need information about the previous row, so you can reduce the memory to O(N) :

 public static int dist( char[] s1, char[] s2 ) { // memoize only previous line of distance matrix int[] prev = new int[ s2.length + 1 ]; for( int j = 0; j < s2.length + 1; j++ ) { prev[ j ] = j; } for( int i = 1; i < s1.length + 1; i++ ) { // calculate current line of distance matrix int[] curr = new int[ s2.length + 1 ]; curr[0] = i; for( int j = 1; j < s2.length + 1; j++ ) { int d1 = prev[ j ] + 1; int d2 = curr[ j - 1 ] + 1; int d3 = prev[ j - 1 ]; if ( s1[ i - 1 ] != s2[ j - 1 ] ) { d3 += 1; } curr[ j ] = Math.min( Math.min( d1, d2 ), d3 ); } // define current line of distance matrix as previous prev = curr; } return prev[ s2.length ]; }

2) A few words about autocomplete

Levenshtein distance is estimated only if you need to find exact matches.

But what if your keyword is apple and the user types green apples ? The Levenshtein distance between the query and the keyword will be large ( 7 points ). The Levensteins distance between apple and bcdfghk (dumb line) will be 7 points !

I suggest you use a full-text search engine (e.g. Lucene ). The trick is that you must use the n-gram model to represent each keyword.

In a few words:
1) , you should represent each keyword as a document containing n-grams: apple -> [ap, pp, pl, le] .

2) after converting each keyword into a set of n-grams - you must index each document keyword on an n-gram in a search engine. You need to create an index like this:

 ... ap -> apple, map, happy ... pp -> apple ... pl -> apple, place ... ...

3) So you have the n-gram index. When you receive a request, you need to break it down into n-grams . This means that you will have a set of n-gram queries. And all you need is a comparison of most of the similar documents with your search engine. A draft approach would be enough.

4) For the best offer - you can rank the search engine results by Levenshtein distance.

PS I suggest you familiarize yourself with the book Introduction to Information Search .

Problems with the Levenshtein algorithm in Java

1) A few words about improving the Levenshtein distance algorithm

2) A few words about autocomplete

More articles: