1) A few words about improving the Levenshtein distance algorithm
The recursive realization of the Levenshtein distance has exponential complexity .
I would suggest you use memoization technology and implement Levenshtein distance without recursion and reduce complexity to O(N^2) (needs O(N^2) memory)
public static int levenshteinDistance( String s1, String s2 ) { return dist( s1.toCharArray(), s2.toCharArray() ); } public static int dist( char[] s1, char[] s2 ) { // distance matrix - to memoize distances between substrings // needed to avoid recursion int[][] d = new int[ s1.length + 1 ][ s2.length + 1 ]; // d[i][j] - would contain distance between such substrings: // s1.subString(0, i) and s2.subString(0, j) for( int i = 0; i < s1.length + 1; i++ ) { d[ i ][ 0 ] = i; } for(int j = 0; j < s2.length + 1; j++) { d[ 0 ][ j ] = j; } for( int i = 1; i < s1.length + 1; i++ ) { for( int j = 1; j < s2.length + 1; j++ ) { int d1 = d[ i - 1 ][ j ] + 1; int d2 = d[ i ][ j - 1 ] + 1; int d3 = d[ i - 1 ][ j - 1 ]; if ( s1[ i - 1 ] != s2[ j - 1 ] ) { d3 += 1; } d[ i ][ j ] = Math.min( Math.min( d1, d2 ), d3 ); } } return d[ s1.length ][ s2.length ]; }
Or, even better - you can notice that for each cell in the distance matrix you only need information about the previous row, so you can reduce the memory to O(N) :
public static int dist( char[] s1, char[] s2 ) { // memoize only previous line of distance matrix int[] prev = new int[ s2.length + 1 ]; for( int j = 0; j < s2.length + 1; j++ ) { prev[ j ] = j; } for( int i = 1; i < s1.length + 1; i++ ) { // calculate current line of distance matrix int[] curr = new int[ s2.length + 1 ]; curr[0] = i; for( int j = 1; j < s2.length + 1; j++ ) { int d1 = prev[ j ] + 1; int d2 = curr[ j - 1 ] + 1; int d3 = prev[ j - 1 ]; if ( s1[ i - 1 ] != s2[ j - 1 ] ) { d3 += 1; } curr[ j ] = Math.min( Math.min( d1, d2 ), d3 ); } // define current line of distance matrix as previous prev = curr; } return prev[ s2.length ]; }
2) A few words about autocomplete
Levenshtein distance is estimated only if you need to find exact matches.
But what if your keyword is apple and the user types green apples ? The Levenshtein distance between the query and the keyword will be large ( 7 points ). The Levensteins distance between apple and bcdfghk (dumb line) will be 7 points !
I suggest you use a full-text search engine (e.g. Lucene ). The trick is that you must use the n-gram model to represent each keyword.
In a few words:
1) , you should represent each keyword as a document containing n-grams: apple -> [ap, pp, pl, le] .
2) after converting each keyword into a set of n-grams - you must index each document keyword on an n-gram in a search engine. You need to create an index like this:
... ap -> apple, map, happy ... pp -> apple ... pl -> apple, place ... ...
3) So you have the n-gram index. When you receive a request, you need to break it down into n-grams . This means that you will have a set of n-gram queries. And all you need is a comparison of most of the similar documents with your search engine. A draft approach would be enough.
4) For the best offer - you can rank the search engine results by Levenshtein distance.
PS I suggest you familiarize yourself with the book Introduction to Information Search .