Find the nearest match for bell names?

I have a list of cities with many misspellings for the same city. One city with an error 18 times! I am trying to clear this, but it will take several hours. Is there any algorithm that could “guess” by the correct city name for each of these spelling errors? Some form of weighing? The data is in MySQL, and I have a table with the correct spelling, and also for comparison.

Any ideas on this? An example PHP will help if possible.

+4
source share
3 answers

you can use the damerau-levenstein function to get the distance between two lines. (This also checks for transposition)

http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

If your tables are large, you may need to optimize the algorithm a bit to break it as soon as the distance to the row exceeds the thrashold.

Also, if you can assume that the first letter of the city is entered correctly, which should reduce the number of comparisons.

Its not PHP, but if any help welcomes the java version, I wrote:

class LevinshteinDistance{ public static void main(String args[]){ if(args.length != 2){ System.out.println("Displays the Levenshtein distance between 2 strings"); System.out.println("Usage: LevenshteinDistance stringA stringB"); }else{ int distance = getLevenshteinDistance(args[0], args[1]); System.out.print(getLevenshteinMatrix(args[0], args[1])); System.out.println("Distance: "+distance); } } /** * @param a first string for comparison * @param b second string for comparison * @param caseSensitive whether or not to use case sensitive matching * @return a levenshtein string distance */ public static int getLevenshteinDistance(String a, String b, boolean caseSensitive){ if(! caseSensitive){ a = a.toUpperCase(); b = b.toUpperCase(); } int[][] matrix = generateLevenshteinMatrix(a, b); return matrix[a.length()][b.length()]; } /** * @param a first string for comparison * @param b second string for comparison * @return a case sensitive levenshtein string distance */ public static int getLevenshteinDistance(String a, String b){ int[][] matrix = generateLevenshteinMatrix(a, b); return matrix[a.length()][b.length()]; } /** * @param a first string for comparison * @param b second string for comparison * @return a case sensitive string representation of the search matrix */ public static String getLevenshteinMatrix(String a, String b){ int[][] matrix = generateLevenshteinMatrix(a, b); StringBuilder result = new StringBuilder(); final int ROWS = a.length()+1; final int COLS = b.length()+1; result.append(rowSeperator(COLS-1, false)); result.append("| "+b+" |\n"); result.append(rowSeperator(COLS-1, true)); for(int r=0; r<ROWS; r++){ result.append('|'); if(r > 0){ result.append(a.charAt(r-1)); }else{ result.append(' '); } result.append(" |"); for(int c=0; c<COLS; c++){ result.append(matrix[r][c]); } result.append(" |\n"); } result.append(rowSeperator(COLS-1, false)); return result.toString(); } private static String rowSeperator(final int LEN, boolean hasGap){ StringBuilder result = new StringBuilder(); if(hasGap){ result.append("+ +-"); }else{ result.append("+----"); } for(int i=0; i<LEN; i++) result.append('-'); result.append("-+\n"); return result.toString(); } private static int[][] generateLevenshteinMatrix(String a, String b){ final int ROWS = a.length()+1; final int COLS = b.length()+1; int matrix[][] = new int[ROWS][COLS]; for(int r=0; r<ROWS; r++){ matrix[r][0]=r; } for(int c=0; c<COLS; c++){ matrix[0][c]=c; } for(int r=1; r<ROWS; r++){ char cA = a.charAt(r-1); for(int c=1; c<COLS; c++){ char cB = b.charAt(c-1); int cost = (cA == cB)?0:1; int deletion = matrix[r-1][c]+1; int insertion = matrix[r][c-1]+1; int substitution = matrix[r-1][c-1]+cost; int minimum = Math.min(Math.min(deletion, insertion), substitution); if( (r > 1 && c > 1) && a.charAt(r-2) == cB && cA == b.charAt(c-2) ){ int transposition = matrix[r-2][c-2]+cost; minimum = Math.min(minimum, transposition); } matrix[r][c] = minimum; } } return matrix; } } 
+3
source
+1
source

As the infamous " Bariteney Spears " offers, and, as most of us know from experience, Google, through this scan, got pretty good at correctly checking the spelling of popular names. A city is what it usually corrects. That way, you can try writing a PHP function that parses the Googlse search page to see what kind of correction Google offers, or, which is more complicated (since the page is more complex), you can even try to parse the parameters that Google Maps provides you with.

0
source

All Articles