Correction of errors in names

I am trying to install an algorithm that performs error correction in names. My approach has a database with the correct names, calculates the editing distance between each of them and the name entered, and then suggests 5 or 10 closest.

This task is significantly different from the standard correction of errors in words, since some names can be replaced by initials. For example, Jonathan Smith and J. Smith are actually quite close and can easily be considered the same name, so the editing distance should be really small, if not 0. Another problem is that some names may be written differently when the sound is the same. For example, Shnaider and Schneider are versions of the same name written by people with different locales (there are better examples for this). And another case - imagine all the possible mistakes in spelling Jawaharlal Nehru , most of which have nothing to do with the real name. Again, probably most of them will be phonetically similar.

Obviously the Lucene error correction algorithm will not help me here, since it does not handle the above cases.

So, my question is: do you know any library capable of correcting errors in names? Can you suggest some kind of algorithm to handle the cases mentioned above?

I am interested in libraries in C ++ or java. As for the suggestions of the algorithms, then any language or pseudo-code will do.

+7
source share
2 answers

For phonetic correspondence see Soundex .

I think that a modification of the Levenshtein distance algorithm to handle “abbreviation to initial” and “extension from initial”, since distance editing should be simple, but the details are now outside of me.

+6
source

You can also watch Metaphone .

+3
source

All Articles