The most effective editing distance for spelling errors in names?

Algorithms for edit distance specify the distance between two lines.

Question: which of these measures will be most relevant for finding two different names of people who are actually the same? (different due to improper handling). The trick is that it should minimize false positives. Example:

Obaama Obama => probably should be combined

Obama IBAMA => should not be combined.

This is just a simple example. Are their programmers and computer scientists more fully addressed this issue?

+6
source share
2 answers

I can offer a method for finding information , but this requires a large collection of documents in order to work correctly.

Index your data using standard IR analysis methods. Lucene is a good open source library that can help you.

As soon as you get a name (for example, Obaama): select a collection of collections in which the word Obaama appears . Let this set be D1 .
Now for each word w in D1 1, find Obaama AND w (using your IR system). Let the set D2 .

The estimate |D2|/|D1| is an estimate of how w is connected to Obaama and is likely to be close to 1 for w=Obama 2 ,
You can manually outline a set of examples and find the meaning from which words will be expected.

Using the standard lexicographic similarity technique, you can filter out words that are definitely not spelling mistakes (for example, Barack ).

Another commonly used solution requires a query log - to find the correlation between the searched words , if obaama has a correlation with obama in the query log - they are connected.


1: you can improve performance by first making a second filter, and check only for candidates who are “fairly similar” lexicographically.

2: Normalization is usually used, since more frequent words are likely to be in the same documents with any word, regardless of whether they are related or not.

+5
source

You can check out NerSim ( demo ) which also uses SecondString . You can find their respective documents or review this document: Reliable similarity measures for matching named objects .

+2
source

Source: https://habr.com/ru/post/922664/


All Articles