I can offer a method for finding information , but this requires a large collection of documents in order to work correctly.
Index your data using standard IR analysis methods. Lucene is a good open source library that can help you.
As soon as you get a name (for example, Obaama): select a collection of collections in which the word Obaama appears . Let this set be D1 .
Now for each word w in D1 1, find Obaama AND w (using your IR system). Let the set D2 .
The estimate |D2|/|D1| is an estimate of how w is connected to Obaama and is likely to be close to 1 for w=Obama 2 ,
You can manually outline a set of examples and find the meaning from which words will be expected.
Using the standard lexicographic similarity technique, you can filter out words that are definitely not spelling mistakes (for example, Barack ).
Another commonly used solution requires a query log - to find the correlation between the searched words , if obaama has a correlation with obama in the query log - they are connected.
1: you can improve performance by first making a second filter, and check only for candidates who are “fairly similar” lexicographically.
2: Normalization is usually used, since more frequent words are likely to be in the same documents with any word, regardless of whether they are related or not.
source share