You need to define the appropriate rules around your lines. What defines a "similar string"
- number of matching characters
- number of inappropriate characters
- similar length
- typos or phonetic errors
- business abbreviations
- must start with the same substring
- must end with the same substring
I did quite a bit of work with string matching algorithms, and I have not yet found any existing library or code that matches my specific requirements. Browse through them, borrow ideas from them, but you will always have to tweak and write your own code.
Levenshtein's algorithm is good, but a bit slow. I had some success with Smith-Waterman and Jaro-Winkler algorithms, but the best I found for myself was Monge (from memory). However, he pays to read the original study and determine why they wrote their algorithms and their target data set.
If you incorrectly determined what you want to combine and measure, you will find high scores for unexpected matches and low scores in expected matches. String matching is domain specific. If you do not correctly define your domain, then you, as a fisherman without a hint, throw hooks and hope for the best.
Kirk Broadhurst
source share