Fuzzy match numbers

Question

Fuzzy match numbers

I worked with Double Metaphone and Caverphone2 to compare strings, and they work well on things like names, addresses, etc. (Caverphone2 works best for me). However, they cause too many false positives when you get numerical values such as phone numbers, IP addresses, credit card numbers, etc.

So, I looked at Luhn and Verhoeff , and they mostly describe what I want, but not quite. They seem to be good at checking, but don't seem to be made for fuzzy matching. Is there anything that behaves like Luhn and Verhoeff that could detect single-bit errors and transpose errors with two adjacent digits for encoding and comparison purposes, similar to fuzzy-string algorithms?

I would like to code a number and then compare it with 100,000 other numbers to find close matches. So, something like 7041234 will match 7041324 as a possible transcription error, but something like 4213704 will not.

+7

algorithm fuzzy-comparison

Jeffg Dec 28 '11 at 15:31

source share

1 answer

Thomas ahle · Answer 1 · 2011-12-31T00:09:31+0000

Levenshtein and friends can be helpful in finding the distance between specific lines or numbers. However, if you want to create a spelling corrector, you do not want to run your entire word database with every query.

Peter Norwig wrote a very enjoyable article on a simple spelling correction for “fuzzy matching,” based on some technologies related to Google’s spelling suggestions.

If your dictionary has N entries and the middle word is L , the Brute force Levenshtein approach will take O(N*L^3) time O(N*L^3) . Instead of Peter Norwig's approach, all words are generated at a certain editing distance from the input and are viewed in the dictionary. Therefore, it reaches O(L^k) , where k is the farthest editing distance.

Fuzzy match numbers

More articles: