I have "records" (mostly CSV strings) of two names and one address. I need to find entries that are similar to each other: basically names and addresses all look βthe sameβ as if they were interpreted by a person.
I used the ideas from this excellent blog post: http://knol.google.com/k/simple-simhashing# to write a simple SimHash. If the SimHash results for two or more lines are the same, I pass all the records of this subset to a fine-grained matching program, which is O (n ^ 2), which compares each set record with each other record.
For the SimHash part, I have options where I can determine the size of the datagram (basically a sliding window of size n above the lines) and the number of iterations used to determine the number of (random) hashes that I need to use to calculate SimHash. So far, the size of datagram 4 and the use of 4 hashes to calculate SimHash. I have tried various other combinations, but this result gives the best results.
The problem I am facing is that this method finds about 80% of the duplicates in the datasets that I have. I know this because I checked the entire dataset against the painfully slow complete match O (n ^ 2) mentioned above. The O (n ^ 2) connector is suitable for datasets less than 10 ^ 4, but quickly becomes impracticable since I need to run 10 ^ 8 sized sets.
Any ideas, suggestions or thoughts on how I can improve the accuracy of SimHash, so that more similar entries are marked with the same SimHash number?
EDIT: Before SimHashing, I use and extract all characters! [0-9A-Z]. Examples of things that need to match (spelling errors intentionally):
- JOHN SMITH, 123 ANY STREET SOMETOWN ZIP
- JOHNNY SMITH, 123 ANY STRET
- SOMETOWNE ZIP ROBERT PARKER, 442 ANY STREET SOMETOWN ZIP
Here 1 and 2 are similar, 3 is not. The output should be: 1 + 2
source
share