Make the Sim Hash (Locality Sensitive Hashing) algorithm more accurate?

I have "records" (mostly CSV strings) of two names and one address. I need to find entries that are similar to each other: basically names and addresses all look β€œthe same” as if they were interpreted by a person.

I used the ideas from this excellent blog post: http://knol.google.com/k/simple-simhashing# to write a simple SimHash. If the SimHash results for two or more lines are the same, I pass all the records of this subset to a fine-grained matching program, which is O (n ^ 2), which compares each set record with each other record.

For the SimHash part, I have options where I can determine the size of the datagram (basically a sliding window of size n above the lines) and the number of iterations used to determine the number of (random) hashes that I need to use to calculate SimHash. So far, the size of datagram 4 and the use of 4 hashes to calculate SimHash. I have tried various other combinations, but this result gives the best results.

The problem I am facing is that this method finds about 80% of the duplicates in the datasets that I have. I know this because I checked the entire dataset against the painfully slow complete match O (n ^ 2) mentioned above. The O (n ^ 2) connector is suitable for datasets less than 10 ^ 4, but quickly becomes impracticable since I need to run 10 ^ 8 sized sets.

Any ideas, suggestions or thoughts on how I can improve the accuracy of SimHash, so that more similar entries are marked with the same SimHash number?

EDIT: Before SimHashing, I use and extract all characters! [0-9A-Z]. Examples of things that need to match (spelling errors intentionally):


  • JOHN SMITH, 123 ANY STREET SOMETOWN ZIP
  • JOHNNY SMITH, 123 ANY STRET
  • SOMETOWNE ZIP ROBERT PARKER, 442 ANY STREET SOMETOWN ZIP

Here 1 and 2 are similar, 3 is not. The output should be: 1 + 2

+5
source share
2 answers

Before you try to prove yourself and change the sim, did you try to apply domain knowledge to the problem?

? - ?

, , , , N, E, S, W , , , , ..?

+3

( , rep.)

? ? ? ? ?

, - , O (n ^ 2), , , . , , , ( O (n ^ 2), .

: , ( /).

- , , , , Java . .equals() ( , ). hashCode() . HashSet.

HashSet ( .equals()/.hashCode()).

, , , HashSet, , , , - .

, , . , HashSet .

- , , , . , , .

, , - , MD5 . . - . , , , , . , , .

0

All Articles