Data Deduplication Algorithm for a Large Number of Contacts

Question

Data Deduplication Algorithm for a Large Number of Contacts

I am developing an application that should be able to find and combine duplicates in hundreds of thousands of contact information stored in SQL Server DB. I need to compare all the columns in a table, each column has a weight value. The comparison should work based on the weight value. Based on the results of the comparison and the degree of equivalence, I must decide to automatically combine the contacts or ask the user for attention. I know that there are many fuzzy logic algorithms for deduplication.

Read about algorithms based on N-grams or Q-grams at http://www.melissadata.com/ . Is this algorithm feasible for a large dataset? If not, can someone help me with some kind of algorithm or over the phone, where to start?

An example of what I want to achieve

Gonzales = Gonzalez (two different spelling of different name)
Smith = Smyth (Phonetic sound the same)
123 Main st = 123 Main street (abbrevation)
Bob Smith = Robert Smith (synonym)

+4

algorithm duplicates fuzzy-logic record-linkage

Shankar Oct 4 '13 at 11:54

source share

3 answers

( , ). , , . , .

Q/N- ( ) - , . , . Q- , .

(, Soundex Metaphone - ), , , , , .. , , . Soundex. , , , . .

Wikipedia , . Duke . , , . .

, , .

+4

larsga 14 . '14 20:20

, - . , :

Group: Kathryn names: [Kathryn, Katharine, Katherin, Katherynn, Kathrynn, Katherynne, Kathrynne, Catherine, Cathryn, Catharine, Catherin, Catherynn, Cathrynn, Catherynne, Cathrynne]
Group: Assaf names: [Assaf, Asaf]
Group: Megan names: [Megan, Meagan, Meghan, Meaghan]
Group: Allison names: [Allison, Alyson, Allyson, Alison, Allisyn]
==============================================================
Phonetic Encoder: Caverphone2
---- Names Group: Kathryn ----
Encoded names: {KTRN111111=16}
---- Names Group: Assaf ----
Encoded names: {ASF1111111=3}
---- Names Group: Megan ----
Encoded names: {MKN1111111=5}
---- Names Group: Allison ----
Encoded names: {ALSN111111=6}
==============================================================
Phonetic Encoder: DoubleMetaphone
---- Names Group: Kathryn ----
Encoded names: {K0RN=16}
---- Names Group: Assaf ----
Encoded names: {ASF=3}
---- Names Group: Megan ----
Encoded names: {MKN=5}
---- Names Group: Allison ----
Encoded names: {ALSN=6}
==============================================================
Phonetic Encoder: Nysiis
---- Names Group: Kathryn ----
Encoded names: {CATRYN=7, CATARA=6, CATARY=5}
---- Names Group: Assaf ----
Encoded names: {ASAF=3}
---- Names Group: Megan ----
Encoded names: {MAGAN=5}
---- Names Group: Allison ----
Encoded names: {ALASAN=3, ALYSAN=3, ALASYN=2}
==============================================================
Phonetic Encoder: Soundex
---- Names Group: Kathryn ----
Encoded names: {K365=8, C365=9}
---- Names Group: Assaf ----
Encoded names: {A210=3}
---- Names Group: Megan ----
Encoded names: {M250=5}
---- Names Group: Allison ----
Encoded names: {A425=6}
==============================================================
Phonetic Encoder: RefinedSoundex
---- Names Group: Kathryn ----
Encoded names: {C30609080=5, K3060908=5, K30609080=4, C3060908=5}
---- Names Group: Assaf ----
Encoded names: {A0302=3}
---- Names Group: Megan ----
Encoded names: {M80408=5}
---- Names Group: Allison ----
Encoded names: {A070308=6}
==============================================================

In this example, you can see that for Caverphone and DoubleMetaphone all names were encoded on the same line. you must understand what makes sense for your data, the encoder used depends on the language and etymology of the names (for example, English names, German names ...)

+1

Asaf Oct 4 '13 at 12:48

source share

Shankar · Accepted Answer · 2013-10-08T10:43:08+0000

Partial solution was found using simhash algorithm. Found a good example here http://simhash.codeplex.com/

Data Deduplication Algorithm for a Large Number of Contacts

More articles: