The best way to clean and normalize large amounts of data based on a string matching algorithm

I am currently working on a data modeling project as part of my summer project at the university. Client data needs a lot of clearing, as a number of columns rely on user input and have free text.

To give an example, the Business Name has several entries for the same company. For "Hugo Boss" it includes "of Hugo Bos", "Huggo on Boss", "of Hugo Boss Ltd" .

I can potentially go through each line and determine all the values ​​that were used and create a map for each record, however, given that I am dealing with 1 million records, it is very time consuming and not very ideal.

Do people know the source code of such / similar implementation? I reviewed the matching algorithm, however they rely on a pre-computed template. What other matching method or machine learning methods can I use to design an automated process that would clear the data, i.e. Matched all the different names to one name.

Any help would be appreciated.

+4
source share
4 answers

This research field is called Data Alignment or Link Record.

, . , ( ).

, n- .

n = 3 Hugo Boss

[hug, ugo, go , o b,  bo, bos, oss]

jaccard ngrams.

, , Hugo Boss Huggo Boss:

[hug, ugo, go , o b,  bo, bos, oss]
[hug, ugg, ggo, go , o b,  bo, bos, oss]
jaccard similarity: 0.6666666666666666

, Lucene. .

+5

"Hugo Boss" "Hugo Bos", "Huggo Boss", "Hugo Boss Ltd".... soundex ( ), "LTD" ".

soundex . "Hugo Boss" "Hugo Bos" "Huggo Boss". "Hugo Boss Ltd" - LTD . , , .

, soundex , . , .

, , "", "LLC", "Corp", - . soundex, .

, ngrams, thomas , ngrams .

NYSIIS:

, -:

1. Translate first characters of name: MAC β†’ MCC, KN β†’ N, K β†’ C, PH, PF β†’ FF, SCH β†’ SSS
2. Translate last characters of name: EE β†’ Y, IE β†’ Y, DT, RT, RD, NT, ND β†’ D
3. First character of key = first character of name.
4. Translate remaining characters by following rules, incrementing by one character each time:
    1. EV β†’ AF else A, E, I, O, U β†’ A
    2. Q β†’ G, Z β†’ S, M β†’ N
    3. KN β†’ N else K β†’ C
    4. SCH β†’ SSS, PH β†’ FF
    5. H β†’ If previous or next is non-vowel, previous.
    6. W β†’ If previous is vowel, A.
    7. Add current to key if current is not same as the last key character.
5. If last character is S, remove it.
6. If last characters are AY, replace with Y.
7. If last character is A, remove it.
8. Append translated key to value from step 3 (removed first character)
9. If longer than 6 characters, truncate to first 6 characters. (only needed for true NYSIIS, some versions use the full key)

Soundex . python :

#!/usr/bin/env python

import fuzzy

names = [ 'Catherine', 'Katherine', 'Katarina',
      'Johnathan', 'Jonathan', 'John',
      'Teresa', 'Theresa',
      'Smith', 'Smyth',
      'Jessica',
      'Joshua',
      ]

for n in names:
    print '%-10s' % n, fuzzy.nysiis(n)

:

$ python show_nysiis.py
Catherine  CATARAN
Katherine  CATARAN
Katarina   CATARAN
Johnathan  JANATAN
Jonathan   JANATAN
John       JAN
Teresa     TARAS
Theresa    TARAS
Smith      SNATH
Smyth      SNATH
Jessica    JASAC
Joshua     JAS

: http://www.informit.com/articles/article.aspx?p=1848528

ngrams .

, - .

+1

Another option is to look at the Levenshtein distance https://en.wikipedia.org/wiki/Levenshtein_distance

This will help you in cases like Hugo Boss vs Huggo Boss, but will not work for Hugo Boss vs The Hugo Boss

0
source

You can try aho-corasick finite state machine and increase it with a wildcard. Another option is the suffix tree, i.e. levensthein distance. You can try my PHP implementation aho-corasick @ https://phpahocorasick.codeplex.com .

0
source

All Articles