Data comparison

We have a SQL Server table containing the company name, address, and contact name (among others).

We regularly receive data files from external sources, which require us to compare with this table. Unfortunately, the data is slightly different since it comes from a completely different system. For example, we have 123 E. Main St. and we get 123 East Main Street. Another example: we have "Acme, LLC", and the file contains "Acme Inc." Another thing, we have Ed Smith, and they have Edward Smith.

We have a legacy system that uses some fairly sophisticated and processor intensive methods to handle these matches. Some of them include pure SQL, while others include VBA code in the Access database. The current system is good, but not perfect and bulky and difficult to maintain.

The management here wants to expand its use. Developers who inherit system support want to replace it with a more flexible solution that requires less maintenance.

Is there a generally accepted way to deal with this data mapping?

+5
source share
7 answers

- ( , ). , (, VB.Net) - ( ):

    Public Shared Function FindMostSimilarString(ByVal toFind As String, ByVal ParamArray stringList() As String) As String
        Dim bestMatch As String = ""
        Dim bestDistance As Integer = 1000 'Almost anything should be better than that!

        For Each matchCandidate As String In stringList
            Dim candidateDistance As Integer = LevenshteinDistance(toFind, matchCandidate)
            If candidateDistance < bestDistance Then
                bestMatch = matchCandidate
                bestDistance = candidateDistance
            End If
        Next

        Return bestMatch
    End Function

    'This will be used to determine how similar strings are.  Modified from the link below...
    'Fxn from: http://ca0v.terapad.com/index.cfm?fa=contentNews.newsDetails&newsID=37030&from=list
    Public Shared Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
        Dim sLength As Integer = s.Length ' length of s
        Dim tLength As Integer = t.Length ' length of t
        Dim lvCost As Integer ' cost
        Dim lvDistance As Integer = 0
        Dim zeroCostCount As Integer = 0

        Try
            ' Step 1
            If tLength = 0 Then
                Return sLength
            ElseIf sLength = 0 Then
                Return tLength
            End If

            Dim lvMatrixSize As Integer = (1 + sLength) * (1 + tLength)
            Dim poBuffer() As Integer = New Integer(0 To lvMatrixSize - 1) {}

            ' fill first row
            For lvIndex As Integer = 0 To sLength
                poBuffer(lvIndex) = lvIndex
            Next

            'fill first column
            For lvIndex As Integer = 1 To tLength
                poBuffer(lvIndex * (sLength + 1)) = lvIndex
            Next

            For lvRowIndex As Integer = 0 To sLength - 1
                Dim s_i As Char = s(lvRowIndex)
                For lvColIndex As Integer = 0 To tLength - 1
                    If s_i = t(lvColIndex) Then
                        lvCost = 0
                        zeroCostCount += 1
                    Else
                        lvCost = 1
                    End If
                    ' Step 6
                    Dim lvTopLeftIndex As Integer = lvColIndex * (sLength + 1) + lvRowIndex
                    Dim lvTopLeft As Integer = poBuffer(lvTopLeftIndex)
                    Dim lvTop As Integer = poBuffer(lvTopLeftIndex + 1)
                    Dim lvLeft As Integer = poBuffer(lvTopLeftIndex + (sLength + 1))
                    lvDistance = Math.Min(lvTopLeft + lvCost, Math.Min(lvLeft, lvTop) + 1)
                    poBuffer(lvTopLeftIndex + sLength + 2) = lvDistance
                Next
            Next
        Catch ex As ThreadAbortException
            Err.Clear()
        Catch ex As Exception
            WriteDebugMessage(Application.StartupPath , [Assembly].GetExecutingAssembly().GetName.Name.ToString, MethodBase.GetCurrentMethod.Name, Err)
        End Try

        Return lvDistance - zeroCostCount
    End Function
+4

SSIS ( Sql 2005+ Enterprise) Fuzzy Lookup, .

, , .

+2

, . , .

, , , , . , , - .

, , "", " ", " VBA" " " , .

EDIT: , .NET , , . , .NET.

+2

. :

/

, .

+2

Access . SSIS . Access, SQL Server Enterprise . , .

. PIck Street, raod .. , . , - . , .

, , 5 . , . , 100, Acme, Inc., ​​:

idfield

100 Acme, Inc.

100 Acme, Inc

100 Acme, Incorporated

100 Acme, LLC

100 Acme

, , , ( ), , .

Torial , .

timeconsuming, , . , , , .

+1

, .

, , .

0

, . , ..

  • , ..
  • ( Rremove ..)

, .

. Oracle, IBM, SAS Dataflux .., .

:

, , 4,4 . , ( . )

DataMatch Enterprise, ( > 95%), ,

IBM, ( > 90%), , ( > $100K)

SAS, ( > 85%), , ( > 100K) , , .

0

All Articles