Fuzzy matching using T-SQL

I have a table of Persons with personal data, etc. There are many columns, but interest in them is here: addressindex , lastname and firstname , where addressindex is a unique address drilled to the door of the apartment. Therefore, if I have "as below", two people with lastname and one firstnames same, they are most likely duplicates.

I need a way to list these duplicates.

 tabledata: personid 1 firstname "Carl" lastname "Anderson" addressindex 1 personid 2 firstname "Carl Peter" lastname "Anderson" addressindex 1 

I know how to do this if I fit all the columns exactly, but I need a fuzzy match to do the trick with (from the above example), for example:

 Row personid addressindex lastname firstname 1 2 1 Anderson Carl Peter 2 1 1 Anderson Carl ..... 

Any clues on how to fix this problem correctly?

+57
sql-server tsql fuzzy-search
May 28 '09 at 16:52
source share
8 answers

I found that the material SQL Server gives you makes fuzzy matching rather awkward. I was very lucky with my own CLR functions using the Levenshtein distance algorithm and some weight. Using this algorithm, I made a UDF called GetSimilarityScore, which takes two lines and returns a result between 0.0 and 1.0. The closer to 1.0 the match, the better. Then query with a threshold> = 0.8 or so to get the most likely matches. Something like that:

 if object_id('tempdb..#similar') is not null drop table #similar select a.id, ( select top 1 x.id from MyTable x where x.id <> a.id order by dbo.GetSimilarityScore(a.MyField, x.MyField) desc ) as MostSimilarId into #similar from MyTable a select *, dbo.GetSimilarityScore(a.MyField, c.MyField) from MyTable a join #similar b on a.id = b.id join MyTable c on b.MostSimilarId = c.id 

Just don't do this with really big tables. This is a slow process.

Here's the CLR UDFs:

 ''' <summary> ''' Compute the distance between two strings. ''' </summary> ''' <param name="s1">The first of the two strings.</param> ''' <param name="s2">The second of the two strings.</param> ''' <returns>The Levenshtein cost.</returns> <Microsoft.SqlServer.Server.SqlFunction()> _ Public Shared Function ComputeLevenstheinDistance(ByVal string1 As SqlString, ByVal string2 As SqlString) As SqlInt32 If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null Dim s1 As String = string1.Value Dim s2 As String = string2.Value Dim n As Integer = s1.Length Dim m As Integer = s2.Length Dim d As Integer(,) = New Integer(n, m) {} ' Step 1 If n = 0 Then Return m If m = 0 Then Return n ' Step 2 For i As Integer = 0 To n d(i, 0) = i Next For j As Integer = 0 To m d(0, j) = j Next ' Step 3 For i As Integer = 1 To n 'Step 4 For j As Integer = 1 To m ' Step 5 Dim cost As Integer = If((s2(j - 1) = s1(i - 1)), 0, 1) ' Step 6 d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost) Next Next ' Step 7 Return d(n, m) End Function ''' <summary> ''' Returns a score between 0.0-1.0 indicating how closely two strings match. 1.0 is a 100% ''' T-SQL equality match, and the score goes down from there towards 0.0 for less similar strings. ''' </summary> <Microsoft.SqlServer.Server.SqlFunction()> _ Public Shared Function GetSimilarityScore(string1 As SqlString, string2 As SqlString) As SqlDouble If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null Dim s1 As String = string1.Value.ToUpper().TrimEnd(" "c) Dim s2 As String = string2.Value.ToUpper().TrimEnd(" "c) If s1 = s2 Then Return 1.0F ' At this point, T-SQL would consider them the same, so I will too Dim flatLevScore As Double = InternalGetSimilarityScore(s1, s2) Dim letterS1 As String = GetLetterSimilarityString(s1) Dim letterS2 As String = GetLetterSimilarityString(s2) Dim letterScore As Double = InternalGetSimilarityScore(letterS1, letterS2) 'Dim wordS1 As String = GetWordSimilarityString(s1) 'Dim wordS2 As String = GetWordSimilarityString(s2) 'Dim wordScore As Double = InternalGetSimilarityScore(wordS1, wordS2) If flatLevScore = 1.0F AndAlso letterScore = 1.0F Then Return 1.0F If flatLevScore = 0.0F AndAlso letterScore = 0.0F Then Return 0.0F ' Return weighted result Return (flatLevScore * 0.2F) + (letterScore * 0.8F) End Function Private Shared Function InternalGetSimilarityScore(s1 As String, s2 As String) As Double Dim dist As SqlInt32 = ComputeLevenstheinDistance(s1, s2) Dim maxLen As Integer = If(s1.Length > s2.Length, s1.Length, s2.Length) If maxLen = 0 Then Return 1.0F Return 1.0F - Convert.ToDouble(dist.Value) / Convert.ToDouble(maxLen) End Function ''' <summary> ''' Sorts all the alpha numeric characters in the string in alphabetical order ''' and removes everything else. ''' </summary> Private Shared Function GetLetterSimilarityString(s1 As String) As String Dim allChars = If(s1, "").ToUpper().ToCharArray() Array.Sort(allChars) Dim result As New StringBuilder() For Each ch As Char In allChars If Char.IsLetterOrDigit(ch) Then result.Append(ch) End If Next Return result.ToString() End Function ''' <summary> ''' Removes all non-alpha numeric characters and then sorts ''' the words in alphabetical order. ''' </summary> Private Shared Function GetWordSimilarityString(s1 As String) As String Dim words As New List(Of String)() Dim curWord As StringBuilder = Nothing For Each ch As Char In If(s1, "").ToUpper() If Char.IsLetterOrDigit(ch) Then If curWord Is Nothing Then curWord = New StringBuilder() End If curWord.Append(ch) Else If curWord IsNot Nothing Then words.Add(curWord.ToString()) curWord = Nothing End If End If Next If curWord IsNot Nothing Then words.Add(curWord.ToString()) End If words.Sort(StringComparer.OrdinalIgnoreCase) Return String.Join(" ", words.ToArray()) End Function 
+15
Mar 16 2018-12-12T00:
source share

In addition to other useful information, here you can consider using the phonetic Double Metaphone algorithm, which is far superior to SOUNDEX . There is a Transact-SQL version ( link to the code here ).

This will help to combine the names with small spelling mistakes, for example, Karl and Karl.

+13
May 28 '09 at 17:05
source share

I would use SQL Server Full Text Indexing, which will allow you to search and return things that not only contain a word, but can also have a spelling error.

+7
May 28 '09 at 16:56
source share

Since the first release of Master Data Services, you have access to better algorithms with fuzzy logic than SOUNDEX implements. Thus, provided that you have MDS installed, you can find the Similarity () function in the mdq schema (MDS database).

Further information on how this works: http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/

+7
Mar 14 2018-12-12T00:
source share

I personally use the Jaro-Winkler CLR implementation of the algorithm, which seems to work very well - it struggles a bit with strings longer than about 15 characters and doesn't like email address matching, but otherwise it's good - a complete implementation guide can be found here

If you cannot use the CLR functions for any reason, perhaps you can try to run the data through the SSIS package (using the fuzzy transform search) - detailed here

+3
Mar 13 2018-12-12T00:
source share

As far as deduplication of things is concerned, your split and line matching is great. If data is known about data that can be used to reduce workload and / or obtain better results, it is always useful to use it. Keep in mind that it is often impossible to completely eliminate manual work in order to eliminate duplication, although you can do it much easier by catching as much as you can, and then generate reports on your “cases of uncertainty”.

Regarding name matching: SOUNDEX is terrible for the quality of the match, and especially bad for the type of work you are trying to do, as it will fit things that are too far from the target. It is better to use a combination of double metaphone results and Levenshtein distance to perform name matching. With the appropriate bias, this works very well and can probably be used for the second pass after cleaning your known ones.

You can also consider using the SSIS package and searching for fuzzy search and grouping transformations (http://msdn.microsoft.com/en-us/library/ms345128(SQL.90).aspx).

Using a full-text SQL search (http://msdn.microsoft.com/en-us/library/cc879300.aspx) is also possible, but probably not suitable for your specific problem domain.

+1
Nov 24 '10 at 18:46
source share

You can use SOUNDEX and the corresponding DIFFERENCE function in SQL Server to find similar names. MSDN link here .

0
May 28 '09 at 16:56
source share

do it like that

  create table person( personid int identity(1,1) primary key, firstname varchar(20), lastname varchar(20), addressindex int, sound varchar(10) ) 

and then create a trigger

  create trigger trigoninsert for dbo.person on insert as declare @personid int; select @personid=personid from inserted; update person set sound=soundex(firstname) where personid=@personid; 

Now, what can I do, I can create a procedure that looks something like this:

  create procedure getfuzzi(@personid int) as declare @sound varchar(10); set @sound=(select sound from person where personid=@personid; select personid,firstname,lastname,addressindex from person where sound=@sound 

this will return to you all names that almost match the names provided for a particular person

0
Oct 18 '13 at 6:31
source share



All Articles