Fuzzy SQL Server Search with Percent Match

I am using SQL Server 2008 R2 SP1.

I have a table with 36034 customer reports. I am trying to implement a Fuzy search in the Client Name field.

Here is the function for fuzzy search

ALTER FUNCTION [Party].[FuzySearch] ( @Reference VARCHAR(200) , @Target VARCHAR(200) ) RETURNS DECIMAL(5, 2) WITH SCHEMABINDING AS BEGIN DECLARE @score DECIMAL(5, 2) SELECT @score = CASE WHEN @Reference = @Target THEN CAST(100 AS NUMERIC(5, 2)) WHEN @Reference IS NULL OR @Target IS NULL THEN CAST(0 AS NUMERIC(5, 2)) ELSE ( SELECT [Score %] = CAST(SUM(LetterScore) * 100.0 / MAX(WordLength * WordLength) AS NUMERIC(5, 2)) FROM ( -- do SELECT seq = t1.n , ref.Letter , v.WordLength , LetterScore = v.WordLength - ISNULL(MIN(tgt.n), v.WordLength) FROM ( -- v SELECT Reference = LEFT(@Reference + REPLICATE('_', WordLength), WordLength) , Target = LEFT(@Target + REPLICATE('_', WordLength), WordLength) , WordLength = WordLength FROM ( -- di SELECT WordLength = MAX(WordLength) FROM ( VALUES ( DATALENGTH(@Reference)), ( DATALENGTH(@Target)) ) d ( WordLength ) ) di ) v CROSS APPLY ( -- t1 SELECT TOP ( WordLength ) n FROM ( VALUES ( 1), ( 2), ( 3), ( 4), ( 5), ( 6), ( 7), ( 8), ( 9), ( 10), ( 11), ( 12), ( 13), ( 14), ( 15), ( 16), ( 17), ( 18), ( 19), ( 20), ( 21), ( 22), ( 23), ( 24), ( 25), ( 26), ( 27), ( 28), ( 29), ( 30), ( 31), ( 32), ( 33), ( 34), ( 35), ( 36), ( 37), ( 38), ( 39), ( 40), ( 41), ( 42), ( 43), ( 44), ( 45), ( 46), ( 47), ( 48), ( 49), ( 50), ( 51), ( 52), ( 53), ( 54), ( 55), ( 56), ( 57), ( 58), ( 59), ( 60), ( 61), ( 62), ( 63), ( 64), ( 65), ( 66), ( 67), ( 68), ( 69), ( 70), ( 71), ( 72), ( 73), ( 74), ( 75), ( 76), ( 77), ( 78), ( 79), ( 80), ( 81), ( 82), ( 83), ( 84), ( 85), ( 86), ( 87), ( 88), ( 89), ( 90), ( 91), ( 92), ( 93), ( 94), ( 95), ( 96), ( 97), ( 98), ( 99), ( 100), ( 101), ( 102), ( 103), ( 104), ( 105), ( 106), ( 107), ( 108), ( 109), ( 110), ( 111), ( 112), ( 113), ( 114), ( 115), ( 116), ( 117), ( 118), ( 119), ( 120), ( 121), ( 122), ( 123), ( 124), ( 125), ( 126), ( 127), ( 128), ( 129), ( 130), ( 131), ( 132), ( 133), ( 134), ( 135), ( 136), ( 137), ( 138), ( 139), ( 140), ( 141), ( 142), ( 143), ( 144), ( 145), ( 146), ( 147), ( 148), ( 149), ( 150), ( 151), ( 152), ( 153), ( 154), ( 155), ( 156), ( 157), ( 158), ( 159), ( 160), ( 161), ( 162), ( 163), ( 164), ( 165), ( 166), ( 167), ( 168), ( 169), ( 170), ( 171), ( 172), ( 173), ( 174), ( 175), ( 176), ( 177), ( 178), ( 179), ( 180), ( 181), ( 182), ( 183), ( 184), ( 185), ( 186), ( 187), ( 188), ( 189), ( 190), ( 191), ( 192), ( 193), ( 194), ( 195), ( 196), ( 197), ( 198), ( 199), ( 200) ) t2 ( n ) ) t1 CROSS APPLY ( SELECT Letter = SUBSTRING(Reference, t1.n, 1) ) ref OUTER APPLY ( -- tgt SELECT TOP ( WordLength ) n = ABS(t1.n - t2.n) FROM ( VALUES ( 1), ( 2), ( 3), ( 4), ( 5), ( 6), ( 7), ( 8), ( 9), ( 10), ( 11), ( 12), ( 13), ( 14), ( 15), ( 16), ( 17), ( 18), ( 19), ( 20), ( 21), ( 22), ( 23), ( 24), ( 25), ( 26), ( 27), ( 28), ( 29), ( 30), ( 31), ( 32), ( 33), ( 34), ( 35), ( 36), ( 37), ( 38), ( 39), ( 40), ( 41), ( 42), ( 43), ( 44), ( 45), ( 46), ( 47), ( 48), ( 49), ( 50), ( 51), ( 52), ( 53), ( 54), ( 55), ( 56), ( 57), ( 58), ( 59), ( 60), ( 61), ( 62), ( 63), ( 64), ( 65), ( 66), ( 67), ( 68), ( 69), ( 70), ( 71), ( 72), ( 73), ( 74), ( 75), ( 76), ( 77), ( 78), ( 79), ( 80), ( 81), ( 82), ( 83), ( 84), ( 85), ( 86), ( 87), ( 88), ( 89), ( 90), ( 91), ( 92), ( 93), ( 94), ( 95), ( 96), ( 97), ( 98), ( 99), ( 100), ( 101), ( 102), ( 103), ( 104), ( 105), ( 106), ( 107), ( 108), ( 109), ( 110), ( 111), ( 112), ( 113), ( 114), ( 115), ( 116), ( 117), ( 118), ( 119), ( 120), ( 121), ( 122), ( 123), ( 124), ( 125), ( 126), ( 127), ( 128), ( 129), ( 130), ( 131), ( 132), ( 133), ( 134), ( 135), ( 136), ( 137), ( 138), ( 139), ( 140), ( 141), ( 142), ( 143), ( 144), ( 145), ( 146), ( 147), ( 148), ( 149), ( 150), ( 151), ( 152), ( 153), ( 154), ( 155), ( 156), ( 157), ( 158), ( 159), ( 160), ( 161), ( 162), ( 163), ( 164), ( 165), ( 166), ( 167), ( 168), ( 169), ( 170), ( 171), ( 172), ( 173), ( 174), ( 175), ( 176), ( 177), ( 178), ( 179), ( 180), ( 181), ( 182), ( 183), ( 184), ( 185), ( 186), ( 187), ( 188), ( 189), ( 190), ( 191), ( 192), ( 193), ( 194), ( 195), ( 196), ( 197), ( 198), ( 199), ( 200) ) t2 ( n ) WHERE SUBSTRING(@Target, t2.n, 1) = ref.Letter ) tgt GROUP BY t1.n , ref.Letter , v.WordLength ) do ) END RETURN @score END 

Here is a function call request

 select [Party].[FuzySearch]('First Name Middle Name Last Name', C.FirstName) from dbo.Customer C 

It takes about 2 minutes 22 seconds to give me a fuzzy match percentage for everyone

How can I fix this to run in lessthan second. Any suggestions on my feature to make it more reliable.

Expected yield - 45.34, 40.00, 100.00, 23.00, 81.23 .....

+1
source share
2 answers

Here is how I could do it:

Explained further @ Fuzzy SQL Server Search - Levenshtein Algorithm

Create the file below using any editor of your choice:

 using System; using System.Data; using System.Data.SqlClient; using System.Data.SqlTypes; using Microsoft.SqlServer.Server; public partial class StoredFunctions { [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = false)] public static SqlDouble Levenshtein(SqlString stringOne, SqlString stringTwo) { #region Handle for Null value if (stringOne.IsNull) stringOne = new SqlString(""); if (stringTwo.IsNull) stringTwo = new SqlString(""); #endregion #region Convert to Uppercase string strOneUppercase = stringOne.Value.ToUpper(); string strTwoUppercase = stringTwo.Value.ToUpper(); #endregion #region Quick Check and quick match score int strOneLength = strOneUppercase.Length; int strTwoLength = strTwoUppercase.Length; int[,] dimention = new int[strOneLength + 1, strTwoLength + 1]; int matchCost = 0; if (strOneLength + strTwoLength == 0) { return 100; } else if (strOneLength == 0) { return 0; } else if (strTwoLength == 0) { return 0; } #endregion #region Levenshtein Formula for (int i = 0; i <= strOneLength; i++) dimention[i, 0] = i; for (int j = 0; j <= strTwoLength; j++) dimention[0, j] = j; for (int i = 1; i <= strOneLength; i++) { for (int j = 1; j <= strTwoLength; j++) { if (strOneUppercase[i - 1] == strTwoUppercase[j - 1]) matchCost = 0; else matchCost = 1; dimention[i, j] = System.Math.Min(System.Math.Min(dimention[i - 1, j] + 1, dimention[i, j - 1] + 1), dimention[i - 1, j - 1] + matchCost); } } #endregion // Calculate Percentage of match double percentage = System.Math.Round((1.0 - ((double)dimention[strOneLength, strTwoLength] / (double)System.Math.Max(strOneLength, strTwoLength))) * 100.0, 2); return percentage; } }; 

Name him levenshtein.cs

Go to the command line. Go to the levenshtein.cs file directory, then call csc.exe / t: library / out: UserFunctions.dll levenshtein.cs , you may need to provide the full csc.exe path from NETFrameWork 2.0.

As soon as your library is ready. Add it to the assemblies Database → Programmability → Assemblies → New assembly.

Create a function in your database:

 CREATE FUNCTION dbo.LevenshteinSVF ( @S1 NVARCHAR(200) , @S2 NVARCHAR(200) ) RETURNS FLOAT AS EXTERNAL NAME UserFunctions.StoredFunctions.Levenshtein GO 

In my case, I had to enable clr:

 sp_configure 'clr enabled', 1 GO reconfigure GO 

Check the function:

 SELECT dbo.LevenshteinSVF('James','James Bond') 

Result: 50% match

+2
source

The best I could do was simplify part of the query and change it to a table-valued function. Scalar functions are known to be poor executors, and the advantage of the built-in TVF is that the definition of the request is expanded in the main request, like a view.

This greatly reduces the execution time of the tests that I performed.

 ALTER FUNCTION dbo.FuzySearchTVF (@Reference VARCHAR(200), @Target VARCHAR(200)) RETURNS TABLE AS RETURN ( WITH N (n) AS ( SELECT TOP (ISNULL(CASE WHEN DATALENGTH(@Reference) > DATALENGTH(@Target) THEN DATALENGTH(@Reference) ELSE DATALENGTH(@Target) END, 0)) ROW_NUMBER() OVER(ORDER BY n1.n) FROM (VALUES (1), (1), (1), (1), (1), (1), (1), (1), (1), (1)) AS N1 (n) CROSS JOIN (VALUES (1), (1), (1), (1), (1), (1), (1), (1), (1), (1)) AS N2 (n) CROSS JOIN (VALUES (1), (1)) AS N3 (n) WHERE @Reference IS NOT NULL AND @Target IS NOT NULL ), Src AS ( SELECT Reference = CASE WHEN DATALENGTH(@Reference) > DATALENGTH(@Target) THEN @Reference ELSE @Reference + REPLICATE('_', DATALENGTH(@Target) - DATALENGTH(@Reference)) END, Target = CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN @Target ELSE @Target + REPLICATE('_', DATALENGTH(@Target) - DATALENGTH(@Reference)) END, WordLength = CASE WHEN DATALENGTH(@Reference) > DATALENGTH(@Target) THEN DATALENGTH(@Reference) ELSE DATALENGTH(@Target) END WHERE @Reference IS NOT NULL AND @Target IS NOT NULL AND @Reference != @Target ), Scores AS ( SELECT seq = t1.n , Letter = SUBSTRING(s.Reference, t1.n, 1), s.WordLength , LetterScore = s.WordLength - ISNULL(MIN(ABS(t1.n - t2.n)), s.WordLength) FROM Src AS s CROSS JOIN N AS t1 INNER JOIN N AS t2 ON SUBSTRING(@Target, t2.n, 1) = SUBSTRING(s.Reference, t1.n, 1) WHERE @Reference IS NOT NULL AND @Target IS NOT NULL AND @Reference != @Target GROUP BY t1.n, SUBSTRING(s.Reference, t1.n, 1), s.WordLength ) SELECT [Score] = 100 WHERE @Reference = @Target UNION ALL SELECT 0 WHERE @Reference IS NULL OR @Target IS NULL UNION ALL SELECT CAST(SUM(LetterScore) * 100.0 / MAX(WordLength * WordLength) AS NUMERIC(5, 2)) FROM Scores WHERE @Reference IS NOT NULL AND @Target IS NOT NULL AND @Reference != @Target GROUP BY WordLength ); 

And this will be called as:

 SELECT f.Score FROM dbo.Customer AS c CROSS APPLY [dbo].[FuzySearch]('First Name Middle Name Last Name', c.FirstName) AS f 

This is still a pretty complicated feature, and depending on the number of entries in your clients table, I think getting it up to 1 second will be a bit of a challenge.

+3
source

All Articles