Fuzzy Logic Matching

So, I look at the implementation of fuzzy logic in my company and do not get good results. Firstly, I am trying to match the names of companies with the names listed on other companies.
My first attempt was to use soundex, but it seems that soundex only compares the first sounds in a company name, so longer company names mix too easily with each other.
Now I am working on my second attempt, using distance comparison over Levenstein. This looks promising, especially if you remove the punctuation first. However, I still have difficulty finding duplicates without unnecessary false positives. One of the problems I have is companies like widgetsco vs widgets inc. So, if I compare the substring of the length of the shorter name, I also write things such as the BBC University and the CBC University campus. I suspect that evaluating using a combination of distance and the longest common substring may be the solution.
Has anyone managed to build an algorithm that performs this mapping with limited false positives?

+5
source share
3 answers

You want to use something like Levenshtein Distance or another string comparison algorithm. You can watch this project on Codeplex.

http://fuzzystring.codeplex.com/

0
source

Are you using Access? If so, consider the “*” character without quotation marks. If you are using SQL Server, use the symbol "%". However, this is really not fuzzy logic, it is really a Like operator. If you really need fuzzy logic, export your dataset to Excel and load AddIn from the URL below.

https://www.microsoft.com/en-us/download/details.aspx?id=15011

Read the instructions very carefully. It definitely works, and it works great, but you need to follow the instructions, and this is not completely intuitive. The first time I tried this, I did not follow the instructions, and I spent a lot of time getting it to work. I finally figured it out and it will work great !!

0
source

We had good results when matching name and address using the Metaphone feature created by Lawrence Philips. It works similarly to Soundex, but creates a sound / consonant pattern for the whole value. You may find this useful in combination with some other methods, especially if you can remove some of the fluff, such as “co”. and "inc." as mentioned in other comments:

create function [dbo].[Metaphone](@str as nvarchar(70), @KeepNumeric as bit = 0) returns nvarchar(25) /* Metaphone Algorithm Created by Lawrence Philips. Metaphone presented in article in "Computer Language" December 1990 issue. *********** BEGIN METAPHONE RULES *********** Lawrence Philips' RULES follow: The 16 consonant sounds: |--- ZERO represents "th" | BXSKJTFHLMNPR 0 WY Drop vowels Exceptions: Beginning of word: "ae-", "gn", "kn-", "pn-", "wr-" ----> drop first letter Beginning of word: "wh-" ----> change to "w" Beginning of word: "x" ----> change to "s" Beginning of word: vowel or "H" + vowel ----> Keep it Transformations: B ----> B unless at the end of word after "m", as in "dumb", "McComb" C ----> X (sh) if "-cia-" or "-ch-" S if "-ci-", "-ce-", or "-cy-" SILENT if "-sci-", "-sce-", or "-scy-" K otherwise K "-sch-" D ----> J if in "-dge-", "-dgy-", or "-dgi-" T otherwise F ----> F G ----> SILENT if "-gh-" and not at end or before a vowel "-gn" or "-gned" "-dge-" etc., as in above rule J if "gi", "ge", "gy" if not double "gg" K otherwise H ----> SILENT if after vowel and no vowel follows or "-ch-", "-sh-", "-ph-", "-th-", "-gh-" H otherwise J ----> J K ----> SILENT if after "c" K otherwise L ----> L M ----> M N ----> N P ----> F if before "h" P otherwise Q ----> K R ----> R S ----> X (sh) if "sh" or "-sio-" or "-sia-" S otherwise T ----> X (sh) if "-tia-" or "-tio-" 0 (th) if "th" SILENT if "-tch-" T otherwise V ----> F W ----> SILENT if not followed by a vowel W if followed by a vowel X ----> KS Y ----> SILENT if not followed by a vowel Y if followed by a vowel Z ----> S */ as begin declare @Result varchar(25) ,@str3 char(3) ,@str2 char(2) ,@str1 char(1) ,@strp char(1) ,@strLen tinyint ,@cnt tinyint set @strLen = len(@str) set @cnt = 0 set @Result = '' -- Preserve first 5 numeric values when required if @KeepNumeric = 1 begin set @Result = case when isnumeric(substring(@str,1,1)) = 1 then case when isnumeric(substring(@str,2,1)) = 1 then case when isnumeric(substring(@str,3,1)) = 1 then case when isnumeric(substring(@str,4,1)) = 1 then case when isnumeric(substring(@str,5,1)) = 1 then left(@str,5) else left(@str,4) end else left(@str,3) end else left(@str,2) end else left(@str,1) end else '' end set @str = right(@str,len(@str)-len(@Result)) end --Process beginning exceptions set @str2 = left(@str,2) if @str2 = 'wh' begin set @str = 'w' + right(@str , @strLen - 2) set @strLen = @strLen - 1 end else if @str2 in('ae', 'gn', 'kn', 'pn', 'wr') begin set @str = right(@str , @strLen - 1) set @strLen = @strLen - 1 end set @str1 = left(@str,1) if @str1 = 'x' set @str = 's' + right(@str , @strLen - 1) else if @str1 in ('a','e','i','o','u') begin set @str = right(@str, @strLen - 1) set @strLen = @strLen - 1 set @Result = @Result + @str1 end while @cnt <= @strLen begin set @cnt = @cnt + 1 set @str1 = substring(@str,@cnt,1) set @strp = case when @cnt <> 0 then substring(@str,(@cnt-1),1) else ' ' end -- Check if the current character is the same as the previous character. -- If we are keeping numbers, only compare non-numeric characters. if case when @KeepNumeric = 1 and @strp = @str1 and isnumeric(@str1) = 0 then 1 when @KeepNumeric = 0 and @strp = @str1 then 1 else 0 end = 1 continue -- Skip this loop set @str2 = substring(@str,@cnt,2) set @Result = case when @KeepNumeric = 1 and isnumeric(@str1) = 1 then @Result + @str1 when @str1 in('f','j','l','m','n','r') then @Result + @str1 when @str1 = 'q' then @Result + 'k' when @str1 = 'v' then @Result + 'f' when @str1 = 'x' then @Result + 'ks' when @str1 = 'z' then @Result + 's' when @str1 = 'b' then case when @cnt = @strLen then case when substring(@str,(@cnt - 1),1) <> 'm' then @Result + 'b' else @Result end else @Result + 'b' end when @str1 = 'c' then case when @str2 = 'ch' or substring(@str,@cnt,3) = 'cia' then @Result + 'x' else case when @str2 in('ci','ce','cy') and @strp <> 's' then @Result + 's' else @Result + 'k' end end when @str1 = 'd' then case when substring(@str,@cnt,3) in ('dge','dgy','dgi') then @Result + 'j' else @Result + 't' end when @str1 = 'g' then case when substring(@str,(@cnt - 1),3) not in ('dge','dgy','dgi','dha','dhe','dhi','dho','dhu') then case when @str2 in('gi', 'ge','gy') then @Result + 'j' else case when @str2 <> 'gn' or (@str2 <> 'gh' and @cnt+1 <> @strLen) then @Result + 'k' else @Result end end else @Result end when @str1 = 'h' then case when @strp not in ('a','e','i','o','u') and @str2 not in ('ha','he','hi','ho','hu') then case when @strp not in ('c','s','p','t','g') then @Result + 'h' else @Result end else @Result end when @str1 = 'k' then case when @strp <> 'c' then @Result + 'k' else @Result end when @str1 = 'p' then case when @str2 = 'ph' then @Result + 'f' else @Result + 'p' end when @str1 = 's' then case when substring(@str,@cnt,3) in ('sia','sio') or @str2 = 'sh' then @Result + 'x' else @Result + 's' end when @str1 = 't' then case when substring(@str,@cnt,3) in ('tia','tio') then @Result + 'x' else case when @str2 = 'th' then @Result + '0' else case when substring(@str,@cnt,3) <> 'tch' then @Result + 't' else @Result end end end when @str1 = 'w' then case when @str2 not in('wa','we','wi','wo','wu') then @Result + 'w' else @Result end when @str1 = 'y' then case when @str2 not in('ya','ye','yi','yo','yu') then @Result + 'y' else @Result end else @Result end end return @Result end 
0
source

All Articles