Find all lines that contain at least X characters, sorted in similarity.

Question

Find all lines that contain at least X characters, sorted in similarity.

I am working on a project that has the names of various drugs. Often I find something like Proscratinol and Proscratinol XR (extended release). I would like to find a request to pick up all the names of this nature so that I can put the “parent” drug in the table, and these “baby” drugs reference it, so when I write a request for counting the number of drugs, I'm not double Proscratinol account because it has XR, CR and any other version. I wrote the following to strike him.

;with x as ( select drug_name from rx group by drug_name ) select distinct * from x,x as x2 where LEFT(x2.drug_name,5) = LEFT(x.drug_name,5) and x.drug_name !=x2.drug_name

This will give me a list of all drugs whose names have the first five letters. Five is completely arbitrary. What I have so far is good enough, but I would like to order the results in a downward similarity. Therefore, I would like their X-most characters read on the left to be the same.

eg. Phenytoin and Felip will be 3 (their first three letters are the same)

with x in the form (select drug_name from rx group by drug_name)

 select x.drug_name as xDrugName ,x2.drug_name as x2DrugName ,case when LEFT(x2.drug_name,6) = LEFT(x.drug_name,6) then LEN(left(x.drug_name,6)) else '0' end from x,x as x2 where LEFT(x2.drug_name,5) = LEFT(x.drug_name,5) and x.drug_name !=x2.drug_name group by x.drug_name,x2.drug_name

Instead of hard coding the int to the left function in the above query, I need this integer expression to return how many identical characters separated the two lines. Any good way to do this?

+4

string sql sql-server sql-server-2008

wootscootinboogie Mar 20 '13 at 15:01

source share

2 answers

you need the longest overall sequence. here is the SQL server implementation:

select dbo.lcs (@ string1, @ string2), len (@ string1), len (@ string2)

 CREATE FUNCTION [dbo].[LCS]( @s varchar(MAX), @t varchar(MAX) ) RETURNS INT AS BEGIN DECLARE @d varchar(MAX), @LD INT, @m INT, @n INT, @i INT, @j INT, @s_i NCHAR(1), @t_j NCHAR(1) SET @n = LEN(@s) IF @n = 0 RETURN 0 SET @m = LEN(@t) IF @m = 0 RETURN 0 SET @d = REPLICATE(CHAR(0),(@n+1)*(@m+1)) SET @i = 1 WHILE @i <= @n BEGIN SET @s_i = SUBSTRING(@s,@i,1) SET @j = 1 WHILE @j <= @m BEGIN SET @t_j = SUBSTRING(@t,@j,1) IF @s_i = @t_j SET @d = STUFF(@d,@j*(@n+1) +@i +1,1, NCHAR(UNICODE( SUBSTRING(@d, (@j-1)*(@n+1) +@i-1 +1, 1) )+1)) ELSE SET @d = STUFF(@d,@j*(@n+1) +@i +1,1,CHAR(dbo.Max2( UNICODE(SUBSTRING(@d,@j*(@n+1) +@i-1 +1,1)), UNICODE(SUBSTRING(@d,(@j-1)*(@n+1) +@i +1,1))))) SET @j = @j+1 END SET @i = @i+1 END SET @LD = UNICODE(SUBSTRING(@d,@n*(@m+1) +@m +1,1)) RETURN @LD END

0

David Jul 01 '13 at 19:50

source share

Gordon linoff · Accepted Answer · 2013-03-20T15:40:09+0000

This approach uses a number generator and then just checks the length of the overlap:

 select x.drug_name, x2.drug_name, MAX(c.seqnum) as OverlapLen from x cross join x x2 cross join (select ROW_NUMBER() over (order by (select NULL)) seqnum from INFORMATION_SCHEMA.COLUMNS c ) c where LEFT(x.drug_name, c.seqnum) = LEFT(x2.drug_name, c.seqnum) and len(x.drug_name) >= c.seqnum and len(x2.drug_name) >= c.seqnum group by x.drug_name, x.drug_name order by x.drug_name, OverlapLen desc

This suggests that information_schema.columns has enough strings for longer medication names.

This joins x to itself, and then joins the list of numbers. The where clause checks three conditions: (1) that the left side of each drug name remains unchanged until segregation; (2) that the length of each drug name is less than or equal to half.

Then the aggregation takes each pair and selects the maximum value of seqnum - this should be the longest match of the substring.

Find all lines that contain at least X characters, sorted in similarity.

More articles: