Complex SQL connection to a group

Question

Complex SQL connection to a group

I am trying to optimize a query that is time consuming. The goal of the query is to obtain the best possible F2. (Special measure of similarity) This is an example of what I have:

CREATE TABLE Test ( F1 varchar(124), F2 varchar(124), F3 varchar(124) ) INSERT INTO TEST ( F1, F2, F3 ) VALUES ( 'A', 'B', 'C' ) INSERT INTO TEST ( F1, F2, F3 ) VALUES ( 'D', 'B', 'E' ) INSERT INTO TEST ( F1, F2, F3 ) VALUES ( 'F', 'I', 'G' ) INSERT INTO TEST ( F1, F2, F3 ) VALUES ( 'F', 'I', 'G' ) INSERT INTO TEST ( F1, F2, F3 ) VALUES ( 'D', 'B', 'C' ) INSERT INTO TEST ( F1, F2, F3 ) VALUES ( 'F', 'B', 'G' ) INSERT INTO TEST ( F1, F2, F3 ) VALUES ( 'D', 'I', 'C' ) INSERT INTO TEST ( F1, F2, F3 ) VALUES ( 'A', 'B', 'C' ) INSERT INTO TEST ( F1, F2, F3 ) VALUES ( 'A', 'B', 'K' ) INSERT INTO TEST ( F1, F2, F3 ) VALUES ( 'A', 'K', 'K' )

Now, if I run this query:

 SELECT B.f2,COUNT(*) AS CNT FROM ( select F1,F3 from Test where F2='B' )AS A INNER JOIN Test AS B ON A.F1 = B.F1 AND A.F3 = B.F3 GROUP BY B.F2 ORDER BY CNT DESC

There are 1m + rows in the table. What would be the best way to do this?

+6

sql join sql-server sql-server-2008 group-by

jozi Sep 16 '12 at 4:24

source share

5 answers

You can also write your request in this form, because you have one choice, so that your recovery time is reduced

 SELECT Test_1.F2, COUNT(Test_1.F1) AS Cnt FROM Test INNER JOIN Test AS Test_1 ON Test.F1 = Test_1.F1 AND Test.F3 = Test_1.F3 WHERE (Test.F2 = 'B') GROUP BY Test_1.F2

+3

Maryam arshi Sep 16 '12 at 4:58

source share

Here is another way to write your request. Next to the answer, guido runs in MS SQL.

 WITH Filtered AS (SELECT DISTINCT F1,F3 FROM Test WHERE F2='B') SELECT B.f2,COUNT(*) AS CNT FROM Test B INNER JOIN Filtered ON B.F1 = Filtered.F1 AND B.F3 = Filtered.F3 GROUP BY B.F2 ORDER BY CNT DESC

I think your original request might have an error, as Fred mentioned. The number F2 = "B" should be 6, not 8, in your example, right? If 8 is intended, remove the DISTINCT .

Another thing you can try is to make the clustering index of the TEST table (F2, F1, F3) and make another non-clustered index (F1, F3).

Sample code is also available on SqlFiddle .

+3

kennethc Sep 16 '12 at 8:25

source share

If there is 1m + rows in the Test table, the combined temporary table on which you will have a group will have hundreds of millions of rows.

This will work in mysql, but not on the afaik sql server:

 SELECT F2,COUNT(*) FROM Test AS B WHERE (B.F1,B.F3) IN ( SELECT F1,F3 FROM Test WHERE F2='B') GROUP BY F2

+1

ᴳᵁᴵᴰᴼ Sep 16 '12 at 6:01

source share

I understand that this has already been answered, but I think this approach can be much faster, especially if F1 and F3 have many duplicate values:

 SELECT B.f2, sum(A.cnt) AS CNT FROM (select F1, F3, count(*) as cnt from Test where F2='B' group by f1, f3 ) A INNER JOIN Test B ON A.F1 = B.F1 AND A.F3 = B.F3 GROUP BY B.F2 ORDER BY CNT DESC

If F1 and F3 do not contain a lot of combinations, then the first subquery should be reduced to a few hundred or thousand rows. (Your sample data has one uppercase letter, so the number of combinations will be 576 if all letters are used.) SQL Server is likely to merge or hash the connection into a result that should work well.

You can also do this without a connection and group using the Windows functions:

 select t.f2, sum(nummatches) as cnt from (select t.*, sum(isB) over (partition by f1, f3) as nummatches from (select t.*, (case when F2 = 'B' then 1 else 0 end) as IsB from test ) t ) t group by t.f2 order by 2 desc

Window functions often work better because they work with smaller pieces of data.

+1

Gordon linoff Sep 16 '12 at 17:46

source share

Fred sobotka · Accepted Answer · 2012-09-16T06:34:44+0000

A filtered search for all rows WHERE F2 = 'B' will result in a full table scan unless you create an index that has F2 as its first or only column. Further, the join condition includes the columns F1 and F3 that you mention are already part of the index that starts with F1.

I also notice that the first part of your query does not eliminate duplicates for the set (T1, T3), where T2 = 'B', as you would expect at the intersection, which was set to the right on another subset of the same table. You may have a reason for this, but we cannot know for sure until you provide some details of the similarity measurement algorithm that you are trying to implement.

Your ORDER BY also affects query execution time, resulting in a potential large internal view in the final result set.

Complex SQL connection to a group

More articles: