SQL grouping / line overlapping

Question

SQL grouping / line overlapping

I have the following table in Postgres that has overlapping data in two columns a_sno and b_sno .

 create table data ( a_sno integer not null, b_sno integer not null, PRIMARY KEY (a_sno,b_sno) ); insert into data (a_sno,b_sno) values ( 4, 5 ) , ( 5, 4 ) , ( 5, 6 ) , ( 6, 5 ) , ( 6, 7 ) , ( 7, 6 ) , ( 9, 10) , ( 9, 13) , (10, 9 ) , (13, 9 ) , (10, 13) , (13, 10) , (10, 14) , (14, 10) , (13, 14) , (14, 13) , (11, 15) , (15, 11);

As you can see from the first 6 rows, the data values 4,5,6 and 7 in two columns intersect / overlap, which must be divided into a group. The same goes for lines 7-16 and lines 17-18, which will be marked as groups 2 and 3, respectively.

The result should look like this:

 group | value ------+------ 1 | 4 1 | 5 1 | 6 1 | 7 2 | 9 2 | 10 2 | 13 2 | 14 3 | 11 3 | 15

+5

sql postgresql common-table-expression recursive-query

bogeyman Apr 19 '15 at 19:42

source share

3 answers

I want to say differently, it may be useful, you can do it in 2 steps:

1. take max(sno) for each group:

  select q.sno, row_number() over(order by q.sno) gn from( select distinct d.a_sno sno from data d where not exists ( select b_sno from data where b_sno=d.a_sno and a_sno>d.a_sno ) )q

result:

 sno gn 7 1 14 2 15 3

2. Use recursive cte to find all related items in groups:

 with recursive cte(sno,gn,path,cycle)as( select q.sno, row_number() over(order by q.sno) gn, array[q.sno],false from( select distinct d.a_sno sno from data d where not exists ( select b_sno from data where b_sno=d.a_sno and a_sno>d.a_sno ) )q union all select d.a_sno,c.gn, d.a_sno || c.path, d.a_sno=any(c.path) from data d join cte c on d.b_sno=c.sno where not cycle ) select distinct gn,sno from cte order by gn,sno

Result:

 gn sno 1 4 1 5 1 6 1 7 2 9 2 10 2 13 2 14 3 11 3 15

here is a demo of what i did.

+4

jfun Apr 20 '15 at 0:35

source share

Here is a start that may give some ideas on the approach. A recursive query starts with a_sno each record, and then tries to follow the b_sno path until it reaches the end or forms a loop. The path is represented by an array of sno integers.

The unnest function splits an array into strings, so the sno value sno mapped to an array of paths, for example:

 4, {6, 5, 4}

will be converted to a string for each value in the array:

 4, 6 4, 5 4, 4

Then array_agg cancels the operation, aggregating the values back to the path, but getting rid of duplicates and ordering.

Now each a_sno is associated with a path, and the path forms a grouping. dense_rank can be used to match groupings (clusters) to numbers.

 SELECT array_agg(DISTINCT map ORDER BY map) AS cluster ,sno FROM ( WITH RECURSIVE x(sno, path, cycle) AS ( SELECT a_sno, ARRAY[a_sno], false FROM data UNION ALL SELECT b_sno, path || b_sno, b_sno = ANY(path) FROM data, x WHERE a_sno = x.sno AND NOT cycle ) SELECT sno, unnest(path) AS map FROM x ORDER BY 1 ) y GROUP BY sno ORDER BY 1, 2

Conclusion:

  cluster | sno --------------+----- {4,5,6,7} | 4 {4,5,6,7} | 5 {4,5,6,7} | 6 {4,5,6,7} | 7 {9,10,13,14} | 9 {9,10,13,14} | 10 {9,10,13,14} | 13 {9,10,13,14} | 14 {11,15} | 11 {11,15} | 15 (10 rows)

Wrap it again for ranking:

 SELECT dense_rank() OVER(order by cluster) AS rank ,sno FROM ( SELECT array_agg(DISTINCT map ORDER BY map) AS cluster ,sno FROM ( WITH RECURSIVE x(sno, path, cycle) AS ( SELECT a_sno, ARRAY[a_sno], false FROM data UNION ALL SELECT b_sno, path || b_sno, b_sno = ANY(path) FROM data, x WHERE a_sno = x.sno AND NOT cycle ) SELECT sno, unnest(path) AS map FROM x ORDER BY 1 ) y GROUP BY sno ORDER BY 1, 2 ) z

Conclusion:

  rank | sno ------+----- 1 | 4 1 | 5 1 | 6 1 | 7 2 | 9 2 | 10 2 | 13 2 | 14 3 | 11 3 | 15 (10 rows)

+2

Glenn Apr 19 '15 at 21:40

source share

Erwin brandstetter · Accepted Answer · 2015-04-20T02:54:00+0000

Assuming that all pairs exist in their mirror combination, as well as (4,5) and (5,4) . But the following solutions work without mirror duplicates.

Simple case

All connections can be arranged in one single ascending sequence and complications like I added to fiddle are impossible, we can use this solution without duplicates in rCTE:

I start with a minimum a_sno for each group with a minimum b_sno :

 SELECT row_number() OVER (ORDER BY a_sno) AS grp , a_sno, min(b_sno) AS b_sno FROM data d WHERE a_sno < b_sno AND NOT EXISTS ( SELECT 1 FROM data WHERE b_sno = d.a_sno AND a_sno < b_sno ) GROUP BY a_sno;

This requires only one level of queries, since the window function can be built on the basis of the aggregate:

Get a single sum of a joined table column

Result:

 grp a_sno b_sno 1 4 5 2 9 10 3 11 15

I avoid branches and duplicated (multiplied) lines - potentially much more expensive with long chains. I use ORDER BY b_sno LIMIT 1 in a correlated subquery to make this fly in a recursive CTE.

Create a unique index in a unique column

The key to performance is the corresponding index, which is already present subject to the PK PRIMARY KEY (a_sno,b_sno) : not vice versa ~~(b_sno, a_sno)~~ :

Is a composite index useful for queries in the first field?

 WITH RECURSIVE t AS ( SELECT row_number() OVER (ORDER BY d.a_sno) AS grp , a_sno, min(b_sno) AS b_sno -- the smallest one FROM data d WHERE a_sno < b_sno AND NOT EXISTS ( SELECT 1 FROM data WHERE b_sno = d.a_sno AND a_sno < b_sno ) GROUP BY a_sno ) , cte AS ( SELECT grp, b_sno AS sno FROM t UNION ALL SELECT c.grp , (SELECT b_sno -- correlated subquery FROM data WHERE a_sno = c.sno AND a_sno < b_sno ORDER BY b_sno LIMIT 1) FROM cte c WHERE c.sno IS NOT NULL ) SELECT * FROM cte WHERE sno IS NOT NULL -- eliminate row with NULL UNION ALL -- no duplicates SELECT grp, a_sno FROM t ORDER BY grp, sno;

Less simple case

All nodes can be reached in ascending order with one or more branches from the root (smallest sno ).

This time, get all the big sno and de-duplicate nodes that you can visit several times with UNION at the end:

 WITH RECURSIVE t AS ( SELECT rank() OVER (ORDER BY d.a_sno) AS grp , a_sno, b_sno -- get all rows for smallest a_sno FROM data d WHERE a_sno < b_sno AND NOT EXISTS ( SELECT 1 FROM data WHERE b_sno = d.a_sno AND a_sno < b_sno ) ) , cte AS ( SELECT grp, b_sno AS sno FROM t UNION ALL SELECT c.grp, d.b_sno FROM cte c JOIN data d ON d.a_sno = c.sno AND d.a_sno < d.b_sno -- join to all connected rows ) SELECT grp, sno FROM cte UNION -- eliminate duplicates SELECT grp, a_sno FROM t -- add first rows ORDER BY grp, sno;

Unlike the first solution, we do not get the last row with NULL here (caused by the correlated subquery).

Both should work very well - especially with long chains / many branches. Result at will:

SQL Fiddle (with added rows to demonstrate difficulties).

Unlimited graph

If there are local lows that cannot be reached from the root with an upward traversal, the above solutions will not work. Consider the Farhęg solution in this case.

SQL grouping / line overlapping

Simple case

Less simple case

Unlimited graph

More articles: