I have a table identities(i.e. aliases) for an arbitrary number of people. Each row has a previous name and a new name. There are about 1 M lines in production. For instance:
id, old, new
---
1, 'Albert', 'Bob'
2, 'Bob', 'Charles'
3, 'Mary', 'Nancy'
4, 'Charles', 'Albert'
5, 'Lydia', 'Nancy'
6, 'Zoe', 'Zoe'
I want to create a list usersand specify all their corresponding identifiers. This is similar to finding all the nodes in each graph of connected identities or searching for a spanning forest:
User 1: Albert, Bob, Charles (identities: 1,2,4)
User 2: Mary, Nancy, Lydia (identities: 3,5)
User 3: Zoe (identities: 6)
I do PostgreSQL WITH RECURSIVE, but it gives for each set and subset. For instance:
1,2,4 <-- spanning tree: good
2 <-- subset: discard
3,5 <-- spanning tree: good
4 <-- subset: discard
5 <-- subset: discard
6 <-- spanning tree: good
What do I need to do to produce only a complete set of identities (i.e. spanning tree) for each user?
SQLFiddle: http://sqlfiddle.com/#!15/9eaed/4 with my last attempt. Here is the code:
WITH RECURSIVE search_graph AS (
SELECT id
, id AS min_id
, ARRAY[id] AS path
, ARRAY[old,new] AS emails
FROM identities
UNION
SELECT identities.id
, LEAST(identities.id, sg.min_id)
, (sg.path || identities.id)
, (sg.emails || identities.old || identities.new)
FROM search_graph sg
JOIN identities ON (identities.old = ANY(sg.emails) OR identities.new = ANY(sg.emails))
WHERE identities.id <> ALL(sg.path)
)
SELECT array_agg(DISTINCT(p)) from search_graph, unnest(path) p GROUP BY min_id;
And the results:
1,2,4
2
3,5
4
5
6