Select all results in which group> 1

I have a table full of accounts with an address. I would like to select each account that is located at the same address as the other account.

If my data looks like this:

------------------------------------ | Account Number | Address | | 12345 | 55 Bee St | | 23456 | 94 Water way | | 34567 | 15 Beagle Drive | | 45678 | 55 Bee St | | 56789 | 94 Water way | | 67890 | 12 Green St | ------------------------------------- 

I would like to do something like:

 SELECT * FROM accounts WHERE group by address > 1; 

So my results will be as follows:

 ------------------------------------ | Account Number | Address | | 12345 | 55 Bee St | | 23456 | 94 Water way | | 45678 | 55 Bee St | | 56789 | 94 Water way | ------------------------------------- 

If that matters, this is a PostgreSQL database.

+4
source share
6 answers

Make a left join with the same table to find records with the same address and group by fields, then you can count the corresponding addresses to get records containing at least one corresponding address:

 select a.AccountNumber, a.Address from accounts a left join accounts o on o.Address = a.Address and o.AccountNumber <> a.AccountNumber group by a.AccountNumber, a.Address having count(o.AccountNumber) >= 1 

This approach gives you addresses with each account number, and it does not give you duplicates if the address occurs more than two times.

+1
source

You need to join the table for yourself using the join condition so that both addresses are the same, but make sure the account number is different between the two lines:

 select distinct account_number, address from accounts a1 join accounts a2 on a1.account_number > a2.account_number and a1.address = a2.address 

Note the use of the > comparison between account numbers, which not only prevents joining strings to itself, but prevents reverse joining.

I added distinct in case there are three accounts with the same address, otherwise you will not need it.

+1
source

This should do the trick:

 SELECT * FROM Account A1 WHERE EXISTS ( SELECT * FROM Account A2 WHERE A1.AccountNumber <> A2.AccountNumber AND A1.Address = A2.Address ) 

In plain English: select each account to have a different account ( A1.AccountNumber <> A2.AccountNumber ) with the same address ( A1.Address = A2.Address ).

+1
source

Here is a test demonstrating the effectiveness of three valid answers.
EXISTS excels with LEFT JOIN / GROUP BY :

Test setup

A table with 100 thousand rows, 1000 different values ​​for b .
The difference in performance has expanded with more rows - fewer duplicates means less difference.
No indexes.

 CREATE TABLE tbl (a text, b text); INSERT INTO tbl SELECT (random()*10000)::int::text ,(random()*1000)::int || ' some more text here' FROM generate_series(1, 100000) g; 

1. @Guffa : LEFT JOIN / GROUP BY / HAVING

 EXPLAIN ANALYZE SELECT ta, tb FROM tbl t LEFT join tbl t2 on t2.b = tb and t2.a <> ta GROUP by ta, tb HAVING count(t2.a) >= 1; 

2. The same thing, unravels only on JOIN / GROUP BY

 EXPLAIN ANALYZE SELECT ta, tb FROM tbl t JOIN tbl t2 ON t2.b = tb AND t2.a <> ta GROUP BY ta, tb; 

3. @Branko : EXISTS

 EXPLAIN ANALYZE SELECT * FROM tbl t WHERE EXISTS ( SELECT * FROM tbl t2 WHERE t2.a <> ta AND t2.b = tb ); 

4. @Bohemian : DISTINCT

 EXPLAIN ANALYZE SELECT DISTINCT ta, tb FROM tbl t JOIN tbl t2 on t2.b = tb and t2.a <> ta; 

β†’ SQLfiddle displaying EXPLAIN ANALYZE output for queries .

  • Total lead time : 12208.954 ms
  • Total execution time: 11504.460 ms
  • Total lead time : 272.508 ms -! ~ 45 times faster than 1.
  • Total run time: 11540.627 ms

After adding an index with multiple columns ( SQLfiddle ).

 CREATE INDEX a_b_idx ON tbl(b, a); 

.. runtime does not change. Postgres does not use an index. Obviously, he expects sequential table scans to be faster since the whole table needs to be read anyway.

Besides the runtime, also pay attention to the number of lines , proving my point, as discussed:
JOIN creates many intermediate duplicates that the EXISTS version avoids:

EXPLAIN ANALYZE output for 1 .:

  HashAggregate (cost = 230601.26..230726.26 rows = 10000 width = 31) (actual time = 12127.090..12183.087 rows = 99476 loops = 1)
 Filter: (count (t2.a)> = 1)
 -> Hash Left Join (cost = 3670.00..154661.89 rows = 10125250 width = 31) (actual time = 99.591..5897.744 rows = 9991102 loops = 1)
 Hash Cond: (tb = t2.b)
 Join Filter: (t2.a ta)
 Rows Removed by Join Filter: 101052
 -> Seq Scan on tbl t (cost = 0.00..1736.00 rows = 100000 width = 27) (actual time = 0.036..36.197 rows = 100000 loops = 1)
 -> Hash (cost = 1736.00..1736.00 rows = 100000 width = 27) (actual time = 99.141..99.141 rows = 100000 loops = 1)
 Buckets: 2048 Batches: 8 Memory Usage: 784kB
 -> Seq Scan on tbl t2 (cost = 0.00..1736.00 rows = 100000 width = 27) (actual time = 0.004..44.899 rows = 100000 loops = 1)
 Total runtime: 12208.954 ms

EXPLAIN ANALYZE output for 3 .:

  Hash Semi Join (cost = 3670.00..7783.00 rows = 1 width = 27) (actual time = 81.630..247.371 rows = 100000 loops = 1)
 Hash Cond: (tb = t2.b)
 Join Filter: (t2.a ta)
 Rows Removed by Join Filter: 1009
 -> Seq Scan on tbl t (cost = 0.00..1736.00 rows = 100000 width = 27) (actual time = 0.010..32.758 rows = 100000 loops = 1)
 -> Hash (cost = 1736.00..1736.00 rows = 100000 width = 27) (actual time = 81.388..81.388 rows = 100000 loops = 1)
 Buckets: 2048 Batches: 8 Memory Usage: 784kB
 -> Seq Scan on tbl t2 (cost = 0.00..1736.00 rows = 100000 width = 27) (actual time = 0.003..32.114 rows = 100000 loops = 1)
 Total runtime: 272.508 ms
+1
source

You need a HAVING :

 SELECT * FROM accounts GROUP BY address HAVING COUNT(address) > 1; 
0
source

I believe you are looking for a HAVING proposal:

  select address,sum(accountnumber) group by address having sum(accountnumber) >1 
0
source

All Articles