MySQL removes duplicates from a large database quickly

I have large (> Mil rows) MySQL database corrupted by duplicates. I think it can be from 1/4 to 1/2 of the total db filled by them. I need to get rid of them quickly (I mean the query execution time). Here's what it looks like:
id (index) | text1 | text2 | text3
text1 and text2 must be unique, if there are any duplicates, only one combination remains with text3 NOT NULL. Example:

1 | abc | def | NULL 2 | abc | def | ghi 3 | abc | def | jkl 4 | aaa | bbb | NULL 5 | aaa | bbb | NULL 

... becomes:

 1 | abc | def | ghi #(doesn't realy matter id:2 or id:3 survives) 2 | aaa | bbb | NULL #(if there no NOT NULL text3, NULL will do) 

New cold ideas are anything, they do not depend on old table identifiers.
I tried things like:

 CREATE TABLE tmp SELECT text1, text2, text3 FROM my_tbl; GROUP BY text1, text2; DROP TABLE my_tbl; ALTER TABLE tmp RENAME TO my_tbl; 

Or SELECT DISTINCT and other options.
While they work with small databases, the query execution time at the mine is simply huge (never reached the end, in fact;> 20 minutes)

Is there a faster way to do this? Please help me solve this problem.

+69
sql mysql duplicates
Oct 30 '09 at 20:01
source share
9 answers

I believe this will be done using a duplicate key + ifnull ():

 create table tmp like yourtable; alter table tmp add unique (text1, text2); insert into tmp select * from yourtable on duplicate key update text3=ifnull(text3, values(text3)); rename table yourtable to deleteme, tmp to yourtable; drop table deleteme; 

It should be much faster than anything that requires a group or an individual or subquery or even order. It does not even require a file system that will kill performance on a large temporary table. A full scan of the source table will still be required, but this cannot be avoided.

+146
Oct 30 '09 at 21:26
source share

Found this simple 1-line code to do what I need:

 ALTER IGNORE TABLE dupTest ADD UNIQUE INDEX(a,b); 

Taken from: http://mediakey.dk/~cc/mysql-remove-duplicate-entries/

+95
Oct 18 2018-11-11T00:
source share
 DELETE FROM dups WHERE id NOT IN( SELECT id FROM ( SELECT DISTINCT id, text1, text2 FROM dups GROUP BY text1, text2 ORDER BY text3 DESC ) as tmp ) 

This asks for all records, groups by difference fields and orders by identifier (means that we select the first non-zero record in text3 format). Then we select the id from this result (these are good identifiers ... they will not be deleted) and delete all identifiers that are NOT these.

Any such query affecting the entire table will be slow. You just need to run it and let it deploy so that you can prevent it in the future.

After you made this “fix,” I would apply UNIQUE INDEX (text1, text2) to this table. To prevent the possibility of duplication in the future.

If you want to go to "create a new table and replace the old one". You can use the internal select statement itself to create an insert statement.

MySQL specification (it is assumed that the new table is called my_tbl2 and has exactly the same structure):

 INSERT INTO my_tbl2 SELECT DISTINCT id, text1, text2, text3 FROM dups GROUP BY text1, text2 ORDER BY text3 DESC 

See MySQL INSERT ... SELECT for more details.

+12
Oct 30 '09 at 20:15
source share

remove duplicates without deleting foreign keys

 create table tmp like mytable; ALTER TABLE tmp ADD UNIQUE INDEX(text1, text2, text3, text4, text5, text6); insert IGNORE into tmp select * from mytable; delete from mytable where id not in ( select id from tmp); 
+8
Jun 10 '13 at 16:06 on
source share

If you can create a new table, do it using a unique key in the text1 + text2 fields. Then insert the ignore errors into the table (using the INSERT IGNORE syntax):

 select * from my_tbl order by text3 desc 
  • I think ordering with text3 desc will put NULLs last, but double check this.

Indexes in all of these columns can help a lot, but creating them now can be quite slow.

+3
Oct 30 '09 at 20:08
source share

For large tables with multiple duplicates, you may need to avoid copying the entire table to another location. One way is to create a temporary table containing the rows that you want to keep (for each key with duplicates), and then delete duplicates from the original table.

An example is given here .

+1
Aug 14 '13 at 23:57
source share

I do not have much experience with MySQL. If it has analytic functions, try:

 delete from my_tbl
  where id in (
      select id 
        from (select id, row_number ()
                             over (partition by text1, text2 order by text3 desc) as rn
                from my_tbl
                / * optional: where text1 like 'a%' * /
              ) as t2
        where rn> 1
      )

the optional where clause allows you to run it several times, one for each letter, etc. Create an index in text1?

Before you run this, make sure that "text desc" will sort the zeros, the last ones in MySQL.

0
Oct 30 '09 at 20:59
source share

I know this is an old thread, but I have a somewhat dirty method that is much faster and more configurable, in terms of speed I would say 10 seconds instead of 100 seconds (10: 1).

My method requires all the dirty stuff you were trying to avoid:

  • Group by (and after)
  • concat group with ORDER BY
  • 2 temporary tables
  • using files on disk!
  • somehow (php?) deleting the file after

But when you talk about MILLIONS (or, in my case, Ten Million), it's worth it.

In any case, there is not much of it, because the comment is in Portuguese, but here is my example:

EDIT : if I get comments, I will explain further how it works :)

 START TRANSACTION; DROP temporary table if exists to_delete; CREATE temporary table to_delete as ( SELECT -- escolhe todos os IDs duplicados menos os que ficam na BD -- A ordem de escolha dos IDs é dada por "ORDER BY campo_ordenacao DESC" em que o primeiro é o que fica right( group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ','), length(group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ',')) - locate(",",group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ',')) ) as ids, count(*) as c -- Tabela a eliminar duplicados FROM teste_dup -- campos a usar para identificar duplicados group by test_campo1, test_campo2, teste_campoN having count(*) > 1 -- é duplicado ); -- aumenta o limite desta variável de sistema para o máx SET SESSION group_concat_max_len=4294967295; -- envia os ids todos a eliminar para um ficheiro select group_concat(ids SEPARATOR ',') from to_delete INTO OUTFILE 'sql.dat'; DROP temporary table if exists del3; create temporary table del3 as (select CAST(1 as signed) as ix LIMIT 0); -- insere os ids a eliminar numa tabela temporaria a partir do ficheiro load data infile 'sql.dat' INTO TABLE del3 LINES TERMINATED BY ','; alter table del3 add index(ix); -- elimina os ids seleccionados DELETE teste_dup -- tabela from teste_dup -- tabela join del3 on id=ix; COMMIT; 
0
Jul 16 '14 at 18:40
source share

You can delete all duplicate entries using this simple query. which will select all duplicate entries and delete them.

  DELETE i1 FROM TABLE i1 LEFT JOIN TABLE i2 ON i1.id = i2.id AND i1.colo = i2.customer_invoice_id AND i1.id < i2.id WHERE i2.customer_invoice_id IS NOT NULL 
0
Apr 02 '18 at 12:29
source share



All Articles