How to efficiently find duplicate blob strings in MySQL?

I have a view table

CREATE TABLE data
{
   pk INT PRIMARY KEY AUTO_INCREMENT,
   dt BLOB
};

It contains about 160,000 rows and about 2 GB of data in the blob column (about 14 kB per blob). Another table has foreign keys in this table.

Something like 3,000 drops the same. So what I want is a query that will give me a re map table that will allow me to remove duplicates.

The naive approach took about an hour to 30-40 thousand lines:

SELECT a.pk, MIN(b.pk) 
    FROM data AS a 
    JOIN data AS b
  ON a.dt=b.dt
  WHERE b.pk < a.pk
  GROUP BY a.pk;

For other reasons, I have a table with block sizes:

CREATE TABLE sizes
(
   fk INT,  // note: non-unique
   sz INT
   // other cols
);

Creating indexes for both fk and sz, a direct query from this takes about 24 seconds with 50k lines:

SELECT da.pk,MIN(db.pk) 
  FROM data AS da
  JOIN data AS db
  JOIN sizes AS sa
  JOIN sizes AS sb
  ON
        sa.size=sb.size
    AND da.pk=sa.fk
    AND db.pk=sb.fk
  WHERE
        sb.fk<sa.fk
    AND da.dt=db.dt 
  GROUP BY da.pk;

da ( ). , , , . , 3- 5- , , 3 .

, : ? , ?

: , , , , vs ?


Xgc #mysql@irc.freenode.net , , , fk, . , , .

+5
1

(MD5 SHA1) , .

, ?

+10

All Articles