I have a view table
CREATE TABLE data
{
pk INT PRIMARY KEY AUTO_INCREMENT,
dt BLOB
};
It contains about 160,000 rows and about 2 GB of data in the blob column (about 14 kB per blob). Another table has foreign keys in this table.
Something like 3,000 drops the same. So what I want is a query that will give me a re map table that will allow me to remove duplicates.
The naive approach took about an hour to 30-40 thousand lines:
SELECT a.pk, MIN(b.pk)
FROM data AS a
JOIN data AS b
ON a.dt=b.dt
WHERE b.pk < a.pk
GROUP BY a.pk;
For other reasons, I have a table with block sizes:
CREATE TABLE sizes
(
fk INT, // note: non-unique
sz INT
// other cols
);
Creating indexes for both fk and sz, a direct query from this takes about 24 seconds with 50k lines:
SELECT da.pk,MIN(db.pk)
FROM data AS da
JOIN data AS db
JOIN sizes AS sa
JOIN sizes AS sb
ON
sa.size=sb.size
AND da.pk=sa.fk
AND db.pk=sb.fk
WHERE
sb.fk<sa.fk
AND da.dt=db.dt
GROUP BY da.pk;
da ( ). , , , . , 3- 5- , , 3 .
, : ? , ?
: , , , , vs ?
Xgc #mysql@irc.freenode.net , , , fk, . , , .