MySQL Duplicate Rows

I have a table with some repeating information: Id, Name, Lastname, Birth, PersonalKey, Personal Info, Direction, Source.

Where the source tells me where the information came from.

Repeated information has a unique identifier, and I need to delete the deceived information. But , I have a priority over some sources of information, which I must be the one who remains, and the other is deleted.

Another thing is that another Information History has some information that I want to stay, so I do not need to replenish PersonalKey for those who stay and erase duplicates.

Table named Pruebas

---Id, Name, Firstname, Lastname, Birth, RFC, Source, PersonalKey--- ---2,Juan,Garcia,Escobeddo,1983-08-04,GAED87393, DRV484930, 34233-- ---3,Juan,Garcia,Escobedo,1987-08-04,GAED87393, FIN484930, -- ---4,Juan,Garcia,Escobedo,1987-08-04,GAED87393, SA484930, -- 

How do you see:

  • Identifiers are unique
  • Repeated first name, first name and last name
  • id 2 is PersonalKey, but 3 and 4 are not
    • I want the one that had the source of FIN% to remain and the others to be erased, but first I need to make sure that the remaining line gets the value of PersonalKey (IOW, I don't want to lose the value of PersonalKey).

Thanks in advance.

+4
source share
4 answers

I would run the cursor (with the MySQL SP, Java, Python, .NET programming language) at this request:

 select Name, Firstname, Lastname, count(1) from Pruebas group by Name, Firstname, Lastname having count(1) > 1 

Then, in the returned lines from the cursor, just do whatever you need: check the FIN% instance, check for the presence of PersonalKey, and update accordingly.

For each line of the cursor, you can open another cursor with:

 select * from Pruebas where Name = the_Name and Firstname = the_Firstname and Lastname = the_Lastname 

And now you will have an internal cursor with all the lines that you change. If this is the one you need, save it and update it with the specified KEY value. Otherwise, delete it.

In Oracle, you can accomplish what you want in a single query, but I don’t think you will get the same level of performance as with this approach.

Hope this helps.

+3
source

The simplest solution I can think of is to copy PersonalKey to other duplicate lines and then delete all lines that don't match 'FIN%' .

 UPDATE Pruebas p1 JOIN Pruebas p2 ON (SOUNDEX(CONCAT(p1.Name, p2.Firstname, p3.Lastname)) = SOUNDEX(CONCAT(p2.Name, p2.Firstname, p2.Lastname))) SET p1.PersonalKey = p2.PersonalKey WHERE p2.PersonalKey IS NOT NULL; DELETE FROM Pruebas WHERE Source NOT LIKE 'FIN%'; 

I am showing an example match expression for a join using SOUNDEX() .


I see from other comments that you left that you have many variations and uncertainties. In this case, there is no way to automate cleaning and deduplication - or at least automatic cleaning will be a more complicated and complicated task than just doing it manually.


Repeat your comment that the request takes many hours: yes, it really is not expected to be effective. The JOIN expression cannot be matched, i.e. It cannot use an index. You can make it more efficient by adding an extra column to physically store the SOUNDEX() value of the name, name, last name. Then create an index in this column.

But SOUNDEX() does not guarantee that all possible spelling errors will be found. You are faced with the task of data cleaning, which cannot be fully automated. Any data cleaning solution requires manual work.

+3
source

I would do something like this:

CREATE TABLE Pruebas_new
SELECT * FROM Pruebas
GROUP BY first name, last name
with a source like "FIN%";

It could be rewritten using a temporary table and overwriting what is in the original table, if you need it to be faster, but which gets the data you need in the simplest way.

+2
source

Sorry for the delay in reply. I have been a little busy the last couple of days.

Below my answer is based on the following assumptions:

1) You will fix problems with the spelling of the name using some other mechanism (you mentioned that you clean it with regular expressions in the comments to the original question).

2) The DUP set can be identified using First Name, Lastname and Birth (you mentioned this in a comment on the original question).

3) Name, Surname and Birth cannot be NULL.

4) You cannot have more than one FIN entry in a DUP set (you mentioned this in a comment on the original question).

If any of the above assumptions is invalid, my answer should be changed.

The following steps are listed below:

1) Update all FIN records to copy the PersonalKey file from a record other than FIN:

  UPDATE Pruebas p1 INNER JOIN Pruebas p2 ON p1.Firstname = p2.Firstname AND p1.Lastname = p2.Lastname AND p1.Birth = p2.Birth SET p1.PersonalKey = p2.PersonalKey WHERE p1.Source like 'FIN%' AND p1.PersonalKey is null AND p2.PersonalKey is not null; 

2) Delete all entries other than FIN, where we have the FIN entry:

  DELETE p2 FROM Pruebas p1 INNER JOIN Pruebas p2 ON p1.Firstname = p2.Firstname AND p1.Lastname = p2.Lastname AND p1.Birth = p2.Birth WHERE p1.Source like 'FIN%' AND p2.Source not like 'FIN%'; 

At this point, all DUPs with the FIN entry have been cleared, so that only the FIN entry will remain.

3) If we decide to stay with a DRV record for all other cases. We will need to copy PersonalKey from another record to a DRV record:

  UPDATE Pruebas p1 INNER JOIN Pruebas p2 ON p1.Firstname = p2.Firstname AND p1.Lastname = p2.Lastname AND p1.Birth = p2.Birth SET p1.PersonalKey = p2.PersonalKey WHERE p1.Source like 'DRV%' AND p1.PersonalKey is null AND p2.PersonalKey is not null; 

4) Delete all entries other than DRV, where we have a DRV entry:

  DELETE p2 FROM Pruebas p1 INNER JOIN Pruebas p2 ON p1.Firstname = p2.Firstname AND p1.Lastname = p2.Lastname AND p1.Birth = p2.Birth WHERE p1.Source like 'DRV%' AND p2.Source not like 'DRV%'; 

At this point, all the DUPs with the DRV record have been cleared, so that only the DRV record remains.

If the only record type is SA, then more DUP is left, and we are done.

5) If we want to select the record with the most information filled out or if we have finished 3 and 4, and there are more than one type of records that cause DUP. We need to copy the PersonalKey from any record in the DUP collection that has it for any record that does not have it for all records other than FIN:

  UPDATE Pruebas p1 INNER JOIN Pruebas p2 ON p1.Firstname = p2.Firstname AND p1.Lastname = p2.Lastname AND p1.Birth = p2.Birth SET p1.PersonalKey = p2.PersonalKey WHERE p1.Source not like 'FIN%' AND p1.PersonalKey is null AND p2.PersonalKey is not null; 

6) Delete all entries except the one that has the most information (as determined by the calculated info_score column):

  DELETE p5 FROM Pruebas p5 INNER JOIN (SELECT p3.Firstname , p3.Lastname , p3.Birth , MIN(p3.Id) AS min_id FROM Pruebas p3 INNER JOIN (SELECT p1.Firstname , p1.Lastname , p1.Birth , count(*) AS c , MAX((p1.Name is not null) + (p1.RFC is not null) + (p1.Source is not null) + (p1.PersonalKey is not null)) AS info_score FROM Pruebas p1 GROUP BY p1.Firstname , p1.Lastname , p1.Birth HAVING count(*) > 1) p2 ON p3.Firstname = p2.Firstname AND p3.Lastname = p2.Lastname AND p3.Birth = p2.Birth AND ((p3.Name is not null) + (p3.RFC is not null) + (p3.Source is not null) + (p3.PersonalKey is not null)) = p2.info_score GROUP BY p3.Firstname , p3.Lastname , p3.Birth) p4 ON p4.Firstname = p5.Firstname AND p4.Lastname = p5.Lastname AND p4.Birth = p5.Birth AND p4.min_id <> p5.Id; 

At this point, all DUPs were minimized while retaining PersonalKey if it was available, and FIN records were saved if they existed differently, or a DRV record or the record with the most information was saved.

Let me know if you have any questions with any of the above questions.

Hope this helps,

-Dipin

+1
source

All Articles