How to remove faster?

I have a DB table that consists of 2.5 billion records. There are duplicates worth 11 million. What is the fastest way to delete these 11 million records?

+6
performance database oracle plsql
source share
5 answers

Removing one duplicate from many is a complex business, and you have a problem with this many records.

One option is to turn the task on your head and copy the records you want to save to a new table. You can use the CREATE TABLE AS SELECT DISTINCT ... NOLOGGING , which will copy your deduplicated records without using a transaction log, which is much faster. When your new table is full, delete / rename the old one and rename the new one in place.

See http://www.databasejournal.com/features/oracle/article.php/3631361/Managing-Tables-Logging-versus-Nologging.htm

Oh, and don't forget to click the UNIQUE index in the new table so that it doesn't happen again.

The moral of this story ... never use DELETE to get rid of a lot of records, it's terribly slow because it should keep all deleted records in the redo log. Either copy, or switch, or TRUNCATE.

+20
source share
 DELETE FROM mytable WHERE rowid IN ( SELECT rowid FROM ( SELECT rowid, ROW_NUMBER() OVER (ORDER BY dupfield) rn FROM mytable r ) WHERE rn > 1 ) 

or perhaps even this:

 DELETE FROM mytable mo WHERE EXISTS ( SELECT NULL FROM mytable mi WHERE mi.dup_field = mo.dup_field AND mi.rowid <> mo.rowid ) 

Both of these queries will use a rather efficient HASH SEMI JOIN , the latter will be faster if there is no index on dup_field .

You may be tempted to copy the lines, but note that when copying 2G lines, much more REDO and UNDO will be generated than when deleting 11M .

+3
source share

To delete existing rows or create a proper new table, and old old ones faster, depends on many factors. 11 million rows is a lot, but it is only 0.5% of the total number of rows in the table. It is possible that recreation and rollback can be much slower than deletion, depending on how many indexes exist in the source table, and also where the rows to be deleted exist on the data pages.

The question then becomes whether the source table is live or not. If inserts and updates occur during this cleanup, the copy and rollback will not work without enough additional code to synchronize the table after the fact.

Finally, why is this operation necessary to be "quick"? Is it because the system must be shut down during the process? You can write a procedure that removes the tricks while the system is live, but does not affect the rest of the system in terms of destruction. We solved this problem in the past by first writing a query that collects the primary keys of the rows that need to be deleted in the second table, for example:

  INSERT INTO RowsToDeleteTable SELECT PKColumn FROM SourceTable WHERE <conditions used to find rows to remove> CREATE UNIQUE INDEX PK_RowsToDelete ON RowsToDeleteTable (PKColumn); 

Then we have a PL / SQL block that either iterates over the lines in the cursor like this:

 BEGIN FOR theRow IN (SELECT PKColumn FROM RowsToDeleteTable ORDER BY 1) LOOP <delete source table for theRow.PKColumn) <optionally wait a bit> commit; END LOOP; END; 

or does something like this:

 BEGIN FOR theRow IN (SELECT MIN(PKColumn) FROM RowsToDeleteTable ) LOOP <delete source table for theRow.PKColumn) <optionally wait a bit> DELETE RowsToDeleteTable WHERE PKColumn = theRow.PKColumn; commit; END LOOP; END; 

The loop and SELECT MAX are obviously less efficient, but it has the advantage of allowing you to monitor the progress of the delete operation. We put some waiting code in a loop so that we can control how much the extraction operation is going on.

The initial creation of a RowsToDeleteTable is very fast, and you have the advantage of allowing the process to accept as many as you want. In this case, the “holes” remaining in extents through deletions will not be too bad, since you delete such a small percentage of the total amount of data.

+2
source share

First enter the index into the column or columns that define and contain duplicate values,

Then, if the table has a primary key (PK),

  Delete Table T Where PK <> (Select Min(PK) From Table Where ColA = T.ColA ... for each column in set defined above And ColB = T.ColB) 

NOTE: you can also use Max (PK), all you do is identify one record that is not removed from each set of duplicates.

EDIT: To eliminate the widespread use of the transaction log and UNDO section, you can save the values ​​that are tricks in the temp table, and then remove the tricks for each pair within the same transaction ...

Assuming that only one column (let's call it ColA, a number), defines a hoax ...

  Create Table Dupes (ColA Number) Insert Dupes(ColA) Select Distinct ColA From Table Group By ColA Having Count(*) > 1 recordExists Number := 0 ; ColAValue Number; Select Case When Exists (Select Count(*) From Dupes) Then 1 Else 0 End Into recordExists From Dual; While recordExists = 1 Loop Select (Select Max(ColA) From Dupes) Into ColAValue From Dual; Begin Transaction Delete Table T Where ColA = ColAValue And pk <> (Select Min(Pk) From Table Where ColA = ColAValue); Delete Dupes Where ColA = ColAValue; Commit Transaction; Select Case When Exists (Select Count(*) From Dupes) Then 1 Else 0 End Into recordExists From Dual; End Loop; 

Not tested, so the syntax may not massage ...

+1
source share

If you are sure that you are not changing the integrity of the data (referential integrity), disable the restrictions (indexes, other restrictions), perform the deletion, and then enable the restrictions. You should try first to make sure that updating indexes on startup takes less time than deleting with them.

Some query optimization may also help, but without knowing the details, we discuss theoretically.

0
source share

All Articles