A faster way to remove matching lines?

Question

A faster way to remove matching lines?

I am a relative newbie when it comes to databases. We use MySQL, and I'm currently trying to speed up the execution of an SQL statement, which seems to take some time. I looked at SO for a similar question, but did not find it.

The goal is to delete all rows in table A that have the corresponding identifier in table B.

I am currently doing the following:

DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE b.id = a.id);

Table a contains about 100K rows in table a and about 22K rows in table b. The "id" column is PK for both tables.

This operator takes about 3 minutes to run on my test field - Pentium D, XP SP3, 2GB RAM, MySQL 5.0.67. It seems to me slow. Perhaps this is not so, but I was hoping to speed up the process. Is there a better / faster way to achieve this?

EDIT:

Some additional information that may be helpful. Tables A and B have the same structure as I did the following to create table B:

 CREATE TABLE b LIKE a;

Table a (and therefore table b) contains several indexes to speed up queries that are made against it. Again, I'm a relative newbie to working with databases and still a student. I do not know what effect, if any, has to do with things. I guess this has an effect, as indexes should also be cleared, right? I also wondered if there are any other database settings that can affect speed.

In addition, I am using INNO DB.

Here is some additional information you might find helpful.

Table A has a structure similar to this one (I processed it a bit):

 DROP TABLE IF EXISTS `frobozz`.`a`; CREATE TABLE `frobozz`.`a` ( `id` bigint(20) unsigned NOT NULL auto_increment, `fk_g` varchar(30) NOT NULL, `h` int(10) unsigned default NULL, `i` longtext, `j` bigint(20) NOT NULL, `k` bigint(20) default NULL, `l` varchar(45) NOT NULL, `m` int(10) unsigned default NULL, `n` varchar(20) default NULL, `o` bigint(20) NOT NULL, `p` tinyint(1) NOT NULL, PRIMARY KEY USING BTREE (`id`), KEY `idx_l` (`l`), KEY `idx_h` USING BTREE (`h`), KEY `idx_m` USING BTREE (`m`), KEY `idx_fk_g` USING BTREE (`fk_g`), KEY `fk_g_frobozz` (`id`,`fk_g`), CONSTRAINT `fk_g_frobozz` FOREIGN KEY (`fk_g`) REFERENCES `frotz` (`g`) ) ENGINE=InnoDB AUTO_INCREMENT=179369 DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;

I suspect that part of the problem is the number of indexes for this table. Table B is similar to table B, although it contains only the columns id and h .

In addition, profiling results are as follows:

 starting 0.000018 checking query cache for query 0.000044 checking permissions 0.000005 Opening tables 0.000009 init 0.000019 optimizing 0.000004 executing 0.000043 end 0.000005 end 0.000002 query end 0.000003 freeing items 0.000007 logging slow query 0.000002 cleaning up 0.000002

solvable

Thanks to all the answers and comments. Of course, they made me think about this problem. Kudos to dotjoe for stepping back from the problem by asking a simple question: "Is there a.id link in other tables?"

The problem was that DELETE TRIGGER was specified in table A, which called the stored procedure to update the other two tables, C and D. Table C had FK back in a.id and after performing some actions related to this identifier in the stored procedure, she had a statement

 DELETE FROM c WHERE c.id = theId;

I looked at the EXPLAIN instruction and rewrote it as

 EXPLAIN SELECT * FROM c WHERE c.other_id = 12345;

So, I could see it doing this, and he gave me the following information:

 id 1 select_type SIMPLE table c type ALL possible_keys NULL key NULL key_len NULL ref NULL rows 2633 Extra using where

This told me that it was a painful operation, and since it was going to be called 22,500 times (it is deleted for this data set), it was a problem. As soon as I created INDEX in this other_id column and re-started EXPLAIN, I got:

 id 1 select_type SIMPLE table c type ref possible_keys Index_1 key Index_1 key_len 8 ref const rows 1 Extra

Much better, really really great.

I added that Index_1 and my delete times correspond to the times indicated by mattkemp . It was a very subtle mistake on my part due to the fact that at the last moment some additional functionality was added. It turned out that most of the proposed alternative DELETE / SELECT statements, as Daniel pointed out, ended up getting about the same amount of time as soulmerge , the expression was pretty much the best that I could build based on what I needed to do. Once I provided an index for this other table C, my DELETEs were fast.

Pathological :
Two lessons came out of this exercise. First, it’s clear that I did not use the EXPLAIN statement to better understand the impact of my SQL queries. This is a rookie mistake, so I'm not going to fight about it. I learn from this error. Secondly, the offensive code was the result of a “quick response”, and inadequate design / testing led to the fact that this problem did not appear earlier. If I created several massive test data sets to use as test input for this new functionality, I would not have wasted your time and yours. In my testing on the DB side, there was not enough depth that my application side has. Now I have the opportunity to improve this.

Ref: EXPLAIN Expression

+53

performance mysql sql-delete sql-execution-plan

itsmatt May 01 '09 at 18:12

source share

14 answers

Your three minute time seems very slow. I assume the id column is not indexed properly. If you could specify the exact definition of the table you are using, that would be helpful.

I created a simple python script to create test data and ran several different versions of the delete request with the same data set. Here are my table definitions:

 drop table if exists a; create table a (id bigint unsigned not null primary key, data varchar(255) not null) engine=InnoDB; drop table if exists b; create table b like a;

Then I inserted 100k rows into rows and 25k into b (22.5k of which were also in a). Here are the results of various delete commands. I fell and classified the table between the aisles by the way.

 mysql> DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE a.id=b.id); Query OK, 22500 rows affected (1.14 sec) mysql> DELETE FROM a USING a LEFT JOIN b ON a.id=b.id WHERE b.id IS NOT NULL; Query OK, 22500 rows affected (0.81 sec) mysql> DELETE a FROM a INNER JOIN b on a.id=b.id; Query OK, 22500 rows affected (0.97 sec) mysql> DELETE QUICK a.* FROM a,b WHERE a.id=b.id; Query OK, 22500 rows affected (0.81 sec)

All tests were carried out on a quad-core Intel Core2 processor with a clock frequency of 2.5 GHz, 2 GB of RAM with Ubuntu 8.10 and MySQL 5.0. Note that executing a single sql statement is still single-threaded.

Update:

I updated my tests to use its circuit. I changed it a bit by deleting the auto-increment (I generate synthetic data), and the character set encoding (did not work - did not dig into it).

Here are my new table definitions:

 drop table if exists a; drop table if exists b; drop table if exists c; create table c (id varchar(30) not null primary key) engine=InnoDB; create table a ( id bigint(20) unsigned not null primary key, c_id varchar(30) not null, h int(10) unsigned default null, i longtext, j bigint(20) not null, k bigint(20) default null, l varchar(45) not null, m int(10) unsigned default null, n varchar(20) default null, o bigint(20) not null, p tinyint(1) not null, key l_idx (l), key h_idx (h), key m_idx (m), key c_id_idx (id, c_id), key c_id_fk (c_id), constraint c_id_fk foreign key (c_id) references c(id) ) engine=InnoDB row_format=dynamic; create table b like a;

Then I repeat the same tests with 100k lines in and 25k lines in b (and overflow between runs).

 mysql> DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE a.id=b.id); Query OK, 22500 rows affected (11.90 sec) mysql> DELETE FROM a USING a LEFT JOIN b ON a.id=b.id WHERE b.id IS NOT NULL; Query OK, 22500 rows affected (11.48 sec) mysql> DELETE a FROM a INNER JOIN b on a.id=b.id; Query OK, 22500 rows affected (12.21 sec) mysql> DELETE QUICK a.* FROM a,b WHERE a.id=b.id; Query OK, 22500 rows affected (12.33 sec)

As you can see, this is quite a bit slower than before, probably due to several indexes. However, it is not so close to the mark of three minutes.

Something else you might want to see is moving the longtext field to the end of the schema. I seem to remember that mySQL works better if all fields with a limited size are first and text, blob, etc. Are at the end.

+8

mattkemp May 6 '09 at 3:08 a.m.

source share

Try the following:

 DELETE a FROM a INNER JOIN b on a.id = b.id

The use of subqueries is usually slower, then combined, because they are run for each record in an external query.

+7

Chris Van Opstal May 01 '09 at 18:15

source share

This is what I always do when I have to work with extra-large data (here: a test table with 150,000 rows):

 drop table if exists employees_bak; create table employees_bak like employees; insert into employees_bak select * from employees where emp_no > 100000; rename table employees to employees_todelete; rename table employees_bak to employees;

In this case, sql filters 50,000 rows into the backup table. The cascade of requests executes on my slow machine after 5 seconds. You can replace the insertion of your choice with your own filter request.

This is a trick for bulk deletion in large databases !; =)

+4

Tom Schaefer May 07 '09 at 11:52 a.m.

source share

You make your subquery on 'b' for each row in 'a'.

Try:

 DELETE FROM a USING a LEFT JOIN b ON a.id = b.id WHERE b.id IS NOT NULL;

+3

Evert May 01 '09 at 18:17

source share

Try the following:

 DELETE QUICK A.* FROM A,B WHERE A.ID=B.ID

This is much faster than regular queries.

Refer to the syntax: http://dev.mysql.com/doc/refman/5.0/en/delete.html

+3

Webrsk May 01, '09 at 19:27

source share

 DELETE FROM a WHERE id IN (SELECT id FROM b)

+2

chaos May 01 '09 at 18:20

source share

You may need to rebuild the indexes before running such a hugh query. Well, you have to rebuild them periodically.

 REPAIR TABLE a QUICK; REPAIR TABLE b QUICK;

and then execute any of the above requests (i.e.)

 DELETE FROM a WHERE id IN (SELECT id FROM b)

+2

Scoregraphic May 6 '09 at 10:09 a.m.

source share

The query itself is already in optimal form, updating indexes makes the whole operation take so long. You can disable the keys in this table before the operation, which should speed up the process. You can turn them on again later if you do not need them.

Another approach would be to add the deleted flag-column column to your table and adjust other queries to take this value into account. The fastest boolean type in mysql is CHAR(0) NULL (true = '', false = NULL). This is a quick operation, you can subsequently delete the values.

The same thoughts expressed in sql statements:

 ALTER TABLE a ADD COLUMN deleted CHAR(0) NULL DEFAULT NULL; -- The following query should be faster than the delete statement: UPDATE a INNER JOIN b SET a.deleted = ''; -- This is the catch, you need to alter the rest -- of your queries to take the new column into account: SELECT * FROM a WHERE deleted IS NULL; -- You can then issue the following queries in a cronjob -- to clean up the tables: DELETE FROM a WHERE deleted IS NOT NULL;

If this is also not what you want, you can see what mysql statements have to say about the speed of delete statements .

+2

soulmerge May 6 '09 at 10:31

source share

I know that this issue was largely resolved due to omissions of OP indexing, but I would like to offer this additional advice, which is valid for a more general case of this problem.

I personally dealt with the need to delete many rows from one table that exist in another, and in my experience it is best to do the following, especially if you expect many rows to be deleted. This method, most importantly, will improve slave replication lag, since the longer each request for a single mutator takes longer, the worse the lag (single-threaded replication).

So here it is: first execute SELECT, as a separate query , remembering the identifiers returned in your script / application, then continue deleting in batches (say, 50,000 lines at a time). This will result in the following:

each of the delete statements will not lock the table for too long, not allowing replication lag out of control. This is especially important if you rely on your replication to provide you with relatively recent data. The advantage of using packages is that if you find that each DELETE query still takes too much time, you can tune it to less without touching DB structures.
Another advantage of using a separate SELECT is that SELECT itself can take a lot of time, especially if it cannot use the best DB indexes for any reason. If SELECT is internal to DELETE, when the entire operator migrates to the slaves, it will have to execute SELECT again, potentially lagging behind the slaves because it has to make a long selection again. Slave lag, again, suffers. If you use a separate SELECT query, this problem disappears as all you go through is a list of identifiers.

Let me know if there is any kind of error in my logic.

For a more detailed discussion of replication lag and ways to deal with it, similar to this, see MySQL Slave Lag (Delay) and 7 ways to deal with it.

PS One thing to be careful about, of course, is the potential changes in the table between the completion of SELECT and DELETE. I will allow you to handle such details using transactions and / or logic related to your application.

+2

Artem Russakovskii May 10, '09 at 17:28

source share

By the way, after posting above on my blog, Baron Schwartz from Percona caught my attention that his maatkit already has a tool for this purpose - mk-archiver. http://www.maatkit.org/doc/mk-archiver.html

This is most likely your best tool for the job.

+2

Artem Russakovskii May 11, '09 at 9:09 PM

source share

Obviously, the SELECT query, which builds the foundation of your DELETE operation, is pretty fast, so I think that either the foreign key constraint or the indexes are causing an extremely slow query.

Try

 SET foreign_key_checks = 0; /* ... your query ... */ SET foreign_key_checks = 1;

This will disable foreign key checks. Unfortunately, you cannot disable (at least I don’t know how) key updates with the InnoDB table. Using the MyISAM table, you can do something like

 ALTER TABLE a DISABLE KEYS /* ... your query ... */ ALTER TABLE a ENABLE KEYS

I really have not tested whether these parameters will affect the duration of the request. But worth a try.

+1

Stefan Gehrig May 6 '09 at 16:19

source share

Connect the database using the terminal and execute the command below, look at the time of each of them, you will find that the deletion time of 10, 100, 1000, 10000, 100000 records is not multiplied.

  DELETE FROM #{$table_name} WHERE id < 10; DELETE FROM #{$table_name} WHERE id < 100; DELETE FROM #{$table_name} WHERE id < 1000; DELETE FROM #{$table_name} WHERE id < 10000; DELETE FROM #{$table_name} WHERE id < 100000;

The deletion time of 10 thousand records is not 10 times longer than the deletion of 100 thousand records. Then, except that finding ways to delete records is faster, there are some indirect methods.

1, we can rename table_name to table_name_bak, and then select the entries from table_name_bak to table_name.

2, To delete 10,000 entries, we can delete 1,000 entries 10 times. There is an example ruby script for this.

 #!/usr/bin/env ruby require 'mysql2' $client = Mysql2::Client.new( :as => :array, :host => '10.0.0.250', :username => 'mysql', :password => '123456', :database => 'test' ) $ids = (1..1000000).to_a $table_name = "test" until $ids.empty? ids = $ids.shift(1000).join(", ") puts "delete ==================" $client.query(" DELETE FROM #{$table_name} WHERE id IN ( #{ids} ) ") end

0

yanyingwang Mar 03 '14 at 8:37

source share

The main method to delete multiple rows of a MySQL row in a single table is through the id field

DELETE FROM tbl_name WHERE id <= 100 AND id >=200; This query is responsible for removing the agreed condition between 100 AND 200 from a specific table

-one

Sarkar Dec 12 '16 at 4:28

source share

Daniel Schneller · Accepted Answer · 2009-05-06 16:51

Deleting data from InnoDB is the most expensive operation you can request. Since you already found that the request itself is not a problem, most of them will in any case be optimized for the same execution plan.

Although it may be difficult to understand why the DELETEs of all cases are the slowest, there is a fairly simple explanation. InnoDB is a transactional storage engine. This means that if your request was interrupted halfway, all the records would still be in place, as if nothing had happened. Once it is completed, everything will disappear in an instant. During DELETE, other clients connecting to the server will see the entries until DELETE completes.

To achieve this, InnoDB uses the MVCC (Multi Version Concurrency Control) method. Basically, this means giving each connection a snapshot of the entire database, as it was when you first started the transaction. To achieve this, each entry in InnoDB internally can have multiple values — one for each snapshot. This is also due to the fact that COUNTing on InnoDB takes some time - it depends on the state of the snapshot that you see at that time.

For your DELETE transaction, each record identified according to your query conditions is marked for deletion. Since other clients can access the data at the same time, they cannot immediately delete them from the table, because they must see their corresponding snapshot to guarantee atomic deletion.

Once all records have been marked for deletion, the transaction completed successfully. And even then, they cannot be immediately deleted from the actual data pages before all other transactions that worked with the snapshot value before your DELETE transaction also ended.

Thus, in fact, your 3 minutes are not so slow, given the fact that all entries must be changed in order to prepare them for deletion in a safe way. You will probably “hear” your hard drive working during the execution of the instruction. This is caused by access to all rows. To improve performance, you can increase the size of the InnoDB buffer pool for your server and try to limit other access to the database during deletion, thereby also reducing the number of historical versions that InnoDB should support per record. With extra memory, InnoDB can read your table (mostly) in memory and avoid some disk lookup time.

A faster way to remove matching lines?

More articles: