MySQL: is it possible to make this query faster?

I have a test table containing millions of records. Each line contains a floating point and a count function, how often this function is present in the id element. The primary key for this table is the combination of "id" and "feature", that is, each element can have several functions. Typically, from a few hundred to several thousand object records per element identifier.

create table test ( id int not null, feature double not null, count int not null ); 

The task is to find the 500 most similar elements for a given reference element. Similarity is measured in the number of identical characteristic values ​​in both elements. The query I cited is given below, but despite the proper use of indexes, its execution plan still contains "using temporary" and "using filesort", which gives unacceptable performance for my use.

 select t1.id, t2.id, sum( least( t1.count, t2.count )) as priority from test as t1 inner join test as t2 on t2.feature = t1.feature where t1.id = {some user supplied id value} group by t1.id, t2.id order by priority desc limit 500; 

Any ideas on how to improve this? The schema can be changed, and indexes added as needed.

+7
optimization sql mysql query-optimization
source share
5 answers

With the current schema, this query can hardly be improved.

You already have an index on the feature , and this is the best thing you can do with the current scheme schema.

The problem is more similar than not the relation of order. If a more like b than c , this does not mean that c less like a than b . Therefore, you cannot create a single index describing this relationship, and you need to do this for each element separately, which will make your index N^2 long, where N is the number of elements.

If you always want only 500 tags, you can limit your index to this number (in this case it will contain 500 * N entries).

MySQL does not support indexed or materialized views, so you have to do it yourself:

  • Create the table as follows:

     CREATE TABLE similarity ( id1 INT NOT NULL, id2 INT NOT NULL, similarity DOUBLE NOT NULL, PRIMARY KEY (id1, id2), KEY (id1, similarity) ) 
  • Whenever you insert a new function into a table, reflect the changes in similarity :

     INSERT INTO similarity SELECT @newid, id, LEAST(@newcount, count) AS ns FROM test WHERE feature = @newfeature AND id <> @newid ON DUPLICATE KEY UPDATE SET similarity = similarity + ns; INSERT INTO similarity SELECT @newid, id, LEAST(@newcount, count) AS ns FROM test WHERE feature = @newfeature AND id <> @newid ON DUPLICATE KEY UPDATE SET similarity = similarity + ns; 
  • Remove unnecessary similarities in time:

     DELETE s FROM ( SELECT id1, ( SELECT similarity FROM similarity si WHERE si.id1 = s.id1 ORDER BY si.id1 DESC, si.similarity DESC LIMIT 499, 1 ) AS cs FROM ( SELECT DISTINCT id1 FROM similarity ) s ) q JOIN similarity s ON s.id1 = q.id1 AND s.similarity < q.cs 
  • Request data:

     SELECT id2 FROM similarity WHERE id1 = @myid ORDER BY similarity DESC LIMIT 500 
+4
source share

Having a floating point number as part of the primary key (PK) is a killer. In this regard, it should not be part of any restrictions - Unique Key (Great Britain), Foreign Key (FK), etc.

To improve the performance of your SQL query many times, try changing the schema as shown below:

 CREATE TABLE test ( item_id INTEGER, feature_id INTEGER, count INTEGER ); CREATE TABLE features ( id INTEGER, feature_value double not null ); CREATE TABLE items ( id INTEGER, item_description varchar2(100) not null ); ALTER TABLE test ADD CONSTRAINT fk_test_item_id foreign key (item_id) references items(id); ALTER TABLE test ADD CONSTRAINT fk_test_feature_id foreign key(feature_id) references features(id); 

When your test table is normalized, as described above, I separate the elements and bind them to separate individual tables, and this becomes more than just a display table with a count of each display.

If you are currently running the SQL query that you fired earlier, with a few changes, as indicated below, you should see a significant / drastic improvement in the performance of SQL queries.

 select t1.id, t2.id, sum( least( t1.count, t2.count )) as priority from test as t1 inner join test as t2 on t2.feature_id = t1.feature_id where t1.id = {some user supplied id value} group by t1.id, t2.id order by priority desc limit 500; 

Hooray!

+3
source share

One optimization would be to exclude the element itself from an independent association:

 inner join test as t2 on t2.feature = t1.feature and t2.id <> t1.id ^^^^^^^^^^^^^^ 

To speed things up, create a coverage index on (feature, id, count) .

+2
source share

I would start with this ... love to hear about the performance you're looking at. I don't think you need LESS (from t1 vs t2 counts). If you first determined where based on ID = {some value}, you will obviously get all these "functions". Then, using self-connection for yourself only with the corresponding "functions", you get an account. Since you break it down into ID1 and ID2, each corresponding β€œfunction” will be counted once. At the end of this query, since I do not rule out the exception t2.ID equal to {some user value}, it must be the EXACT EXACT number of functions in t1, and everything else under this will be your other closest matches.

I would make sure I have an ID and FEATURE index.

 select STRAIGHT_JOIN t1.id, t2.id, count(*) as MatchedInBoth from test as t1, test as t2 where t1.id = {some user value} and t1.feature = t2.feature group by t1.id, t2.id order by MatchedInBoth desc limit 500; 

The result may give something like

 t1 t2 MatchedInBoth {user value} {user value} 275 {user value} Other ID 1 270 {user value} Other ID 2 241 {user value} Other ID 3 218 {user value} Other ID 4 197 {user value} Other ID 5 163, etc 
0
source share

Can you bring him down at just one table? In Usinq subqueries, you can avoid joining, and it will be a victory if the subqueries are faster, indexed, and executed exactly once. Something like this (unverified).

select
t2.id,
SUM (t2.count) as priority
from test as t2
where t2.id = {some user supplied id value} AND
t2.count> (SELECT MIN (count) FROM test t1 WHERE id = {some user supplied value}) AND
t2.feature IN (SELECT feature FROM test t1 WHERE id = {some user supplied value})
group by t1.id
order by priority desc
limit 500;

If this does not work, Mysql is terrible at implementing internal samples, which are persistent tables and will repeat them for each row. Wrapping them when you select again leads to a constant search for the table. Heres a hack:


select
t1.id,
SUM( t2.count ) as priority
from test as t2
where t2.id = {some user supplied id value} AND
t2.count > (
SELECT * FROM (
SELECT MIN(count) FROM test t1 WHERE id= {some user supplied
value} ) as const ) AND
t2.feature IN ( SELECT * from (
SELECT feature FROM test t1 WHERE id= {some user supplied value}
) as const )
group by t1.id
order by priority desc
limit 500;

-one
source share

All Articles