Optimize Top N Queries

Question

Optimize Top N Queries

I ran into a difficult time optimizing a query like

SELECT RESULT_ID FROM RESULTS WHERE SOURCE = 1 AND GROUP=2 AND SCORE1 BETWEEN 20 AND 100 ORDER BY SCORE2 LIMIT 450;

on the 40 millionth innodb line. The query may have to be sorted to 15 million results in order to get a maximum of 450. So far I have tried:

Definition of indexes, but they are not used for sorting, since MySQL ignores any columns in the index after the range condition. Since we have a bunch of columns, we can get the range conditions for the number of them, followed by sorting by a certain account, and limiting the result set to 450.
Using memory tables, but they do not work very well when sorting such large results.
Sphinx, but I'm not sure if this helps in these types of queries.

Also, is there any OLAP cube implementation that can optimize such queries?

+4

mysql sphinx olap-cube

Anu Aug 11 '11 at 7:47

source share

3 answers

Robin castlin · Answer 1 · 2011-08-11T07:51:11+0000

I would suggest creating a separate table containing these 450 rows, and it is calculated each time a new row is added or the old one is updated and refers to another table.

Thus, your request will not need to look through all the lines each time.

Karolis · Answer 2 · 2011-08-12T00:03:00+0000

You can pre-specify the general ranges of points. For example, you can create several types of ranges:

  1 2 3 4 RANGE_50 = { 0..50, 51..100, 101..150, 151..200 } RANGE_100 = { 0..100, 101..200 } RANGE_200 = { 0..200 }

These range types can be created as columns in your table and must be updated according to score1 .

Then you can use the following queries:

 SELECT RESULT_ID FROM RESULTS WHERE SOURCE = 1 AND GROUP=2 AND RANGE_100 = 2 ORDER BY SCORE2 LIMIT 450;

Savino sguera · Answer 3 · 2011-08-11T22:39:51+0000

What you are looking for, IMHO, is a way to get the top elements of K in a (theoretically) infinite stream of elements.

I would not try to solve this problem directly in mysql, since your input is a stream, not a fixed dataset. In addition, given the size of the data set, redefining the vertex K from scratch on each insert is out of the question.

What I would do is to have a compact representation of the top K, which you update as new items arrive. For each element, take its rating and save a bunch of the top K elements that have been discovered so far.

A bit more formal: given the data flow q1,. ,, qn, add qj to the heap if Score (qj) is greater than the smallest estimate on the heap. In this case, the lowest grade point should be allocated from the heap.

Specific solution

You have several columns with ratings, and the user can query a maximum of 450 for any combination of columns using range queries.

What I would do is conceptually:

keep the top 450 on the heap for each column of the account separately using the streaming method above
at query time, get items matching the query column
aggregate and sort lists as needed, and cut into 450

Hope this helps.

Optimize Top N Queries

Specific solution

More articles: