MySQL random with spaces of 40+ million lines

Question

MySQL random with spaces of 40+ million lines

I restrained myself by asking because I feel that this question is being asked so much, but still has no final answer:

Object table: 40M + rows filled with UPC, EIN, ISBN as obj_id primary key. Spaces

* Obj_Cat * table: bind objects to categories. Columns | obj_id | cat_id |

Question: What is the best way to return 5 random random obj_id? Is there a better way than what I listed?

Solution1: SELECT objects.obj_id FROM objects left join obj_cat on objects.obj_id=obj_cat.obj_id WHERE obj_cat.cat_id=cat_id ORDER BY RAND() LIMIT 1; Run 5 times

Very slow with large tables.

Solution2: SELECT obj_id FROM objects WHERE obj_id >= (SELECT FLOOR( MAX(obj_id) * RAND()) FROM objects ) LIMIT 1; Run 5 times (do not enable obj_cat union to make it easier to understand)

The best solution if your ranks are indifferent or have minor gaps. Very fast.
It does not work with categories, as there will inevitably be gaps in the numbering.

Solution3: SELECT FLOOR(RAND() * COUNT(objects.*)) AS offset FROM objects, obj_cat WHERE objects.obj_id=obj_cat.obj_id AND obj_cat.cat_id=cat_id; SELECT obj_id FROM objects LIMIT $offset, 1 FROM objects, obj_cat WHERE objects.obj_id=obj_cat.obj_id AND obj_cat.cat_id=cat_id; SELECT obj_id FROM objects LIMIT $offset, 1 Execute 5 times

Very flexible. Much faster than solution 1. Works with spaces. But with 40M + lines, a single "LIMIT $ offset, 1" may take 1 minute.

I used solution 3, but it is slow. My current solution is to use Solr randomsortfield, as it is easy to specify my category in fq.

Solr solution ?q=*&fl=obj_id&fq=cat:(cat_id)&sort=random_* desc&rows=5

Pretty fast, it takes about 45 seconds for each category, but returns 5 inconsistent results on transition.

Is there a better way people have discovered when working with large data sets? I know this seems like a duplicate question, but I thought that I would add my experience to the 40M + table.

+4

mysql random solr

Anthony lin May 12, '12 at 2:33

source share

1 answer

Peter hanneman · Answer 1 · 2012-05-18T14:43:34+0000

With a data set of this large size, you cannot do calculations on the fly like this. You need to take advantage of the tradeoff in time. Create a new column with an unsigned index in the obj_cat table with a width exceeding the maximum number of rows, and fill each row with a random number. Thus, you can simply create a random number and immediately select the closest match five times. This will be several orders of magnitude faster than trying to use ORDER BY RAND ().

MySQL random with spaces of 40+ million lines

More articles: