I restrained myself by asking because I feel that this question is being asked so much, but still has no final answer:
Object table: 40M + rows filled with UPC, EIN, ISBN as obj_id primary key. Spaces
* Obj_Cat * table: bind objects to categories. Columns | obj_id | cat_id |
Question: What is the best way to return 5 random random obj_id? Is there a better way than what I listed?
Solution1: SELECT objects.obj_id FROM objects left join obj_cat on objects.obj_id=obj_cat.obj_id WHERE obj_cat.cat_id=cat_id ORDER BY RAND() LIMIT 1;
Run 5 times
- Very slow with large tables.
Solution2: SELECT obj_id FROM objects WHERE obj_id >= (SELECT FLOOR( MAX(obj_id) * RAND()) FROM
objects ) LIMIT 1;
Run 5 times (do not enable obj_cat union to make it easier to understand)
The best solution if your ranks are indifferent or have minor gaps. Very fast.
It does not work with categories, as there will inevitably be gaps in the numbering.
Solution3: SELECT FLOOR(RAND() * COUNT(objects.*)) AS
offset FROM objects, obj_cat WHERE objects.obj_id=obj_cat.obj_id AND obj_cat.cat_id=cat_id; SELECT obj_id FROM objects LIMIT $offset, 1
FROM objects, obj_cat WHERE objects.obj_id=obj_cat.obj_id AND obj_cat.cat_id=cat_id; SELECT obj_id FROM objects LIMIT $offset, 1
Execute 5 times
- Very flexible. Much faster than solution 1. Works with spaces. But with 40M + lines, a single "LIMIT $ offset, 1" may take 1 minute.
I used solution 3, but it is slow. My current solution is to use Solr randomsortfield, as it is easy to specify my category in fq.
Solr solution ?q=*&fl=obj_id&fq=cat:(cat_id)&sort=random_* desc&rows=5
- Pretty fast, it takes about 45 seconds for each category, but returns 5 inconsistent results on transition.
Is there a better way people have discovered when working with large data sets? I know this seems like a duplicate question, but I thought that I would add my experience to the 40M + table.
source share