MySQL random with spaces of 40+ million lines

I restrained myself by asking because I feel that this question is being asked so much, but still has no final answer:

Object table: 40M + rows filled with UPC, EIN, ISBN as obj_id primary key. Spaces

* Obj_Cat * table: bind objects to categories. Columns | obj_id | cat_id |

Question: What is the best way to return 5 random random obj_id? Is there a better way than what I listed?

Solution1: SELECT objects.obj_id FROM objects left join obj_cat on objects.obj_id=obj_cat.obj_id WHERE obj_cat.cat_id=cat_id ORDER BY RAND() LIMIT 1; Run 5 times

  • Very slow with large tables.

Solution2: SELECT obj_id FROM objects WHERE obj_id >= (SELECT FLOOR( MAX(obj_id) * RAND()) FROM objects ) LIMIT 1; Run 5 times (do not enable obj_cat union to make it easier to understand)

  • The best solution if your ranks are indifferent or have minor gaps. Very fast.

  • It does not work with categories, as there will inevitably be gaps in the numbering.

Solution3: SELECT FLOOR(RAND() * COUNT(objects.*)) AS offset FROM objects, obj_cat WHERE objects.obj_id=obj_cat.obj_id AND obj_cat.cat_id=cat_id; SELECT obj_id FROM objects LIMIT $offset, 1 FROM objects, obj_cat WHERE objects.obj_id=obj_cat.obj_id AND obj_cat.cat_id=cat_id; SELECT obj_id FROM objects LIMIT $offset, 1 Execute 5 times

  • Very flexible. Much faster than solution 1. Works with spaces. But with 40M + lines, a single "LIMIT $ offset, 1" may take 1 minute.

I used solution 3, but it is slow. My current solution is to use Solr randomsortfield, as it is easy to specify my category in fq.

Solr solution ?q=*&fl=obj_id&fq=cat:(cat_id)&sort=random_* desc&rows=5

  • Pretty fast, it takes about 45 seconds for each category, but returns 5 inconsistent results on transition.

Is there a better way people have discovered when working with large data sets? I know this seems like a duplicate question, but I thought that I would add my experience to the 40M + table.

+4
source share
1 answer

With a data set of this large size, you cannot do calculations on the fly like this. You need to take advantage of the tradeoff in time. Create a new column with an unsigned index in the obj_cat table with a width exceeding the maximum number of rows, and fill each row with a random number. Thus, you can simply create a random number and immediately select the closest match five times. This will be several orders of magnitude faster than trying to use ORDER BY RAND ().

0
source

Source: https://habr.com/ru/post/1412104/


All Articles