Choose a random row from MySQL (with probability)

I have a MySQL table that has a row called cur_odds, which is a percentage with the probability of the percent that row will be selected. How to make a query that will actually select rows at about that frequency when you execute 100 queries, for example?

I tried the following, but a line that has a probability of 0.35 falls into the selection in about 60-70% of cases.

SELECT * FROM table ORDER BY RAND()*cur_odds DESC 

All cur_odds in the table are exactly 1.

+6
mysql probability
source share
2 answers

If cur_odds rarely changes, you can implement the following algorithm:

1) Create another prob_sum column for which

prob_sum [0]: = cur_odds [0]

for 1 <= i <= row_count - 1:

prob_sum [i]: = prob_sum [i - 1] + cur_odds [i]

2) Create a random number from 0 to 1:

rnd: = rand (0,1)

3) Find the first line for which prob_sum > rnd (if you create the BTREE index in prob_sum , the query should work much faster):

CREATE INDEX prob_sum_ind ON <table> (prob_sum);

SET @rnd: = RAND ();

SELECT MIN (prob_sum) FROM <table> WHERE prob_sum> @rnd;

+4
source share

Given your previous SQL statement, any numbers you have in cur_odds are not probabilities that are selected by each row, but instead are simply arbitrary weights (relative to the "weights" of all the other rows) that could be best interpreted instead as a relative tendency to float to the top of a sorted table. The actual value in each row does not make sense (for example, you could have 4 rows with values ​​of 0.35, 0.5, 0.75 and 0.99, or you could have values ​​of 35, 50, 75 and 99, and the results would be the same).

Update:. What happens to your request. You have one line with a cur_odds value of 0.35. To illustrate, I assume that the remaining 9 lines have the same value (0.072). Also to illustrate, suppose that RAND () returns a value between 0.0 and 1.0 (this may actually be).

Each time you run this SELECT statement, each row is assigned a sort value by multiplying the cur_odds value by the RAND () value from 0.0 to 1.0. This means that a line with 0.35 will have a sort value from 0.0 to 0.35.

Each other row (with a value of 0.072) will have sort values ​​ranging from 0.0 to 0.072. This means that the probability that your single row will have a sort value greater than 0.072 will be approximately 80%, which means that there is no chance that any other row will be sorted higher. This is why your line with a cur_odds value of 0.35 appears first more often than you expect.

I incorrectly described the value of cur_odds as relative weighting of changes. It actually functions as the maximum relative weight, which would then include some complex math to determine the actual relative probabilities.

I'm not sure what you need to do with direct T-SQL. I have implemented a weighting probability collector many times (I was even going to ask a question about the best methods for this this morning, ironically), but always in code.

+3
source share

All Articles