Weighted Random Sampling in Elasticsearch

Question

Weighted Random Sampling in Elasticsearch

I need to get a random sample from the ElasticSearch index, that is, issue a query that extracts some documents from this index with a weighted probability Wj/ΣWi (where Wj is the weight of the row j and Wj/ΣWi is the sum of the weights of all documents in this query).

I currently have the following query:

 GET products/_search?pretty=true {"size":5, "query": { "function_score": { "query": { "bool":{ "must": { "term": {"category_id": "5df3ab90-6e93-0133-7197-04383561729e"} } } }, "functions": [{"random_score":{}}] } }, "sort": [{"_score":{"order":"desc"}}] }

It returns 5 elements from the selected category, randomly. Each item has a weight field. So I probably should use

 "script_score": { "script": "weight = data['weight'].value / SUM; if (_score.doubleValue() > weight) {return 1;} else {return 0;}" }

as described here .

I have the following issues:

What is the right way to do this?
Do I need to enable Dynamic Scripting ?
How to calculate the total amount of the request?

Many thanks for your help!

+7

elasticsearch random-sample

dpaluy Dec 7 '15 at 7:54

source share

2 answers

Vermeer grange · Answer 1 · 2018-01-19T10:25:18+0000

In case this helps someone, here's how I recently implemented a weighted shuffle.

In this example, we are shuffling companies. Each company has a "company_score" from 0 to 100. With this simple weighted shuffle, a company with a score of 100 appears 5 times more often on the first page than a company with a score of 20.

 json_body = { "sort": ["_score"], "query": { "function_score": { "query": main_query, # put your main query here "functions": [ { "random_score": {}, }, { "field_value_factor": { "field": "company_score", "modifier": "none", "missing": 0, } } ], # How to combine the result of the two functions 'random_score' and 'field_value_factor'. # This way, on average the combined _score of a company having score 100 will be 5 times as much # as the combined _score of a company having score 20, and thus will be 5 times more likely # to appear on first page. "score_mode": "multiply", # How to combine the result of function_score with the original _score from the query. # We overwrite it as our combined _score (random x company_score) is all we need. "boost_mode": "replace", } } }

Brent axthelm · Answer 2 · 2017-02-22T17:43:01+0000

I know this question is old, but it is responsible for any future search engines.

The comment in front of yours on the GitHub thread seems to have an answer. If each of your documents has a relative weight, you can select a random score for each document and multiply it by weight to create your new weighted random score. This has an additional bonus that does not require a sum of weights.

eg. if two documents have weights 1 and 2 , then you expect that the second will have twice the probability of choosing as the first. Give each document a random score between 0 and 1 (which you already do with "random_score" ). Multiply the random weight estimate and you will have the first document with a score between 0 and 1 , and the second with a score between 0 and 2 , which is two times more likely to be selected!

Weighted Random Sampling in Elasticsearch

More articles: