Elasticsearch: matches each position only once

In my Elasticsearch index, I have documents with multiple tokens in the same position.

I want to return a document when I match at least one token in each position. The order of the tokens is not important. How can i do this? I am using Elasticsearch 0.90.5.

Example:

I index such a document.

{ "field":"red car" } 

I use a synonymous token filter, which adds synonyms in the same positions as the original token. So, now there are 2 positions in the field:

  • Position 1: Red
  • Position 2: "car", "car"

My solution at the moment:

To ensure compliance with all positions, I also indicate the maximum position.

 { "field":"red car", "max_position": 2 } 

I have a usual affinity that extends from DefaultSimilarity and returns 1 tf (), idf () and lengthNorm (). The final result is the number of matching terms in the field.

Query:

 { "custom_score": { "query": { "match": { "field": "a car is an automobile" } }, "_script": "_score*100/doc[\"max_position\"]+_score" }, "min_score":"100" } 

The problem with my solution:

The above search should not match the document, because there is no “red color” in the query string. But this is appropriate because Elasticsearch counts the matches for the car and the car as two matches, and this gives a score of 2, which leads to a score of script 102 that satisfies the "min_score".

+6
source share
1 answer

If you need to guarantee 100% compliance with the terms of the request, you can use minimum_should_match . This is a more common case.


Unfortunately, in your case, you want to provide 100% matches for indexed terms. To do this, you will need to go down to the Lucene level and write a custom (java - here is a template that you can develop ) Similarity class, because you need access to low-level index information that is not provided in the request DSL:

In the document / field scanned in the request counter:

Then your user affinity (perhaps even the default extension) will require query detection, where the agreed terms are < summary terms and multiply them by zero.

Since the analysis of queries and time indices has already occurred at this level of assessment, the total number of indexed terms will already be expanded to include synonyms, as well as query conditions, avoiding the false positive “car is car” above.

0
source

All Articles