Elasticsearch: matches each position only once

Question

Elasticsearch: matches each position only once

In my Elasticsearch index, I have documents with multiple tokens in the same position.

I want to return a document when I match at least one token in each position. The order of the tokens is not important. How can i do this? I am using Elasticsearch 0.90.5.

Example:

I index such a document.

{ "field":"red car" }

I use a synonymous token filter, which adds synonyms in the same positions as the original token. So, now there are 2 positions in the field:

Position 1: Red
Position 2: "car", "car"

My solution at the moment:

To ensure compliance with all positions, I also indicate the maximum position.

 { "field":"red car", "max_position": 2 }

I have a usual affinity that extends from DefaultSimilarity and returns 1 tf (), idf () and lengthNorm (). The final result is the number of matching terms in the field.

Query:

 { "custom_score": { "query": { "match": { "field": "a car is an automobile" } }, "_script": "_score*100/doc[\"max_position\"]+_score" }, "min_score":"100" }

The problem with my solution:

The above search should not match the document, because there is no “red color” in the query string. But this is appropriate because Elasticsearch counts the matches for the car and the car as two matches, and this gives a score of 2, which leads to a score of script 102 that satisfies the "min_score".

+6

position elasticsearch lucene

Danyg Jan 16 '14 at 14:01

source share

1 answer

Peter Dixon-Moses · Answer 1 · 2015-08-11T21:48:25+0000

If you need to guarantee 100% compliance with the terms of the request, you can use minimum_should_match . This is a more common case.

Unfortunately, in your case, you want to provide 100% matches for indexed terms. To do this, you will need to go down to the Lucene level and write a custom (java - here is a template that you can develop ) Similarity class, because you need access to low-level index information that is not provided in the request DSL:

In the document / field scanned in the request counter:

Number of matched matches (overlapping is Lucene terminology, using the coord () method of the DefaultSimilarity class)
The number of terms to analyze in the field . Check out this topic for a couple of different ways to get this information: How do I calculate the number of terms for each document in the lucene index?

Then your user affinity (perhaps even the default extension) will require query detection, where the agreed terms are < summary terms and multiply them by zero.

Since the analysis of queries and time indices has already occurred at this level of assessment, the total number of indexed terms will already be expanded to include synonyms, as well as query conditions, avoiding the false positive “car is car” above.

Elasticsearch: matches each position only once

More articles: