Highlighting elasticsearch on an ngram filter is strange if min_gram is set to 1

So I have this index

{ "settings":{ "index":{ "number_of_replicas":0, "analysis":{ "analyzer":{ "default":{ "type":"custom", "tokenizer":"keyword", "filter":[ "lowercase", "my_ngram" ] } }, "filter":{ "my_ngram":{ "type":"nGram", "min_gram":2, "max_gram":20 } } } } } } 

and I do this search through the gem of a tire

 { "query":{ "query_string":{ "query":"xyz", "default_operator":"AND" } }, "sort":[ { "count":"desc" } ], "filter":{ "term":{ "active":true, "_type":null } }, "highlight":{ "fields":{ "name":{ } }, "pre_tags":[ "<strong>" ], "post_tags":[ "</strong>" ] } } 

and I have two entries that should match the name "xyz post" and "xyz question" When I do this search, I get the selected fields correctly.

 <strong>xyz</strong> question <strong>xyz</strong> post 

Now here's the thing ... as soon as I change min_gram to 1 in my index and reindex. highlighted fields begin to return as it

 <strong>x</strong><strong>y</strong><strong>z</strong> pos<strong>xyz</strong>t <strong>x</strong><strong>y</strong><strong>z</strong> questio<strong>xyz</strong>n 

I just donโ€™t understand why.

+7
source share
1 answer

Short answer

You need to check the mapping and see if you are using fast-vector-highlighter . But still you need to be careful in your requests.

Detailed response

Suppose you are using a fresh instance of ES 0.20.4 on localhost .

Having created the upper hand over your example, add explicit mappings. Note. I am setting up two different analyzes for the code field. The only difference is "term_vector":"with_positions_offsets" .

 curl -X PUT localhost:9200/myindex -d ' { "settings" : { "index":{ "number_of_replicas":0, "number_of_shards":1, "analysis":{ "analyzer":{ "default":{ "type":"custom", "tokenizer":"keyword", "filter":[ "lowercase", "my_ngram" ] } }, "filter":{ "my_ngram":{ "type":"nGram", "min_gram":1, "max_gram":20 } } } } }, "mappings" : { "product" : { "properties" : { "code" : { "type" : "multi_field", "fields" : { "code" : { "type" : "string", "analyzer" : "default", "store" : "yes" }, "code.ngram" : { "type" : "string", "analyzer" : "default", "store" : "yes", "term_vector":"with_positions_offsets" } } } } } } }' 

Please provide some details.

 curl -X POST 'localhost:9200/myindex/product' -d '{ "code" : "Samsung Galaxy i7500" }' curl -X POST 'localhost:9200/myindex/product' -d '{ "code" : "Samsung Galaxy 5 Europa" }' curl -X POST 'localhost:9200/myindex/product' -d '{ "code" : "Samsung Galaxy Mini" }' 

And now we can run queries.

1) Search for "i" to see that one character search works with backlight

 curl -X GET 'localhost:9200/myindex/product/_search?pretty' -d '{ "fields" : [ "code" ], "query" : { "term" : { "code" : "i" } }, "highlight" : { "number_of_fragments" : 0, "fields" : { "code":{}, "code.ngram":{} } } }' 

This gives two searches:

 # 1 ... "fields" : { "code" : "Samsung Galaxy Mini" }, "highlight" : { "code.ngram" : [ "Samsung Galaxy M<em>i</em>n<em>i</em>" ], "code" : [ "Samsung Galaxy M<em>i</em>n<em>i</em>" ] } # 2 ... "fields" : { "code" : "Samsung Galaxy i7500" }, "highlight" : { "code.ngram" : [ "Samsung Galaxy <em>i</em>7500" ], "code" : [ "Samsung Galaxy <em>i</em>7500" ] } 

Both code and code.ngem were correctly highlighted this time. But when using a longer request, things change quickly:

2) Search for 'y m'

 curl -X GET 'localhost:9200/myindex/product/_search?pretty' -d '{ "fields" : [ "code" ], "query" : { "term" : { "code" : "ym" } }, "highlight" : { "number_of_fragments" : 0, "fields" : { "code":{}, "code.ngram":{} } } }' 

This gives:

 "fields" : { "code" : "Samsung Galaxy Mini" }, "highlight" : { "code.ngram" : [ "Samsung Galax<em>y M</em>ini" ], "code" : [ "Samsung Galaxy Min<em>y M</em>i" ] } 

The code fields are not highlighted correctly (similar to your case).

One important thing is that the term is used instead of query_string .

+11
source

All Articles