Phrase Match Edge NGram

Question

Phrase Match Edge NGram

I need to autocomplete phrases . For example, when I look for "dementia in alz", I want to get "dementia in Alzheimer's."

To do this, I configured Edge NGram tokenizer . I tried both edge_ngram_analyzer and standard as an analyzer in the request body. However, I cannot get the results when I try to match a phrase.

What am I doing wrong?

My request:

 { "query":{ "multi_match":{ "query":"dementia in alz", "type":"phrase", "analyzer":"edge_ngram_analyzer", "fields":["_all"] } } }

My mappings:

 ... "type" : { "_all" : { "analyzer" : "edge_ngram_analyzer", "search_analyzer" : "standard" }, "properties" : { "field" : { "type" : "string", "analyzer" : "edge_ngram_analyzer", "search_analyzer" : "standard" }, ... "settings" : { ... "analysis" : { "filter" : { "stem_possessive_filter" : { "name" : "possessive_english", "type" : "stemmer" } }, "analyzer" : { "edge_ngram_analyzer" : { "filter" : [ "lowercase" ], "tokenizer" : "edge_ngram_tokenizer" } }, "tokenizer" : { "edge_ngram_tokenizer" : { "token_chars" : [ "letter", "digit", "whitespace" ], "min_gram" : "2", "type" : "edgeNGram", "max_gram" : "25" } } } ...

My documents:

 { "_score": 1.1152233, "_type": "Diagnosis", "_id": "AVZLfHfBE5CzEm8aJ3Xp", "_source": { "@timestamp": "2016-08-02T13:40:48.665Z", "type": "Diagnosis", "Document_ID": "Diagnosis_1400541", "Diagnosis": "F00.0 - Dementia in Alzheimer disease with early onset", "@version": "1", }, "_index": "carenotes" }, { "_score": 1.1152233, "_type": "Diagnosis", "_id": "AVZLfICrE5CzEm8aJ4Dc", "_source": { "@timestamp": "2016-08-02T13:40:51.240Z", "type": "Diagnosis", "Document_ID": "Diagnosis_1424351", "Diagnosis": "F00.1 - Dementia in Alzheimer disease with late onset", "@version": "1", }, "_index": "carenotes" }

Analysis of the phrase "dementia in Alzheimer's disease":

 { "tokens": [ { "end_offset": 2, "token": "de", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 3, "token": "dem", "type": "word", "start_offset": 0, "position": 1 }, { "end_offset": 4, "token": "deme", "type": "word", "start_offset": 0, "position": 2 }, { "end_offset": 5, "token": "demen", "type": "word", "start_offset": 0, "position": 3 }, { "end_offset": 6, "token": "dement", "type": "word", "start_offset": 0, "position": 4 }, { "end_offset": 7, "token": "dementi", "type": "word", "start_offset": 0, "position": 5 }, { "end_offset": 8, "token": "dementia", "type": "word", "start_offset": 0, "position": 6 }, { "end_offset": 9, "token": "dementia ", "type": "word", "start_offset": 0, "position": 7 }, { "end_offset": 10, "token": "dementia i", "type": "word", "start_offset": 0, "position": 8 }, { "end_offset": 11, "token": "dementia in", "type": "word", "start_offset": 0, "position": 9 }, { "end_offset": 12, "token": "dementia in ", "type": "word", "start_offset": 0, "position": 10 }, { "end_offset": 13, "token": "dementia in a", "type": "word", "start_offset": 0, "position": 11 }, { "end_offset": 14, "token": "dementia in al", "type": "word", "start_offset": 0, "position": 12 }, { "end_offset": 15, "token": "dementia in alz", "type": "word", "start_offset": 0, "position": 13 }, { "end_offset": 16, "token": "dementia in alzh", "type": "word", "start_offset": 0, "position": 14 }, { "end_offset": 17, "token": "dementia in alzhe", "type": "word", "start_offset": 0, "position": 15 }, { "end_offset": 18, "token": "dementia in alzhei", "type": "word", "start_offset": 0, "position": 16 }, { "end_offset": 19, "token": "dementia in alzheim", "type": "word", "start_offset": 0, "position": 17 }, { "end_offset": 20, "token": "dementia in alzheime", "type": "word", "start_offset": 0, "position": 18 }, { "end_offset": 21, "token": "dementia in alzheimer", "type": "word", "start_offset": 0, "position": 19 } ] }

+5

elasticsearch elasticsearch-mapping elasticsearch-query

trex Aug 9 '16 at 10:20

source share

2 answers

I believe your query is incorrect: while you need nGrams during indexing, you do not need them during search. During the search, you need the text to be as "fixed" as possible. Instead, try this query:

 { "query": { "multi_match": { "query": " dementia in alz", "analyzer": "keyword", "fields": [ "_all" ] } } }

You notice two spaces before dementia . This is explained by your parser from the text. To get rid of them, you will need trim token_filter:

  "edge_ngram_analyzer": { "filter": [ "lowercase","trim" ], "tokenizer": "edge_ngram_tokenizer" }

And then this request will work (without spaces before dementia ):

 { "query": { "multi_match": { "query": "dementia in alz", "analyzer": "keyword", "fields": [ "_all" ] } } }

+6

Andrei Stefan Aug 9 '16 at 14:16

source share

trex · Accepted Answer · 2016-08-11T10:34:13+0000

Many thanks to rendel who helped me find the right solution!

Andrey Stefan's decision is not optimal.

Why? Firstly, the absence of a lowercase filter in the search analyzer makes the search inconvenient; business must strictly comply. A custom analyzer with a lowercase filter lowercase required instead of "analyzer": "keyword" .

Secondly, part of the analysis is wrong ! During the time index, the line "F00.0 - Early-onset Alzheimer's disease" analyzes edge_ngram_analyzer . Using this analyzer, we have the following array of dictionaries as the analyzed string:

 { "tokens": [ { "end_offset": 2, "token": "f0", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 3, "token": "f00", "type": "word", "start_offset": 0, "position": 1 }, { "end_offset": 6, "token": "0 ", "type": "word", "start_offset": 4, "position": 2 }, { "end_offset": 9, "token": " ", "type": "word", "start_offset": 7, "position": 3 }, { "end_offset": 10, "token": " d", "type": "word", "start_offset": 7, "position": 4 }, { "end_offset": 11, "token": " de", "type": "word", "start_offset": 7, "position": 5 }, { "end_offset": 12, "token": " dem", "type": "word", "start_offset": 7, "position": 6 }, { "end_offset": 13, "token": " deme", "type": "word", "start_offset": 7, "position": 7 }, { "end_offset": 14, "token": " demen", "type": "word", "start_offset": 7, "position": 8 }, { "end_offset": 15, "token": " dement", "type": "word", "start_offset": 7, "position": 9 }, { "end_offset": 16, "token": " dementi", "type": "word", "start_offset": 7, "position": 10 }, { "end_offset": 17, "token": " dementia", "type": "word", "start_offset": 7, "position": 11 }, { "end_offset": 18, "token": " dementia ", "type": "word", "start_offset": 7, "position": 12 }, { "end_offset": 19, "token": " dementia i", "type": "word", "start_offset": 7, "position": 13 }, { "end_offset": 20, "token": " dementia in", "type": "word", "start_offset": 7, "position": 14 }, { "end_offset": 21, "token": " dementia in ", "type": "word", "start_offset": 7, "position": 15 }, { "end_offset": 22, "token": " dementia in a", "type": "word", "start_offset": 7, "position": 16 }, { "end_offset": 23, "token": " dementia in al", "type": "word", "start_offset": 7, "position": 17 }, { "end_offset": 24, "token": " dementia in alz", "type": "word", "start_offset": 7, "position": 18 }, { "end_offset": 25, "token": " dementia in alzh", "type": "word", "start_offset": 7, "position": 19 }, { "end_offset": 26, "token": " dementia in alzhe", "type": "word", "start_offset": 7, "position": 20 }, { "end_offset": 27, "token": " dementia in alzhei", "type": "word", "start_offset": 7, "position": 21 }, { "end_offset": 28, "token": " dementia in alzheim", "type": "word", "start_offset": 7, "position": 22 }, { "end_offset": 29, "token": " dementia in alzheime", "type": "word", "start_offset": 7, "position": 23 }, { "end_offset": 30, "token": " dementia in alzheimer", "type": "word", "start_offset": 7, "position": 24 }, { "end_offset": 33, "token": "s ", "type": "word", "start_offset": 31, "position": 25 }, { "end_offset": 34, "token": "sd", "type": "word", "start_offset": 31, "position": 26 }, { "end_offset": 35, "token": "s di", "type": "word", "start_offset": 31, "position": 27 }, { "end_offset": 36, "token": "s dis", "type": "word", "start_offset": 31, "position": 28 }, { "end_offset": 37, "token": "s dise", "type": "word", "start_offset": 31, "position": 29 }, { "end_offset": 38, "token": "s disea", "type": "word", "start_offset": 31, "position": 30 }, { "end_offset": 39, "token": "s diseas", "type": "word", "start_offset": 31, "position": 31 }, { "end_offset": 40, "token": "s disease", "type": "word", "start_offset": 31, "position": 32 }, { "end_offset": 41, "token": "s disease ", "type": "word", "start_offset": 31, "position": 33 }, { "end_offset": 42, "token": "s disease w", "type": "word", "start_offset": 31, "position": 34 }, { "end_offset": 43, "token": "s disease wi", "type": "word", "start_offset": 31, "position": 35 }, { "end_offset": 44, "token": "s disease wit", "type": "word", "start_offset": 31, "position": 36 }, { "end_offset": 45, "token": "s disease with", "type": "word", "start_offset": 31, "position": 37 }, { "end_offset": 46, "token": "s disease with ", "type": "word", "start_offset": 31, "position": 38 }, { "end_offset": 47, "token": "s disease with e", "type": "word", "start_offset": 31, "position": 39 }, { "end_offset": 48, "token": "s disease with ea", "type": "word", "start_offset": 31, "position": 40 }, { "end_offset": 49, "token": "s disease with ear", "type": "word", "start_offset": 31, "position": 41 }, { "end_offset": 50, "token": "s disease with earl", "type": "word", "start_offset": 31, "position": 42 }, { "end_offset": 51, "token": "s disease with early", "type": "word", "start_offset": 31, "position": 43 }, { "end_offset": 52, "token": "s disease with early ", "type": "word", "start_offset": 31, "position": 44 }, { "end_offset": 53, "token": "s disease with early o", "type": "word", "start_offset": 31, "position": 45 }, { "end_offset": 54, "token": "s disease with early on", "type": "word", "start_offset": 31, "position": 46 }, { "end_offset": 55, "token": "s disease with early ons", "type": "word", "start_offset": 31, "position": 47 }, { "end_offset": 56, "token": "s disease with early onse", "type": "word", "start_offset": 31, "position": 48 } ] }

As you can see, the entire line, indicated by a marker from 2 to 25 characters in size. The line is symbolized in a linear way, along with all spaces and a position incremented by one for each new token.

There are several problems with it:

edge_ngram_analyzer unused tokens that will never be found, for example: "0", "," d "," sd "," sww ", etc.
In addition, he did not create many useful tokens that could be used, for example: “disease”, “early start”, etc. 0 results if you try to find any of these words.
Note that the last token is an early onset disease. Where is the last "t"? Due to "max_gram" : "25" we " lost " some text in all fields. You can no longer search for this text because there are no tokens for it.
The trim filter only confuses the problem by filtering extra spaces when this can be done by the tokenizer.
edge_ngram_analyzer increases the position of each token, which is problematic for positional requests, such as phrasal requests. Instead of edge_ngram_filter , you should use the save marker position when creating ngrams.

Optimal solution.

Matching settings to use:

 ... "mappings": { "Type": { "_all":{ "analyzer": "edge_ngram_analyzer", "search_analyzer": "keyword_analyzer" }, "properties": { "Field": { "search_analyzer": "keyword_analyzer", "type": "string", "analyzer": "edge_ngram_analyzer" }, ... ... "settings": { "analysis": { "filter": { "english_poss_stemmer": { "type": "stemmer", "name": "possessive_english" }, "edge_ngram": { "type": "edgeNGram", "min_gram": "2", "max_gram": "25", "token_chars": ["letter", "digit"] } }, "analyzer": { "edge_ngram_analyzer": { "filter": ["lowercase", "english_poss_stemmer", "edge_ngram"], "tokenizer": "standard" }, "keyword_analyzer": { "filter": ["lowercase", "english_poss_stemmer"], "tokenizer": "standard" } } } } ...

Look at the analysis:

 { "tokens": [ { "end_offset": 5, "token": "f0", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00.", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00.0", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 17, "token": "de", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dem", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "deme", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "demen", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dement", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dementi", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dementia", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 20, "token": "in", "type": "word", "start_offset": 18, "position": 3 }, { "end_offset": 32, "token": "al", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alz", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzh", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzhe", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzhei", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheim", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheime", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheimer", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 40, "token": "di", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "dis", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "dise", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "disea", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "diseas", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "disease", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 45, "token": "wi", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 45, "token": "wit", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 45, "token": "with", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 51, "token": "ea", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "ear", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "earl", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "early", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 57, "token": "on", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "ons", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "onse", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "onset", "type": "word", "start_offset": 52, "position": 8 } ] }

During the index time, the text is tokenized using a standard tokenizer, then individual words are filtered by lowercase , possessive_english and edge_ngram . Tokens are created only for words . During the search, the text is symbolized by the standard lowercase , then individual words are filtered by lowercase and possessive_english . The selected words are compared with the tokens that were created during the index.

Thus, we do a possible incremental search!

Now, since we execute ngram on separate words, we can even execute requests like

 { 'query': { 'multi_match': { 'query': 'dem in alzh', 'type': 'phrase', 'fields': ['_all'] } } }

and get the right results.

The text is not "lost", everything is searchable, and there is no need to handle spaces with the trim filter.

Phrase Match Edge NGram

Andrey Stefan's decision is not optimal.

Optimal solution.

More articles: