Phrase Match Edge NGram

I need to autocomplete phrases . For example, when I look for "dementia in alz", I want to get "dementia in Alzheimer's."

To do this, I configured Edge NGram tokenizer . I tried both edge_ngram_analyzer and standard as an analyzer in the request body. However, I cannot get the results when I try to match a phrase.

What am I doing wrong?

My request:

 { "query":{ "multi_match":{ "query":"dementia in alz", "type":"phrase", "analyzer":"edge_ngram_analyzer", "fields":["_all"] } } } 

My mappings:

 ... "type" : { "_all" : { "analyzer" : "edge_ngram_analyzer", "search_analyzer" : "standard" }, "properties" : { "field" : { "type" : "string", "analyzer" : "edge_ngram_analyzer", "search_analyzer" : "standard" }, ... "settings" : { ... "analysis" : { "filter" : { "stem_possessive_filter" : { "name" : "possessive_english", "type" : "stemmer" } }, "analyzer" : { "edge_ngram_analyzer" : { "filter" : [ "lowercase" ], "tokenizer" : "edge_ngram_tokenizer" } }, "tokenizer" : { "edge_ngram_tokenizer" : { "token_chars" : [ "letter", "digit", "whitespace" ], "min_gram" : "2", "type" : "edgeNGram", "max_gram" : "25" } } } ... 

My documents:

 { "_score": 1.1152233, "_type": "Diagnosis", "_id": "AVZLfHfBE5CzEm8aJ3Xp", "_source": { "@timestamp": "2016-08-02T13:40:48.665Z", "type": "Diagnosis", "Document_ID": "Diagnosis_1400541", "Diagnosis": "F00.0 - Dementia in Alzheimer disease with early onset", "@version": "1", }, "_index": "carenotes" }, { "_score": 1.1152233, "_type": "Diagnosis", "_id": "AVZLfICrE5CzEm8aJ4Dc", "_source": { "@timestamp": "2016-08-02T13:40:51.240Z", "type": "Diagnosis", "Document_ID": "Diagnosis_1424351", "Diagnosis": "F00.1 - Dementia in Alzheimer disease with late onset", "@version": "1", }, "_index": "carenotes" } 

Analysis of the phrase "dementia in Alzheimer's disease":

 { "tokens": [ { "end_offset": 2, "token": "de", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 3, "token": "dem", "type": "word", "start_offset": 0, "position": 1 }, { "end_offset": 4, "token": "deme", "type": "word", "start_offset": 0, "position": 2 }, { "end_offset": 5, "token": "demen", "type": "word", "start_offset": 0, "position": 3 }, { "end_offset": 6, "token": "dement", "type": "word", "start_offset": 0, "position": 4 }, { "end_offset": 7, "token": "dementi", "type": "word", "start_offset": 0, "position": 5 }, { "end_offset": 8, "token": "dementia", "type": "word", "start_offset": 0, "position": 6 }, { "end_offset": 9, "token": "dementia ", "type": "word", "start_offset": 0, "position": 7 }, { "end_offset": 10, "token": "dementia i", "type": "word", "start_offset": 0, "position": 8 }, { "end_offset": 11, "token": "dementia in", "type": "word", "start_offset": 0, "position": 9 }, { "end_offset": 12, "token": "dementia in ", "type": "word", "start_offset": 0, "position": 10 }, { "end_offset": 13, "token": "dementia in a", "type": "word", "start_offset": 0, "position": 11 }, { "end_offset": 14, "token": "dementia in al", "type": "word", "start_offset": 0, "position": 12 }, { "end_offset": 15, "token": "dementia in alz", "type": "word", "start_offset": 0, "position": 13 }, { "end_offset": 16, "token": "dementia in alzh", "type": "word", "start_offset": 0, "position": 14 }, { "end_offset": 17, "token": "dementia in alzhe", "type": "word", "start_offset": 0, "position": 15 }, { "end_offset": 18, "token": "dementia in alzhei", "type": "word", "start_offset": 0, "position": 16 }, { "end_offset": 19, "token": "dementia in alzheim", "type": "word", "start_offset": 0, "position": 17 }, { "end_offset": 20, "token": "dementia in alzheime", "type": "word", "start_offset": 0, "position": 18 }, { "end_offset": 21, "token": "dementia in alzheimer", "type": "word", "start_offset": 0, "position": 19 } ] } 
+5
source share
2 answers

Many thanks to rendel who helped me find the right solution!

Andrey Stefan's decision is not optimal.

Why? Firstly, the absence of a lowercase filter in the search analyzer makes the search inconvenient; business must strictly comply. A custom analyzer with a lowercase filter lowercase required instead of "analyzer": "keyword" .

Secondly, part of the analysis is wrong ! During the time index, the line "F00.0 - Early-onset Alzheimer's disease" analyzes edge_ngram_analyzer . Using this analyzer, we have the following array of dictionaries as the analyzed string:

 { "tokens": [ { "end_offset": 2, "token": "f0", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 3, "token": "f00", "type": "word", "start_offset": 0, "position": 1 }, { "end_offset": 6, "token": "0 ", "type": "word", "start_offset": 4, "position": 2 }, { "end_offset": 9, "token": " ", "type": "word", "start_offset": 7, "position": 3 }, { "end_offset": 10, "token": " d", "type": "word", "start_offset": 7, "position": 4 }, { "end_offset": 11, "token": " de", "type": "word", "start_offset": 7, "position": 5 }, { "end_offset": 12, "token": " dem", "type": "word", "start_offset": 7, "position": 6 }, { "end_offset": 13, "token": " deme", "type": "word", "start_offset": 7, "position": 7 }, { "end_offset": 14, "token": " demen", "type": "word", "start_offset": 7, "position": 8 }, { "end_offset": 15, "token": " dement", "type": "word", "start_offset": 7, "position": 9 }, { "end_offset": 16, "token": " dementi", "type": "word", "start_offset": 7, "position": 10 }, { "end_offset": 17, "token": " dementia", "type": "word", "start_offset": 7, "position": 11 }, { "end_offset": 18, "token": " dementia ", "type": "word", "start_offset": 7, "position": 12 }, { "end_offset": 19, "token": " dementia i", "type": "word", "start_offset": 7, "position": 13 }, { "end_offset": 20, "token": " dementia in", "type": "word", "start_offset": 7, "position": 14 }, { "end_offset": 21, "token": " dementia in ", "type": "word", "start_offset": 7, "position": 15 }, { "end_offset": 22, "token": " dementia in a", "type": "word", "start_offset": 7, "position": 16 }, { "end_offset": 23, "token": " dementia in al", "type": "word", "start_offset": 7, "position": 17 }, { "end_offset": 24, "token": " dementia in alz", "type": "word", "start_offset": 7, "position": 18 }, { "end_offset": 25, "token": " dementia in alzh", "type": "word", "start_offset": 7, "position": 19 }, { "end_offset": 26, "token": " dementia in alzhe", "type": "word", "start_offset": 7, "position": 20 }, { "end_offset": 27, "token": " dementia in alzhei", "type": "word", "start_offset": 7, "position": 21 }, { "end_offset": 28, "token": " dementia in alzheim", "type": "word", "start_offset": 7, "position": 22 }, { "end_offset": 29, "token": " dementia in alzheime", "type": "word", "start_offset": 7, "position": 23 }, { "end_offset": 30, "token": " dementia in alzheimer", "type": "word", "start_offset": 7, "position": 24 }, { "end_offset": 33, "token": "s ", "type": "word", "start_offset": 31, "position": 25 }, { "end_offset": 34, "token": "sd", "type": "word", "start_offset": 31, "position": 26 }, { "end_offset": 35, "token": "s di", "type": "word", "start_offset": 31, "position": 27 }, { "end_offset": 36, "token": "s dis", "type": "word", "start_offset": 31, "position": 28 }, { "end_offset": 37, "token": "s dise", "type": "word", "start_offset": 31, "position": 29 }, { "end_offset": 38, "token": "s disea", "type": "word", "start_offset": 31, "position": 30 }, { "end_offset": 39, "token": "s diseas", "type": "word", "start_offset": 31, "position": 31 }, { "end_offset": 40, "token": "s disease", "type": "word", "start_offset": 31, "position": 32 }, { "end_offset": 41, "token": "s disease ", "type": "word", "start_offset": 31, "position": 33 }, { "end_offset": 42, "token": "s disease w", "type": "word", "start_offset": 31, "position": 34 }, { "end_offset": 43, "token": "s disease wi", "type": "word", "start_offset": 31, "position": 35 }, { "end_offset": 44, "token": "s disease wit", "type": "word", "start_offset": 31, "position": 36 }, { "end_offset": 45, "token": "s disease with", "type": "word", "start_offset": 31, "position": 37 }, { "end_offset": 46, "token": "s disease with ", "type": "word", "start_offset": 31, "position": 38 }, { "end_offset": 47, "token": "s disease with e", "type": "word", "start_offset": 31, "position": 39 }, { "end_offset": 48, "token": "s disease with ea", "type": "word", "start_offset": 31, "position": 40 }, { "end_offset": 49, "token": "s disease with ear", "type": "word", "start_offset": 31, "position": 41 }, { "end_offset": 50, "token": "s disease with earl", "type": "word", "start_offset": 31, "position": 42 }, { "end_offset": 51, "token": "s disease with early", "type": "word", "start_offset": 31, "position": 43 }, { "end_offset": 52, "token": "s disease with early ", "type": "word", "start_offset": 31, "position": 44 }, { "end_offset": 53, "token": "s disease with early o", "type": "word", "start_offset": 31, "position": 45 }, { "end_offset": 54, "token": "s disease with early on", "type": "word", "start_offset": 31, "position": 46 }, { "end_offset": 55, "token": "s disease with early ons", "type": "word", "start_offset": 31, "position": 47 }, { "end_offset": 56, "token": "s disease with early onse", "type": "word", "start_offset": 31, "position": 48 } ] } 

As you can see, the entire line, indicated by a marker from 2 to 25 characters in size. The line is symbolized in a linear way, along with all spaces and a position incremented by one for each new token.

There are several problems with it:

  • edge_ngram_analyzer unused tokens that will never be found, for example: "0", "," d "," sd "," sww ", etc.
  • In addition, he did not create many useful tokens that could be used, for example: “disease”, “early start”, etc. 0 results if you try to find any of these words.
  • Note that the last token is an early onset disease. Where is the last "t"? Due to "max_gram" : "25" we " lost " some text in all fields. You can no longer search for this text because there are no tokens for it.
  • The trim filter only confuses the problem by filtering extra spaces when this can be done by the tokenizer.
  • edge_ngram_analyzer increases the position of each token, which is problematic for positional requests, such as phrasal requests. Instead of edge_ngram_filter , you should use the save marker position when creating ngrams.

Optimal solution.

Matching settings to use:

 ... "mappings": { "Type": { "_all":{ "analyzer": "edge_ngram_analyzer", "search_analyzer": "keyword_analyzer" }, "properties": { "Field": { "search_analyzer": "keyword_analyzer", "type": "string", "analyzer": "edge_ngram_analyzer" }, ... ... "settings": { "analysis": { "filter": { "english_poss_stemmer": { "type": "stemmer", "name": "possessive_english" }, "edge_ngram": { "type": "edgeNGram", "min_gram": "2", "max_gram": "25", "token_chars": ["letter", "digit"] } }, "analyzer": { "edge_ngram_analyzer": { "filter": ["lowercase", "english_poss_stemmer", "edge_ngram"], "tokenizer": "standard" }, "keyword_analyzer": { "filter": ["lowercase", "english_poss_stemmer"], "tokenizer": "standard" } } } } ... 

Look at the analysis:

 { "tokens": [ { "end_offset": 5, "token": "f0", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00.", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00.0", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 17, "token": "de", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dem", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "deme", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "demen", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dement", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dementi", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dementia", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 20, "token": "in", "type": "word", "start_offset": 18, "position": 3 }, { "end_offset": 32, "token": "al", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alz", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzh", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzhe", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzhei", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheim", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheime", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheimer", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 40, "token": "di", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "dis", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "dise", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "disea", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "diseas", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "disease", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 45, "token": "wi", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 45, "token": "wit", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 45, "token": "with", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 51, "token": "ea", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "ear", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "earl", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "early", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 57, "token": "on", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "ons", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "onse", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "onset", "type": "word", "start_offset": 52, "position": 8 } ] } 

During the index time, the text is tokenized using a standard tokenizer, then individual words are filtered by lowercase , possessive_english and edge_ngram . Tokens are created only for words . During the search, the text is symbolized by the standard lowercase , then individual words are filtered by lowercase and possessive_english . The selected words are compared with the tokens that were created during the index.

Thus, we do a possible incremental search!

Now, since we execute ngram on separate words, we can even execute requests like

 { 'query': { 'multi_match': { 'query': 'dem in alzh', 'type': 'phrase', 'fields': ['_all'] } } } 

and get the right results.

The text is not "lost", everything is searchable, and there is no need to handle spaces with the trim filter.

+9
source

I believe your query is incorrect: while you need nGrams during indexing, you do not need them during search. During the search, you need the text to be as "fixed" as possible. Instead, try this query:

 { "query": { "multi_match": { "query": " dementia in alz", "analyzer": "keyword", "fields": [ "_all" ] } } } 

You notice two spaces before dementia . This is explained by your parser from the text. To get rid of them, you will need trim token_filter:

  "edge_ngram_analyzer": { "filter": [ "lowercase","trim" ], "tokenizer": "edge_ngram_tokenizer" } 

And then this request will work (without spaces before dementia ):

 { "query": { "multi_match": { "query": "dementia in alz", "analyzer": "keyword", "fields": [ "_all" ] } } } 
+6
source

All Articles