Many thanks to rendel who helped me find the right solution!
Andrey Stefan's decision is not optimal.
Why? Firstly, the absence of a lowercase filter in the search analyzer makes the search inconvenient; business must strictly comply. A custom analyzer with a lowercase filter lowercase required instead of "analyzer": "keyword" .
Secondly, part of the analysis is wrong ! During the time index, the line "F00.0 - Early-onset Alzheimer's disease" analyzes edge_ngram_analyzer . Using this analyzer, we have the following array of dictionaries as the analyzed string:
{ "tokens": [ { "end_offset": 2, "token": "f0", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 3, "token": "f00", "type": "word", "start_offset": 0, "position": 1 }, { "end_offset": 6, "token": "0 ", "type": "word", "start_offset": 4, "position": 2 }, { "end_offset": 9, "token": " ", "type": "word", "start_offset": 7, "position": 3 }, { "end_offset": 10, "token": " d", "type": "word", "start_offset": 7, "position": 4 }, { "end_offset": 11, "token": " de", "type": "word", "start_offset": 7, "position": 5 }, { "end_offset": 12, "token": " dem", "type": "word", "start_offset": 7, "position": 6 }, { "end_offset": 13, "token": " deme", "type": "word", "start_offset": 7, "position": 7 }, { "end_offset": 14, "token": " demen", "type": "word", "start_offset": 7, "position": 8 }, { "end_offset": 15, "token": " dement", "type": "word", "start_offset": 7, "position": 9 }, { "end_offset": 16, "token": " dementi", "type": "word", "start_offset": 7, "position": 10 }, { "end_offset": 17, "token": " dementia", "type": "word", "start_offset": 7, "position": 11 }, { "end_offset": 18, "token": " dementia ", "type": "word", "start_offset": 7, "position": 12 }, { "end_offset": 19, "token": " dementia i", "type": "word", "start_offset": 7, "position": 13 }, { "end_offset": 20, "token": " dementia in", "type": "word", "start_offset": 7, "position": 14 }, { "end_offset": 21, "token": " dementia in ", "type": "word", "start_offset": 7, "position": 15 }, { "end_offset": 22, "token": " dementia in a", "type": "word", "start_offset": 7, "position": 16 }, { "end_offset": 23, "token": " dementia in al", "type": "word", "start_offset": 7, "position": 17 }, { "end_offset": 24, "token": " dementia in alz", "type": "word", "start_offset": 7, "position": 18 }, { "end_offset": 25, "token": " dementia in alzh", "type": "word", "start_offset": 7, "position": 19 }, { "end_offset": 26, "token": " dementia in alzhe", "type": "word", "start_offset": 7, "position": 20 }, { "end_offset": 27, "token": " dementia in alzhei", "type": "word", "start_offset": 7, "position": 21 }, { "end_offset": 28, "token": " dementia in alzheim", "type": "word", "start_offset": 7, "position": 22 }, { "end_offset": 29, "token": " dementia in alzheime", "type": "word", "start_offset": 7, "position": 23 }, { "end_offset": 30, "token": " dementia in alzheimer", "type": "word", "start_offset": 7, "position": 24 }, { "end_offset": 33, "token": "s ", "type": "word", "start_offset": 31, "position": 25 }, { "end_offset": 34, "token": "sd", "type": "word", "start_offset": 31, "position": 26 }, { "end_offset": 35, "token": "s di", "type": "word", "start_offset": 31, "position": 27 }, { "end_offset": 36, "token": "s dis", "type": "word", "start_offset": 31, "position": 28 }, { "end_offset": 37, "token": "s dise", "type": "word", "start_offset": 31, "position": 29 }, { "end_offset": 38, "token": "s disea", "type": "word", "start_offset": 31, "position": 30 }, { "end_offset": 39, "token": "s diseas", "type": "word", "start_offset": 31, "position": 31 }, { "end_offset": 40, "token": "s disease", "type": "word", "start_offset": 31, "position": 32 }, { "end_offset": 41, "token": "s disease ", "type": "word", "start_offset": 31, "position": 33 }, { "end_offset": 42, "token": "s disease w", "type": "word", "start_offset": 31, "position": 34 }, { "end_offset": 43, "token": "s disease wi", "type": "word", "start_offset": 31, "position": 35 }, { "end_offset": 44, "token": "s disease wit", "type": "word", "start_offset": 31, "position": 36 }, { "end_offset": 45, "token": "s disease with", "type": "word", "start_offset": 31, "position": 37 }, { "end_offset": 46, "token": "s disease with ", "type": "word", "start_offset": 31, "position": 38 }, { "end_offset": 47, "token": "s disease with e", "type": "word", "start_offset": 31, "position": 39 }, { "end_offset": 48, "token": "s disease with ea", "type": "word", "start_offset": 31, "position": 40 }, { "end_offset": 49, "token": "s disease with ear", "type": "word", "start_offset": 31, "position": 41 }, { "end_offset": 50, "token": "s disease with earl", "type": "word", "start_offset": 31, "position": 42 }, { "end_offset": 51, "token": "s disease with early", "type": "word", "start_offset": 31, "position": 43 }, { "end_offset": 52, "token": "s disease with early ", "type": "word", "start_offset": 31, "position": 44 }, { "end_offset": 53, "token": "s disease with early o", "type": "word", "start_offset": 31, "position": 45 }, { "end_offset": 54, "token": "s disease with early on", "type": "word", "start_offset": 31, "position": 46 }, { "end_offset": 55, "token": "s disease with early ons", "type": "word", "start_offset": 31, "position": 47 }, { "end_offset": 56, "token": "s disease with early onse", "type": "word", "start_offset": 31, "position": 48 } ] }
As you can see, the entire line, indicated by a marker from 2 to 25 characters in size. The line is symbolized in a linear way, along with all spaces and a position incremented by one for each new token.
There are several problems with it:
edge_ngram_analyzer unused tokens that will never be found, for example: "0", "," d "," sd "," sww ", etc.- In addition, he did not create many useful tokens that could be used, for example: “disease”, “early start”, etc. 0 results if you try to find any of these words.
- Note that the last token is an early onset disease. Where is the last "t"? Due to
"max_gram" : "25" we " lost " some text in all fields. You can no longer search for this text because there are no tokens for it. - The
trim filter only confuses the problem by filtering extra spaces when this can be done by the tokenizer. edge_ngram_analyzer increases the position of each token, which is problematic for positional requests, such as phrasal requests. Instead of edge_ngram_filter , you should use the save marker position when creating ngrams.
Optimal solution.
Matching settings to use:
... "mappings": { "Type": { "_all":{ "analyzer": "edge_ngram_analyzer", "search_analyzer": "keyword_analyzer" }, "properties": { "Field": { "search_analyzer": "keyword_analyzer", "type": "string", "analyzer": "edge_ngram_analyzer" }, ... ... "settings": { "analysis": { "filter": { "english_poss_stemmer": { "type": "stemmer", "name": "possessive_english" }, "edge_ngram": { "type": "edgeNGram", "min_gram": "2", "max_gram": "25", "token_chars": ["letter", "digit"] } }, "analyzer": { "edge_ngram_analyzer": { "filter": ["lowercase", "english_poss_stemmer", "edge_ngram"], "tokenizer": "standard" }, "keyword_analyzer": { "filter": ["lowercase", "english_poss_stemmer"], "tokenizer": "standard" } } } } ...
Look at the analysis:
{ "tokens": [ { "end_offset": 5, "token": "f0", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00.", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00.0", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 17, "token": "de", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dem", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "deme", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "demen", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dement", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dementi", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dementia", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 20, "token": "in", "type": "word", "start_offset": 18, "position": 3 }, { "end_offset": 32, "token": "al", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alz", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzh", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzhe", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzhei", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheim", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheime", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheimer", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 40, "token": "di", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "dis", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "dise", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "disea", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "diseas", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "disease", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 45, "token": "wi", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 45, "token": "wit", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 45, "token": "with", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 51, "token": "ea", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "ear", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "earl", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "early", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 57, "token": "on", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "ons", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "onse", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "onset", "type": "word", "start_offset": 52, "position": 8 } ] }
During the index time, the text is tokenized using a standard tokenizer, then individual words are filtered by lowercase , possessive_english and edge_ngram . Tokens are created only for words . During the search, the text is symbolized by the standard lowercase , then individual words are filtered by lowercase and possessive_english . The selected words are compared with the tokens that were created during the index.
Thus, we do a possible incremental search!
Now, since we execute ngram on separate words, we can even execute requests like
{ 'query': { 'multi_match': { 'query': 'dem in alzh', 'type': 'phrase', 'fields': ['_all'] } } }
and get the right results.
The text is not "lost", everything is searchable, and there is no need to handle spaces with the trim filter.