Exclude from CamelCase tokenizer in Elasticsearch

Fighting to make iPhone suitable when searching for iphone in Elasticsearch.

Since I have some source code, I definitely need a CamelCase tokenizer , but it seems to violate the iPhone in two terms, so iphone may not be found.

Does anyone know how to add exceptions for breaking camelCase words in tokens (camel + case)?

UPDATE: for clarity to be clear, I want a NullPointerException to be indicated as [null, pointer, exception], but I don't want the iPhone to become [i, phone].

Any other solution?

UPDATE 2: @ChintanShah's answer offers a different approach that gives us even more - a NullPointerException will be denoted as [null, pointer, exception, nullpointer, pointerexception, nullpointerexception], which is certainly much more useful from the point of view of the one you are looking for. And indexing is also faster! The price to pay is the size of the index, but it is an excellent solution.

+6
source share
1 answer

You can achieve your requirements with the word_delimiter text filter . This is my setting

{ "settings": { "analysis": { "analyzer": { "camel_analyzer": { "tokenizer": "whitespace", "filter": [ "camel_filter", "lowercase", "asciifolding" ] } }, "filter": { "camel_filter": { "type": "word_delimiter", "generate_number_parts": false, "stem_english_possessive": false, "split_on_numerics": false, "protected_words": [ "iPhone", "WiFi" ] } } } }, "mappings": { } } 

This will separate the words into case changes , so a NullPointerException will be denoted as null, a pointer and an exception, but the iPhone and WiFi will remain as they are protected . word_delimiter has many possibilities for flexibility. You can also preserve_original , which will help you a lot.

 GET logs_index/_analyze?text=iPhone&analyzer=camel_analyzer 

Result

 { "tokens": [ { "token": "iphone", "start_offset": 0, "end_offset": 6, "type": "word", "position": 1 } ] } 

Now

 GET logs_index/_analyze?text=NullPointerException&analyzer=camel_analyzer 

Result

 { "tokens": [ { "token": "null", "start_offset": 0, "end_offset": 4, "type": "word", "position": 1 }, { "token": "pointer", "start_offset": 4, "end_offset": 11, "type": "word", "position": 2 }, { "token": "exception", "start_offset": 11, "end_offset": 20, "type": "word", "position": 3 } ] } 

Another approach is to analyze your field twice with different analyzers, but I feel that word_delimiter will do the trick.

Does it help?

+6
source

All Articles