Line breaks or punctuation marks as spaces in the elastics search

Is there a way in elasticsearch to configure an analyzer that will create spaces between tokens when breaking lines or punctuation marks?

Let's say I index an object with the following meaningless row (with line break) as one of its fields:

The quick brown fox runs after the rabbit. Then comes the jumpy frog. 

The standard analyzer will display the following markers with the corresponding positions:

 0 the 1 quick 2 brown 3 fox 4 runs 5 after 6 the 7 rabbit 8 then 9 comes 10 the 11 jumpy 12 frog 

This means that the match_phrase the rabbit then comes request will match this document as a hit. Is there a way to introduce a position gap between rabbit and then so that it does not match if slop not entered?

Of course, a workaround could be to convert one line to an array (one line for each record) and use position_offset_gap in the field mapping, but I would really prefer to keep one line with new characters (and the final solution involves large position breaks for lines of new lines than, say, for punctuation marks).

+8
elasticsearch analyzer
source share
1 answer

I finally figured out a solution using char_filter to introduce additional tokens for line breaks and punctuation marks:

 PUT /index { "settings": { "analysis": { "char_filter": { "my_mapping": { "type": "mapping", "mappings": [ ".=>\\n_PERIOD_\\n", "\\n=>\\n_NEWLINE_\\n" ] } }, "analyzer": { "my_analyzer": { "tokenizer": "standard", "char_filter": ["my_mapping"], "filter": ["lowercase"] } } } } } 

Testing with an Example String

 POST /index/_analyze?analyzer=my_analyzer&pretty The quick brown fox runs after the rabbit. Then comes the jumpy frog. 

gives the following result:

 { "tokens" : [ { "token" : "the", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 1 }, { ... snip ... "token" : "rabbit", "start_offset" : 35, "end_offset" : 41, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "_period_", "start_offset" : 41, "end_offset" : 41, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "_newline_", "start_offset" : 42, "end_offset" : 42, "type" : "<ALPHANUM>", "position" : 10 }, { "token" : "then", "start_offset" : 43, "end_offset" : 47, "type" : "<ALPHANUM>", "position" : 11 ... snip ... } ] } 
+6
source share

All Articles