I finally figured out a solution using char_filter to introduce additional tokens for line breaks and punctuation marks:
PUT /index { "settings": { "analysis": { "char_filter": { "my_mapping": { "type": "mapping", "mappings": [ ".=>\\n_PERIOD_\\n", "\\n=>\\n_NEWLINE_\\n" ] } }, "analyzer": { "my_analyzer": { "tokenizer": "standard", "char_filter": ["my_mapping"], "filter": ["lowercase"] } } } } }
Testing with an Example String
POST /index/_analyze?analyzer=my_analyzer&pretty The quick brown fox runs after the rabbit. Then comes the jumpy frog.
gives the following result:
{ "tokens" : [ { "token" : "the", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 1 }, { ... snip ... "token" : "rabbit", "start_offset" : 35, "end_offset" : 41, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "_period_", "start_offset" : 41, "end_offset" : 41, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "_newline_", "start_offset" : 42, "end_offset" : 42, "type" : "<ALPHANUM>", "position" : 10 }, { "token" : "then", "start_offset" : 43, "end_offset" : 47, "type" : "<ALPHANUM>", "position" : 11 ... snip ... } ] }
Shadocko
source share