ElasticSearch incorrectly indexes and queries non-alphanumeric characters

My ElasticSearch index incorrectly indexes and queries non-alphanumeric characters. In particular, dots and dashes create problems.

If I index a document with the name OK Corral, it should match the queries for OK Corral. Similarly, if I index Whiskey A Go-Go, I would like it to match Whiskey A GoGo and Whiskey A Go Go.

Right now, only queries with the correct dots and dashes will return these documents.

I hope the solution also solves any potential problems with other non-alphanumeric characters such as commas and apostrophes.

This is similar to working with ElasticSearch token filters, but I could not find the one that does what I'm looking for. In addition, I would like to do this in ElasticSearch - I do not want to write custom string manipulations to normalize the data before it gets into my ES index.

Thanks for your help!

+6
source share
1 answer

You might want to take a look at the Separator Token Filter . He will at least do what you want with Whiskey A GoGo and Whiskey A Go-Go. You can check its behavior in advance using analyze api .

+7
source

Source: https://habr.com/ru/post/924015/


All Articles