For the ElasticSearch query, we want to process words (i.e. tokens consisting only of letters) and non-words in different ways. To do this, we will try to identify two analyzers, either by returning words or non-words.
For example, we have documents describing products for an equipment store:
{
"name": "Torx drive T9",
"category": "screws",
"size": 2.5,
}
The user will then search for “Torx T9” and expects to find this document. A T9 search will be too general and yield too many irrelevant products. Therefore, we only want to search for the term “T9” if we have already found “Torx”.
We are trying to create a query like this
{
"query": {
"bool": {
"must": {
"match: {
"name": {
"query": "Torx T9",
"analyzer": "words"
}
},
"should": {
"match: {
"name": {
"query": "Torx T9",
"analyzer": "nonwords"
}
}
}
}
}
The idea is that it would be easy to create token filters for this. For example, something like:
"settings": {
"analysis": {
"filter": {
"words": {
"type": "pattern",
"pattern": "\\A\\p{L}*\\Z",
},
"nonwords": {
"type": "pattern",
"pattern": "\\P{L}",
}
}
}
, . (ab) pattern_replace:
"settings": {
"analysis": {
"filter": {
"words": {
"type": "pattern_replace",
"pattern": "\\A((?=.*\\P{L}).*)",
"replacement": ""
},
"nonwords": {
"type": "pattern_replace",
"pattern": "\\A((?!.*\\P{L}).*)",
"replacement": ""
},
"nonempty": {
"type": "length",
"min":1
}
}
}
, . , .
?