How to filter tokens based on regex in ElasticSearch

Question

How to filter tokens based on regex in ElasticSearch

For the ElasticSearch query, we want to process words (i.e. tokens consisting only of letters) and non-words in different ways. To do this, we will try to identify two analyzers, either by returning words or non-words.

For example, we have documents describing products for an equipment store:

{
    "name": "Torx drive T9",
    "category": "screws",
    "size": 2.5,
}

The user will then search for “Torx T9” and expects to find this document. A T9 search will be too general and yield too many irrelevant products. Therefore, we only want to search for the term “T9” if we have already found “Torx”.

We are trying to create a query like this

{
    "query": {
        "bool": {
            "must": {
                "match: {
                    "name": {
                    "query": "Torx T9",
                    "analyzer": "words"
                 }
             },
            "should": {
                "match: {
                    "name": {
                    "query": "Torx T9",
                    "analyzer": "nonwords"
                 }
             }
         }
     }
}

The idea is that it would be easy to create token filters for this. For example, something like:

"settings": {
  "analysis": {
     "filter": {
        "words": {
           "type": "pattern",
           "pattern": "\\A\\p{L}*\\Z",
        },
        "nonwords": {
            "type": "pattern",
            "pattern": "\\P{L}",
        }
    }
}

, . (ab) pattern_replace:

"settings": {
  "analysis": {
     "filter": {
        "words": {
           "type": "pattern_replace",
           "pattern": "\\A((?=.*\\P{L}).*)",
           "replacement": ""
        },
        "nonwords": {
            "type": "pattern_replace",
            "pattern": "\\A((?!.*\\P{L}).*)",
            "replacement": ""
        },
        "nonempty": {
            "type": "length",
            "min":1
        }
    }
}

, . , .

?

+4

elasticsearch

pmakholm 02 . '16 8:53

1

Vijay R. · Answer 1 · 2016-03-05T11:41:36+0000

query-string-query default_operator "" .

, , "Torx drive T9" "Square drive T9". whitespace forkenizer

: torx, drive t9.
: square, drive t9.

, AND, .

{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace": {
          "type": "pattern",
          "pattern": "\\s+"
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer": "whitespace"
        }
      }
    }
  }
}

{
   "query": {
    "query_string": {
       "default_field": "name",
       "query": "Torx T9",
       "default_operator": "AND"
        }
     }
 }

, torx, t9.

How to filter tokens based on regex in ElasticSearch

More articles: