Elasticsearch aggregation request with a few exceptions

I have a bunch of company data in an ES database. I am looking for data on the number of documents submitted by each company, but I am having problems with the aggregation request. I want to exclude terms such as "Corporation" or "Inc." So far, I have been able to do this successfully for one semester at a time according to the code below.

 { "aggs" : { "companies" : { "terms" : { "field" : "Companies.name", "exclude" : "corporation" } } } } 

Which returns

 "aggregations": { "assignee": { "buckets": [ { "key": "inc", "doc_count": 375 }, { "key": "company", "doc_count": 252 } ] } } 

Ideally, I would like to be able to do something like

 { "aggs" : { "companies" : { "terms" : { "field" : "Companies.name", "exclude" : ["corporation", "inc.", "inc", "co", "company", "the", "industries", "incorporated", "international"], } } } } 

But I could not find a way that does not throw an error

I reviewed the "Terms" section of Aggregation in the ES documentation and can only find an example for one exception. I am wondering if it is possible to exclude a few terms, and if so, what is the correct syntax for this.

Note: I know that I could set the "not_analyzed" field and get groupings for full company names, and not for split names. However, I hesitate to do this, as the analysis allows the basket to be more tolerant of name changes (for example, Microsoft Corp & Microsoft Corporation)

+8
source share
2 answers

The exclude parameter is a regular expression , so you can use a regular expression that exhaustively lists all the options:

 "exclude" : "corporation|inc\\.|inc|co|company|the|industries|incorporated|international" 

When doing this in general, it is important to avoid values ​​(e.g . ). If it is not generically generated, you can simplify some of them by grouping them (for example, inc\\.? Covers inc\\.|inc or more complicated: co(mpany|rporation)? ). If this is a lot, then it is probably worth checking how the added complexity affects performance.

Flags can also be added, which are parameters that exist in the Java Pattern . One that may come in handy is CASE_INSENSITIVE .

 "exclude" : { "pattern" : "...expression as before...", "flags" : "CASE_INSENSITIVE" } 
+11
source

This is an old question, but a newer answer: an array currently supported to exclude exact matching of list items

so the array syntax in OP is now valid and works as expected (in addition to the actual regular expression answer too)

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_exact_values

0
source

All Articles