Elasticsearch group - multiple fields

I am looking for a better way to group data in elasticsearch. Elasticsearch does not support something like "group by" in sql.

Let's say I have 1k categories and millions of products. What do you think is the best way to make a complete tree of categories? you need some metadata (icon, link-target, seo-title, ...) and custom sorting for categories.

for example, building a category tree using these 3 sucks solutions. Solution 1 may work (ES 1 is not stable right now) Solution 2 does not work Solution 3 is a pain because it feels ugly, you need to prepare a lot of data, and the granules will explode.

Is there an alternative not to store category data in ES, but only the identifier https://found.no/play/gist/a53e46c91e2bf077f2e1

than you could get a related category from another system like redis, memcache or database.

this will end with clean code, but performance can be a problem. for example, loading 1k Categories from memcache / Redis / databases can be slow. Another problem is that synchronizing 2 databases is harder than synchronizing.

How do you deal with such problems?

I'm sorry for the links, but I can’t post more than two articles in one article.

+6
source share
3 answers

The aggregated API allows you to group across multiple fields using subaggregation. Suppose you want to group by fields field1 , field2 and field3 :

 { "aggs": { "agg1": { "terms": { "field": "field1" }, "aggs": { "agg2": { "terms": { "field": "field2" }, "aggs": { "agg3": { "terms": { "field": "field3" } } } } } } } } 

Of course, this can go on as many fields as possible.

Update:
For completeness, here is what the result of the above query looks like. Also below is python code for generating an aggregation request and smoothing the result into a list of dictionaries.

 { "aggregations": { "agg1": { "buckets": [{ "doc_count": <count>, "key": <value of field1>, "agg2": { "buckets": [{ "doc_count": <count>, "key": <value of field2>, "agg3": { "buckets": [{ "doc_count": <count>, "key": <value of field3> }, { "doc_count": <count>, "key": <value of field3> }, ... ] }, { "doc_count": <count>, "key": <value of field2>, "agg3": { "buckets": [{ "doc_count": <count>, "key": <value of field3> }, { "doc_count": <count>, "key": <value of field3> }, ... ] }, ... ] }, { "doc_count": <count>, "key": <value of field1>, "agg2": { "buckets": [{ "doc_count": <count>, "key": <value of field2>, "agg3": { "buckets": [{ "doc_count": <count>, "key": <value of field3> }, { "doc_count": <count>, "key": <value of field3> }, ... ] }, { "doc_count": <count>, "key": <value of field2>, "agg3": { "buckets": [{ "doc_count": <count>, "key": <value of field3> }, { "doc_count": <count>, "key": <value of field3> }, ... ] }, ... ] }, ... ] } } } 

The following python code executes a group by specifying a list of fields. I specify include_missing=True , it also contains combinations of values ​​in which some of the fields are missing (you do not need this if you have version 2.0 of Elasticsearch thanks to this )

 def group_by(es, fields, include_missing): current_level_terms = {'terms': {'field': fields[0]}} agg_spec = {fields[0]: current_level_terms} if include_missing: current_level_missing = {'missing': {'field': fields[0]}} agg_spec[fields[0] + '_missing'] = current_level_missing for field in fields[1:]: next_level_terms = {'terms': {'field': field}} current_level_terms['aggs'] = { field: next_level_terms, } if include_missing: next_level_missing = {'missing': {'field': field}} current_level_terms['aggs'][field + '_missing'] = next_level_missing current_level_missing['aggs'] = { field: next_level_terms, field + '_missing': next_level_missing, } current_level_missing = next_level_missing current_level_terms = next_level_terms agg_result = es.search(body={'aggs': agg_spec})['aggregations'] return get_docs_from_agg_result(agg_result, fields, include_missing) def get_docs_from_agg_result(agg_result, fields, include_missing): current_field = fields[0] buckets = agg_result[current_field]['buckets'] if include_missing: buckets.append(agg_result[(current_field + '_missing')]) if len(fields) == 1: return [ { current_field: bucket.get('key'), 'doc_count': bucket['doc_count'], } for bucket in buckets if bucket['doc_count'] > 0 ] result = [] for bucket in buckets: records = get_docs_from_agg_result(bucket, fields[1:], include_missing) value = bucket.get('key') for record in records: record[current_field] = value result.extend(records) return result 
+13
source

I think that some developers will definitely look the same in the Spring DATA ES and JAVA ES APIs.

Please, find: -

 List<FieldObject> fieldObjectList = Lists.newArrayList(); SearchQuery aSearchQuery = new NativeSearchQueryBuilder().withQuery(matchAllQuery()).withIndices(indexName).withTypes(type) .addAggregation( terms("ByField1").field("field1").subAggregation(AggregationBuilders.terms("ByField2").field("field2") .subAggregation(AggregationBuilders.terms("ByField3").field("field3"))) ) .build(); Aggregations aField1Aggregations = elasticsearchTemplate.query(aSearchQuery, new ResultsExtractor<Aggregations>() { @Override public Aggregations extract(SearchResponse aResponse) { return aResponse.getAggregations(); } }); Terms aField1Terms = aField1Aggregations.get("ByField1"); aField1Terms.getBuckets().stream().forEach(aField1Bucket -> { String field1Value = aField1Bucket.getKey(); Terms aField2Terms = aField1Bucket.getAggregations().get("ByField2"); aField2Terms.getBuckets().stream().forEach(aField2Bucket -> { String field2Value = aField2Bucket.getKey(); Terms aField3Terms = aField2Bucket.getAggregations().get("ByField3"); aField3Terms.getBuckets().stream().forEach(aField3Bucket -> { String field3Value = aField3Bucket.getKey(); Long count = aField3Bucket.getDocCount(); FieldObject fieldObject = new FieldObject(); fieldObject.setField1(field1Value); fieldObject.setField2(field2Value); fieldObject.setField3(field3Value); fieldObject.setCount(count); fieldObjectList.add(fieldObject); }); }); }); 

import must be performed for the same: -

 import static org.elasticsearch.index.query.QueryBuilders.matchAllQuery; import static org.elasticsearch.search.aggregations.AggregationBuilders.terms; import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.common.collect.Lists; import org.elasticsearch.index.query.FilterBuilder; import org.elasticsearch.index.query.FilterBuilders; import org.elasticsearch.index.query.TermFilterBuilder; import org.elasticsearch.search.aggregations.AggregationBuilders; import org.elasticsearch.search.aggregations.Aggregations; import org.elasticsearch.search.aggregations.bucket.filter.InternalFilter; import org.elasticsearch.search.aggregations.bucket.terms.Terms; import org.springframework.data.elasticsearch.core.ElasticsearchTemplate; import org.springframework.data.elasticsearch.core.ResultsExtractor; import org.springframework.data.elasticsearch.core.query.NativeSearchQueryBuilder; import org.springframework.data.elasticsearch.core.query.SearchQuery; 
+3
source
Subcategories

- this is what you need ... although it is never explicitly stated in the documents, it can be found by implicitly structuring aggregations

This will result in sub-aggregation, as if the query was filtered out as a result of higher aggregation. In fact, it looks as if this is what is happening there.

 { "aggregations": { "VALUE1AGG": { "terms": { "field": "VALUE1", }, "aggregations": { "VALUE2AGG": { "terms": { "field": "VALUE2", } } } } } } 
0
source

All Articles