Aggregation by filtered inner_hits query in ElasticSearch

Question

Aggregation by filtered inner_hits query in ElasticSearch

I’m just a few days new to ElasticSearch, and as a training exercise, a rudimentary work scraper was introduced, which combines tasks from several sites dedicated to the task and fills the index with some data for me.

My index contains a document for each website that lists jobs. The property of each of these documents is an array of tasks that contains an object for each task that exists on this site. I consider indexing each work as its own document (especially since the ElasticSearch documentation states that inner_hits is an experimental function), but so far I'm trying to figure out if I can do what I want to do using inner_hits and nested ElasticSearch functions,

I can only query, filter, and return relevant jobs. However, I'm not sure how to apply the same inner_hits constraints to aggregation.

This is my mapping:

{ "jobsitesIdx" : { "mappings" : { "sites" : { "properties" : { "createdAt" : { "type" : "date", "format" : "dateOptionalTime" }, "jobs" : { "type" : "nested", "properties" : { "company" : { "type" : "string" }, "engagement" : { "type" : "string" }, "link" : { "type" : "string", "index" : "not_analyzed" }, "location" : { "type" : "string", "fields" : { "raw" : { "type" : "string", "index" : "not_analyzed" } } }, "title" : { "type" : "string" } } }, "jobscount" : { "type" : "long" }, "sitename" : { "type" : "string" }, "url" : { "type" : "string" } } } } } }

This is the query and collection that I am trying (from Node.js):

 client.search({ "index": 'jobsitesIdx, "type": 'sites', "body": { "aggs" : { "jobs" : { "nested" : { "path" : "jobs" }, "aggs" : { "location" : { "terms" : { "field" : "jobs.location.raw", "size": 25 } }, "company" : { "terms" : { "field" : "jobs.company.raw", "size": 25 } } } } }, "query": { "filtered": { "query": {"match_all": {}}, "filter": { "nested": { "inner_hits" : { "size": 1000 }, "path": "jobs", "query":{ "filtered": { "query": { "match_all": {}}, "filter": { "and": [ {"term": {"jobs.location": "york"}}, {"term": {"jobs.location": "new"}} ] } } } } } } } } }, function (error, response) { response.hits.hits.forEach(function(jobsite) { jobs = jobsite.inner_hits.jobs.hits.hits; jobs.forEach(function(job) { console.log(job); }); }); console.log(response.aggregations.jobs.location.buckets); });

This returns me all the inner_hits of jobs in New York, but the totality shows me what I count for each location and company, and not just those that match internal_hits.

Any suggestions on how to get an aggregate only for the data contained in the corresponding inner_hits?

Edit: I am updating this to include export of matching and index data, as requested. I exported this using the Taskrabbit elasticdump utility found here: https://github.com/taskrabbit/elasticsearch-dump

Index: http://pastebin.com/WaZwBwn4 Display: http://pastebin.com/ZkGnYN94

The above related data differs from the example code in my original question in that the index is called jobsites6 in the data instead of the IDX jobs, as indicated in the question. In addition, the data type is a “job”, while in the code above it is a “site”.

I filled in the callback in the above code to display the response data. I only see jobs in New York from the foreach inner_hits cycle, as expected, however I see this aggregation for the location:

 [ { key: 'New York, NY', doc_count: 243 }, { key: 'San Francisco, CA', doc_count: 92 }, { key: 'Chicago, IL', doc_count: 43 }, { key: 'Boston, MA', doc_count: 39 }, { key: 'Berlin, Germany', doc_count: 22 }, { key: 'Seattle, WA', doc_count: 22 }, { key: 'Los Angeles, CA', doc_count: 20 }, { key: 'Austin, TX', doc_count: 18 }, { key: 'Anywhere', doc_count: 16 }, { key: 'Cupertino, CA', doc_count: 15 }, { key: 'Washington DC', doc_count: 14 }, { key: 'United States', doc_count: 11 }, { key: 'Atlanta, GA', doc_count: 10 }, { key: 'London, UK', doc_count: 10 }, { key: 'Ulm, Deutschland', doc_count: 10 }, { key: 'Riverton, UT', doc_count: 9 }, { key: 'San Diego, CA', doc_count: 9 }, { key: 'Charlotte, NC', doc_count: 8 }, { key: 'Irvine, CA', doc_count: 8 }, { key: 'London', doc_count: 8 }, { key: 'San Mateo, CA', doc_count: 8 }, { key: 'Boulder, CO', doc_count: 7 }, { key: 'Houston, TX', doc_count: 7 }, { key: 'Palo Alto, CA', doc_count: 7 }, { key: 'Sydney, Australia', doc_count: 7 } ]

Since my inner_hits are limited to those in New York, I see that the aggregation is not on my inner_hits because it gives me a count for all locations.

+6

elasticsearch

mmccaff Sep 05 '15 at 15:57

source share

1 answer

Val · Accepted Answer · 2015-09-06T03:27:06+0000

You can achieve this by adding the same filter to your aggregation to include only jobs in New York. Also note that in the second aggregate you had company.raw , but in your mapping, the jobs.company field jobs.company not have a not_analyzed part named raw , so you probably need to add it if you want to compile a non-parsed company name.

 { "_source": [ "sitename" ], "query": { "filtered": { "filter": { "nested": { "inner_hits": { "size": 1000 }, "path": "jobs", "query": { "filtered": { "filter": { "terms": { "jobs.location": [ "new", "york" ] } } } } } } } }, "aggs": { "jobs": { "nested": { "path": "jobs" }, "aggs": { "only_loc": { "filter": { <----- add this filter "terms": { "jobs.location": [ "new", "york" ] } }, "aggs": { "location": { "terms": { "field": "jobs.location.raw", "size": 25 } }, "company": { "terms": { "field": "jobs.company", "size": 25 } } } } } } } }

Aggregation by filtered inner_hits query in ElasticSearch

More articles: