How to highlight nested fields in Elasticsearch

Although the logical structure is Lucene, I try to make my nested fields highlighted when there is any search result in their contents.

Here is an explanation from Elasticsearch documentation ( nested type mapping)

Internal implementation

Internally nested objects are indexed as additional documents, but since they can be indexed in the same โ€œblockโ€, this allows you to quickly attach to parent documents.

These internal subdocuments are automatically masked when performing operations against the index (for example, search with match_all query), and they fail when using a subquery.

Because subdocuments are always masked by the parent document, subdocuments can never be accessed outside the subquery area. For example, saved fields can be included in fields inside nested objects, but they cannot be restored because the saved fields are extracted outside the scope of nested queries.

0. In my case

I have an Elasticsearch index containing the following mapping:

{ "my_documents": { "dynamic_date_formats": [ "dd.MM.yyyy", "yyyy-MM-dd", "yyyy-MM-dd HH:mm:ss" ], "index_analyzer": "Analyzer2_index", "search_analyzer": "Analyzer2_search_decompound", "_timestamp": { "enabled": true }, "properties": { "identifier": { "type": "string" }, "description": { "type": "multi_field", "fields": { "sort": { "type": "string", "index": "not_analyzed" }, "description": { "type": "string" } } }, "files": { "type": "nested", "include_in_root": true, "properties": { "content": { "type": "string", "include_in_root": true } } }, "and then some other": "normal string fields" } } } 

I am trying to execute a query like this:

 { "size": 100, "query": { "bool": { "should": [ { "nested": { "path": "files", "query": { "bool": { "should": { "match": { "content": { "query": "burpcontrol", "minimum_should_match": "85%" } } } } } } }, { "match": { "description": { "query": "burpcontrol", "minimum_should_match": "85%" } } }, { "match": { "identifier": { "query": "burpcontrol", "minimum_should_match": "85%" } } } ] } }, "highlight": { "pre_tags": [ "<span style=\"background-color: yellow\">" ], "post_tags": [ "</span>" ], "order": "score", "no_match_size": 100, "fragment_size": 50, "number_of_fragments": 3, "require_field_match": true, "fields": { "files.content": {}, "description": {}, "identifier": {} } } } 

I have a problem:

1. require_field_match

If I use "require_field_match": false , I get that even if highlighting does not work on nested fields, the search term is always highlighted in all fields. This is the solution I am using, but the actions are terrible. For 50 documents, my request takes 25 seconds. 100 documents in about 50 seconds. 10 documents 5sec. And if I remove the nested field from the selection, everything works quickly, like light!

2.include_in_root

I would like to have a flattened version of my nested fields (therefore, store them as regular objects / fields. For this I have to specify

"files": {"type": "inested", " include_in_root ": true, ...

but I donโ€™t know why, after reindexing, I donโ€™t see any additional flattened field at the root of the document (while I was expecting something like "files.content":["content1", "content2", "..."] )

If this worked, instead it would be possible to access (in a smoothed field) the contents of the nested field and perform selection on it.

Do you know if it is possible to achieve good (and indicative) highlighting on nested fields, or at least offer me why my request is so slow? (I already optimized the fragments)

+7
nested elasticsearch lucene nested-query
source share
1 answer

Here you can do a few things with the parent / child relationship. I will go through a few, and hopefully this will lead you in the right direction; he will do a lot of testing anyway to find out if this solution will be more effective for you. In addition, I clarified some details of your installation, for clarity. Please forgive the long post.

I set the display of the parent / child as follows:

 DELETE /test_index PUT /test_index { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "parent_doc": { "properties": { "identifier": { "type": "string" }, "description": { "type": "string" } } }, "child_doc": { "_parent": { "type": "parent_doc" }, "properties": { "content": { "type": "string" } } } } } 

Then some test documents are added:

 POST /test_index/_bulk {"index":{"_index":"test_index","_type":"parent_doc","_id":1}} {"identifier": "first", "description":"some special text"} {"index":{"_index":"test_index","_type":"child_doc","_parent":1}} {"content":"text that is special"} {"index":{"_index":"test_index","_type":"child_doc","_parent":1}} {"content":"text that is not"} {"index":{"_index":"test_index","_type":"parent_doc","_id":2}} {"identifier": "second", "description":"some different text"} {"index":{"_index":"test_index","_type":"child_doc","_parent":2}} {"content":"different child text, but special"} {"index":{"_index":"test_index","_type":"parent_doc","_id":3}} {"identifier": "third", "description":"we don't want this parent"} {"index":{"_index":"test_index","_type":"child_doc","_parent":3}} {"content":"or this child"} 

If I understand your specifications correctly, we need a request for "special" to return each of these documents except the last two (correct me if I am wrong). We need documents that match the text, have a child that matches the text, or have a parent that matches the text.

We can return the parents matching the request:

 POST /test_index/parent_doc/_search { "query": { "match": { "description": "special" } }, "highlight": { "fields": { "description": {}, "identifier": {} } } } ... { "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 1, "max_score": 1.1263815, "hits": [ { "_index": "test_index", "_type": "parent_doc", "_id": "1", "_score": 1.1263815, "_source": { "identifier": "first", "description": "some special text" }, "highlight": { "description": [ "some <em>special</em> text" ] } } ] } } 

And we can return the children matching the request:

 POST /test_index/child_doc/_search { "query": { "match": { "content": "special" } }, "highlight": { "fields": { "content": {} } } } ... { "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 2, "max_score": 0.92364895, "hits": [ { "_index": "test_index", "_type": "child_doc", "_id": "geUFenxITZSL7epvB568uA", "_score": 0.92364895, "_source": { "content": "text that is special" }, "highlight": { "content": [ "text that is <em>special</em>" ] } }, { "_index": "test_index", "_type": "child_doc", "_id": "IMHXhM3VRsCLGkshx52uAQ", "_score": 0.80819285, "_source": { "content": "different child text, but special" }, "highlight": { "content": [ "different child text, but <em>special</em>" ] } } ] } } 

We can return to parents who fit the text and children who fit the text:

 POST /test_index/parent_doc,child_doc/_search { "query": { "multi_match": { "query": "special", "fields": ["description", "content"] } }, "highlight": { "fields": { "description": {}, "identifier": {}, "content": {} } } } ... { "took": 3, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 3, "max_score": 1.1263815, "hits": [ { "_index": "test_index", "_type": "parent_doc", "_id": "1", "_score": 1.1263815, "_source": { "identifier": "first", "description": "some special text" }, "highlight": { "description": [ "some <em>special</em> text" ] } }, { "_index": "test_index", "_type": "child_doc", "_id": "geUFenxITZSL7epvB568uA", "_score": 0.75740534, "_source": { "content": "text that is special" }, "highlight": { "content": [ "text that is <em>special</em>" ] } }, { "_index": "test_index", "_type": "child_doc", "_id": "IMHXhM3VRsCLGkshx52uAQ", "_score": 0.6627297, "_source": { "content": "different child text, but special" }, "highlight": { "content": [ "different child text, but <em>special</em>" ] } } ] } } 

However, to return all documents associated with this request, we need to use the bool request:

 POST /test_index/parent_doc,child_doc/_search { "query": { "bool": { "should": [ { "multi_match": { "query": "special", "fields": [ "description", "content" ] } }, { "has_child": { "type": "child_doc", "query": { "match": { "content": "special" } } } }, { "has_parent": { "type": "parent_doc", "query": { "match": { "description": "special" } } } } ] } }, "highlight": { "fields": { "description": {}, "identifier": {}, "content": {} } }, "fields": ["_parent", "_source"] } ... { "took": 5, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 5, "max_score": 0.8866254, "hits": [ { "_index": "test_index", "_type": "parent_doc", "_id": "1", "_score": 0.8866254, "_source": { "identifier": "first", "description": "some special text" }, "highlight": { "description": [ "some <em>special</em> text" ] } }, { "_index": "test_index", "_type": "child_doc", "_id": "geUFenxITZSL7epvB568uA", "_score": 0.67829096, "_source": { "content": "text that is special" }, "fields": { "_parent": "1" }, "highlight": { "content": [ "text that is <em>special</em>" ] } }, { "_index": "test_index", "_type": "child_doc", "_id": "IMHXhM3VRsCLGkshx52uAQ", "_score": 0.18709806, "_source": { "content": "different child text, but special" }, "fields": { "_parent": "2" }, "highlight": { "content": [ "different child text, but <em>special</em>" ] } }, { "_index": "test_index", "_type": "child_doc", "_id": "NiwsP2VEQBKjqu1M4AdjCg", "_score": 0.12531912, "_source": { "content": "text that is not" }, "fields": { "_parent": "1" } }, { "_index": "test_index", "_type": "parent_doc", "_id": "2", "_score": 0.12531912, "_source": { "identifier": "second", "description": "some different text" } } ] } } 

(I included the "_parent" field to make it easier to understand why the documents were included in the results, as shown here ).

Let me know if this helps.

Here is the code I used:

http://sense.qbox.io/gist/d69a4d6531dc063faa4b4e094cff2a472a73c5a6

+6
source share

All Articles