Elasticsearch: Dimensional HTML tags before indexing documents with a disabled html_strip filter

Given , I specified my html strip char filter in my user analyzer

When I index a document with html content

Then I expect html to be excluded from indexed content

And when retrieving the returned document from the index, shoult does not contain hmtl

ACTUAL : Indexed document contains html Received document contained html

I tried to specify the analyzer as index_analyzer, as expected, and several others from the desperation of search_analyzer and the analyzer. This does not seem to affect indexing or document retrieval.

Testing Doc Indexing in the HTML_Strip Field Parsed Field:

REQUEST: An example POST document with html content

POST /html_poc_v2/html_poc_type/02 { "description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>", "title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>", "body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>" } 

Expected: Indexed data is parsed using an html parser. In fact: data is indexed using html

REACTION

 { "_index": "html_poc_v2", "_type": "html_poc_type", "_id": "02", ... "_source": { "description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>", "title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>", "body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>" } } 

Settings and document matching

 PUT /html_poc_v2 { "settings": { "analysis": { "analyzer": { "my_html_analyzer": { "type": "custom", "tokenizer": "standard", "char_filter": [ "html_strip" ] } } }, "mappings": { "html_poc_type": { "properties": { "body": { "type": "string", "analyzer": "my_html_analyzer" }, "description": { "type": "string", "analyzer": "my_html_analyzer" }, "title": { "type": "string", "search_analyser": "my_html_analyzer" }, "urlTitle": { "type": "string" } } } } } } 

The user analyzer test works fine:

REQUEST

 GET /html_poc_v2/_analyze?analyzer=my_html_analyzer {<p>Some d&eacute;j&agrave; vu <a href="http://somedomain.com>">website</a>} 

answer

 { "tokens": [ { "token": "Some",… "position": 1 }, { "token": "déjà",… "position": 2 }, { "token": "vu",… "position": 3 }, { "token": "website",… "position": 4 } ] } 

Under the hood

being under the hood with built-in script proofs that my html parser should be skipped

REQUEST

 GET /html_poc_v2/html_poc_type/_search?pretty=true { "query" : { "match_all" : { } }, "script_fields": { "terms" : { "script": "doc[field].values", "params": { "field": "title" } } } } 

REACTION

 { … "hits": { .. "hits": [ { "_index": "html_poc_v2", "_type": "html_poc_type", … "fields": { "terms": [ [ "a", "agrave", "d", "eacute", "href", "http", "j", "p", "some", "somedomain.com", "title", "vu", "website" ] ] } } ] } } 

Similar to this question: Why is the HTML tag searchable, even if it was filtered in the elastic search

I also read this wonderful document: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

ES Version: 1.7.2

Please, help.

+6
source share
1 answer

You are confusing the " _source " field in the response to return what is parsed and indexed. It looks like your expectation is that the _source field in the response returns the parsed document. This is not true.

From the documentation ;

The _source field contains the source document of the JSON document that was passed at the time of the index. The _source field itself is not indexed (and thus not searchable), but it is stored so that it can be returned when querying queries such as retrieve or search.

Ideally, in the above case, when you want to format the source data for presentation purposes, this should be done on the client side.

However, if you say that one way to achieve this for the above use case is to use script fields and keyword-tokenizer as follows:

 PUT test { "settings": { "analysis": { "analyzer": { "my_html_analyzer": { "type": "custom", "tokenizer": "standard", "char_filter": [ "html_strip" ] }, "parsed_analyzer": { "type": "custom", "tokenizer": "keyword", "char_filter": [ "html_strip" ] } } } }, "mappings": { "test": { "properties": { "body": { "type": "string", "analyzer": "my_html_analyzer", "fields": { "parsed": { "type": "string", "analyzer": "parsed_analyzer" } } } } } } } PUT test/test/1 { "body" : "Title <p> Some d&eacute;j&agrave; vu <a href='http://somedomain.com'> website </a> <span> this is inline </span></p> " } GET test/_search { "query" : { "match_all" : { } }, "script_fields": { "terms" : { "script": "doc[field].values", "params": { "field": "body.parsed" } } } } 

Result:

 { "_index": "test", "_type": "test", "_id": "1", "_score": 1, "fields": { "terms": [ "Title \n Some déjà vu website this is inline \n " ] } } 

note. I believe the above idea is a bad idea, since removing html descriptors can be easily achieved on the client side and you will have much more control regarding formatting than depending on such work. More importantly, perhaps this is done on the client side.

+2
source

All Articles