Given , I specified my html strip char filter in my user analyzer
When I index a document with html content
Then I expect html to be excluded from indexed content
And when retrieving the returned document from the index, shoult does not contain hmtl
ACTUAL : Indexed document contains html Received document contained html
I tried to specify the analyzer as index_analyzer, as expected, and several others from the desperation of search_analyzer and the analyzer. This does not seem to affect indexing or document retrieval.
Testing Doc Indexing in the HTML_Strip Field Parsed Field:
REQUEST: An example POST document with html content
POST /html_poc_v2/html_poc_type/02 { "description": "Description <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>", "title": "Title <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>", "body": "Body <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>" }
Expected: Indexed data is parsed using an html parser. In fact: data is indexed using html
REACTION
{ "_index": "html_poc_v2", "_type": "html_poc_type", "_id": "02", ... "_source": { "description": "Description <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>", "title": "Title <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>", "body": "Body <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>" } }
Settings and document matching
PUT /html_poc_v2 { "settings": { "analysis": { "analyzer": { "my_html_analyzer": { "type": "custom", "tokenizer": "standard", "char_filter": [ "html_strip" ] } } }, "mappings": { "html_poc_type": { "properties": { "body": { "type": "string", "analyzer": "my_html_analyzer" }, "description": { "type": "string", "analyzer": "my_html_analyzer" }, "title": { "type": "string", "search_analyser": "my_html_analyzer" }, "urlTitle": { "type": "string" } } } } } }
The user analyzer test works fine:
REQUEST
GET /html_poc_v2/_analyze?analyzer=my_html_analyzer {<p>Some déjà vu <a href="http://somedomain.com>">website</a>}
answer
{ "tokens": [ { "token": "Some",… "position": 1 }, { "token": "déjà",… "position": 2 }, { "token": "vu",… "position": 3 }, { "token": "website",… "position": 4 } ] }
Under the hood
being under the hood with built-in script proofs that my html parser should be skipped
REQUEST
GET /html_poc_v2/html_poc_type/_search?pretty=true { "query" : { "match_all" : { } }, "script_fields": { "terms" : { "script": "doc[field].values", "params": { "field": "title" } } } }
REACTION
{ … "hits": { .. "hits": [ { "_index": "html_poc_v2", "_type": "html_poc_type", … "fields": { "terms": [ [ "a", "agrave", "d", "eacute", "href", "http", "j", "p", "some", "somedomain.com", "title", "vu", "website" ] ] } } ] } }
Similar to this question: Why is the HTML tag searchable, even if it was filtered in the elastic search
I also read this wonderful document: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html
ES Version: 1.7.2
Please, help.