HTML Strip Char Filteredit
The html_strip character filter strips HTML elements from the text and
replaces HTML entities with their decoded value (e.g. replacing & with
&).
Example outputedit
POST _analyze
{
"tokenizer": "keyword",
"char_filter": [ "html_strip" ],
"text": "
I'm so happy!
"}
The keyword tokenizer returns a single term.
The above example returns the term:
[ \nI'm so happy!\n ]
The same example with the standard tokenizer would return the following terms:
[ I'm, so, happy ]
Configurationedit
The html_strip character filter accepts the following parameter:
escaped_tags
An array of HTML tags which should not be stripped from the original text.
Example configurationedit
In this example, we configure the html_strip character filter to leave
tags in place:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": ["my_char_filter"]
}
},
"char_filter": {
"my_char_filter": {
"type": "html_strip",
"escaped_tags": ["b"]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "
I'm so happy!
"}
The above example produces the following term:
[ \nI'm so happy!\n ]