溶液1
现在,如果你想在索引和存储内容之前完全去除html,你可以使用mapper attachment插件 - 当y您可以定义映射,您可以将content_type分类为“html”。“
映射器附件对许多事情很有用,特别是如果您处理多种文档类型,但最值得注意的是 - 我相信只是使用它来剥离html标签就足够了(您不能用html_strip char filter)
虽然只是一个预警 - 没有任何html标签会被存储,所以如果你确实需要这些标签,我会建议定义另一个字段来存储原始内容另一个注意:你不能指定multifields for mapper attachment documents,so you would need to store that that outside of the mapper attachment document。看到我的工作示例如下:
你“会需要导致该映射:
{
"html5-es" : {
"aliases" : { },
"mappings" : {
"document" : {
"properties" : {
"delete" : {
"type" : "boolean"
},
"file" : {
"type" : "attachment",
"fields" : {
"content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"author" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
},
"content_length" : {
"type" : "integer"
},
"language" : {
"type" : "string"
}
}
},
"hash_id" : {
"type" : "string"
},
"path" : {
"type" : "string"
},
"raw_content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "raw"
},
"title" : {
"type" : "string"
}
}
}
},
"settings" : { //insert your own settings here },
"warmers" : { }
}
}
使得在NEST,我将装配的内容以这样:
Attachment attachment = new Attachment();
attachment.Content = Convert.ToBase64String(File.ReadAllBytes("path/to/document"));
attachment.ContentType = "html";
Document document = new Document();
document.File = attachment;
document.RawContent = InsertRawContentFromString(originalText);
我在感测试此 - 结果如下:
"file": {
"_content": "PGh0bWwgeG1sbnM6TWFkQ2FwPSJodHRwOi8vd3d3Lm1hZGNhcHNvZnR3YXJlLmNvbS9TY2hlbWFzL01hZENhcC54c2QiPg0KICA8aGVhZCAvPg0KICA8Ym9keT4NCiAgICA8aDE+VG9waWMxMDwvaDE+DQogICAgPHA+RGVsZXRlIHRoaXMgdGV4dCBhbmQgcmVwbGFjZSBpdCB3aXRoIHlvdXIgb3duIGNvbnRlbnQuIENoZWNrIHlvdXIgbWFpbGJveC48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+YXNkZjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD4xMDwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5MYXZlbmRlci48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+MTAvNiAxMjowMzwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD41IDA5PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPjExIDQ3PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPkhhbGxvd2VlbiBpcyBpbiBPY3RvYmVyLjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5qb2c8L3A+DQogIDwvYm9keT4NCjwvaHRtbD4=",
"_content_length": 0,
"_content_type": "html",
"_date": "0001-01-01T00:00:00",
"_title": "Topic10"
},
"delete": false,
"raw_content": "
Topic10
Delete this text and replace it with your own content. Check your mailbox.
asdf
10
Lavender.
10/6 12:03
5 09
11 47
Halloween is in October.
jog
"},
"highlight": {
"file.content": [
"\n Topic10\n\n Delete this text and replace it with your own content. Check your mailbox.\n\n \n\n asdf\n\n \n\n 10\n\n \n\n Lavender.\n\n \n\n 10/6 12:03\n\n \n\n 5 09\n\n \n\n 11 47\n\n \n\n Halloween is in October.\n\n \n\n jog\n\n "
]
}
解决方案2
您将需要建立使用标准分析仪对您的内容和SEARCH进行索引的NGram分析仪。
"analyzer" : {
"standard" : {
"type" : "standard"
},
"autocomplete" : {
"filter" : [ "standard", "lowercase" ],
"char_filter" : [ "html_strip" ],
"type" : "custom",
"tokenizer" : "ngram"
}
的这个实施例:
输入: “棕色”
NGRAM分析器:
并[b],[BR],[BRO],[眉头],[棕色]
[R],[RO],[行],[rown]
[0],[流],[自己]
[W],[WN]
[N]
所以当你做一个自动完成搜索,它将匹配任何这些索引碎片。但是,使用标准分析仪只搜索(返回结果页面)非常重要,这样它就不会仅匹配任何这些随机碎片。