html显示elasticsearch,Elasticsearch正确的策略来索引html文件的内容

最新推荐文章于 2024-03-21 13:11:56 发布

weixin_40003283

最新推荐文章于 2024-03-21 13:11:56 发布

阅读量283

点赞数

文章标签： html显示elasticsearch

溶液1

现在，如果你想在索引和存储内容之前完全去除html，你可以使用mapper attachment插件 - 当y您可以定义映射，您可以将content_type分类为“html”。“

映射器附件对许多事情很有用，特别是如果您处理多种文档类型，但最值得注意的是 - 我相信只是使用它来剥离html标签就足够了(您不能用html_strip char filter)

虽然只是一个预警 - 没有任何html标签会被存储，所以如果你确实需要这些标签，我会建议定义另一个字段来存储原始内容另一个注意：你不能指定multifields for mapper attachment documents，so you would need to store that that outside of the mapper attachment document。看到我的工作示例如下：

你“会需要导致该映射：

{

"html5-es" : {

"aliases" : { },

"mappings" : {

"document" : {

"properties" : {

"delete" : {

"type" : "boolean"

"file" : {

"type" : "attachment",

"fields" : {

"content" : {

"type" : "string",

"store" : true,

"term_vector" : "with_positions_offsets",

"analyzer" : "autocomplete"

"author" : {

"type" : "string",

"store" : true,

"term_vector" : "with_positions_offsets"

"title" : {

"type" : "string",

"store" : true,

"term_vector" : "with_positions_offsets",

"analyzer" : "autocomplete"

"name" : {

"type" : "string"

"date" : {

"type" : "date",

"format" : "strict_date_optional_time||epoch_millis"

"keywords" : {

"type" : "string"

"content_type" : {

"type" : "string"

"content_length" : {

"type" : "integer"

"language" : {

"type" : "string"

}

"hash_id" : {

"type" : "string"

"path" : {

"type" : "string"

"raw_content" : {

"type" : "string",

"store" : true,

"term_vector" : "with_positions_offsets",

"analyzer" : "raw"

"title" : {

"type" : "string"

}

"settings" : { //insert your own settings here },

"warmers" : { }

}

使得在NEST，我将装配的内容以这样：

Attachment attachment = new Attachment();

attachment.Content = Convert.ToBase64String(File.ReadAllBytes("path/to/document"));

attachment.ContentType = "html";

Document document = new Document();

document.File = attachment;

document.RawContent = InsertRawContentFromString(originalText);

我在感测试此 - 结果如下：

"file": {

"_content": "PGh0bWwgeG1sbnM6TWFkQ2FwPSJodHRwOi8vd3d3Lm1hZGNhcHNvZnR3YXJlLmNvbS9TY2hlbWFzL01hZENhcC54c2QiPg0KICA8aGVhZCAvPg0KICA8Ym9keT4NCiAgICA8aDE+VG9waWMxMDwvaDE+DQogICAgPHA+RGVsZXRlIHRoaXMgdGV4dCBhbmQgcmVwbGFjZSBpdCB3aXRoIHlvdXIgb3duIGNvbnRlbnQuIENoZWNrIHlvdXIgbWFpbGJveC48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+YXNkZjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD4xMDwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5MYXZlbmRlci48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+MTAvNiAxMjowMzwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD41IDA5PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPjExIDQ3PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPkhhbGxvd2VlbiBpcyBpbiBPY3RvYmVyLjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5qb2c8L3A+DQogIDwvYm9keT4NCjwvaHRtbD4=",

"_content_length": 0,

"_content_type": "html",

"_date": "0001-01-01T00:00:00",

"_title": "Topic10"

"delete": false,

"raw_content": "

Topic10

Delete this text and replace it with your own content. Check your mailbox.

asdf

Lavender.

10/6 12:03

5 09

11 47

Halloween is in October.

jog

"highlight": {

"file.content": [

"\n Topic10\n\n Delete this text and replace it with your own content. Check your mailbox.\n\n \n\n asdf\n\n \n\n 10\n\n \n\n Lavender.\n\n \n\n 10/6 12:03\n\n \n\n 5 09\n\n \n\n 11 47\n\n \n\n Halloween is in October.\n\n \n\n jog\n\n "

]

}

解决方案2

您将需要建立使用标准分析仪对您的内容和SEARCH进行索引的NGram分析仪。

"analyzer" : {

"standard" : {

"type" : "standard"

"autocomplete" : {

"filter" : [ "standard", "lowercase" ],

"char_filter" : [ "html_strip" ],

"type" : "custom",

"tokenizer" : "ngram"