html显示elasticsearch,剥离HTML标记后,ElasticSearch会突出显示

现在,如果要在索引和存储内容之前完全去除html,可以使用映射器附件插件 - 在定义映射时,可以将content_type分类为“html”。 您可以在没有html标签的情况下突出显示。

映射器附件对很多东西很有用,特别是如果你处理多种文档类型,但最值得注意的是 - 我相信只是为了剥离html标签而使用它就足够了(你不能用html_strip char做的事情)过滤器)。

只是预警 - 不会存储任何html标签。因此,如果你确实需要这些标签,我建议定义另一个字段来存储原始内容。另一个注意事项:您无法为映射器附件文档指定多字段,因此您需要将其存储在映射器附件文档之外。请参阅下面的工作示例。

您需要导致此映射:

{

"html5-es" : {

"aliases" : { },

"mappings" : {

"document" : {

"properties" : {

"delete" : {

"type" : "boolean"

},

"file" : {

"type" : "attachment",

"fields" : {

"content" : {

"type" : "string",

"store" : true,

"term_vector" : "with_positions_offsets",

"analyzer" : "autocomplete"

},

"author" : {

"type" : "string",

"store" : true,

"term_vector" : "with_positions_offsets"

},

"title" : {

"type" : "string",

"store" : true,

"term_vector" : "with_positions_offsets",

"analyzer" : "autocomplete"

},

"name" : {

"type" : "string"

},

"date" : {

"type" : "date",

"format" : "strict_date_optional_time||epoch_millis"

},

"keywords" : {

"type" : "string"

},

"content_type" : {

"type" : "string"

},

"content_length" : {

"type" : "integer"

},

"language" : {

"type" : "string"

}

}

},

"hash_id" : {

"type" : "string"

},

"path" : {

"type" : "string"

},

"raw_content" : {

"type" : "string",

"store" : true,

"term_vector" : "with_positions_offsets",

"analyzer" : "raw"

},

"title" : {

"type" : "string"

}

}

}

},

"settings" : { //insert your own settings here },

"warmers" : { }

}

}

这样在NEST中,我将这样组装内容:

Attachment attachment = new Attachment();

attachment.Content = Convert.ToBase64String(File.ReadAllBytes("path/to/document"));

attachment.ContentType = "html";

Document document = new Document();

document.File = attachment;

document.RawContent = InsertRawContentFromString(originalText);

我在Sense中测试了这个 - 结果如下:

"file": {

"_content": "PGh0bWwgeG1sbnM6TWFkQ2FwPSJodHRwOi8vd3d3Lm1hZGNhcHNvZnR3YXJlLmNvbS9TY2hlbWFzL01hZENhcC54c2QiPg0KICA8aGVhZCAvPg0KICA8Ym9keT4NCiAgICA8aDE+VG9waWMxMDwvaDE+DQogICAgPHA+RGVsZXRlIHRoaXMgdGV4dCBhbmQgcmVwbGFjZSBpdCB3aXRoIHlvdXIgb3duIGNvbnRlbnQuIENoZWNrIHlvdXIgbWFpbGJveC48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+YXNkZjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD4xMDwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5MYXZlbmRlci48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+MTAvNiAxMjowMzwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD41IDA5PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPjExIDQ3PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPkhhbGxvd2VlbiBpcyBpbiBPY3RvYmVyLjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5qb2c8L3A+DQogIDwvYm9keT4NCjwvaHRtbD4=",

"_content_length": 0,

"_content_type": "html",

"_date": "0001-01-01T00:00:00",

"_title": "Topic10"

},

"delete": false,

"raw_content": "

Topic10

Delete this text and replace it with your own content. Check your mailbox.

asdf

10

Lavender.

10/6 12:03

5 09

11 47

Halloween is in October.

jog

"

},

"highlight": {

"file.content": [

"\n Topic10\n\n Delete this text and replace it with your own content. Check your mailbox.\n\n  \n\n asdf\n\n  \n\n 10\n\n  \n\n Lavender.\n\n  \n\n 10/6 12:03\n\n  \n\n 5 09\n\n  \n\n 11 47\n\n  \n\n Halloween is in October.\n\n  \n\n jog\n\n "

]

}

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值