业务场景:
某聊天记录搜索需求,客户希望对聊天记录进行搜索,但又不希望通过搜索引擎可以查询到聊天记录明文。
业务分析:
从搜索引擎的角度而言,客户上述需求实为只利用聊天记录内容字段建立索引,但不希望保留原文。
标题解决方案:
方案1: 建立test索引,包含三个字段:itemid(聊天记录唯一id),date(聊天记录发布时间),content(聊天记录内容),解决思路:再配置mapping时,不做特殊配置,只是在查询时强制将content剔除返回字段:
PUT test
{
"mappings": {
"test":{
"properties": {
"itemid": {
"type": "keyword"
},
"date": {
"type": "long"
},
"content": {
"type": "text"
}
}
}
}
}
我们插入一条记录
PUT test/test/1
{
"itemid": "1",
"date": 1000,
"content": "a very large content"
}
查询时通过_source filter将content排除
GET test/_search
{
"query": {
"match": {
"_all": "very"
}
},
"_source": ["itemid", "date"]
}
结果如下,可以看到基于ES搜索时,默认返回_source内容时,content已经被屏蔽。
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.26742277,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.26742277,
"_source": {
"date": 1000,
"itemid": "1"
}
}
]
}
}
这种方案的好处时:定制开发比较简单,不需要做很多的处理,坏处就是并不算是真正的符合客户需求,客户更希望的是完成索引建立后,任何人都不能再看到字段的原始内容了。
方案2. 基于store的方式,此种方案思路是我们通过配置字段是否store,然后查询时通过store_fields的方式获取item,首先是创建索引mapping时,将itemid和date字段的store配置为true,这样es会保存两个字段的原始内容,而不保存content字段的内容,es默认store为false
PUT test
{
"mappings": {
"test":{
"properties": {
"itemid": {
"type": "keyword",
"store": true
},
"date": {
"type": "long",
"store": true
},
"content": {
"type": "text"
}
}
}
}
}
通过store_fields来配置返回的字段
GET test/_search
{
"query": {
"match": {
"_all": "very"
}
},
"stored_fields": ["itemid", "date"]
}
可以看到搜索结果为:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.26742277,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.26742277,
"fields": {
"date": [
1000
],
"itemid": [
"1"
]
}
}
]
}
}
上述两种方案类似,都是通过限制ES返回字段来控制,但并未真正满足客户的需求,客户更希望的是任何人完成索引建立后,任何人都无法看到原始内容。由于ES会默认将建立索引时的JSON数据保存在_source中,所以问题关键在于将_source关闭。
**
终极解决方案:
**
mapping配置,这里通过设置参数 “_source”: {“enabled”: false}取消原始json的保存,在mapping时content的store也为false,因此再es中并未保存content的原始内容,只是建立了索引。
PUT test
{
"mappings": {
"test":{
"_source": {
"enabled": false
},
"properties": {
"itemid": {
"type": "keyword",
"store": true
},
"date": {
"type": "long",
"store": true
},
"content": {
"type": "text"
}
}
}
}
}
此时,再进行搜索,通过设置_source或store_fields的方法均找不到原始content的内容了,只会根据倒排索引进行匹配。我们推送两条数据,并进行搜索,尝试用store_fields的方式召回content,因为其store为false,因此并不能返回,智能获取匹配的文档_id。
PUT test/test/1
{
"itemid": "1",
"date": 1000,
"content": "a very large content"
}
PUT test/test/2
{
"itemid": "2",
"date": 1001,
"content": "a large content"
}
GET test/_search
{
"query": {
"match": {
"_all": "very"
}
},
"stored_fields": ["content"]
}
如果通过_source来指定返回content字段,则会报错,因为我们已将_source功能屏蔽。
GET test/_search
{
"query": {
"match": {
"_all": "very"
}
},
"_source": "content"
}
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 4,
"failed": 1,
"failures": [
{
"shard": 3,
"index": "test",
"node": "hLgp8P3LQxqGDv65X2CXEg",
"reason": {
"type": "illegal_argument_exception",
"reason": "unable to fetch fields from _source field: _source is disabled in the mappings for index [test]"
}
}
]
},
"hits": {
"total": 1,
"max_score": 0.26742277,
"hits": []
}
}
当然由于禁掉_source也会带来一些其他问题,下面是ES官网列出的一些受影响功能:
Think before disabling the _source field
Users often disable the _source field without thinking about the consequences, and then live to regret it. If the _source field isn’t available then a number of features are not supported:
- The update, update_by_query, and reindex APIs.
- On the fly highlighting.
- The ability to reindex from one Elasticsearch index to another, either to change mappings or analysis, or to upgrade an index to a new major version.
- The ability to debug queries or aggregations by viewing the original document used at index time.
- Potentially in the future, the ability to repair index corruption automatically.