Elasticsearch索引配置隐私字段

最新推荐文章于 2023-07-28 09:07:37 发布

gogoout123

最新推荐文章于 2023-07-28 09:07:37 发布

阅读量1.2k

点赞数

分类专栏： Elasticsearch 文章标签：索引配置

本文链接：https://blog.csdn.net/Vancl_Wang/article/details/85316012

版权

Elasticsearch 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

业务场景：

某聊天记录搜索需求，客户希望对聊天记录进行搜索，但又不希望通过搜索引擎可以查询到聊天记录明文。

业务分析：

从搜索引擎的角度而言，客户上述需求实为只利用聊天记录内容字段建立索引，但不希望保留原文。

标题解决方案：

方案1: 建立test索引，包含三个字段：itemid(聊天记录唯一id)，date(聊天记录发布时间)，content(聊天记录内容)，解决思路：再配置mapping时，不做特殊配置，只是在查询时强制将content剔除返回字段：

PUT test
{
  "mappings": {
    "test":{
      "properties": {
        "itemid": {
          "type": "keyword"
        },
        "date": {
          "type": "long"
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

我们插入一条记录

PUT test/test/1
{
  "itemid": "1",
  "date": 1000,
  "content": "a very large content"
}

查询时通过_source filter将content排除

GET test/_search
{
  "query": {
    "match": {
      "_all": "very"
    }
  },
  "_source": ["itemid", "date"]
}

结果如下，可以看到基于ES搜索时，默认返回_source内容时，content已经被屏蔽。

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.26742277,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "1",
        "_score": 0.26742277,
        "_source": {
          "date": 1000,
          "itemid": "1"
        }
      }
    ]
  }
}

这种方案的好处时：定制开发比较简单，不需要做很多的处理，坏处就是并不算是真正的符合客户需求，客户更希望的是完成索引建立后，任何人都不能再看到字段的原始内容了。
方案2. 基于store的方式，此种方案思路是我们通过配置字段是否store，然后查询时通过store_fields的方式获取item，首先是创建索引mapping时，将itemid和date字段的store配置为true，这样es会保存两个字段的原始内容，而不保存content字段的内容，es默认store为false

PUT test
{
  "mappings": {
    "test":{
      "properties": {
        "itemid": {
          "type": "keyword",
          "store": true
        },
        "date": {
          "type": "long",
          "store": true
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

通过store_fields来配置返回的字段

GET test/_search
{
  "query": {
    "match": {
      "_all": "very"
    }
  },
  "stored_fields": ["itemid", "date"]
}

可以看到搜索结果为：

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.26742277,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "1",
        "_score": 0.26742277,
        "fields": {
          "date": [
            1000
          ],
          "itemid": [
            "1"
          ]
        }
      }
    ]
  }
}

上述两种方案类似，都是通过限制ES返回字段来控制，但并未真正满足客户的需求，客户更希望的是任何人完成索引建立后，任何人都无法看到原始内容。由于ES会默认将建立索引时的JSON数据保存在_source中，所以问题关键在于将_source关闭。

终极解决方案：

**
mapping配置，这里通过设置参数 “_source”: {“enabled”: false}取消原始json的保存，在mapping时content的store也为false，因此再es中并未保存content的原始内容，只是建立了索引。

PUT test
{
  "mappings": {
    "test":{
      "_source": {
        "enabled": false
      }, 
      "properties": {
        "itemid": {
          "type": "keyword",
          "store": true
        },
        "date": {
          "type": "long",
          "store": true
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

此时，再进行搜索，通过设置_source或store_fields的方法均找不到原始content的内容了，只会根据倒排索引进行匹配。我们推送两条数据，并进行搜索，尝试用store_fields的方式召回content，因为其store为false，因此并不能返回，智能获取匹配的文档_id。

PUT test/test/1
{
  "itemid": "1",
  "date": 1000,
  "content": "a very large content"
}

PUT test/test/2
{
  "itemid": "2",
  "date": 1001,
  "content": "a large content"
}

GET test/_search
{
  "query": {
    "match": {
      "_all": "very"
    }
  },
  "stored_fields": ["content"]
}

如果通过_source来指定返回content字段，则会报错，因为我们已将_source功能屏蔽。

GET test/_search
{
  "query": {
    "match": {
      "_all": "very"
    }
  },
  "_source": "content"
}
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 4,
    "failed": 1,
    "failures": [
      {
        "shard": 3,
        "index": "test",
        "node": "hLgp8P3LQxqGDv65X2CXEg",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "unable to fetch fields from _source field: _source is disabled in the mappings for index [test]"
        }
      }
    ]
  },
  "hits": {
    "total": 1,
    "max_score": 0.26742277,
    "hits": []
  }
}

当然由于禁掉_source也会带来一些其他问题，下面是ES官网列出的一些受影响功能：
Think before disabling the _source field
Users often disable the _source field without thinking about the consequences, and then live to regret it. If the _source field isn’t available then a number of features are not supported:

The update, update_by_query, and reindex APIs.
On the fly highlighting.
The ability to reindex from one Elasticsearch index to another, either to change mappings or analysis, or to upgrade an index to a new major version.
The ability to debug queries or aggregations by viewing the original document used at index time.
Potentially in the future, the ability to repair index corruption automatically.

gogoout123

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Elasticsearch索引配置隐私字段

业务场景：某聊天记录搜索需求，客户希望对聊天记录进行搜索，但又不希望通过搜索引擎可以查询到聊天记录明文。业务分析：从搜索引擎的角度而言，客户上述需求实为只利用聊天记录内容字段建立索引，但不希望保留原文。标题解决方案：方案1: 建立test索引，包含三个字段：itemid(聊天记录唯一id)，date(聊天记录发布时间)，content(聊天记录内容)，解决思路：再配置mapping时，不...
复制链接

扫一扫