Elasticsearch- 结构化搜索之term filter以及底层bitset和caching

本文链接：https://blog.csdn.net/u011262847/article/details/77937451

对于测试数据：

POST /forum/article/_bulk
{ "index": { "_id": 1 }}
{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 2 }}
{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }
{ "index": { "_id": 3 }}
{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 4 }}
{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }

一.对于keyword索引类型

可以看到es会自动映射出数据类型：

{
  "forum": {
    "mappings": {
      "article": {
        "properties": {
          "articleID": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } },
          "hidden": { "type": "boolean" },
          "postDate": { "type": "date" },
          "userID": { "type": "long" } }
      }
    }
  }
}

根据新版的es，应该是5.2版本之后，对于type=text（字符串）的类型字段，会默认设置两个索引，一个是field本身（默认会进行分词的），一个是field.keyword，默认不会分词，最多保留256个字符。

对于：

GET /forum/article/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "articleID": "KDKE-B-9947-#kL5"
        }
      }
    }
  }
}

将不会有结果出现，因为term对应的搜索不会对搜索字段进行分词处理，而document中的field将会进行分词处理，但是我们可以对field.keyword来搜索到我们想要的结果

GET /forum/article/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "articleID.keyword": "KDKE-B-9947-#kL5"
        }
      }
    }
  }
}

将会搜索到我们想要的不分词的结果

所以term filter，对text过滤，可以考虑使用内置的field.keyword来进行匹配。但是有个问题，默认就保留256个字符，所以尽可能还是自己去手动简历索引，指定not_analyzed。在新版本的es中可以直接设置type=keyword（手动建立索引需要删除之前的document）

PUT forum
{
  "mappings": {
    "article": {
      "properties": {
        "articleID": {
          "type": "keyword"
        }
      }
    }
  }
}

二.bitset和caching底层原理

(1). 在倒排索引中查找搜索串，获取document list，date举例：

word	doc1	doc2	doc3
2017-01-01	*	*
2017-02-02		*	*
2017-03-03	*	*	*

对于过滤条件：filter: 2017-02-02

(2). 为每个倒排索引中搜索到的结果，构建一个bitset，类似【0,0,0,1,0,1】
对于每一个filter去构建一个二进制的数组，用最简单的数据结构去实现复杂的功能，1表示匹配，0表示不匹配，上述文档将会是【0，1，1】

(3). 遍历每个filter的条件对应的bitset，优先从最稀疏的开始搜索（0最多的），如此可以一开始就淘汰很多的document，来提升性能，然后执行其他的bitset最后保留下来的是满足所有条件的document，就可以作为结果返回client

(4). caching bitset，跟踪quert，在最近的256个filter中，有某个filter超过了一定次数，次数不固定，就会自动缓存这个filter对应的bitset，但是针对小segment获取到的结果，可以不缓存，segment记录数《 1000或者segment大小《 index总大小的3%，因为segment数据量很小，此时哪怕是扫描也很快；segment会在后台自动合并，小segment很快就会跟其他segment合并成大的segment，此时缓存就没有什么意义，segment很快就会消失

(5). filter大部分情况下，在query之前执行，先尽量过滤掉尽可能多的数据，对于query会计算doc对搜索条件的relevance score，还会根据这个score去排序，filter只是简单的过滤出想要的结果数据，不计算relevance score，也不排序

(6). 对于后续的document有新增或者修改，那么cached bitset会自动更新

(7). 对于之后的filter就会直接使用cache过的bitset来进行过滤

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-filter-context.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/keyword.html