elasticsearch笔记_精确值搜索_filter(四)

最新推荐文章于 2024-04-21 20:23:20 发布

-yanhui-

最新推荐文章于 2024-04-21 20:23:20 发布

阅读量1.3k

点赞数 1

文章标签： elasticsearch 搜索结构化 sql filter

本文链接：https://blog.csdn.net/xyh930929/article/details/71642153

版权

Elasticsearch 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

过滤器搜索(filter)

结构化搜索（Structured search）: 是指有关探询具有内在结构数据的过程。比如日期、时间和数字都是结构化的：它们有精确的格式，我们可以对这些格式进行逻辑操作。比较常见的操作包括比较数字或时间的范围，或判定两个值的大小。

注意:对于结构化文本来说，一个值要么相等，要么不等.

由于是结构化查询 , 下面会类比sql语句:

1.term查询数字

select document from products where price = 20

对于精确查找 , 用term实现:

{
    "term" : {
        "price" : 20
    }
}

在不想被评分(不评分可以提高效率)的时候 , 将term查找转化为过滤器 , constant_score表示可以使term查询以非评分(其实是统一评分 , 结果的_score都是1.0)的方式进行查询 :

GET /my_store/products/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "price" : 20 }
            }
        }
    }
}

2.term文本查找

select product from products where productId="XHDK-A-1293-#fJ3"

直接用刚才的term精确查找 , 不会得到预期结果,原因是elasticsearch底层的倒排索引 , 将文本查询进行分析 , 并拆分成词条。查询一下被分析的结果:

GET /my_store/_analyze
{
  "field": "productID",
  "text": "XHDK-A-1293-#fJ3"
}

结果:
{
  "tokens" : [ {
    "token" :        "xhdk",
    "start_offset" : 0,
    "end_offset" :   4,
    "type" :         "<ALPHANUM>",
    "position" :     1
  }, {
    "token" :        "a",
    "start_offset" : 5,
    "end_offset" :   6,
    "type" :         "<ALPHANUM>",
    "position" :     2
  }, {
    "token" :        "1293",
    "start_offset" : 7,
    "end_offset" :   11,
    "type" :         "<NUM>",
    "position" :     3
  }, {
    "token" :        "fj3",
    "start_offset" : 13,
    "end_offset" :   16,
    "type" :         "<ALPHANUM>",
    "position" :     4
  } ]
}

为了使查询的关键字不被分析 , 得告诉Elasticsearch这个productId是不被分析的 , 即not_analyzed , 只能更改索引(注意:elasticsearch里面的索引一旦创建不能被修改 , 只能被删除和重新创建) 。
更改索引:

删除原来的索引:
DELETE /my_store 

我们可以创建新的索引并为其指定自定义映射:
PUT /my_store 
{
    "mappings" : {
        "products" : {
            "properties" : {
                "productID" : {
                    "type" : "string",
                    "index" : "not_analyzed" }
            }
        }
    }

}

此时再进行刚才的term查询,会得到与productId完全匹配的预期结果。

3.Elasticsearch 会在运行非评分查询的时执行多个操作:

(1)查找匹配文档 。在倒排索引中查找 XHDK-A-1293-#fJ3 然后获取包含该 term 的所有文档。

(2)创建 bitset 。(bitset是一个只存0或1的数组),描述了有哪些文档被term到了(比如:如果有4个文档,文档2被term匹配到了,bitset就是[0,1,0,0])

(3)迭代 bitset(s) 。 Elasticsearch会为每一个查询条件都生成一个bitset,这些bitset构成了一个bitsets。Elasticsearch 就会循环迭代 bitsets 从而找到满足所有过滤条件的匹配文档的集合。在内部，它表示成一个 “roaring bitmap”，可以同时对稀疏或密集的集合进行高效编码。通常会先迭代稀疏的bitset,因为这样就会首先排除大量的文档。

(4)增量使用计数 。 Elasticsearch会缓存非评分查询,如果查询在最近的 256 次查询中会被用到，那么它就会被缓存到内存中。

4.组合过滤器

select product from products where (price = 20 OR productID = "XHDK-A-1293-#fJ3") AND (price != 30)

需要用到bool过滤器 : bool里面的关键字与sql里面的关键字对应(and–>must ; or—>should ; not—>must_not)

GET /my_store/products/_search
{
   "query" : {
      "filtered" : { 
         "filter" : {
            "bool" : {
              "should" : [ { "term" : {"price" : 20}}, { "term" : {"productID" : "XHDK-A-1293-#fJ3"}} ],
              "must_not" : { "term" : {"price" : 30} } }
         }
      }
   }
}

注意:组合过滤器需要用filtered套在外面.

5.嵌套过滤器

select document from products where products where productId="KDKE-B-9947-#kL5" or (productId="JODL-X-1937-#pV7" and price=30 )

GET /my_store/products/_search
{
   "query" : {
      "filtered" : {
         "filter" : {
            "bool" : {
              "should" : [ { "term" : {"productID" : "KDKE-B-9947-#kL5"}}, { "bool" : { "must" : [ { "term" : {"productID" : "JODL-X-1937-#pV7"}}, { "term" : {"price" : 30}} ] }} ] }
         }
      }
   }
}

6.查找多个精确值

select document from products where price in (20,30);

GET /my_store/products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "terms" : { 
                    "price" : [20, 30] }
            }
        }
    }
}

注意 : 上面用的是terms , 注意term 和 terms的区别.term相当于sql里面的”=” , terms相当于sql里面的in .

7.范围查找

select document from products where price between 20 and 40;

GET /my_store/products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "range" : {
                    "price" : { "gte" : 20, "lt" : 40 } }
            }
        }
    }
}

特殊字符代表的意义:
gt : > 大于（greater than）
lt : < 小于（less than）
gte : >= 大于或等于（greater than or equal to）
lte : <= 小于或等于（less than or equal to）

如果是查询日期字段的时候,将上面的过滤器里面的range换成如下:

"range" : {
    "timestamp" : {
        "gt" : "2014-01-01 00:00:00",
        "lt" : "2014-01-07 00:00:00"
    }
}

"range" : {
    "timestamp" : {
        "gt" : "now-1h" //过去一小时.
    }
}

"range" : {
    "timestamp" : {
        "gt" : "2014-01-01 00:00:00",
        "lt" : "2014-01-01 00:00:00||+1M" //在这个时间的基础上再加1个月.
    }
}

范围查找也支持字符串(字典顺序)范围:

"range" : {
    "title" : {
        "gte" : "a",
        "lt" :  "b"
    }
}

注意 : Elasticsearch 实际上是在为范围内的每个词项都执行 term 过滤器，这会比日期或数字的范围过滤慢许多。

8.关于null值

在elasticsearch里面 , 如果字段不存在，那么它也不会持有任何 token(倒排索引里面的词条)。null []（空数组）和 [null] 所有这些都是等价的，它们无法存于倒排索引中。

select tags from posts where tags is not null;

GET /my_index/posts/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "exists" : { "field" : "tags" }
            }
        }
    }
}

select tags from posts where tags is null;

GET /my_index/posts/_search
{
    "query" : {
        "constant_score" : {
            "filter": {
                "missing" : { "field" : "tags" }
            }
        }
    }
}

关于自定义对象的null :

{
   "name" : {
      "first" : "John",
      "last" :  "Smith"
   }
}

实际上是这样存储的 :

{
   "name.first" : "John",
   "name.last"  : "Smith"
}

所以用 exists 或 missing 查询 name 字段时 :

{
    "exists" : { "field" : "name" }
}

实际上执行的是 :

{
    "bool": {
        "should": [
            { "exists": { "field": "name.first" }},
            { "exists": { "field": "name.last" }}
        ]
    }
}

如果 first 和 last 都是空，那么 name 这个命名空间才会被认为不存在 .

9.关于缓存

elasticsearch对非评分查询是有缓存的 , 并且缓存的值是bitset , 并且是以增量的形式更新的( 例如新添加一个文档 , 只需将那些新文档加入已有 bitset , 而不是对整个缓存一遍又一遍的重复计算。)

于是就会有下面两种bool查询会使用同一 bitset .

让我们看看下面例子中的查询，它查找满足以下任意一个条件的电子邮件：

(1)在收件箱中，且没有被读过的.
(2)不在 收件箱中，但被标注重要的.

GET /inbox/emails/_search
{
  "query": {
      "constant_score": {
          "filter": {
              "bool": {
                 "should": [ { "bool": { "must": [ { "term": { "folder": "inbox" }}, { "term": { "read": false }} ] }}, { "bool": { "must_not": { "term": { "folder": "inbox" } }, "must": { "term": { "important": true } } }} ] }
            }
        }
    }
}

如果一个非评分查询在最近的 256 词查询中被使用过（次数取决于查询类型），那么这个查询就会作为缓存的候选。但是，并不是所有的片段都能保证缓存 bitset 。只有那些文档数量超过 10,000 （或超过总文档数量的 3% )才会缓存 bitset 。因为小的片段可以很快的进行搜索和合并，这里缓存的意义不大。

一旦缓存了，非评分计算的 bitset 会一直驻留在缓存中直到它被剔除。剔除规则是基于 LRU 的：一旦缓存满了，最近最少使用的过滤器会被剔除。