Elasticsearch 优化

最新推荐文章于 2022-04-17 11:40:59 发布

九指码农

最新推荐文章于 2022-04-17 11:40:59 发布

阅读量415

点赞数

分类专栏： es学习

本文链接：https://blog.csdn.net/qq_14950717/article/details/79311929

版权

es学习专栏收录该内容

12 篇文章 0 订阅

订阅专栏

一、Elasticsearch查询过程

Elasticsearch查询分两个阶段查询阶段和提取阶段
查询阶段
客户端向集群中的某个节点（假设节点1）发起查询请求，节点1会创建一个from+size大小的队列（from：偏移量，size：要取回的文档个数）。
节点1向集群中所有其他的分片（主或从）发起查询请求，每个分片也会创建一个from+size大小的队列，并将查询结果添加到队列中。
其他分片将查询到的文档ID和排序值发送给节点1，节点1将所有结果进行合并并排序，添加到队列中
提取阶段
节点1根据文档ID发出请求到相关分片，相关分片查询出结果，并将结果返回给节点1
节点1将结果返回给客户端
由上可以看出当from偏移量较大时（size分页一般是默认大小），会对单机造成较大压力，汇聚结果的节点要对分片个数 * （from + size）个文档进行归并，应从业务上尽量避免这种大分页

二、查询

1、1、elasticsearch查询分为query查询和filter查询两种方式。
query查询过程：
1，比较查询条件；
2，然后计算分值，最后返回文档结果。
这种查询方式适合于全文检索类的查询。
filter查询过程
1，判断是否满足查询条件，如果不满足，会缓存查询过程（记录该文档不满足结果）；
2，满足的话，就直接缓存结果。
这种查询方式适合于精确值匹配方式的查询。
综上所述，filter快在两个方面：
1 对结果进行缓存；
2 避免计算文档相关性分值。
当我们不关键搜索结果的评分或者不使用全文检索的时候，为了提高性能，注意使用filter。

2、filter的类型。
但是这里需要注意的是filter也分为两种类型的filter：post_filter和filtered
post_filter(先查询再过滤)这样这种filter不会提高性能。
filtered(先过滤再查询，速度快)
post_filter(先查询再过滤)
{
“query”: {
“match”:{“title”:”cat”}
},
“post_filter”:{
“term”:{“year”:1999}
}
}
即上面的查询过程为：先按照”match”:{“title”:”cat”} 进行匹配查询，然后对结果进行过滤。这样这种filter不会提高性能。

filtered(先过滤再查询，速度快)
{
“query”: {
“filtered”: {
“query”: {
“match”: {
“title”: “cat”
}
},
“filter”: {
“term”: {
“year”: 1999
}
}
}
}
}
上面的查询过程为：
1，先按照
“filter”: {
“term”: {
“year”: 1999
}
}
进行过滤，注意针对这个filter的查询结果进行缓存，同时也不计算文档的相关性分值。
2，再按照
“query”: {
“match”: {
“title”: “cat”
}
}
对第一步中的过滤结果再进行query查询。
看到上面的这种复杂的方式那么为什么不全部使用效率更高的filter查询呢，例如下面这样？
{
“query”: {
“filtered”: {
“filter”: {
“bool”: {
“must”: [
{
“term”: {
“title”: “cat”
}
},
{
“term”: {
“year”: 1999
}
}
]
}
}
}
}
}
这是因为：
1，这样的话即保证了query查询的特性；
2，又没有浪费filter缓存，当然如果这个内容没必要缓存的话也就没必要使用query了。

利用词的相似度（word proximity）、部分匹配（partial matching）、模糊匹配（fuzzy matching）及语言感知（language awareness）。

理解每个查询如何贡献相关度评分 _score 有助于调试我们的查询：确保我们认为的最佳匹配文档出现在结果首页，以及削减结果中几乎不相关的 “长尾（long tail）”。

搜索不仅仅是全文搜索：很大一部分数据都是结构化的，如日期和数字。

结构化搜索
结构化搜索：查询具有内在结构数据的过程，如日期、时间、数字，有精确的格式，可对其逻辑操作，如数字、时间的范围，大小比较。

文本也可以是结构化的。如颜色有红、绿等。

结构化查询的结果：yes 或 no，无相关度或评分的概念。

1 精准查找（Finding Exact Values）
会用filters ，无评分过程，执行快，性能高。

term查询：处理number、boolean、date、text.

POST /my_store/products/_bulk
{ “index”: { “_id”: 1 }}
{ “price” : 10, “productID” : “XHDK-A-1293-#fJ3” }
{ “index”: { “_id”: 2 }}
{ “price” : 20, “productID” : “KDKE-B-9947-#kL5” }
{ “index”: { “_id”: 3 }}
{ “price” : 30, “productID” : “JODL-X-1937-#pV7” }
{ “index”: { “_id”: 4 }}
{ “price” : 30, “productID” : “QQPX-R-3956-#aD8” }
//初级L1
{
“query”: {
“term”: {
“price”: 20
}
}
}
//升级L2：
{
“query” : {
“constant_score” : { //非评分模式执行term查询，将term查询转为过滤器
“filter” : {//constant_score必须跟filter
“term” : {
“price” : 20
}
}
}
}
}
// 查询置于 filter 语句内不进行评分或相关度的计算，所有的结果返回默认评分1。

term 查询文本

//通过分析API，可知已被分词
GET my_store/_analyze
{
“field”: “productID”,
“text”: “XHDK-A-1293-#fJ3”
}

结果:

{
“tokens” : [ {
“token” : “xhdk”,
“start_offset” : 0,
“end_offset” : 4,
“type” : “”,
“position” : 1
}
分词后：
1. 用多个token，而不是单个表示字段
2. 所有字母都小写了
3. 丢失连字符和哈希符

当直接用term精准查找时，无结果的，term查询词不再倒排索引中！

DELETE my_store 字段类型不可修改，只好①先删②重建③重导数据④term查询
PUT my_store
{
“mappings” : {
“products” : {
“properties” : {
“productID” : {
“type” : “string”,
“index” : “not_analyzed” //不分词
}
}
}
}

}

内部过滤器(filter，非评分)的操作
非评分查询时执行的多个操作：

查找匹配文档. 倒排表中，找所有匹配的文档。

创建 bitset.
filter为每个非评分的查询创建一个bitset数组(值为0或1),描述哪个文档包含该term，匹配的标志位是1，如编号1~4的文档中，只有编号1的匹配，则值为bitset=[1,0,0,0]。在内部，它表示成一个”roaring bitmap“，可以同时对稀疏或密集的集合进行高效编码。

迭代 bitset(s)
一旦为每个查询生成了 bitsets ，Elasticsearch 就会循环迭代 bitsets 从而找到满足所有过滤条件的匹配文档的集合。执行顺序是启发式的，但一般来说先迭代稀疏的 bitset （因为它可以排除掉大量的文档）。

增量使用计数.
ES 能够缓存非评分查询从而获取更快的访问，但是它也会缓存一些使用极少的东西——资源浪费。
为此 ES 会为每个索引跟踪保留查询使用的历史状态。如果查询在最近的 256 次查询中会被用到，那么它就会被缓存到内存中。当 bitset 被缓存后，缓存会在那些低于 10,000 个文档（或少于 3% 的总索引数）的段（segment）中被忽略。因为这些小的段即将会merge，不必分配缓存。

从概念上记住非评分计算是首先执行的，这将有助于写出高效又快速的搜索请求。

2 组合过滤器（Combining Filters）
场景：过滤多个字段或值。
工具：bool filter
模式：bool query is composed of four sections

{
“bool” : {//每一section都是optional，每一section有一个或一组 query
“must” : [], // = and ，全部yes , 所有语句都必须匹配
“should” : [], // = or ，任一yes , 至少一个语句要匹配
“must_not” : [], // = not , 全部no , 所有语句都不能匹配
“filter”: [] // must匹配，即 and 匹配，但 run in non-scoring,filtering mode，当然在constant_score下再讨论filter无意义，已经非评分模式了。
}
}

GET /my_store/products/_search
{
“query” : {
“constant_score” : { //ES升级2.x ：之前filteredt已被constant_score替换，constant_score –> non-scoring
“filter” : {
“bool” : {
“should” : [//多个子句用数组 […]
{ “term” : {“price” : 20}},
{ “term” : {“productID” : “XHDK-A-1293-#fJ3”}}
],
“must_not” : {//一个子句用对象 – {…}
“term” : {“price” : 30}
}
}
}
}
}
}

Nesting Boolean Queries
GET /my_store/products/_search
{
“query” : {
“constant_score” : {
“filter” : {
“bool” : {
“should” : [//should一个子句是bool查询：任一
{ “term” : {“productID” : “KDKE-B-9947-#kL5”}},
{ “bool” : {
“must” : [
{ “term” : {“productID” : “JODL-X-1937-#pV7”}},
{ “term” : {“price” : 30}}
]
}}
]
}
}
}
}
}

Ref https://www.elastic.co/guide/en/elasticsearch/reference/2.1/query-dsl-bool-query.html

3 查找多个精确值（Finding Multiple Exact Values）
term —->single value.
terms —->multiple value,
两者都是：contains any,not equal ，the nature of an inverted index also means that entire field equality is rather difficult to calculate.

{
“terms” : {//注意：contains any,not equal
“price” : [20, 30]
}
}
{
“query” : {
“constant_score” : {//filter子句内
“filter” : {
“terms” : {
“price” : [20, 30]
}
}
}
}
}
Equals Exactly 加个字段统计term个数

{ “tags” : [“search”], “tag_count” : 1 }
{ “tags” : [“search”, “open_source”], “tag_count” : 2 }

….

{
“query”: {
“constant_score” : {
“filter” : {
“bool” : {
“must” : [
{ “term” : { “tags” : “search” } },
{ “term” : { “tag_count” : 1 } }
]
}
}
}
}
}
4 范围
gt、gte. Greater than, or equal to
lt、lte
//query –>constant_score –>filter
“range” : {
“price” : {
“gte” : 20,
“lt” : 40
}
}
//date字段类型，支持date math 操作：
“gt” : “now-1h” //相对
“lt” : “2014-01-01 00:00:00||+1M” //绝对
字符串比较
字典序、按字母地。
Terms in the inverted index are sorted in lexicographical order, which is why string ranges use this order.

Be Careful of Cardinality(基数)：
Numeric and date fields are indexed in such a way that ranges are efficient to calculate.
string field，ES is effectively performing a term filter for every term that falls in the range. This is much slower than a date or numeric range.String ranges are fine on a field with low cardinality—a small number of unique terms.

5 处理 Null 值
exists ,NOT NULL. qcf（query constant_score_filter）,have any value in the specified field.
“exists” : { “field” : “tags” } 仅一个null值，或无值，排除在外，但若包含null和其它值，不会排除在外的。
missing 与exists相反
“missing” : { “field” : “tags” }
区分null值和压根不存在字段值
string, numeric, Boolean, or date field, you can also set a null_value that will be used whenever an explicit null value is encountered.

注意类型对应上
保证null_value值唯一，不会和业务上非null的值重复，避免困惑。
exists/missing on Objects
The exists and missing queries also work on inner objects, not just core types.

{
“name” : {
“first” : “John”,
“last” : “Smith”
}
}
本质：
{
“name.first” : “John”,
“name.last” : “Smith”
}
查询

{
“exists” : { “field” : “name” }
}
本质：
{
“bool”: {
“should”: [
{ “exists”: { “field”: “name.first” }},
{ “exists”: { “field”: “name.last” }}
]
}
}
6 关于缓存
Bitset representing which documents match the filter.
Once cached, these bitsets can be reused wherever the same query is used, without having to reevaluate the entire query again.

Bitsets are “smart”: they are updated incrementally.

Independent Query Caching
once cached, a query can be reused in multiple search requests.

It is not dependent on the “context” of the surrounding query. This allows caching to accelerate the most frequently used portions of your queries, without wasting overhead on the less frequent / more volatile portions.

In the inbox(收件箱) and have not been read
Not in the inbox but have been marked as important
GET /inbox/emails/_search
{
“query”: {
“constant_score”: {
“filter”: {
“bool”: {
“should”: [
{ “bool”: {
“must”: [
{ “term”: { “folder”: “inbox” }}, //1
{ “term”: { “read”: false }}
]
}},
{ “bool”: {
“must_not”: {
“term”: { “folder”: “inbox” } //2,虽然分别在must、must_not子句，但1、2等价，复用bitset
},
“must”: {
“term”: { “important”: true }
}
}}
]
}
}
}
}
}
Autocaching Behavior
就算在filter中，不一定就缓存。
ES早期版本：cache everything that was cacheable.

Many filters are very fast to evaluate, but substantially slower to cache (and reuse from cache). These filters don’t make sense to cache, since you’d be better off just re-executing the filter again.

Inspecting the inverted index is very fast(快) and most query components are rare(大多查询罕见).
Consider a term filter on a “user_id” field: if you have millions of users, any particular user ID will only occur rarely.

Elasticsearch caches queries automatically based on usage frequency. If a non-scoring query has been used a few times (dependent on the query type) in the last 256 queries , the query is a candidate for caching. However, not all segments are guaranteed to cache the bitset. Only segments that hold more than 10,000 documents (or 3% of the total documents, whichever is larger) will cache the bitset. Because small segments are fast to search and merged out quickly, it doesn’t make sense to cache bitsets here.

Once cached, a non-scoring bitset will remain in the cache until it is evicted. Eviction is done on an LRU basis.