es学习分享

最新推荐文章于 2023-04-07 12:04:23 发布

VIP文章 sinat_32176267

最新推荐文章于 2023-04-07 12:04:23 发布

阅读量341

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/sinat_32176267/article/details/119211343

版权

1.es写入原理、查询原理

为什么搜索是近实时的？
Elasticsearch 是怎样保证更新被持久化在断电时也不丢失数据?
为什么删除文档不会立刻释放空间？

1.1ES写入流程

以下是从主分片或者副本分片检索文档的步骤顺序：

1、客户端向 Node 1 发送获取请求。

2、节点使用文档的 _id 来确定文档属于分片 0 。分片 0 的副本分片存在于所有的三个节点上。在这种情况下，它将请求转发到 Node 2 。

3、Node 2 将文档返回给 Node 1 ，然后将文档返回给客户端。

在处理读取请求时，协调结点在每次请求的时候都会通过轮询所有的副本分片来达到负载均衡。

2.Bool查询

2.1bool查询

bool过滤器组成

{
"bool" : {
"must" : [],
"should" : [],
"must_not" : [],
}
}

must所有的语句都 必须（must） 匹配，与 AND 等价。must_not所有的语句都 不能（must not） 匹配，与 NOT 等价。should至少有一个语句要匹配，与 OR 等价。

备注：所有 must 语句必须匹配，所有 must_not 语句都必须不匹配，但有多少 should 语句应该匹配呢？默认情况下，没有 should 语句是必须匹配的，只有一个例外：那就是当没有 must 语句的时候，至少有一个 should 语句必须匹配。

它可以接受多个其他过滤器作为参数，并将这些过滤器结合成各式各样的布尔（逻辑）组合。

2.1.1简单布尔过滤器

SELECT product
FROM   products
WHERE  (price = 20 OR productID = "a")
  AND  (price != 30)

{
   "query" : {
      "bool" : {
         "filter" : {
            "bool" : {
              "should" : [
                 { "term" : {"price" : 20}},
                 { "term" : {"productID" : "a"}}
              ],
              "must_not" : {
                 "term" : {"price" : 30}
              }
           }
         }
      }
   }
}

//使用filter查询将所有的东西包起来，在should语句块里面的两个term过滤器与bool过滤器是父子关系，两个term条件只需要匹配其一。
如果一个产品的价格是30，那么它会自动排除，因为它位于must_not语句中。
如果是同级关系，那么should可以不满足。

2.1.2嵌套布尔过滤器

SELECT document
FROM products
WHERE productID = "a"
OR ( productID = "c"
AND price = 30 )

GET localhost:9200/my_store/_search

{
   "query" : {
      "bool" : {
         "filter" : {
            "bool" : {
              "should" : [
                { "term" : {"productID" : "a"}}, 
                { "bool" : { （1）
                  "must" : [
                    { "term" : {"productID" : "c"}}, 
                    { "term" : {"price" : 30}} （2）
                  ]
                }}
              ]
           }
         }
      }
   }
}
（1）因为 term 和 bool 过滤器是兄弟关系，他们都处于外层的布尔逻辑 should 的内部，返回的命中文档至少须匹配其中一个过滤器的条件。
（2）这两个 term 语句作为兄弟关系，同时处于 must 语句之中，所以返回的命中文档要必须都能同时匹配这两个条件。

bool查询下面，可以有同级的must,should，must_not，这个里面是term，或者range查询都行；

bool下面的条件里如果嵌套的话，还需要在外面套一个bool，如果同级的有一个是{},如果同级的有多个并列是[]

2.2嵌套Nested对象

由于在 Elasticsearch 中单个文档的增删改都是原子性操作,那么将相关实体数据都存储在同一文档中也就理所当然。

SELECT *
FROM   user
WHERE  user.first='Alice' and user.last = 'Smith'

2.1.1普通object对象查询

新建索引：
PUT localhost:9200/nested_test
 
插入数据
PUT localhost:9200/nested_test/_doc/1
{
  "group" : "fans",
  "user" : [ 
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}
 
查询
GET localhost:9200/nested_test/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "user.first": "Alice" }},
        { "match": { "user.last":  "Smith" }}
      ]
    }
  }
}

 {
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.5753642,
        "hits": [
            {
                "_index": "nested_test",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.5753642,
                "_source": {
                    "group": "fans",
                    "user": [
                        {
                            "first": "John",
                            "last": "Smith"
                        },
                        {
                            "first": "Alice",
                            "last": "White"
                        }
                    ]
                }
            }
        ]
    }
}
//可以看到查询的并不是满足的条件，但是返回了。

2.1.2nested嵌套对象类型

新建索引
PUT localhost:9200/nested_test2
{
  "mappings": {
    "properties": {
      "user": {
        "type": "nested" 
      }
    }
  }
}
插入数据
PUT localhost:9200/nested_test2/_doc/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}
//查询
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }} 
          ]
        }
      }
    }
  }
}
//这个查询是没有返回的，因为内部是独立的文档。
{
    "query": {
        "nested": {
            "path": "user",
            "query": {
                "bool": {
                    "must": [
                        {
                            "match": {
                                "user.first": "Alice"
                            }
                        },
                        {
                            "match": {
                                "user.last": "White"
                            }
                        }
                    ]
                }
            }
        }
    }
}

这个查询匹配到了该数据返回如下

 {
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.3862942,
        "hits": [
            {
                "_index": "nested_test2",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.3862942,
                "_source": {
                    "group": "fans",
                    "user": [
                        {
                            "first": "John",
                            "last": "Smith"
                        },
                        {
                            "first": "Alice",
                            "last": "White"
                        }
                    ]
                }
            }
        ]
    }
}

出现上面这种问题的原因是 JSON 格式的文档被处理成如下的扁平式键值对的结构。

{
  "title":            [ eggs, nest ],
  "body":             [ making, money, work, your ],
  "tags":             [ cash, shares ],
  "comments.name":    [ alice, john, smith, white ],
  "comments.comment": [ article, great, like, more, please, this ],
  "comments.age":     [ 28, 31 ],
  "comments.stars":   [ 4, 5 ],
  "comments.date":    [ 2014-09-01, 2014-10-22 ]
}

查询嵌套文档，必须使用nested字段+path查询嵌套文档

2.1.3嵌套排序

场景：查询在10月份收到评论的博客文章，并且按照 stars 数的最小值来由小到大排序，那么查询语句如下：

SELECT article
FROM   t
WHERE  comments.date between '2014-10-01' and '2014-11-01'  order by stars

GET /_search
{
  "query": {
    "nested": { 
      "path": "comments",
      "filter": {
        "range": {
          "comments.date": {
            "gte": "2014-10-01",
            "lt":  "2014-11-01"
          }
        }
      }
    }
  },
  "sort": {
    "comments.stars": { 
      "order": "asc",   
      "mode":  "min",   
      "nested_path": "comments", 
      "nested_filter": {
        "range": {
          "comments.date": {
            "gte": "2014-10-01",
            "lt":  "2014-11-01"
          }
        }
      }
    }
  }
}
此处的 nested 查询将结果限定为在10月份收到过评论的博客文章。
结果按照匹配的评论中 comment.stars 字段的最小值 (min) 来由小到大 (asc) 排序。

为什么要用 nested_path 和 nested_filter 重复查询条件呢？原因在于，排序发生在查询执行之后。查询条件限定了在10月份收到评论的博客文档，但返回的是博客文档。如果我们不在排序子句中加入 nested_filter ，那么我们对博客文档的排序将基于博客文档的所有评论，而不是仅仅在10月份接收到的评论。

2.1.4嵌套聚合查询

SELECT min(price)
FROM   products
WHERE  name = "led tv"

GET /products/_search
{
  "query": {
    "match": { "name": "led tv" }
  },
  "aggs": {
    "resellers": {
      "nested": {
        "path": "resellers"
      },
      "aggs": {
        "min_price": { "min": { "field": "resellers.price" } }
      }
    }
  }
}
//nested 聚合 “进入” 嵌套的 resellers对象。
//计算每个桶内最低价格。

结果返回

{
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.5753642,
        "hits": [
            {
                "_index": "products",
                "_type": "_doc",
                "_id": "0",
                "_score": 0.5753642,
                "_source": {
                    "name": "LED TV",
                    "resellers": [
                        {
                            "reseller": "companyA",
                            "price": 350
                        },
                        {
                            "reseller": "companyB",
                            "price": 500
                        }
                    ]
                }
            }
        ]
    },
    "aggregations": {
        "resellers": {
            "doc_count": 2,
            "min_price": {
                "value": 350.0
            }
        }
    }
}

3.聚合

ES的聚合一共有4种类型，Bucket 、Metric、Pipeline 是经常使用的，掌握了这3种聚合，就可以满足日常大部分的聚合分析场景。

aggs的语法：

简单示例，学会aggs语法

{
  "size": 0,
  "aggs": {
    "first_agg_name": {
      "terms": {
        "field": "color"
      },
      "aggs": {
        "sub_agg_name1": {
          "avg": {
            "field": "price"
          }
        },
        "sub_agg_name2": {
          "terms": {
            "field": "make"
          }
        }
      }
    }
  }
}
//size设置0，表示返回搜索结果的具体内容为0条，提高搜索速度
//聚合操作被置于顶层参数 aggs 之下（如果你愿意，完整形式 aggregations 同样有效）
//为聚合指定一个我们想要名称，本例中是： first_agg_name, 语义是查询不同颜色车的平均价格，以及不同厂商的数量

3.1Bucket Aggregations

Bucket 就是桶的意思，即按照一定的规则将文档分配到不同的桶中，达到分类分析的目的。

测试数据来源：为kibana自带数据，Home->Add data->Sample data->Sample eCommerce orders

3.1.1Terms术语聚合

聚合是在特定搜索结果背景下执行的，这也就是说它只是查询请求的另外一个顶层参数（例如/_search端点）,聚合可以与查询结对。

select count(*) from cars gropu by color

GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : { 
        "popular_colors" : { 
            "terms" : { 
              "field" : "color"
            }
        }
    }
}
size设置0，表示返回搜索结果的具体内容为0条，提高搜索速度
聚合操作被置于顶层参数 aggs 之下（如果你愿意，完整形式 aggregations 同样有效）
为聚合指定一个我们想要名称，本例中是： popular_colors，对颜色聚合，类似于group by color

GET /cars/transactions/_search
{
    "aggregations": {
        "popular_colors": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": "red",
                    "doc_count": 4
                },
                {
                    "key": "blue",
                    "doc_count": 2
                },
                {
                    "key": "green",
                    "doc_count": 2
                }
            ]
        }
    }
}
sum_other_doc_count 结果中没有被包含的数据数量总和。
这个新的聚合层让可以将 avg 度量嵌套置于terms桶内。实际上，这就为每个颜色生成了平均价格

3.1.2Rare Terms稀有术语聚合

在 Terms Aggs 中，聚合结果的排序是默认根据 doc_count 的值降序排列，但在实际使用过程中，我们有时候希望根据 doc_count 的值升序排列，这个时候就应该使用 Rare Terms【之所以不使用 Terms aggs再去改变排序规则，是因为聚合精度问题】

select count(*) cnt from cars gropu by color order by cnt desc

GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : { 
        "popular_colors" : { 
            "rare_terms" : { 
              "field" : "color",
               "max_doc_count": 10,
			 "include":["blue","green"]
            }
        }
    }
}

{
    "aggregations": {
        "popular_colors": {
            "buckets": [
                {
                    "key": "blue",
                    "doc_count": 2
                },
                {
                    "key": "green",
                    "doc_count": 2
                },
                {
                    "key": "red",
                    "doc_count": 4
                }
            ]
        }
    }
}

3.1.3Histogram Aggregation

属于多bucket类型的聚合，它可以对数值字段按照固定的间隔聚合。

 GET /kibana_sample_data_ecommerce/_search?size=0
{
  "aggs": {
    "base_price": {
      "histogram": {
        "field": "products.base_price",
        "interval": 5
      }
    }
  }
}

最低0.47元/天解锁文章

sinat_32176267

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
es学习分享

1.es写入原理、查询原理为什么搜索是近实时的？ Elasticsearch 是怎样保证更新被持久化在断电时也不丢失数据? 为什么删除文档不会立刻释放空间？1.1ES写入流程以下是从主分片或者副本分片检索文档的步骤顺序：1、客户端向Node 1发送获取请求。2、节点使用文档的_id来确定文档属于分片0。分片0的副本分片存在于所有的三个节点上。在这种情况下，它将请求转发到Node 2。3、Node 2将文档返回给Node 1，然...
复制链接

扫一扫