ElasticSearch之聚合简介Bucket和Metric

程大帅气

已于 2022-02-14 11:21:06 修改

阅读量1.2k

点赞数

分类专栏： Elasticsearch 文章标签： elasticsearch 搜索引擎大数据 java 全文检索

于 2022-02-13 17:21:40 首次发布

本文链接：https://blog.csdn.net/weixin_44692700/article/details/122910382

版权

Elasticsearch 专栏收录该内容

13 篇文章 2 订阅

订阅专栏

ElasticSearch之聚合Bucket和Metrix简介

一、聚合（Aggregation）
二、聚合的分类
三、Bucket和Metric

一、聚合（Aggregation）

ES除了提供搜索以外，还提供了针对ES数据进行统计分析的功能
- 实时性高
- Hadoop统计分析功能时效性为(T+1)
通过聚合，我们会得到一个数据的概览，是分析和总结全套数据，而不是寻找某个文档
性能高，只需要一条语句，就能从ElasticSearch中得到分析结果，无需在客户端自己实现分析逻辑

在Kibana中的可视化报表，都是基于ES的聚合分析功能。

二、聚合的分类

Bucket Aggregation：一些满足特定条件的文档的集合
Metric Aggregation：数学运算，对字段进行统计分析
Pipeline Aggregation：对聚合结果进行二次聚合
Matrix Aggregation：支持对多个字段的操作并提供一个结果矩阵

三、Bucket和Metric

比如在Mysql中我们编写一条SQL语句
select count(1) from people group by sex;
count(1)就可以理解为Metric，是一系列统计方法。
group by sex可以理解为Bucket，一组满足条件的文档

Bucket
在这里插入图片描述
比如，可以将商品分为高中低三档，每一档对应一个桶，桶中放的是相关的商品。
同时我们也能对高档桶中根据不同的规则进行继续分桶，如评分、评价、价格区间等等。

Metrix
metrix基于数据计算的结果，除了支持在字段上进行计算，同样也支持在脚本产生的结果上进行计算。

大多数Metrix是数学计算，仅输出一个值
min/max/sum/avg/cardinality

部分metrix支持输出多个数值
stats/percentiles/percentile_ranks

1.Bucket示例

首先我们准备一个index，包含三个字段。

put people/_mapping
{
  "properties": {
    "name": {
      "type": "keyword"
    },
    "age":{
      "type":"long"
    },
    "sex":{
      "type":"keyword"
    }
  }
}

进行bucket聚合查询，分别统计people索引中的性别分组和年龄分组。
其中，aggs中的字段为自定义分桶名称,terms中指定需要分桶的字段。

get people/_search
{
  "size":0,
  "aggs":{
    "性别":{
      "terms":{
        "field": "sex"
      }
    },
    "年龄":{
      "terms":{
        "field": "age"
      }
    }
  }
}

得到结果如下：

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "年龄" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 18,
          "doc_count" : 2
        },
        {
          "key" : 13,
          "doc_count" : 1
        },
        {
          "key" : 17,
          "doc_count" : 1
        }
      ]
    },
    "性别" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "0",
          "doc_count" : 2
        },
        {
          "key" : "1",
          "doc_count" : 2
        }
      ]
    }
  }
}

由结果我们可以看到：
根据年龄分桶，18岁的有2条数据，13和17的各有1条。
根据性别分桶，0和1各有2条。
并且会根据我们自定义的分桶名称进行显示

2.Metric示例

接下来我们根据性别分桶结果，得到桶中的平均年龄、最大年龄和最小年龄

get people/_search
{
  "size":0,
  "aggs":{
    "性别":{
      "terms":{
        "field": "sex"
      },
      "aggs":{
        "平均年龄":{
          "avg":{
            "field":"age"
          }
        },
        "最大年龄":{
          "max":{
            "field":"age"
          }
        },
        "最小":{
          "min":{
            "field":"age"
          }
        }
      }
    }
  }
}

得到结果：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "性别" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "0",
          "doc_count" : 2,
          "最小" : {
            "value" : 17.0
          },
          "平均年龄" : {
            "value" : 17.5
          },
          "最大年龄" : {
            "value" : 18.0
          }
        },
        {
          "key" : "1",
          "doc_count" : 2,
          "最小" : {
            "value" : 13.0
          },
          "平均年龄" : {
            "value" : 15.5
          },
          "最大年龄" : {
            "value" : 18.0
          }
        }
      ]
    }
  }
}

可以看到结果：
在性别为0的桶中，最大年龄为18，平均年龄17.5，最小年龄17。
在性别为1的桶中，最大年龄为18，平均年龄15.5，最小年龄13.

3.嵌套

我们不仅能对数据进行分桶和计算，也可以对分桶结果进行进一步分桶。

比如我现在需要先对年龄分桶，并分析出桶中的平均年龄，再获取每个桶中的年龄分布

get people/_search
{
  "size":0,
  "aggs":{
    "性别":{
      "terms":{
        "field": "sex"
      },
      "aggs":{
        "平均年龄":{
          "avg":{
            "field":"age"
          }
        },
        "年龄分布":{
          "terms":{
            "field":"age"
          }
        }
      }
    }
  }
}

得到结果：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "性别" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "0",
          "doc_count" : 2,
          "平均年龄" : {
            "value" : 17.5
          },
          "年龄分布" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 17,
                "doc_count" : 1
              },
              {
                "key" : 18,
                "doc_count" : 1
              }
            ]
          }
        },
        {
          "key" : "1",
          "doc_count" : 2,
          "平均年龄" : {
            "value" : 15.5
          },
          "年龄分布" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 13,
                "doc_count" : 1
              },
              {
                "key" : 18,
                "doc_count" : 1
              }
            ]
          }
        }
      ]
    }
  }
}

从上述结果我们可以看到，不仅对年龄进行了分桶，还对桶中的数据进行了平均值计算，然后对桶中数据再一次做了分桶处理。这就是聚合功能的嵌套用法。

4.stats

stats提供了多种Metric计算，不用单独指定。指定stats会给到我们count、min、max、avg、sum。

get people/_search
{
  "size":0,
  "aggs":{
    "性别":{
      "terms":{
        "field": "sex"
      },
      "aggs":{
        "年龄分布":{
          "terms":{
            "field":"age"
          }
        },
        "stats计算age":{
          "stats":{
            "field":"age"
          }
        }
      }
    }
  }
}

得到结果

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "性别" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "0",
          "doc_count" : 2,
          "stats计算age" : {
            "count" : 2,
            "min" : 17.0,
            "max" : 18.0,
            "avg" : 17.5,
            "sum" : 35.0
          },
          "年龄分布" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 17,
                "doc_count" : 1
              },
              {
                "key" : 18,
                "doc_count" : 1
              }
            ]
          }
        },
        {
          "key" : "1",
          "doc_count" : 2,
          "stats计算age" : {
            "count" : 2,
            "min" : 13.0,
            "max" : 18.0,
            "avg" : 15.5,
            "sum" : 31.0
          },
          "年龄分布" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 13,
                "doc_count" : 1
              },
              {
                "key" : 18,
                "doc_count" : 1
              }
            ]
          }
        }
      ]
    }
  }
}