7.ElasticSearch系列之深入聚合分析

深入理解聚合分析原理及精确性问题

1. Metric Aggregation
  • 单值分析,只输出一个分析结果

    • min max avg sum
    • cardinality (类似distinct count)
  • 多值分析,输出多个分析结果

    • stats extended stats
    • percentile, percentile rank
    • top hits(排在前面的示例)
# 聚合所有类型type,统计唯一值数量
POST kibana_sample_data_ecommerce/_search
{
  "size": 0, 
  "aggs": {
    "type": {
      "cardinality": {
        "field": "type"
      }
    }
  }
}
# 求价格中位数
GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "AvgTicketPrice": {
      "percentiles": {
        "field": "AvgTicketPrice",
        "percents": [
          50
        ]
      }
    }
  }
}
# 对嵌套类型detail_info中price聚合
GET poi/_search
{
  "size": 0,
  "aggs": {
    "detail_info": {
      "nested": {
        "path": "detail_info"
      },
      "aggs": {
        "price": {
          "stats": {
            "field": "detail_info.price"
          }
        }
      }
    }
  }
}
2. Bucket Aggregation
  • 按照一定的规则,将文档分配到不同的桶中。ES提供了常见的Bucket Aggregation
    • Terms
    • 数字类型
      • Range / Date Range
      • Histogram / Date Histogram
  • 支持嵌套:也就是在桶中继续分桶

一个较为复杂的聚合示例:获取城市某个时间段的移动平均中位数

def get_city_median(city, start_time, end_time):
    return es.elastic_client.search(body={
            "query": {
                "bool": {
                    "must": [
                        {
                            "term": {
                                "city": city
                            }
                        }
                    ],
                    "filter": [
                        {
                            "range": {
                                "publish_date": {
                                    "gt": start_time,
                                    "lte": end_time
                                }
                            }
                        }
                    ]
                }
            },
            "size": 0,
            "aggs": {
                "group_by_city": {
                    "terms": {
                        "field": "city"
                    },
                    "aggs": {
                        "group_by_date": {
                            "date_histogram": {
                                "field": "publish_date",
                                "calendar_interval": "month",
                                "format": "yyyy-MM"
                            },
                            "aggs": {
                                "avg_price_percentile": {
                                    "percentiles": {
                                        "field": "avg_price",
                                        "percents": [50]
                                    }
                                },
                                 "the_movperc": { // 用到了下面所说的管道概念
                                      "moving_percentiles": {
                                        "buckets_path": "avg_price_percentile",   
                                        "window": 3
                                      }
                                }
                            }
                        }
                    }
                }
            }
        }, index='xxxx')
3. Pipeline Aggregation
  • 管道的概念:支持对聚合结果分析,再次进行聚合分析
  • Pipeline的分析结果会输出在原结果中,根据位置的不同,分为两类
    • Sibling - 结果和现有分析结果同级
      • Max Min Avg Sum Bucket
      • Stats Extended Stats Bucket
      • Percentiles Bucket
    • Parent - 结果内嵌在现有的聚合分析结果之中
      • Derivative(求导)
      • Cumultive Sum(累计求和)
      • Moving Function(滑动窗口)
# 聚合费用最少的飞行目的地
GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "dest": {
      "terms": {
        "field": "Dest",
        "size": 10
      },
      "aggs": {
        "price": {
          "avg": {
            "field": "AvgTicketPrice"
          }
        }
      }
    },
    "min_dest_price": {
      "min_bucket": {
        "buckets_path": "dest>price"
      }
    }
  }
}
4. 聚合作用范围
#Filter
POST employees/_search
{
  "size": 0,
  "aggs": {
    "older_person": {
      "filter":{ // filter在该聚合中过滤生效
        "range":{
          "age":{
            "from":35
          }
        }
      },
      "aggs":{
         "jobs":{
           "terms": {
        "field":"job.keyword"
      }
      }
    }},
    "all_jobs": {
      "terms": {
        "field":"job.keyword"
        
      }
    }
  }
}
POST employees/_search
{
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"
      }
    }
  },
  "post_filter": { // 筛选聚合后符合条件的结果
    "match": {
      "job.keyword": "Dev Manager"
    }
  }
}
#global
POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 40
      }
    }
  },
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"
        
      }
    },
    "all":{
      "global":{}, // 忽略全局范围过滤,筛选所有年龄段
      "aggs":{
        "salary_avg":{
          "avg":{
            "field":"salary"
          }
        }
      }
    }
  }
}

5. 聚合分析的原理及精确性问题

分布式系统的近似统计算法

http://shenjianblog.oss-cn-shanghai.aliyuncs.com/pic/20220906/680ffd414e1141b6968e85e38a2b7cf2-ES4.png

GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "dest": {
      "terms": {
        "field": "Dest",
        "size": 10
      }
    }
  }
}
// 返回结果
"aggregations" : {
    "dest" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 8898
    }
}

在Terms Aggregation的返回中有两个特殊的数值

  • doc_count_error_upper_bound: 被遗漏的term分桶包含的文档,有可能的最大值
  • sum_other_doc_count:ES除了返回结果bucket的terms外,该参数返回其他terms的文档总数(总数-返回的文档总数)

其中,当doc_count_error_upper_bound大于0的时候,可能结果不准

  • 不准原因为,数据分散在多个分片上,Coordinating Node无法获取数据全貌
  • 解决方案1:当数据量不大时,设置Primary Shard为1,实现准确性
  • 解决方案2:在分布式数据上,设置shard_size参数,提高精确度
    • 原理:每个从Shard上额外多获取数据,提升准确性
    • 调整shard_size大小,降低doc_count_error_upper_bound来提升准确度
      • 增加整体计算量,提高了准确度,但会降低相应时间
    • shard_size默认大小=size*1.5+10
DELETE my_flights
PUT my_flights
{
  "settings": {
    "number_of_shards": 20
  },
  "mappings" : {
      "properties" : {
        "AvgTicketPrice" : {
          "type" : "float"
        },
        "Cancelled" : {
          "type" : "boolean"
        },
        "Carrier" : {
          "type" : "keyword"
        },
        "Dest" : {
          "type" : "keyword"
        },
        "DestAirportID" : {
          "type" : "keyword"
        },
        "DestCityName" : {
          "type" : "keyword"
        },
        "DestCountry" : {
          "type" : "keyword"
        },
        "DestLocation" : {
          "type" : "geo_point"
        },
        "DestRegion" : {
          "type" : "keyword"
        },
        "DestWeather" : {
          "type" : "keyword"
        },
        "DistanceKilometers" : {
          "type" : "float"
        },
        "DistanceMiles" : {
          "type" : "float"
        },
        "FlightDelay" : {
          "type" : "boolean"
        },
        "FlightDelayMin" : {
          "type" : "integer"
        },
        "FlightDelayType" : {
          "type" : "keyword"
        },
        "FlightNum" : {
          "type" : "keyword"
        },
        "FlightTimeHour" : {
          "type" : "keyword"
        },
        "FlightTimeMin" : {
          "type" : "float"
        },
        "Origin" : {
          "type" : "keyword"
        },
        "OriginAirportID" : {
          "type" : "keyword"
        },
        "OriginCityName" : {
          "type" : "keyword"
        },
        "OriginCountry" : {
          "type" : "keyword"
        },
        "OriginLocation" : {
          "type" : "geo_point"
        },
        "OriginRegion" : {
          "type" : "keyword"
        },
        "OriginWeather" : {
          "type" : "keyword"
        },
        "dayOfWeek" : {
          "type" : "integer"
        },
        "timestamp" : {
          "type" : "date"
        }
      }
    }
}

POST _reindex
{
  "source": {
    "index": "kibana_sample_data_flights"
  },
  "dest": {
    "index": "my_flights"
  }
}

GET my_flights/_search
{
  "size": 0,
  "aggs": {
    "weather": {
      "terms": {
        "field":"OriginWeather",
        "size":1,
        "shard_size":10 // 当设置为5时,可以看到返回的doc_count_error_upper_boundda大于0,10则为0
      }
    }
  }
}

欢迎关注公众号算法小生沈健的技术博客

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

算法小生Đ

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值