7.ElasticSearch系列之深入聚合分析

算法小生Đ

已于 2024-06-23 11:49:36 修改

阅读量420

点赞数

分类专栏： NOSQL 文章标签： elasticsearch 大数据搜索引擎

于 2022-10-20 21:13:05 首次发布

本文链接：https://blog.csdn.net/SJshenjian/article/details/127435297

版权

NOSQL 专栏收录该内容

25 篇文章 1 订阅

订阅专栏

深入理解聚合分析原理及精确性问题

1. Metric Aggregation

单值分析，只输出一个分析结果
- min max avg sum
- cardinality (类似distinct count)
多值分析，输出多个分析结果
- stats extended stats
- percentile, percentile rank
- top hits(排在前面的示例)

# 聚合所有类型type,统计唯一值数量
POST kibana_sample_data_ecommerce/_search
{
  "size": 0, 
  "aggs": {
    "type": {
      "cardinality": {
        "field": "type"
      }
    }
  }
}
# 求价格中位数
GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "AvgTicketPrice": {
      "percentiles": {
        "field": "AvgTicketPrice",
        "percents": [
          50
        ]
      }
    }
  }
}
# 对嵌套类型detail_info中price聚合
GET poi/_search
{
  "size": 0,
  "aggs": {
    "detail_info": {
      "nested": {
        "path": "detail_info"
      },
      "aggs": {
        "price": {
          "stats": {
            "field": "detail_info.price"
          }
        }
      }
    }
  }
}

2. Bucket Aggregation

按照一定的规则，将文档分配到不同的桶中。ES提供了常见的Bucket Aggregation
- Terms
- 数字类型
  - Range / Date Range
  - Histogram / Date Histogram
支持嵌套：也就是在桶中继续分桶

一个较为复杂的聚合示例：获取城市某个时间段的移动平均中位数

def get_city_median(city, start_time, end_time):
    return es.elastic_client.search(body={
            "query": {
                "bool": {
                    "must": [
                        {
                            "term": {
                                "city": city
                            }
                        }
                    ],
                    "filter": [
                        {
                            "range": {
                                "publish_date": {
                                    "gt": start_time,
                                    "lte": end_time
                                }
                            }
                        }
                    ]
                }
            },
            "size": 0,
            "aggs": {
                "group_by_city": {
                    "terms": {
                        "field": "city"
                    },
                    "aggs": {
                        "group_by_date": {
                            "date_histogram": {
                                "field": "publish_date",
                                "calendar_interval": "month",
                                "format": "yyyy-MM"
                            },
                            "aggs": {
                                "avg_price_percentile": {
                                    "percentiles": {
                                        "field": "avg_price",
                                        "percents": [50]
                                    }
                                },
                                 "the_movperc": { // 用到了下面所说的管道概念
                                      "moving_percentiles": {
                                        "buckets_path": "avg_price_percentile",   
                                        "window": 3
                                      }
                                }
                            }
                        }
                    }
                }
            }
        }, index='xxxx')

3. Pipeline Aggregation

管道的概念：支持对聚合结果分析，再次进行聚合分析
Pipeline的分析结果会输出在原结果中，根据位置的不同，分为两类
- Sibling - 结果和现有分析结果同级
  - Max Min Avg Sum Bucket
  - Stats Extended Stats Bucket
  - Percentiles Bucket
- Parent - 结果内嵌在现有的聚合分析结果之中
  - Derivative(求导)
  - Cumultive Sum(累计求和)
  - Moving Function(滑动窗口)

# 聚合费用最少的飞行目的地
GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "dest": {
      "terms": {
        "field": "Dest",
        "size": 10
      },
      "aggs": {
        "price": {
          "avg": {
            "field": "AvgTicketPrice"
          }
        }
      }
    },
    "min_dest_price": {
      "min_bucket": {
        "buckets_path": "dest>price"
      }
    }
  }
}

4. 聚合作用范围

#Filter
POST employees/_search
{
  "size": 0,
  "aggs": {
    "older_person": {
      "filter":{ // filter在该聚合中过滤生效
        "range":{
          "age":{
            "from":35
          }
        }
      },
      "aggs":{
         "jobs":{
           "terms": {
        "field":"job.keyword"
      }
      }
    }},
    "all_jobs": {
      "terms": {
        "field":"job.keyword"
        
      }
    }
  }
}

POST employees/_search
{
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"
      }
    }
  },
  "post_filter": { // 筛选聚合后符合条件的结果
    "match": {
      "job.keyword": "Dev Manager"
    }
  }
}

#global
POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 40
      }
    }
  },
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"
        
      }
    },
    "all":{
      "global":{}, // 忽略全局范围过滤，筛选所有年龄段
      "aggs":{
        "salary_avg":{
          "avg":{
            "field":"salary"
          }
        }
      }
    }
  }
}

5. 聚合分析的原理及精确性问题

分布式系统的近似统计算法

GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "dest": {
      "terms": {
        "field": "Dest",
        "size": 10
      }
    }
  }
}
// 返回结果
"aggregations" : {
    "dest" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 8898
    }
}

在Terms Aggregation的返回中有两个特殊的数值

doc_count_error_upper_bound: 被遗漏的term分桶包含的文档，有可能的最大值
sum_other_doc_count:ES除了返回结果bucket的terms外，该参数返回其他terms的文档总数（总数-返回的文档总数）

其中，当doc_count_error_upper_bound大于0的时候，可能结果不准

不准原因为，数据分散在多个分片上，Coordinating Node无法获取数据全貌
解决方案1：当数据量不大时，设置Primary Shard为1，实现准确性
解决方案2：在分布式数据上，设置shard_size参数，提高精确度
- 原理：每个从Shard上额外多获取数据，提升准确性
- 调整shard_size大小，降低doc_count_error_upper_bound来提升准确度
  - 增加整体计算量，提高了准确度，但会降低相应时间
- shard_size默认大小=size*1.5+10

DELETE my_flights
PUT my_flights
{
  "settings": {
    "number_of_shards": 20
  },
  "mappings" : {
      "properties" : {
        "AvgTicketPrice" : {
          "type" : "float"
        },
        "Cancelled" : {
          "type" : "boolean"
        },
        "Carrier" : {
          "type" : "keyword"
        },
        "Dest" : {
          "type" : "keyword"
        },
        "DestAirportID" : {
          "type" : "keyword"
        },
        "DestCityName" : {
          "type" : "keyword"
        },
        "DestCountry" : {
          "type" : "keyword"
        },
        "DestLocation" : {
          "type" : "geo_point"
        },
        "DestRegion" : {
          "type" : "keyword"
        },
        "DestWeather" : {
          "type" : "keyword"
        },
        "DistanceKilometers" : {
          "type" : "float"
        },
        "DistanceMiles" : {
          "type" : "float"
        },
        "FlightDelay" : {
          "type" : "boolean"
        },
        "FlightDelayMin" : {
          "type" : "integer"
        },
        "FlightDelayType" : {
          "type" : "keyword"
        },
        "FlightNum" : {
          "type" : "keyword"
        },
        "FlightTimeHour" : {
          "type" : "keyword"
        },
        "FlightTimeMin" : {
          "type" : "float"
        },
        "Origin" : {
          "type" : "keyword"
        },
        "OriginAirportID" : {
          "type" : "keyword"
        },
        "OriginCityName" : {
          "type" : "keyword"
        },
        "OriginCountry" : {
          "type" : "keyword"
        },
        "OriginLocation" : {
          "type" : "geo_point"
        },
        "OriginRegion" : {
          "type" : "keyword"
        },
        "OriginWeather" : {
          "type" : "keyword"
        },
        "dayOfWeek" : {
          "type" : "integer"
        },
        "timestamp" : {
          "type" : "date"
        }
      }
    }
}

POST _reindex
{
  "source": {
    "index": "kibana_sample_data_flights"
  },
  "dest": {
    "index": "my_flights"
  }
}

GET my_flights/_search
{
  "size": 0,
  "aggs": {
    "weather": {
      "terms": {
        "field":"OriginWeather",
        "size":1,
        "shard_size":10 // 当设置为5时，可以看到返回的doc_count_error_upper_boundda大于0，10则为0
      }
    }
  }
}