ElasticSearch Cardinality Aggregation聚合超出40000存在误差

Precision controledit

This aggregation also supports the precision_threshold option:

The precision_threshold option is specific to the current internal implementation of the cardinality agg, which may change in the future

{
    "aggs" : {
        "author_count" : {
            "cardinality" : {
                "field" : "author_hash",
                "precision_threshold": 100 
            }
        }
    }
}

The precision_threshold options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. Default value depends on the number of parent aggregations that multiple create buckets (such as terms or histograms).

Counts are approximateedit

Computing exact counts requires loading values into a hash set and returning its size. This doesn’t scale when working on high-cardinality sets and/or large values as the required memory usage and the need to communicate those per-shard sets between nodes would utilize too many resources of the cluster.

This cardinality aggregation is based on the HyperLogLog++ algorithm, which counts based on the hashes of the values with some interesting properties:

  • configurable precision, which decides on how to trade memory for accuracy,
  • excellent accuracy on low-cardinality sets,
  • fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.

For a precision threshold of c, the implementation that we are using requires about c * 8 bytes.

The following chart shows how the error varies before and after the threshold:

For all 3 thresholds, counts have been accurate up to the configured threshold (although not guaranteed, this is likely to be the case). Please also note that even with a threshold as low as 100, the error remains under 5%, even when counting millions of items.

 

 

  其意思就是:聚合查询存在误差,在5%范围之内,通过调整“precision_threshold”参数进行调整。

  于是翻阅查询代码:加入如下部分问题得到解决。该参数在查询时未设置的情况下,默认值为3000。

  

 private void buildSearchQueryForAgg(NativeSearchQueryBuilder nativeSearchQueryBuilder) {
        // 设置聚合条件
        TermsBuilder agg = AggregationBuilders.terms(aggreName).field(XXX.XXX).size(Integer.MAX_VALUE);

        // 查询条件构建
        BoolQueryBuilder packBoolQuery = QueryBuilders.boolQuery();
        FilterAggregationBuilder packAgg = AggregationBuilders.filter(xxx).filter(packBoolQuery);
       
        packAgg.subAggregation(AggregationBuilders.cardinality(xxx).field(ZZZZ.XXX).precisionThreshold(CARDINALITY_PRECISION_THRESHOLD));//指定精度值
        agg.subAggregation(packAgg);


        nativeSearchQueryBuilder.addAggregation(agg);
    }

官网文档:

Elasticsearch Guide:  https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

https://www.elastic.co/guide/en/elasticsearch/reference/8.0/search-aggregations-metrics-cardinality-aggregation.html 

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值