ES中BKD VS doc value

最新推荐文章于 2024-06-27 01:03:14 发布

chuanyangwang

最新推荐文章于 2024-06-27 01:03:14 发布

阅读量395

点赞数

分类专栏： ES 文章标签： elasticsearch 深度学习

本文链接：https://blog.csdn.net/chuanyangwang/article/details/120578467

版权

ES 专栏收录该内容

50 篇文章 0 订阅

订阅专栏

基于bkd的数字范围查询性能很好，但是由于BKD-Tree内的docId非有序，不能采用类似skipList的向后跳的方式，如果跟其他查询做交集，必须先构造FixedBitSet，这一步可能非常耗时。Lucene中通过IndexOrDocValuesQuery对一些场景做了优化。

在org.apache.lucene.search.join.PointInSetIncludingScoreQuery:score中构造FixedBitSet对象

        PointValues values = reader.getPointValues(field);
        if (values == null) {
          return null;
        }

        FixedBitSet result = new FixedBitSet(reader.maxDoc());
        float[] scores = new float[reader.maxDoc()];
        values.intersect(new MergePointVisitor(sortedPackedPoints, result, scores));

      @Override
      public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOException {
        final ScorerSupplier indexScorerSupplier = indexWeight.scorerSupplier(context);
        final ScorerSupplier dvScorerSupplier = dvWeight.scorerSupplier(context);
        if (indexScorerSupplier == null || dvScorerSupplier == null) {
          return null;
        }
        return new ScorerSupplier() {
          @Override
          public Scorer get(long leadCost) throws IOException {
            // At equal costs, doc values tend to be worse than points since they
            // still need to perform one comparison per document while points can
            // do much better than that given how values are organized. So we give
            // an arbitrary 8x penalty to doc values.
            final long threshold = cost() >>> 3;
            if (threshold <= leadCost) {
              return indexScorerSupplier.get(leadCost);
            } else {
              return dvScorerSupplier.get(leadCost);
            }
          }

          @Override
          public long cost() {
            return indexScorerSupplier.cost();
          }
        };
      }

 /**
   * Get an estimate of the {@link Scorer} that would be returned by {@link #get}.
   * This may be a costly operation, so it should only be called if necessary.
   * @see DocIdSetIterator#cost
   */
  public abstract long cost();

Something that is interesting to notice here is that this query planning optimization does not only depend on the fields that are used and their cardinalities, it goes further and estimates the total number of matches for each node of the query tree in order to make good decisions. This means that taking a query and slightly changing the range of values might completely change how the query is executed under the hood.

计算cost的方法

  private long computeCost() {
    OptionalLong minRequiredCost = Stream.concat(
        subs.get(Occur.MUST).stream(),
        subs.get(Occur.FILTER).stream())
        .mapToLong(ScorerSupplier::cost)
        .min();
    if (minRequiredCost.isPresent() && minShouldMatch == 0) {
      return minRequiredCost.getAsLong();
    } else {
      final Collection<ScorerSupplier> optionalScorers = subs.get(Occur.SHOULD);
      final long shouldCost = MinShouldMatchSumScorer.cost(
          optionalScorers.stream().mapToLong(ScorerSupplier::cost),
          optionalScorers.size(), minShouldMatch);
      return Math.min(minRequiredCost.orElse(Long.MAX_VALUE), shouldCost);
    }
  }

参考文档

1. 工作中组内遇到的 elasticsearch 使用上的踩坑总结 - AIQ

2. https://www.elastic.co/cn/blog/better-query-planning-for-range-queries-in-elasticsearch

3. [LUCENE-7055] Better execution path for costly queries - ASF JIRA

4. IndexOrDocValuesQuery-html