基于bkd的数字范围查询性能很好,但是由于BKD-Tree内的docId非有序,不能采用类似skipList的向后跳的方式,如果跟其他查询做交集,必须先构造FixedBitSet,这一步可能非常耗时。Lucene中通过IndexOrDocValuesQuery对一些场景做了优化。
在org.apache.lucene.search.join.PointInSetIncludingScoreQuery:score中构造FixedBitSet对象
PointValues values = reader.getPointValues(field);
if (values == null) {
return null;
}
FixedBitSet result = new FixedBitSet(reader.maxDoc());
float[] scores = new float[reader.maxDoc()];
values.intersect(new MergePointVisitor(sortedPackedPoints, result, scores));
@Override
public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOException {
final ScorerSupplier indexScorerSupplier = indexWeight.scorerSupplier(context);
final ScorerSupplier dvScorerSupplier = dvWeight.scorerSupplier(context);
if (indexScorerSupplier == null || dvScorerSupplier == null) {
return null;
}
return new ScorerSupplier() {
@Override
public Scorer get(long leadCost) throws IOException {
// At equal costs, doc values tend to be worse than points since they
// still need to perform one comparison per document while points can
// do much better than that given how values are organized. So we give
// an arbitrary 8x penalty to doc values.
final long threshold = cost() >>> 3;
if (threshold <= leadCost) {
return indexScorerSupplier.get(leadCost);
} else {
return dvScorerSupplier.get(leadCost);
}
}
@Override
public long cost() {
return indexScorerSupplier.cost();
}
};
}
/**
* Get an estimate of the {@link Scorer} that would be returned by {@link #get}.
* This may be a costly operation, so it should only be called if necessary.
* @see DocIdSetIterator#cost
*/
public abstract long cost();
Something that is interesting to notice here is that this query planning optimization does not only depend on the fields that are used and their cardinalities, it goes further and estimates the total number of matches for each node of the query tree in order to make good decisions. This means that taking a query and slightly changing the range of values might completely change how the query is executed under the hood.
计算cost的方法
private long computeCost() {
OptionalLong minRequiredCost = Stream.concat(
subs.get(Occur.MUST).stream(),
subs.get(Occur.FILTER).stream())
.mapToLong(ScorerSupplier::cost)
.min();
if (minRequiredCost.isPresent() && minShouldMatch == 0) {
return minRequiredCost.getAsLong();
} else {
final Collection<ScorerSupplier> optionalScorers = subs.get(Occur.SHOULD);
final long shouldCost = MinShouldMatchSumScorer.cost(
optionalScorers.stream().mapToLong(ScorerSupplier::cost),
optionalScorers.size(), minShouldMatch);
return Math.min(minRequiredCost.orElse(Long.MAX_VALUE), shouldCost);
}
}
参考文档
1. 工作中组内遇到的 elasticsearch 使用上的踩坑总结 - AIQ
2. https://www.elastic.co/cn/blog/better-query-planning-for-range-queries-in-elasticsearch
3. [LUCENE-7055] Better execution path for costly queries - ASF JIRA