lucene源码分析---11

lucene源码分析—BooleanQuery的评分过程

前面的章节分析过BooleanQuery的查询过,评分的过程只是简单介绍了下,本章回头再看一下BooleanQuery的评分过程,从其score函数开始。

BooleanScorer::score

  public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {

    ...

    BulkScorerAndDoc top = advance(min);
    while (top.next < max) {
      top = scoreWindow(top, collector, singleClauseCollector, acceptDocs, min, max);
    }

    return top.next;
  }

advance函数首先获得第一个匹配文档对应的BulkScorerAndDoc结构,其next成员变量就是文档号,然后通过scoreWindow函数循环处理匹配到的文档,scoreWindow函数默认一次处理最多2048个文档。

BooleanScorer::score->advance

  private BulkScorerAndDoc advance(int min) throws IOException {
    final HeadPriorityQueue head = this.head;
    final TailPriorityQueue tail = this.tail;
    BulkScorerAndDoc headTop = head.top();
    BulkScorerAndDoc tailTop = tail.top();
    while (headTop.next < min) {

        ...

        headTop.advance(min);
        headTop = head.updateTop();

        ...

    }
    return headTop;
  }

head为HeadPriorityQueue,对应的top函数返回BulkScorerAndDoc,updateTop函数将文档数量小的BulkScorerAndDoc排在前面并返回。

BooleanScorer::score->advance->BulkScorerAndDoc::advance

    void advance(int min) throws IOException {
      score(orCollector, null, min, min);
    }

    void score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {
      next = scorer.score(collector, acceptDocs, min, max);
    }

    public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {
      collector.setScorer(scorer);
      if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) {
        ...
      } else {
        int doc = scorer.docID();
        if (doc < min) {
            doc = iterator.advance(min);
        }
        return scoreRange(collector, iterator, twoPhase, acceptDocs, doc, max);
      }
    }

如果是第一次获得文档ID,则docID函数返回-1,min为0,因此此时会调用iterator的advance函数获得文档ID,iterator的类型为BlockDocsEnum,其advance函数从对应的.doc文件中读取文档信息。

BooleanScorer::score->advance->BulkScorerAndDoc::advance->score->DefaultBulkScorer::score->BlockDocsEnum::advance

    public int advance(int target) throws IOException {

      if (docFreq > BLOCK_SIZE && target > nextSkipDoc) {
        ...
      }

      if (docUpto == docFreq) {
        return doc = NO_MORE_DOCS;
      }

      if (docBufferUpto == BLOCK_SIZE) {
        refillDocs();
      }

      while (true) {
        accum += docDeltaBuffer[docBufferUpto];
        docUpto++;

        if (accum >= target) {
          break;
        }
        docBufferUpto++;
        if (docUpto == docFreq) {
          return doc = NO_MORE_DOCS;
        }
      }

      freq = freqBuffer[docBufferUpto];
      docBufferUpto++;
      return doc = accum;
    }

docUpto表示处理的文档指针,docBufferUpto是当前处理的文档指针,BLOCK_SIZE表示缓存大小,如果缓存已满,则调用refillDocs从.doc文件中读取数据到缓存。docDeltaBuffer和freqBuffer缓存分别存储了文档ID和词频,存储方式为差值存储,最后返回需要的文档ID。

获得第一个文档ID后,BooleanScorer的score函数接下来通过scoreWindow函数处理匹配到的文档。

BooleanScorer::score->scoreWindow

  private BulkScorerAndDoc scoreWindow(BulkScorerAndDoc top, LeafCollector collector,
      LeafCollector singleClauseCollector, Bits acceptDocs, int min, int max) throws IOException {
    final int windowBase = top.next & ~MASK;
    final int windowMin = Math.max(min, windowBase);
    final int windowMax = Math.min(max, windowBase + SIZE);

    leads[0] = head.pop();
    int maxFreq = 1;
    while (head.size() > 0 && head.top().next < windowMax) {
      leads[maxFreq++] = head.pop();
    }

    if (minShouldMatch == 1 && maxFreq == 1) {

      ...

    } else {
      scoreWindowMultipleScorers(collector, acceptDocs, windowBase, windowMin, windowMax, maxFreq);
      return head.top();
    }
  }

scoreWindow函数一次只处理最多SIZE大小的文档,windowMin和windowMax分别表示当前处理的文档号的最小值和最大值。接下来获得对应的BulkScorerAndDoc保存在leads数组中,最后调用scoreWindowMultipleScorers函数继续处理。

BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers

  private void scoreWindowMultipleScorers(LeafCollector collector, Bits acceptDocs, int windowBase, int windowMin, int windowMax, int maxFreq) throws IOException {

    ...

    if (maxFreq >= minShouldMatch) {

      ...

      scoreWindowIntoBitSetAndReplay(collector, acceptDocs, windowBase, windowMin, windowMax, leads, maxFreq);
    }

    ...
  }

scoreWindowMultipleScorers函数会继续调用scoreWindowIntoBitSetAndReplay进行处理。

BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers->scoreWindowIntoBitSetAndReplay

  private void scoreWindowIntoBitSetAndReplay(LeafCollector collector, Bits acceptDocs,
      int base, int min, int max, BulkScorerAndDoc[] scorers, int numScorers) throws IOException {
    for (int i = 0; i < numScorers; ++i) {
      final BulkScorerAndDoc scorer = scorers[i];
      scorer.score(orCollector, acceptDocs, min, max);
    }

    scoreMatches(collector, base);
    Arrays.fill(matching, 0L);
  }

scoreWindowIntoBitSetAndReplay函数遍历当前的BulkScorerAndDoc数组,调用其score函数计算评分。BulkScorerAndDoc的score函数最终会调用到OrCollector的collect函数。scoreMatches对本次的处理结果进行最终处理。最终清空matching数组,以便后续2048个文档的分析。

BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers->scoreWindowIntoBitSetAndReplay->BulkScorerAndDoc::score->DefaultBulkScorer::score->scoreRange->OrCollector::collect

    public void collect(int doc) throws IOException {
      final int i = doc & MASK;
      final int idx = i >>> 6;
      matching[idx] |= 1L << i;
      final Bucket bucket = buckets[i];
      bucket.freq++;
      bucket.score += scorer.score();
    }

collect函数一次最多处理2048个文档,成员变量matching用比特位记录匹配到了哪些文档,buckets存储当前处理的最多2048个文档的得分,分别调用score函数计算得到。其中,2048个文档被分成32个组,每组64个比特位记录哪些文档匹配。

BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers->scoreWindowIntoBitSetAndReplay->scoreMatches

  private void scoreMatches(LeafCollector collector, int base) throws IOException {
    long matching[] = this.matching;
    for (int idx = 0; idx < matching.length; idx++) {
      long bits = matching[idx];
      while (bits != 0L) {
        int ntz = Long.numberOfTrailingZeros(bits);
        int doc = idx << 6 | ntz;
        scoreDocument(collector, base, doc);
        bits ^= 1L << ntz;
      }
    }
  }

scoreMatches根据比特位查看匹配到的文档号是多少,然后调用scoreDocument函数计算最终得分并排序。

BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers->scoreWindowIntoBitSetAndReplay->scoreMatches->scoreDocument

  private void scoreDocument(LeafCollector collector, int base, int i) throws IOException {
    final FakeScorer fakeScorer = this.fakeScorer;
    final Bucket bucket = buckets[i];
    if (bucket.freq >= minShouldMatch) {
      fakeScorer.freq = bucket.freq;
      fakeScorer.score = (float) bucket.score * coordFactors[bucket.freq];
      final int doc = base | i;
      fakeScorer.doc = doc;
      collector.collect(doc);
    }
    bucket.freq = 0;
    bucket.score = 0;
  }

这里计算文档号,并从buckets数组中取出前面的计算结果,然后调用collect函数处理最终结果。这里的collector是最先创建的SimpleTopScoreDocCollector,其collect函数就是比较分数,对最终要返回的文档进行排序。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值