lucene源码分析---8

lucene源码分析—查询过程

本章开始介绍lucene的查询过程,即IndexSearcher的search函数,

IndexSearcher::search

  public TopDocs search(Query query, int n)
    throws IOException {
    return searchAfter(null, query, n);
  }
  public TopDocs searchAfter(ScoreDoc after, Query query, int numHits) throws IOException {
    final int limit = Math.max(1, reader.maxDoc());
    numHits = Math.min(numHits, limit);
    final int cappedNumHits = Math.min(numHits, limit);

    final CollectorManager<TopScoreDocCollector, TopDocs> manager = new CollectorManager<TopScoreDocCollector, TopDocs>() {

      @Override
      public TopScoreDocCollector newCollector() throws IOException {
        ...
      }

      @Override
      public TopDocs reduce(Collection<TopScoreDocCollector> collectors) throws IOException {
        ...
      }

    };

    return search(query, manager);
  }

传入的参数query封装了查询语句,n代表取前n个结果。searchAfter函数前面的计算保证最后的文档数量n不会超过所有文档的数量。接下来创建CollectorManager,并调用重载的search继续执行。

IndexSearch::search->searchAfter->search

  public <C extends Collector, T> T search(Query query, CollectorManager<C, T> collectorManager) throws IOException {
    if (executor == null) {
      final C collector = collectorManager.newCollector();
      search(query, collector);
      return collectorManager.reduce(Collections.singletonList(collector));
    } else {

      ...

    }
  }

假设查询过程为单线程,此时executor为空。首先通过CollectorManager的newCollector创建TopScoreDocCollector,每个TopScoreDocCollector封装了最后的查询结果,如果是多线程查询,则最后要对多个TopScoreDocCollector进行合并。

IndexSearch::search->searchAfter->search->CollectorManager::newCollector

   public TopScoreDocCollector newCollector() throws IOException {
     return TopScoreDocCollector.create(cappedNumHits, after);
   }

   public static TopScoreDocCollector create(int numHits, ScoreDoc after) {
    if (after == null) {
      return new SimpleTopScoreDocCollector(numHits);
    } else {
      return new PagingTopScoreDocCollector(numHits, after);
    }
  }

参数after用来实现类似分页的效果,这里假设为null。newCollector函数最终返回SimpleTopScoreDocCollector。创建完TopScoreDocCollector后,接下来调用重载的search函数继续执行。

IndexSearch::search->searchAfter->search->search

  public void search(Query query, Collector results)
    throws IOException {
    search(leafContexts, createNormalizedWeight(query, results.needsScores()), results);
  }

leafContexts是CompositeReaderContext中的leaves成员变量,是一个LeafReaderContext列表,每个LeafReaderContext封装了每个段的SegmentReader,SegmentReader可以读取每个段的所有信息和数据。接下来通过createNormalizedWeight函数进行查询匹配,并计算一些基本的权重用来给后面的打分过程使用。

  public Weight createNormalizedWeight(Query query, boolean needsScores) throws IOException {
    query = rewrite(query);
    Weight weight = createWeight(query, needsScores);
    float v = weight.getValueForNormalization();
    float norm = getSimilarity(needsScores).queryNorm(v);
    if (Float.isInfinite(norm) || Float.isNaN(norm)) {
      norm = 1.0f;
    }
    weight.normalize(norm, 1.0f);
    return weight;
  }

首先通过rewrite函数对Query进行重写,例如删除一些不必要的项,将非原子查询转化为原子查询。

rewrite

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->rewrite

  public Query rewrite(Query original) throws IOException {
    Query query = original;
    for (Query rewrittenQuery = query.rewrite(reader); rewrittenQuery != query;
         rewrittenQuery = query.rewrite(reader)) {
      query = rewrittenQuery;
    }
    return query;
  }

这里循环调用每个Query的rewrite函数进行重写,之所以循环是因为可能一次重写改变Query结构后又产生了可以被重写的部分,下面假设这里的query为BooleanQuery,BooleanQuery并不包含真正的查询语句,而是包含多个子查询,每个子查询可以是TermQuery这样不可再分的Query,也可以是另一个BooleanQuery。
由于BooleanQuery的rewrite函数较长,下面分段来看。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第一部分

  public Query rewrite(IndexReader reader) throws IOException {

    if (clauses.size() == 1) {
      BooleanClause c = clauses.get(0);
      Query query = c.getQuery();
      if (minimumNumberShouldMatch == 1 && c.getOccur() == Occur.SHOULD) {
        return query;
      } else if (minimumNumberShouldMatch == 0) {
        switch (c.getOccur()) {
          case SHOULD:
          case MUST:
            return query;
          case FILTER:
            return new BoostQuery(new ConstantScoreQuery(query), 0);
          case MUST_NOT:
            return new MatchNoDocsQuery();
          default:
            throw new AssertionError();
        }
      }
    }

    ...

  }

如果BooleanQuery中只有一个子查询,则没必要对其封装,直接取出该子查询中的Query即可。
minimumNumberShouldMatch成员变量表示至少需要匹配多少项,如果唯一的子查询条件为SHOULD,并且匹配1项就行了,则直接返回对应的Query,如果条件为MUST或者SHOULD,也是直接返回子查询中的Query,如果条件为FILTER,则直接通过BoostQuery封装并返回,如果条件为MUST_NOT,则说明唯一的查询不需要查询任何文档,直接创建MatchNODocsQuery即可。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第二部分

  public Query rewrite(IndexReader reader) throws IOException {

    ...

    {
      BooleanQuery.Builder builder = new BooleanQuery.Builder();
      builder.setDisableCoord(isCoordDisabled());
      builder.setMinimumNumberShouldMatch(getMinimumNumberShouldMatch());
      boolean actuallyRewritten = false;
      for (BooleanClause clause : this) {
        Query query = clause.getQuery();
        Query rewritten = query.rewrite(reader);
        if (rewritten != query) {
          actuallyRewritten = true;
        }
        builder.add(rewritten, clause.getOccur());
      }
      if (actuallyRewritten) {
        return builder.build();
      }
    }

    ...

  }

这部分rewrite函数遍历BooleanQuery下的所有的子查询列表,嵌套调用rewrite函数,如果某次rewrite函数返回的Query和原来的Query不一样,则说明某个子查询被重写了,此时通过BooleanQuery.Builder的build函数重新生成BooleanQuery。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第三部分

  public Query rewrite(IndexReader reader) throws IOException {

    ...

    {
      int clauseCount = 0;
      for (Collection<Query> queries : clauseSets.values()) {
        clauseCount += queries.size();
      }
      if (clauseCount != clauses.size()) {
        BooleanQuery.Builder rewritten = new BooleanQuery.Builder();
        rewritten.setDisableCoord(disableCoord);
        rewritten.setMinimumNumberShouldMatch(minimumNumberShouldMatch);
        for (Map.Entry<Occur, Collection<Query>> entry : clauseSets.entrySet()) {
          final Occur occur = entry.getKey();
          for (Query query : entry.getValue()) {
            rewritten.add(query, occur);
          }
        }
        return rewritten.build();
      }
    }

    ...

  }

clauseSets中保存了MUST_NOT和FILTER对应的子查询Clause,并使用HashSet进行存储。利用HashSet的结构可以去除重复的条件为MUST_NOT和FILTER的子查询。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第四部分

  public Query rewrite(IndexReader reader) throws IOException {

    ...

    if (clauseSets.get(Occur.MUST).size() > 0 && clauseSets.get(Occur.FILTER).size() > 0) {
      final Set<Query> filters = new HashSet<Query>(clauseSets.get(Occur.FILTER));
      boolean modified = filters.remove(new MatchAllDocsQuery());
      modified |= filters.removeAll(clauseSets.get(Occur.MUST));
      if (modified) {
        BooleanQuery.Builder builder = new BooleanQuery.Builder();
        builder.setDisableCoord(isCoordDisabled());
        builder.setMinimumNumberShouldMatch(getMinimumNumberShouldMatch());
        for (BooleanClause clause : clauses) {
          if (clause.getOccur() != Occur.FILTER) {
            builder.add(clause);
          }
        }
        for (Query filter : filters) {
          builder.add(filter, Occur.FILTER);
        }
        return builder.build();
      }
    }

    ...

  }

删除条件为FILTER又同时为MUST的子查询,同时删除查询所有文档的子查询(因为此时子查询的数量肯定大于1),查询所有文档的结果集里包含了任何其他查询的结果集。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第五部分

    {
      final Collection<Query> musts = clauseSets.get(Occur.MUST);
      final Collection<Query> filters = clauseSets.get(Occur.FILTER);
      if (musts.size() == 1
          && filters.size() > 0) {
        Query must = musts.iterator().next();
        float boost = 1f;
        if (must instanceof BoostQuery) {
          BoostQuery boostQuery = (BoostQuery) must;
          must = boostQuery.getQuery();
          boost = boostQuery.getBoost();
        }
        if (must.getClass() == MatchAllDocsQuery.class) {
          BooleanQuery.Builder builder = new BooleanQuery.Builder();
          for (BooleanClause clause : clauses) {
            switch (clause.getOccur()) {
              case FILTER:
              case MUST_NOT:
                builder.add(clause);
                break;
              default:
                break;
            }
          }
          Query rewritten = builder.build();
          rewritten = new ConstantScoreQuery(rewritten);

          builder = new BooleanQuery.Builder()
            .setDisableCoord(isCoordDisabled())
            .setMinimumNumberShouldMatch(getMinimumNumberShouldMatch())
            .add(rewritten, Occur.MUST);
          for (Query query : clauseSets.get(Occur.SHOULD)) {
            builder.add(query, Occur.SHOULD);
          }
          rewritten = builder.build();
          return rewritten;
        }
      }
    }

    return super.rewrite(reader);

如果某个MatchAllDocsQuery是唯一的类型为MUST的Query,则对其进行重写。最后如果没有重写,就调用父类Query的rewrite直接返回其自身。

看完了BooleanQuery的rewrite函数,下面简单介绍一下其他类型Query的rewrite函数。
TermQuery的rewrite函数,直接返回自身。SynonymQuery的rewrite函数检测是否只包含一个Query,如果只有一个Query,则将其转化为TermQuery。WildcardQuery、PrefixQuery、RegexpQuery以及FuzzyQuery都继承自MultiTermQuery。WildcardQuery的rewrite函数返回一个封装了原来Query的MultiTermQueryConstantScoreWrapper。PrefixQuery的rewrite函数返回一个MultiTermQueryConstantScoreWrapper。RegexpQuery类似PrefixQuery。FuzzyQuery最后根据情况返回一个BlendedTermQuery。

回到createNormalizedWeight函数中,重写完Query之后,接下来通过createWeight函数进行匹配并计算权重。

createWeight

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight

  public Weight createWeight(Query query, boolean needsScores) throws IOException {
    final QueryCache queryCache = this.queryCache;
    Weight weight = query.createWeight(this, needsScores);
    if (needsScores == false && queryCache != null) {
      weight = queryCache.doCache(weight, queryCachingPolicy);
    }
    return weight;
  }

IndexSearch中的成员变量queryCache被初始化为LRUQueryCache。createWeight函数会调用各个Query中的createWeight函数,假设为BooleanQuery。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->BooleanQuery::createWeight

  public Weight createWeight(IndexSearcher searcher, boolean needsScores) throws IOException {
    BooleanQuery query = this;
    if (needsScores == false) {
      query = rewriteNoScoring();
    }
    return new BooleanWeight(query, searcher, needsScores, disableCoord);
  }

needsScores在SimpleTopScoreDocCollector中默认返回true。createWeight函数创建BooleanWeight并返回。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->BooleanWeight::createWeight->BooleanWeight::BooleanWeight

  BooleanWeight(BooleanQuery query, IndexSearcher searcher, boolean needsScores, boolean disableCoord) throws IOException {
    super(query);
    this.query = query;
    this.needsScores = needsScores;
    this.similarity = searcher.getSimilarity(needsScores);
    weights = new ArrayList<>();
    int i = 0;
    int maxCoord = 0;
    for (BooleanClause c : query) {
      Weight w = searcher.createWeight(c.getQuery(), needsScores && c.isScoring());
      weights.add(w);
      if (c.isScoring()) {
        maxCoord++;
      }
      i += 1;
    }
    this.maxCoord = maxCoord;

    coords = new float[maxCoord+1];
    Arrays.fill(coords, 1F);
    coords[0] = 0f;
    if (maxCoord > 0 && needsScores && disableCoord == false) {
      boolean seenActualCoord = false;
      for (i = 1; i < coords.length; i++) {
        coords[i] = coord(i, maxCoord);
        seenActualCoord |= (coords[i] != 1F);
      }
      this.disableCoord = seenActualCoord == false;
    } else {
      this.disableCoord = true;
    }
  }

getSimilarity函数默认返回IndexSearcher中的BM25Similarity。BooleanWeight函数嵌套调用createWeight获取子查询的Weight,假设子查询为TermQuery,后面来看TermQuery的createWeight函数。maxCoord用来表示有多少个子查询,最后面的coords数组能够影响检索文档的得分,计算公式为coord(q,d) = q/d。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight

  public Weight createWeight(IndexSearcher searcher, boolean needsScores) throws IOException {
    final IndexReaderContext context = searcher.getTopReaderContext();
    final TermContext termState;
    if (perReaderTermState == null
        || perReaderTermState.topReaderContext != context) {
      termState = TermContext.build(context, term);
    } else {
      termState = this.perReaderTermState;
    }

    return new TermWeight(searcher, needsScores, termState);
  }

getTopReaderContext返回CompositeReaderContext,封装了SegmentReader。
perReaderTermState默认为null,因此接下来通过TermContext的build函数进行匹配并获取对应的Term在索引表中的相应信息,最后根据得到的信息TermContext创建TermWeight并返回。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermContext::build

  public static TermContext build(IndexReaderContext context, Term term)
      throws IOException {
    final String field = term.field();
    final BytesRef bytes = term.bytes();
    final TermContext perReaderTermState = new TermContext(context);
    for (final LeafReaderContext ctx : context.leaves()) {
      final Terms terms = ctx.reader().terms(field);
      if (terms != null) {
        final TermsEnum termsEnum = terms.iterator();
        if (termsEnum.seekExact(bytes)) { 
          final TermState termState = termsEnum.termState();
          perReaderTermState.register(termState, ctx.ord, termsEnum.docFreq(), termsEnum.totalTermFreq());
        }
      }
    }
    return perReaderTermState;
  }

Term的bytes函数返回查询的字节,默认的是UTF-8编码。LeafReaderContext的reader函数返回SegmentReader,对应的terms函数返回FieldReader用来读取文件中的信息。

IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermContext::build->SegmentReader::terms

  public final Terms terms(String field) throws IOException {
    return fields().terms(field);
  }

  public final Fields fields() {
    return getPostingsReader();
  }

  public FieldsProducer getPostingsReader() {
    ensureOpen();
    return core.fields;
  }

  public Terms terms(String field) throws IOException {
    FieldsProducer fieldsProducer = fields.get(field);
    return fieldsProducer == null ? null : fieldsProducer.terms(field);
  }

core在SegmentReader构造函数中创建为SegmentCoreReaders,对应fields为PerFieldPostingsFormat。fields.get最终返回BlockTreeTermsReader,在创建索引时设置的。
BlockTreeTermsReader的terms最终返回对应域的FieldReader。

回到TermContext的build函数中,接下来的iterator函数返回SegmentTermsEnum,然后通过seekExact函数查询匹配,如果匹配,通过SegmentTermsEnum的termState函数返回一个IntBlockTermState,里面封装该Term的各个信息,seekExact函数在下一章分析。build函数最后通过TermContext的register函数保存计算获得的IntBlockTermState。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermContext::build->register

  public void register(TermState state, final int ord, final int docFreq, final long totalTermFreq) {
    register(state, ord);
    accumulateStatistics(docFreq, totalTermFreq);
  }

  public void register(TermState state, final int ord) {
    states[ord] = state;
  }

  public void accumulateStatistics(final int docFreq, final long totalTermFreq) {
    this.docFreq += docFreq;
    if (this.totalTermFreq >= 0 && totalTermFreq >= 0)
      this.totalTermFreq += totalTermFreq;
    else
      this.totalTermFreq = -1;
  }

传入的参数ord用于标识一个唯一的IndexReaderContext,即一个段。register函数将TermState,其实是IntBlockTermState存储进数组states中,然后通过accumulateStatistics更新统计信息。
回到TermQuery的createWeight函数中,最后创建一个TermWeight并返回。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight

    public TermWeight(IndexSearcher searcher, boolean needsScores, TermContext termStates)
        throws IOException {
      super(TermQuery.this);
      this.needsScores = needsScores;
      this.termStates = termStates;
      this.similarity = searcher.getSimilarity(needsScores);

      final CollectionStatistics collectionStats;
      final TermStatistics termStats;
      if (needsScores) {
        collectionStats = searcher.collectionStatistics(term.field());
        termStats = searcher.termStatistics(term, termStates);
      } else {
        ...
      }

      this.stats = similarity.computeWeight(collectionStats, termStats);
    }

整体上看,collectionStatistics函数用来统计某个域中的信息,termStatistics函数用来统计某个词的信息。
最后通过这两个信息调用computeWeight函数计算权重。下面分别来看。

IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->IndexSearcher::collectionStatistics

  public CollectionStatistics collectionStatistics(String field) throws IOException {
    final int docCount;
    final long sumTotalTermFreq;
    final long sumDocFreq;

    Terms terms = MultiFields.getTerms(reader, field);
    if (terms == null) {
      docCount = 0;
      sumTotalTermFreq = 0;
      sumDocFreq = 0;
    } else {
      docCount = terms.getDocCount();
      sumTotalTermFreq = terms.getSumTotalTermFreq();
      sumDocFreq = terms.getSumDocFreq();
    }
    return new CollectionStatistics(field, reader.maxDoc(), docCount, sumTotalTermFreq, sumDocFreq);
  }

getTerms函数和前面的分析类似,最后返回一个FieldReader,然后获取docCount文档数、sumTotalTermFreq所有termFreq(每篇文档有多少个Term)的总和、sumDocFreq所有docFreq(多少篇文档包含Term)的总和,最后创建CollectionStatistics封装这些信息并返回。

IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->IndexSearcher::termStatistics

  public TermStatistics termStatistics(Term term, TermContext context) throws IOException {
    return new TermStatistics(term.bytes(), context.docFreq(), context.totalTermFreq());
  }

docFreq返回有多少篇文档包含该词,totalTermFreq返回文档中包含多少个该词,最后创建一个TermStatistics并返回,构造函数简单。

回到TermWeight的构造函数中,similarity默认为BM25Similarity,computeWeight函数如下。

IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->BM25Similarity::computeWeight

  public final SimWeight computeWeight(CollectionStatistics collectionStats, TermStatistics... termStats) {
    Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);

    float avgdl = avgFieldLength(collectionStats);

    float cache[] = new float[256];
    for (int i = 0; i < cache.length; i++) {
      cache[i] = k1 * ((1 - b) + b * decodeNormValue((byte)i) / avgdl);
    }
    return new BM25Stats(collectionStats.field(), idf, avgdl, cache);
  }

idfExplain函数用来计算idf,即反转文档频率。
IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->BM25Similarity::computeWeight->idfExplain

  public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats) {
    final long df = termStats.docFreq();
    final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();
    final float idf = idf(df, docCount);
    return Explanation.match(idf, "idf(docFreq=" + df + ", docCount=" + docCount + ")");
  }

df为多少篇文档包含该词,docCount为文档总数,idf函数的计算公式如下,
1 + log(numDocs/(docFreq+1)),含义是如果文档中出现Term的频率越高显得文档越不重要。
回到computeWeight中,avgFieldLength函数用来计算每篇文档包含词的平均数。

IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->BM25Similarity::computeWeight->avgFieldLength

  protected float avgFieldLength(CollectionStatistics collectionStats) {
    final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
    if (sumTotalTermFreq <= 0) {
      return 1f;
    } else {
      final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();
      return (float) (sumTotalTermFreq / (double) docCount);
    }
  }

avgFieldLength函数将词频总数除以文档数,得到每篇文档的平均词数。回到computeWeight中,接下来计算BM25的相关系数,BM25是lucene进行排序的算法,最后创建BM25Stats并返回。

回到createNormalizedWeight中,接下来通过getValueForNormalization函数计算权重。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanWeight::getValueForNormalization

  public float getValueForNormalization() throws IOException {
    float sum = 0.0f;
    int i = 0;
    for (BooleanClause clause : query) {
      float s = weights.get(i).getValueForNormalization();
      if (clause.isScoring()) {
        sum += s;
      }
      i += 1;
    }

    return sum ;
  }

BooleanWeight的getValueForNormalization函数用来累积子查询中getValueForNormalization函数返回的值。假设子查询为TermQuery,对应的Weight为TermWeight,其getValueForNormalization函数如下。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanWeight::getValueForNormalization->TermWeight::getValueForNormalization

    public float getValueForNormalization() {
      return stats.getValueForNormalization();
    }

    public float getValueForNormalization() {
      return weight * weight;
    }

    public void normalize(float queryNorm, float boost) {
      this.boost = boost;
      this.weight = idf.getValue() * boost;
    } 

stats就是是BM25Stats,其getValueForNormalization函数最终返回idf值乘以boost后的平方。

回到createNormalizedWeight中,queryNorm函数直接返回1,normalize函数根据norm重新计算权重。首先看BooleanWeight的normalize函数,

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanWeight::normalize

  public void normalize(float norm, float boost) {
    for (Weight w : weights) {
      w.normalize(norm, boost);
    }
  }

假设子查询对应的Weight为TermWeight。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->TermWeight::normalize

    public void normalize(float queryNorm, float boost) {
      stats.normalize(queryNorm, boost);
    }

    public void normalize(float queryNorm, float boost) {
      this.boost = boost;
      this.weight = idf.getValue() * boost;
    }

回到IndexSearcher的search函数中,createNormalizedWeight返回Weight后,继续调用重载的search函数,定义如下,

IndexSearch::search->searchAfter->search->search->search

  protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)
      throws IOException {

    for (LeafReaderContext ctx : leaves) {
      final LeafCollector leafCollector;
      try {
        leafCollector = collector.getLeafCollector(ctx);
      } catch (CollectionTerminatedException e) {

      }
      BulkScorer scorer = weight.bulkScorer(ctx);
      if (scorer != null) {
        try {
          scorer.score(leafCollector, ctx.reader().getLiveDocs());
        } catch (CollectionTerminatedException e) {

        }
      }
    }
  }

根据《lucence源码分析—6》leaves是封装了SegmentReader的LeafReaderContext列表,collector是SimpleTopScoreDocCollector。

IndexSearch::search->searchAfter->search->search->search->SimpleTopScoreDocCollector::getLeafCollector

    public LeafCollector getLeafCollector(LeafReaderContext context)
        throws IOException {
      final int docBase = context.docBase;
      return new ScorerLeafCollector() {

        @Override
        public void collect(int doc) throws IOException {
          float score = scorer.score();
          totalHits++;
          if (score <= pqTop.score) {
            return;
          }
          pqTop.doc = doc + docBase;
          pqTop.score = score;
          pqTop = pq.updateTop();
        }

      };
    }

getLeafCollector函数创建ScorerLeafCollector并返回。

回到search函数中,接下来通过Weight的bulkScorer函数获得BulkScorer,用来计算得分。

bulkScorer

假设通过createNormalizedWeight函数创建的Weight为BooleanWeight,下面来看其bulkScorer函数,

IndexSearcher::search->searchAfter->search->search->search->BooleanWeight::bulkScorer

  public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {
    final BulkScorer bulkScorer = booleanScorer(context);
    if (bulkScorer != null) {
      return bulkScorer;
    } else {
      return super.bulkScorer(context);
    }
  }

bulkScorer函数首先创建一个booleanScorer,假设为null,下面调用其父类Weight的bulkScorer函数并返回。

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer

  public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {

    Scorer scorer = scorer(context);
    if (scorer == null) {
      return null;
    }

    return new DefaultBulkScorer(scorer);
  }

scorer函数重定义在BooleanWeight中,

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer

  public Scorer scorer(LeafReaderContext context) throws IOException {
    int minShouldMatch = query.getMinimumNumberShouldMatch();

    List<Scorer> required = new ArrayList<>();
    List<Scorer> requiredScoring = new ArrayList<>();
    List<Scorer> prohibited = new ArrayList<>();
    List<Scorer> optional = new ArrayList<>();
    Iterator<BooleanClause> cIter = query.iterator();
    for (Weight w  : weights) {
      BooleanClause c =  cIter.next();
      Scorer subScorer = w.scorer(context);
      if (subScorer == null) {
        if (c.isRequired()) {
          return null;
        }
      } else if (c.isRequired()) {
        required.add(subScorer);
        if (c.isScoring()) {
          requiredScoring.add(subScorer);
        }
      } else if (c.isProhibited()) {
        prohibited.add(subScorer);
      } else {
        optional.add(subScorer);
      }
    }

    if (optional.size() == minShouldMatch) {
      required.addAll(optional);
      requiredScoring.addAll(optional);
      optional.clear();
      minShouldMatch = 0;
    }

    if (required.isEmpty() && optional.isEmpty()) {
      return null;
    } else if (optional.size() < minShouldMatch) {
      return null;
    }

    if (!needsScores && minShouldMatch == 0 && required.size() > 0) {
      optional.clear();
    }

    if (optional.isEmpty()) {
      return excl(req(required, requiredScoring, disableCoord), prohibited);
    }

    if (required.isEmpty()) {
      return excl(opt(optional, minShouldMatch, disableCoord), prohibited);
    }

    Scorer req = excl(req(required, requiredScoring, true), prohibited);
    Scorer opt = opt(optional, minShouldMatch, true);

    if (disableCoord) {
      if (minShouldMatch > 0) {
        return new ConjunctionScorer(this, Arrays.asList(req, opt), Arrays.asList(req, opt), 1F);
      } else {
        return new ReqOptSumScorer(req, opt);          
      }
    } else if (optional.size() == 1) {
      if (minShouldMatch > 0) {
        return new ConjunctionScorer(this, Arrays.asList(req, opt), Arrays.asList(req, opt), coord(requiredScoring.size()+1, maxCoord));
      } else {
        float coordReq = coord(requiredScoring.size(), maxCoord);
        float coordBoth = coord(requiredScoring.size() + 1, maxCoord);
        return new BooleanTopLevelScorers.ReqSingleOptScorer(req, opt, coordReq, coordBoth);
      }
    } else {
      if (minShouldMatch > 0) {
        return new BooleanTopLevelScorers.CoordinatingConjunctionScorer(this, coords, req, requiredScoring.size(), opt);
      } else {
        return new BooleanTopLevelScorers.ReqMultiOptScorer(req, opt, requiredScoring.size(), coords); 
      }
    }
  }

BooleanWeight的scorer函数会循环调用每个子查询对应的Weight的scorer函数,假设为TermWeight。

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer

    public Scorer scorer(LeafReaderContext context) throws IOException {
      final TermsEnum termsEnum = getTermsEnum(context);
      PostingsEnum docs = termsEnum.postings(null, needsScores ? PostingsEnum.FREQS : PostingsEnum.NONE);
      return new TermScorer(this, docs, similarity.simScorer(stats, context));
    }

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer->getTermsEnum

    private TermsEnum getTermsEnum(LeafReaderContext context) throws IOException {
      final TermState state = termStates.get(context.ord);
      final TermsEnum termsEnum = context.reader().terms(term.field())
          .iterator();
      termsEnum.seekExact(term.bytes(), state);
      return termsEnum;
    }

首先获得前面查询的结果TermState,iterator函数返回SegmentTermsEnum。SegmentTermsEnum的seekExact函数主要是封装前面的查询结果TermState,具体的细节下一章再研究。

回到TermWeight的scorer函数中,接下来调用SegmentTermsEnum的postings函数,最终调用Lucene50PostingsReader的postings函数。

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer->Lucene50PostingsReader::postings

  public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, PostingsEnum reuse, int flags) throws IOException {

    boolean indexHasPositions = fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) >= 0;
    boolean indexHasOffsets = fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS) >= 0;
    boolean indexHasPayloads = fieldInfo.hasPayloads();

    if (indexHasPositions == false || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == false) {
      BlockDocsEnum docsEnum;
      if (reuse instanceof BlockDocsEnum) {
        ...
      } else {
        docsEnum = new BlockDocsEnum(fieldInfo);
      }
      return docsEnum.reset((IntBlockTermState) termState, flags);
    } else if ((indexHasOffsets == false || PostingsEnum.featureRequested(flags, PostingsEnum.OFFSETS) == false) &&
               (indexHasPayloads == false || PostingsEnum.featureRequested(flags, PostingsEnum.PAYLOADS) == false)) {
        ...
    } else {
        ...
    }
  }

首先取出索引文件中的存储类型。假设进入第一个if语句,reuse参数默认为null,因此接下来创建一个BlockDocsEnum,并通过reset函数初始化。

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer->BM25Similarity::simScorer

  public final SimScorer simScorer(SimWeight stats, LeafReaderContext context) throws IOException {
    BM25Stats bm25stats = (BM25Stats) stats;
    return new BM25DocScorer(bm25stats, context.reader().getNormValues(bm25stats.field));
  }

  public final NumericDocValues getNormValues(String field) throws IOException {
    ensureOpen();
    Map<String,NumericDocValues> normFields = normsLocal.get();

    NumericDocValues norms = normFields.get(field);
    if (norms != null) {
      return norms;
    } else {
      FieldInfo fi = getFieldInfos().fieldInfo(field);
      if (fi == null || !fi.hasNorms()) {
        return null;
      }
      norms = getNormsReader().getNorms(fi);
      normFields.put(field, norms);
      return norms;
    }
  }

getNormsReader返回Lucene53NormsProducer,Lucene53NormsProducer的getNorms函数根据域信息以及.nvd、.nvm文件读取信息创建NumericDocValues并返回。simScorer最终创建BM25DocScorer并返回。

回到TermWeight的scorer函数中,最后创建TermScorer并返回。

再回到BooleanWeight的scorer函数中,如果SHOULD条件的Scorer等于minShouldMatch,则表明及时条件为SHOULD但也必须得到满足,此时将其归入MUST条件中。再往下,如果SHOULD以及MUST对应的Scorer都为空,则表明没有任何查询条件,返回空,如果SHOULD条件的Scorer小于minShouldMatch,则表明SHOULD条件下查询到的匹配字符太少,也返回空。再往下,如果optional为空,则没有SHOULD条件的Scorer,此时通过req封装MUST条件的Scorer,并通过excl排除MUST_NOT条件的Scorer;相反,如果required为空,则没有MUST条件的Scorer,此时通过opt封装SHOULD条件的Scorer。

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->req

  private Scorer req(List<Scorer> required, List<Scorer> requiredScoring, boolean disableCoord) {
    if (required.size() == 1) {
      Scorer req = required.get(0);

      if (needsScores == false) {
        return req;
      }

      if (requiredScoring.isEmpty()) {
        return new FilterScorer(req) {
          @Override
          public float score() throws IOException {
            return 0f;
          }
          @Override
          public int freq() throws IOException {
            return 0;
          }
        };
      }

      float boost = 1f;
      if (disableCoord == false) {
        boost = coord(1, maxCoord);
      }
      if (boost == 1f) {
        return req;
      }
      return new BooleanTopLevelScorers.BoostedScorer(req, boost);
    } else {
      return new ConjunctionScorer(this, required, requiredScoring,
                                   disableCoord ? 1.0F : coord(requiredScoring.size(), maxCoord));
    }
  }

如果MUST条件下的Scorer数量大于1,则直接创建ConjunctionScorer并返回,如果requiredScoring为空,则对应唯一的MUST条件下的Scorer并不需要评分,此时直接返回FilterScorer,否则通过计算返回BoostedScorer,其他情况下直接返回对应的Scorer。

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->excl

  private Scorer excl(Scorer main, List<Scorer> prohibited) throws IOException {
    if (prohibited.isEmpty()) {
      return main;
    } else if (prohibited.size() == 1) {
      return new ReqExclScorer(main, prohibited.get(0));
    } else {
      float coords[] = new float[prohibited.size()+1];
      Arrays.fill(coords, 1F);
      return new ReqExclScorer(main, new DisjunctionSumScorer(this, prohibited, coords, false));
    }
  }

excl根据是否有MUST_NOT条件的Scorer将Scorer进一步封装成ReqExclScorer(表示需要排除prohibited中的Scorer)或者直接返回。

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->opt

  private Scorer opt(List<Scorer> optional, int minShouldMatch, boolean disableCoord) throws IOException {
    if (optional.size() == 1) {
      Scorer opt = optional.get(0);
      if (!disableCoord && maxCoord > 1) {
        return new BooleanTopLevelScorers.BoostedScorer(opt, coord(1, maxCoord));
      } else {
        return opt;
      }
    } else {
      float coords[];
      if (disableCoord) {
        coords = new float[optional.size()+1];
        Arrays.fill(coords, 1F);
      } else {
        coords = this.coords;
      }
      if (minShouldMatch > 1) {
        return new MinShouldMatchSumScorer(this, optional, minShouldMatch, coords);
      } else {
        return new DisjunctionSumScorer(this, optional, coords, needsScores);
      }
    }
  }

和req函数类似,opt函数根据不同情况将Scorer封装成BoostedScorer、MinShouldMatchSumScorer、DisjunctionSumScorer或者直接返回。

回到BooleanWeight的scorer函数中,接下来根据disableCoord、minShouldMatch和optional的值将Scorer进一步封装成ConjunctionScorer、ReqOptSumScorer、ReqSingleOptScorer、CoordinatingConjunctionScorer或者ReqMultiOptScorer。

回到Weight的bulkScorer和BooleanWeight的bulkScorer函数中,接下来将scorer函数返回的结果进一步封装成DefaultBulkScorer并返回。

再向上回到IndexSearcher的search函数中,接下来调用刚刚创建的DefaultBulkScorer的score函数,
IndexSearch::search->searchAfter->search->search->search->DefaultBulkScorer::score

  public void score(LeafCollector collector, Bits acceptDocs) throws IOException {
    final int next = score(collector, acceptDocs, 0, DocIdSetIterator.NO_MORE_DOCS);
  }

  public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {
    collector.setScorer(scorer);
    if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) {
      scoreAll(collector, iterator, twoPhase, acceptDocs);
      return DocIdSetIterator.NO_MORE_DOCS;
    } else {
      ...
    }
  }

  static void scoreAll(LeafCollector collector, DocIdSetIterator iterator, TwoPhaseIterator twoPhase, Bits acceptDocs) throws IOException {
    if (twoPhase == null) {
      for (int doc = iterator.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = iterator.nextDoc()) {
        if (acceptDocs == null || acceptDocs.get(doc)) {
          collector.collect(doc);
        }
      }
    } else {
      ...
    }
  }

默认情况下,需要对所有命中文档进行评分,因此这里往下看scoreAll函数,首先遍历iterator,然后依次调用ScorerLeafCollector的collect函数进行打分。

IndexSearch::search->searchAfter->search->search->search->DefaultBulkScorer::score->score->scoreAll->ScorerLeafCollector::collect

        public void collect(int doc) throws IOException {
          float score = scorer.score();

          totalHits++;
          if (score <= pqTop.score) {
            return;
          }
          pqTop.doc = doc + docBase;
          pqTop.score = score;
          pqTop = pq.updateTop();
        }

成员变量scorer假设为ReqOptSumScorer,对应的score函数计算文档的分数,因为本博文不涉及lucene的打分算法,因此就不往下看了。最后设置刚刚计算的分数到队列pqTop中,并调用updateTop更新队列指针。

回到再上一层的search函数中,下面来看CollectorManager的reduce函数。

IndexSearch::search->searchAfter->search->CollectorManager::reduce

      public TopDocs reduce(Collection<TopScoreDocCollector> collectors) throws IOException {
        final TopDocs[] topDocs = new TopDocs[collectors.size()];
        int i = 0;
        for (TopScoreDocCollector collector : collectors) {
          topDocs[i++] = collector.topDocs();
        }
        return TopDocs.merge(cappedNumHits, topDocs);
      }

这里遍历collectors,依次取出SimpleTopScoreDoc并调用其topDocs函数取出前面的文档信息。

IndexSearch::search->searchAfter->search->CollectorManager::reduce->SimpleTopScoreDoc::topDocs

  public TopDocs topDocs() {
    return topDocs(0, topDocsSize());
  }

  public TopDocs topDocs(int start, int howMany) {

    int size = topDocsSize();
    if (start < 0 || start >= size || howMany <= 0) {
      return newTopDocs(null, start);
    }

    howMany = Math.min(size - start, howMany);
    ScoreDoc[] results = new ScoreDoc[howMany];

    for (int i = pq.size() - start - howMany; i > 0; i--) { pq.pop(); }

    populateResults(results, howMany);

    return newTopDocs(results, start);
  }

topDocsSize返回命中的文档数,pop函数删除队列中没有用的ScoreDoc,populateResults函数复制最后剩下的ScoreDoc至results中,最后通过newTopDocs函数创建TopDocs封装ScoreDoc数组。

回到CollectorManager的reduce函数中,最后通过merge函数合并不同TopScoreDocCollector产生的TopDocs结果。

IndexSearch::search->searchAfter->search->CollectorManager::reduce->TopDocs::merge

  public static TopDocs merge(int topN, TopDocs[] shardHits) throws IOException {
    return merge(0, topN, shardHits);
  }

  public static TopDocs merge(int start, int topN, TopDocs[] shardHits) throws IOException {
    return mergeAux(null, start, topN, shardHits);
  }

  private static TopDocs mergeAux(Sort sort, int start, int size, TopDocs[] shardHits) throws IOException {
    final PriorityQueue<ShardRef> queue;
    if (sort == null) {
      queue = new ScoreMergeSortQueue(shardHits);
    } else {
      queue = new MergeSortQueue(sort, shardHits);
    }

    int totalHitCount = 0;
    int availHitCount = 0;
    float maxScore = Float.MIN_VALUE;
    for(int shardIDX=0;shardIDX<shardHits.length;shardIDX++) {
      final TopDocs shard = shardHits[shardIDX];
      totalHitCount += shard.totalHits;
      if (shard.scoreDocs != null && shard.scoreDocs.length > 0) {
        availHitCount += shard.scoreDocs.length;
        queue.add(new ShardRef(shardIDX));
        maxScore = Math.max(maxScore, shard.getMaxScore());
      }
    }

    if (availHitCount == 0) {
      maxScore = Float.NaN;
    }

    final ScoreDoc[] hits;
    if (availHitCount <= start) {
      hits = new ScoreDoc[0];
    } else {
      hits = new ScoreDoc[Math.min(size, availHitCount - start)];
      int requestedResultWindow = start + size;
      int numIterOnHits = Math.min(availHitCount, requestedResultWindow);
      int hitUpto = 0;
      while (hitUpto < numIterOnHits) {
        ShardRef ref = queue.top();
        final ScoreDoc hit = shardHits[ref.shardIndex].scoreDocs[ref.hitIndex++];
        hit.shardIndex = ref.shardIndex;
        if (hitUpto >= start) {
          hits[hitUpto - start] = hit;
        }
        hitUpto++;
        if (ref.hitIndex < shardHits[ref.shardIndex].scoreDocs.length) {
          queue.updateTop();
        } else {
          queue.pop();
        }
      }
    }

    if (sort == null) {
      return new TopDocs(totalHitCount, hits, maxScore);
    } else {
      return new TopFieldDocs(totalHitCount, hits, sort.getSort(), maxScore);
    }
  }

mergeAux简单来说就是合并多个TopDocs为一个TopDocs并返回。

到此IndexSearcher的search函数的大体过程就分析到这了,下一章开始分析lucene读取索引文件的具体函数。

  • 6
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值