lucene源码分析---12

最新推荐文章于 2023-04-19 17:18:00 发布

二侠

最新推荐文章于 2023-04-19 17:18:00 发布

阅读量2.7k

点赞数

分类专栏： lucene-6.1.0源码分析

本文链接：https://blog.csdn.net/conansonic/article/details/53138111

版权

lucene-6.1.0源码分析专栏收录该内容

15 篇文章 12 订阅

订阅专栏

lucene源码分析—PhraseQuery

本章开始查看PhraseQuery的源码，PhraseQuery顾名思义是短语查询，先来看PhraseQuery是如何构造的，

        PhraseQuery.Builder queryBuilder = new PhraseQuery.Builder();
        queryBuilder.add(new Term("body","hello"));
        queryBuilder.add(new Term("body","world"));
        queryBuilder.setSlop(100);
        PhraseQuery query = queryBuilder.build();

首先创建PhraseQuery.Builder，然后向其添加短语中的每个词Term，PhraseQuery要求Term的数量至少为2个，这很容易理解，例子中的“body”表示域名，“hello”和“world”代表词的内容。然后通过setSlop设置词的最大间距，最后通过build函数创建PhraseQuery 。

创建完PhraseQuery后，就调用IndexSearcher的search函数进行查询，最终会执行到如下代码，

  public void search(Query query, Collector results)
    throws IOException {
    search(leafContexts, createNormalizedWeight(query, results.needsScores()), results);
  }

createNormalizedWeight函数找出短语所在文档的整体信息，然后通过search函数对每个文档进行评分，筛选出最后的文档。

IndexSearcher::createNormalizedWeight

  public Weight createNormalizedWeight(Query query, boolean needsScores) throws IOException {
    query = rewrite(query);
    Weight weight = createWeight(query, needsScores);

    ...

    return weight;
  }

对PhraseQuery而言，rewrite函数什么也不做，即PhraseQuery不需要改写。createNormalizedWeight函数接下来通过createWeight函数创建Weight，封装短语的整体信息。

IndexSearcher::createNormalizedWeight->createWeight

  public Weight createWeight(Query query, boolean needsScores) throws IOException {

    ...

    Weight weight = query.createWeight(this, needsScores);

    ...

    return weight;
  }

  public Weight createWeight(IndexSearcher searcher, boolean needsScores) throws IOException {
    return new PhraseWeight(searcher, needsScores);
  }

createWeight函数最终会创建PhraseWeight。

IndexSearcher::createNormalizedWeight->createWeight->PhraseQuery::createWeight->PhraseWeight::PhraseWeight

  public PhraseWeight(IndexSearcher searcher, boolean needsScores)
    throws IOException {
    final int[] positions = PhraseQuery.this.getPositions();
    this.needsScores = needsScores;
    this.similarity = searcher.getSimilarity(needsScores);
    final IndexReaderContext context = searcher.getTopReaderContext();
    states = new TermContext[terms.length];
    TermStatistics termStats[] = new TermStatistics[terms.length];
    for (int i = 0; i < terms.length; i++) {
      final Term term = terms[i];
      states[i] = TermContext.build(context, term);
      termStats[i] = searcher.termStatistics(term, states[i]);
    }
    stats = similarity.computeWeight(searcher.collectionStatistics(field), termStats);
  }

getPositions函数返回查询的每个词在查询短语中的位置，例如查询“hello world”，则getPositions返回[0，1]。
getSimilarity默认返回BM25Similarity，用于评分，开发者可以定义自己的Similarity并设置进IndexerSearch中，这里就会返回自定义的Similarity。
getTopReaderContext默认返回CompositeReaderContext。再往下，遍历每个查询的词Term，通过TermContext的build函数获取每个词的整体信息，然后通过termStatistics函数将部分信息封装成TermStatistics。
最后通过Similarity的computeWeight函数计算所有词的整体信息保存在stats中。
这些函数在前面的章节已经分析过了，和PhraseQuery并无关系，这里就不往下看了。

回到一开始的search函数中，接下来继续通过search函数进行搜索。

IndexSearcher::search

  protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)
      throws IOException {
    for (LeafReaderContext ctx : leaves) {
      final LeafCollector leafCollector = collector.getLeafCollector(ctx);
      BulkScorer scorer = weight.bulkScorer(ctx);
      scorer.score(leafCollector, ctx.reader().getLiveDocs());
    }
  }

getLeafCollector返回的LeafCollector用来打分并对评分后的文档进行排序。
bulkScorer返回的BulkScorer用来为多个文档打分。
最终通过BulkScorer的scorer函数进行打分并获得最终的结果。

IndexSearcher::search->PhraseWeight::bulkScorer

  public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {
    Scorer scorer = scorer(context);
    return new DefaultBulkScorer(scorer);
  }

scorer函数最终会调用PhraseWeight的scorer函数获得Scorer，然后通过DefaultBulkScorer进行封装和处理。

IndexSearcher::search->PhraseWeight::bulkScorer->scorer

    public Scorer scorer(LeafReaderContext context) throws IOException {
      final LeafReader reader = context.reader();
      PostingsAndFreq[] postingsFreqs = new PostingsAndFreq[terms.length];

      final Terms fieldTerms = reader.terms(field);
      final TermsEnum te = fieldTerms.iterator();
      float totalMatchCost = 0;

      for (int i = 0; i < terms.length; i++) {
        final Term t = terms[i];
        final TermState state = states[i].get(context.ord);
        te.seekExact(t.bytes(), state);
        PostingsEnum postingsEnum = te.postings(null, PostingsEnum.POSITIONS);
        postingsFreqs[i] = new PostingsAndFreq(postingsEnum, positions[i], t);
        totalMatchCost += termPositionsCost(te);
      }

      return new SloppyPhraseScorer(this, postingsFreqs, slop,
                                        similarity.simScorer(stats, context),
                                        needsScores, totalMatchCost);
    }

terms获得FieldReader，并通过其iterator函数创建SegmentTermsEnum，该过程在前面的章节有叙述。
接下来遍历待搜索的词Term，针对每个Term，通过SegmentTermsEnum的postings函数返回倒排列表对应的.doc、.pos和.pay文件的读取信息，并根据.tip文件中的索引数据设置这三个文件的输出流指针。封装成PostingsAndFreq。
假设slop>0，scorer函数最后创建SloppyPhraseScorer并返回，SloppyPhraseScorer用于计算短语查询的最终得分。

IndexSearcher::search->PhraseWeight::bulkScorer->DefaultBulkScorer

    public DefaultBulkScorer(Scorer scorer) {
      this.scorer = scorer;
      this.iterator = scorer.iterator();
      this.twoPhase = scorer.twoPhaseIterator();
    }

DefaultBulkScorer构造函数中的iterator和twoPhaseIterator函数都是创建迭代器，用于获取到下一个文档ID，还可以计算词位置登信息。

回到IndexSearcher函数中，接下来通过刚刚创建的DefaultBulkScorer的score函数计算最终的文档得分并排序，DefaultBulkScorer的score函数最终会调用到scoreAll函数。

IndexSearcher::search->DefaultBulkScorer::score->score->scoreAll

    static void scoreAll(LeafCollector collector, DocIdSetIterator iterator, TwoPhaseIterator twoPhase, Bits acceptDocs) throws IOException {
      final DocIdSetIterator approximation = twoPhase.approximation();
      for (int doc = approximation.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = approximation.nextDoc()) {
        if ((acceptDocs == null || acceptDocs.get(doc)) && twoPhase.matches()) {
          collector.collect(doc);
        }
      }
    }

approximation是前面在DefaultBulkScorer的构造函数中创建的继承自DocIdSetIterator的ConjunctionDISI，通过其nextDoc函数获取下一个文档ID，然后调用TwoPhaseIterator的matches信息获取词的位置信息，最后通过LeafCollector的collect函数完成最终的打分并排序。

IndexSearcher::search->DefaultBulkScorer::score->score->scoreAll->ConjunctionDISI::nextDoc

  public int nextDoc() throws IOException {
    return doNext(lead.nextDoc());
  }

这里的变量lead实质上未BlockPostingsEnum，其nextDoc获得.doc文件中对应的倒排列表中的下一个文档ID。然后通过doNext函数查找下一个文档ID，对应的文档包含了查询短语中的所有词，如果不包含，则要继续查找下一个文档。

IndexSearcher::search->DefaultBulkScorer::score->score->scoreAll->ConjunctionDISI::nextDoc->doNext

  private int doNext(int doc) throws IOException {
    for(;;) {
      advanceHead: for(;;) {
        for (DocIdSetIterator other : others) {
          if (other.docID() < doc) {
            final int next = other.advance(doc);

            if (next > doc) {
              doc = lead.advance(next);
              break advanceHead;
            }
          }
        }
        return doc;
      }
    }
  }

others是除了lead外的所有词对应的DocIdSetIterator列表，例如查找“hello world champion”，则假设此时lead为“hello”，others对应的DocIdSetIterator列表则分别包含“world”和“champion”，advance函数从other中找到一个等于或大于当前文档ID的文档，这里有一个默认条件，就是lead中的文档ID是最小的，因此当others中的所有文档ID不大于当前文档ID，就说明该文档ID包含了所有的词，返回当前文档ID。如果某个other的文档ID大于了当前的文档ID，则要重新计算当前文档ID，并继续循环直到找到为止。

获取到文档ID后，就要通过phraseFreq计算词的位置信息了。

IndexSearcher::search->DefaultBulkScorer::score->score->scoreAll->TwoPhaseIterator::matches->phraseFreq

  private float phraseFreq() throws IOException {
    if (!initPhrasePositions()) {
      return 0.0f;
    }
    float freq = 0.0f;
    numMatches = 0;
    PhrasePositions pp = pq.pop();
    int matchLength = end - pp.position;
    int next = pq.top().position; 
    while (advancePP(pp)) {
      if (pp.position > next) {
        if (matchLength <= slop) {
          freq += docScorer.computeSlopFactor(matchLength);
          numMatches++;
        }      
        pq.add(pp);
        pp = pq.pop();
        next = pq.top().position;
        matchLength = end - pp.position;
      } else {
        int matchLength2 = end - pp.position;
        if (matchLength2 < matchLength) {
          matchLength = matchLength2;
        }
      }
    }
    if (matchLength <= slop) {
      freq += docScorer.computeSlopFactor(matchLength);
      numMatches++;
    }    
    return freq;
  }

省略的部分代码用于处理重复的词，例如当查询“hello hello world”时，就会出现两个重复的“hello”。initPhrasePositions用来初始化。matchLength表示计算得到的两个词的距离，当matchLength小于slop时，调用computeSlopFactor计算得分并累加，公式为1/(distance+1)。
这里不逐行逐字地分析这段代码了，举个例子就明白了，假设一篇文档如下
“…hello(5)…world(10)…world(13)…hello(21)…”
查询短语为“hello world”。
则首先找到第一个hello和第一个world，计算得到matchLength为10-5=5，

IndexSearcher::search->DefaultBulkScorer::score->score->scoreAll->TwoPhaseIterator::matches->phraseFreq->initPhrasePositions

  private boolean initPhrasePositions() throws IOException {
    end = Integer.MIN_VALUE;
    if (!checkedRpts) {
      return initFirstTime();
    }
    if (!hasRpts) {
      initSimple();
      return true;
    }
    return initComplex();
  }

DefaultBulkScorer::score->score->scoreAll->TwoPhaseIterator::matches->phraseFreq->initPhrasePositions->initFirstTime

  private boolean initFirstTime() throws IOException {
    checkedRpts = true;
    placeFirstPositions();

    LinkedHashMap<Term,Integer> rptTerms = repeatingTerms(); 
    hasRpts = !rptTerms.isEmpty();

    if (hasRpts) {
      rptStack = new PhrasePositions[numPostings];
      ArrayList<ArrayList<PhrasePositions>> rgs = gatherRptGroups(rptTerms);
      sortRptGroups(rgs);
      if (!advanceRepeatGroups()) {
        return false;
      }
    }

    fillQueue();
    return true;
  }

checkedRpts设置为true，表示初始化过了。
placeFirstPositions函数初始化位置信息。
repeatingTerms函数获取重复的词。
如果存在重复的词，则此时hasRpts为true，则要进行相应的处理，本章不考虑这种情况。
最后调用fillQueue函数设置end成员变量，end表示当前最大的位置。

DefaultBulkScorer::score->score->scoreAll->TwoPhaseIterator::matches->phraseFreq->initPhrasePositions->initFirstTime->placeFirstPositions

  private void placeFirstPositions() throws IOException {
    for (PhrasePositions pp : phrasePositions) {
      pp.firstPosition();
    }
  }

phrasePositions对应每个Term的PhrasePositions，内部封装了倒排列表信息，firstPosition函数初始化位置信息，内部会读取.pos文件。

DefaultBulkScorer::score->score->scoreAll->TwoPhaseIterator::matches->phraseFreq->initPhrasePositions->initFirstTime->placeFirstPositions->PhrasePositions::firstPosition

  final void firstPosition() throws IOException {
    count = postings.freq();
    nextPosition();
  }

  final boolean nextPosition() throws IOException {
    if (count-- > 0) {
      position = postings.nextPosition() - offset;
      return true;
    } else
      return false;
  }

freq表示当前有多少可以读取的位置信息。
然后调用nextPosition获得下一个位置信息，进而调用BlockPostingsEnum的nextPosition函数获取下一个位置信息，offset是起始的位移，例如查询“abc def”，当获得abc所在文档的位置时，offset为0，当获得def所在文档的位置时，offset为1。

DefaultBulkScorer::score->score->scoreAll->TwoPhaseIterator::matches->phraseFreq->initPhrasePositions->initFirstTime->placeFirstPositions->PhrasePositions::firstPosition->nextPosition->BlockPostingsEnum::nextPosition

    public int nextPosition() throws IOException {

      if (posPendingFP != -1) {
        posIn.seek(posPendingFP);
        posPendingFP = -1;
        posBufferUpto = BLOCK_SIZE;
      }

      if (posPendingCount > freq) {
        skipPositions();
        posPendingCount = freq;
      }

      if (posBufferUpto == BLOCK_SIZE) {
        refillPositions();
        posBufferUpto = 0;
      }
      position += posDeltaBuffer[posBufferUpto++];
      posPendingCount--;
      return position;
    }

posPendingFP是从.tip文件中读取得到的.pos文件的指针。
通过seek函数移动到读取位置posPendingFP，然后将posPendingFP指针重置。
设置posBufferUpto为BLOCK_SIZE后面会调用refillPositions将.pos文件中的信息读入缓存中。
posPendingCount变量表示待读取的词位置的个数，假设单词“abc”在文档1中的频率为2，在文档2中的频率为3，当读取完文档1中的频率2而没有处理，再读取文档2中的频率3，此时posPendingCount为2+3=5。
当posPendingCount大于当前文档的频率freq时，则调用skipPositions跳过之前读取的信息。
posBufferUpto等于BLOCK_SIZE表示缓存已经读到末尾，或者表示首次读取缓存，此时通过refillPositions函数将.pos文件中的信息读取到缓存中，并将缓存的指针posBufferUpto设置为0。
最后返回的position是从缓存posDeltaBuffer读取到的最终的位置，并将待读取的个数posPendingCount减1。

    private void skipPositions() throws IOException {
      int toSkip = posPendingCount - freq;

      final int leftInBlock = BLOCK_SIZE - posBufferUpto;
      if (toSkip < leftInBlock) {
        posBufferUpto += toSkip;
      } else {
        toSkip -= leftInBlock;
        while(toSkip >= BLOCK_SIZE) {
          forUtil.skipBlock(posIn);
          toSkip -= BLOCK_SIZE;
        }
        refillPositions();
        posBufferUpto = toSkip;
      }

      position = 0;
    }

toSkip计算需要跳过多少。
leftInBlock表示当前缓存中的剩余空间。
如果toSkip小于leftInBlock则表示缓存中还有富余空间，此时直接更新缓存的指针posBufferUpto即可。
如果toSkip大于等于leftInBlock，则先减去当前缓存的剩余空间，然后判断剩余的toSkip是否大于缓存的最大空间BLOCK_SIZE，如果大于，就循环将文件指针向前移动BLOCK_SIZE，再将toSkip减去BLOCK_SIZE，知道toSkip小于BLOCK_SIZE，此时通过refillPositions函数再向缓存填充BLOCK_SIZE大小的数据，设置文件指针为toSkip。
因为位置信息存储的是差值，因此最后将位置position重置为0。

DefaultBulkScorer::score->score->scoreAll->TwoPhaseIterator::matches->phraseFreq->initPhrasePositions->initFirstTime->repeatingTerms

  private LinkedHashMap<Term,Integer> repeatingTerms() {
    LinkedHashMap<Term,Integer> tord = new LinkedHashMap<>();
    HashMap<Term,Integer> tcnt = new HashMap<>();
    for (PhrasePositions pp : phrasePositions) {
      for (Term t : pp.terms) {
        Integer cnt0 = tcnt.get(t);
        Integer cnt = cnt0==null ? new Integer(1) : new Integer(1+cnt0.intValue());
        tcnt.put(t, cnt);
        if (cnt==2) {
          tord.put(t,tord.size());
        }
      }
    }
    return tord;
  }

PhrasePositions的terms成员变量保存了该PhrasePositions的词，这里其实是遍历搜索的所有词，将重复的词放入最终返回的HashMap中。

DefaultBulkScorer::score->score->scoreAll->TwoPhaseIterator::matches->phraseFreq->initPhrasePositions->initFirstTime->fillQueue

  private void fillQueue() {
    pq.clear();
    for (PhrasePositions pp : phrasePositions) {
      if (pp.position > end) {
        end = pp.position;
      }
      pq.add(pp);
    }
  }

fillQueue依次读取每个PhrasePositions，将第一个Term在文档中出现位置的最大值保存在end中。

二侠

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
lucene源码分析---12

lucene源码分析—PhraseQuery本章开始查看PhraseQuery的源码，PhraseQuery顾名思义是短语查询，先来看PhraseQuery是如何构造的， PhraseQuery.Builder queryBuilder = new PhraseQuery.Builder(); queryBuilder.add(new Term("body",
复制链接

扫一扫