lucene索引源码分析2

最新推荐文章于 2023-10-24 23:12:21 发布

jj380382856

最新推荐文章于 2023-10-24 23:12:21 发布

阅读量1.2k

点赞数 1

分类专栏： lucene 文章标签：索引 lucene

本文链接：https://blog.csdn.net/jj380382856/article/details/52688245

版权

本文深入探讨Lucene索引过程中术语（terms）的处理，详细分析了如何通过术语HashPerField管理和内存缓冲区（bytePool, intPool）构建索引。每个文档字段（field）的处理是串行的，而不同文档之间是并行的。重点在于理解postingsArray、intPool和bytePool之间的关系，它们共同决定了段（segment）信息的写入。代码分析中提到了addTerm()和newTerm()方法，以及他们在freproxtermwriter和termvectorconsumer中的作用。" 113972494,10330485,Python批量发送邮件并显示收件人列表,"['Python开发', '邮件发送', '编程问题', '错误修复']

摘要由CSDN通过智能技术生成

上一篇文章大概讲了索引从indexwriter到defaultindexchain的过程，也分析了defaultindexchain的基本流程，

主要就是：

将dwpt接收的每个文档一条条处理---》对每一条文档再按Field依次处理---》对每个Field依据他是否分词，是否存储是否有docvalue再分别处理。

可见每个dwpt之间是并行的做事情，每个dwpt内是串行的做事情。

每个field的具体处理是需要写termsHashPerField的信息，而这个信息是被termsHash统一管理的，可以把termsHash理解为一个dwpt中共享的缓冲区，主要用于在内存中建立索引，并在需要fulsh的时候刷入磁盘，每一个dwpt被刷入磁盘后其实就是一个段，当然如果段的大小被设置的话，可能还需要进行段合并之类的。这是后话，我们将在以后专门分析flush，在本文中我们主要继续上一篇分析pf.invert（），代码如下

 public void invert(IndexableField field, boolean first) throws IOException, AbortingException {
      if (first) {
        // First time we're seeing this field (indexed) in
        // this document:
        invertState.reset();      //第一次的话需要重置信息
      }
      
      IndexableFieldType fieldType = field.fieldType();
      
      IndexOptions indexOptions = fieldType.indexOptions();
      fieldInfo.setIndexOptions(indexOptions);
      
      if (fieldType.omitNorms()) {
        fieldInfo.setOmitsNorms();
      }
      
      final boolean analyzed = fieldType.tokenized() && docState.analyzer != null;
      
      // only bother checking offsets if something will consume them.
      // TODO: after we fix analyzers, also check if termVectorOffsets will be indexed.
      final boolean checkOffsets = indexOptions == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;
      
      /*
       * To assist people in tracking down problems in analysis components, we wish to write the field name to the
       * infostream when we fail. We expect some caller to eventually deal with the real exception, so we don't want any
       * 'catch' clauses, but rather a finally that takes note of the problem.
       */
      boolean succeededInProcessingField = false;
      try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) {    //获取field内容的token流
        // reset the TokenStream to the first token
        stream.reset();
        invertState.setAttributeSource(stream);
        termsHashPerField.start(field, first);
        
        while (stream.incrementToken()) {   //对field内容中分出的每一个term进行处理
          
          // If we hit an exception in stream.next below
          // (which is fairly common, e.g. if analyzer
          // chokes on a given document), then it's
          // non-aborting and (above) this one document
          // will be marked as deleted, but still
          // consume a docID
          
          int posIncr = invertState.posIncrAttribute.getPositionIncrement();
          invertState.position += posIncr;
          if (invertState.position < invertState.lastPosition) {