Lucene 6.6.1源码分析---倒排索引

最新推荐文章于 2024-06-02 08:24:06 发布

道友，且慢

最新推荐文章于 2024-06-02 08:24:06 发布

阅读量745

点赞数

分类专栏： Lucene

本文链接：https://blog.csdn.net/qqqq0199181/article/details/84319987

版权

Lucene 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

之前分析Lucene的添加文档过程是已经知道，文档的添加可分解为域的添加，而域的添加过程就是倒排索引的过程。本文将以域的添加作为入口来分析倒排索引的过程。首先看添加域的入口方法：

private int processField(IndexableField field, long fieldGen, int fieldCount) throws IOException, AbortingException {
    String fieldName = field.name();
    IndexableFieldType fieldType = field.fieldType();

    PerField fp = null;

    if (fieldType.indexOptions() == null) {
      throw new NullPointerException("IndexOptions must not be null (field: \"" + field.name() + "\")");
    }

    // Invert indexed fields:
    if (fieldType.indexOptions() != IndexOptions.NONE) {
      
      // if the field omits norms, the boost cannot be indexed.
      if (fieldType.omitNorms() && field.boost() != 1.0f) {
        throw new UnsupportedOperationException("You cannot set an index-time boost: norms are omitted for field '" + field.name() + "'");
      }
      
      fp = getOrAddField(fieldName, fieldType, true);
      boolean first = fp.fieldGen != fieldGen;
      fp.invert(field, first);

      if (first) {
        fields[fieldCount++] = fp;
        fp.fieldGen = fieldGen;
      }
    } else {
      verifyUnIndexedFieldType(fieldName, fieldType);
    }

    // Add stored fields:
    if (fieldType.stored()) {
      if (fp == null) {
        fp = getOrAddField(fieldName, fieldType, false);
      }
      if (fieldType.stored()) {
        String value = field.stringValue();
        if (value != null && value.length() > IndexWriter.MAX_STORED_STRING_LENGTH) {
          throw new IllegalArgumentException("stored field \"" + field.name() + "\" is too large (" + value.length() + " characters) to store");
        }
        try {
          storedFieldsConsumer.writeField(fp.fieldInfo, field);
        } catch (Throwable th) {
          throw AbortingException.wrap(th);
        }
      }
    }

    DocValuesType dvType = fieldType.docValuesType();
    if (dvType == null) {
      throw new NullPointerException("docValuesType must not be null (field: \"" + fieldName + "\")");
    }
    if (dvType != DocValuesType.NONE) {
      if (fp == null) {
        fp = getOrAddField(fieldName, fieldType, false);
      }
      indexDocValue(fp, dvType, field);
    }
    if (fieldType.pointDimensionCount() != 0) {
      if (fp == null) {
        fp = getOrAddField(fieldName, fieldType, false);
      }
      indexPoint(fp, field);
    }
    
    return fieldCount;
  }

1、获取域名称fieldName、域类型fieldType。fieldType中包含着索引类型IndexOptions，索引类型总共有以下5钟：

a) NONE：不索引

b) DOCS：仅索引文档

c) DOCS_AND_FREQS：索引文档，词项频率

d) DOCS_AND_FREQS_AND_POSITIONS：索引文档，词项频率，词项位置

e) DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS：索引文档，词项频率，词项位置，词项偏移量

2、如果索引类型为NONE，则该域不做任何索引操作。也就不可能以这个域作为查询条件去查询相关文档了；如果该域不为NONE，则需要做索引。根据fieldName从PerField数组中获取对应的PerField（如果没有就先新建再获取），PerField中保存了域的信息与索引信息。调用fp.invert(field, first)做倒排索引，然后将倒排索引的数据保存在PerField中。倒排索引是本文重点，后面会重点研究。

3、用fieldType.stored判断这个域是否需要存储，如果需要存储，根据fieldName从PerField数组中获取对应的PerField（如果没有就先新建再获取），PerField中保存了域的信息与索引信息。再调用storedFieldsConsumer.writeField(fp.fieldInfo, field)将这个域的数据写到内存中。

4、待所有域都经过以上步骤处理过后，调用finishStoredFields();将需要存储的域的数据从内存中写到.fdx文件中

接下来我们来分析一下倒排索引的过程

public void invert(IndexableField field, boolean first) throws IOException, AbortingException {
      if (first) {
        // First time we're seeing this field (indexed) in
        // this document:
        invertState.reset();
      }

      IndexableFieldType fieldType = field.fieldType();

      IndexOptions indexOptions = fieldType.indexOptions();
      fieldInfo.setIndexOptions(indexOptions);

      if (fieldType.omitNorms()) {
        fieldInfo.setOmitsNorms();
      }

      final boolean analyzed = fieldType.tokenized() && docState.analyzer != null;
        
      // only bother checking offsets if something will consume them.
      // TODO: after we fix analyzers, also check if termVectorOffsets will be indexed.
      final boolean checkOffsets = indexOptions == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;

      /*
       * To assist people in tracking down problems in analysis components, we wish to write the field name to the infostream
       * when we fail. We expect some caller to eventually deal with the real exception, so we don't want any 'catch' clauses,
       * but rather a finally that takes note of the problem.
       */
      boolean succeededInProcessingField = false;
      try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) {
        // reset the TokenStream to the first token
        stream.reset();
        invertState.setAttributeSource(stream);
        termsHashPerField.start(field, first);

        while (stream.incrementToken()) {

          // If we hit an exception in stream.next below
          // (which is fairly common, e.g. if analyzer
          // chokes on a given document), then it's
          // non-aborting and (above) this one document
          // will be marked as deleted, but still
          // consume a docID

          int posIncr = invertState.posIncrAttribute.getPositionIncrement();
          invertState.position += posIncr;
          if (invertState.position < invertState.lastPosition) {
            if (posIncr == 0) {
              throw new IllegalArgumentException("first position increment must be > 0 (got 0) for field '" + field.name() + "'");
            } else if (posIncr < 0) {
              throw new IllegalArgumentException("position increment must be >= 0 (got " + posIncr + ") for field '" + field.name() + "'");
            } else {
              throw new IllegalArgumentException("position overflowed Integer.MAX_VALUE (got posIncr=" + posIncr + " lastPosition=" + invertState.lastPosition + " position=" + invertState.position + ") for field '" + field.name() + "'");
            }
          } else if (invertState.position > IndexWriter.MAX_POSITION) {
            throw new IllegalArgumentException("position " + invertState.position + " is too large for field '" + field.name() + "': max allowed position is " + IndexWriter.MAX_POSITION);
          }
          invertState.lastPosition = invertState.position;
          if (posIncr == 0) {
            invertState.numOverlap++;
          }
              
          if (checkOffsets) {
            int startOffset = invertState.offset + invertState.offsetAttribute.startOffset();
            int endOffset = invertState.offset + invertState.offsetAttribute.endOffset();
            if (startOffset < invertState.lastStartOffset || endOffset < startOffset) {
              throw new IllegalArgumentException("startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards "
                                                 + "startOffset=" + startOffset + ",endOffset=" + endOffset + ",lastStartOffset=" + invertState.lastStartOffset + " for field '" + field.name() + "'");
            }
            invertState.lastStartOffset = startOffset;
          }

          invertState.length++;
          if (invertState.length < 0) {
            throw new IllegalArgumentException("too many tokens in field '" + field.name() + "'");
          }
          //System.out.println("  term=" + invertState.termAttribute);

          // If we hit an exception in here, we abort
          // all buffered documents since the last
          // flush, on the likelihood that the
          // internal state of the terms hash is now
          // corrupt and should not be flushed to a
          // new segment:
          try {
            termsHashPerField.add();
          } catch (MaxBytesLengthExceededException e) {
            byte[] prefix = new byte[30];
            BytesRef bigTerm = invertState.termAttribute.getBytesRef();
            System.arraycopy(bigTerm.bytes, bigTerm.offset, prefix, 0, 30);
            String msg = "Document contains at least one immense term in field=\"" + fieldInfo.name + "\" (whose UTF8 encoding is longer than the max length " + DocumentsWriterPerThread.MAX_TERM_LENGTH_UTF8 + "), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '" + Arrays.toString(prefix) + "...', original message: " + e.getMessage();
            if (docState.infoStream.isEnabled("IW")) {
              docState.infoStream.message("IW", "ERROR: " + msg);
            }
            // Document will be deleted above:
            throw new IllegalArgumentException(msg, e);
          } catch (Throwable th) {
            throw AbortingException.wrap(th);
          }
        }

        // trigger streams to perform end-of-stream operations
        stream.end();

        // TODO: maybe add some safety? then again, it's already checked 
        // when we come back around to the field...
        invertState.position += invertState.posIncrAttribute.getPositionIncrement();
        invertState.offset += invertState.offsetAttribute.endOffset();

        /* if there is an exception coming through, we won't set this to true here:*/
        succeededInProcessingField = true;
      } finally {
        if (!succeededInProcessingField && docState.infoStream.isEnabled("DW")) {
          docState.infoStream.message("DW", "An exception was thrown while processing field " + fieldInfo.name);
        }
      }

      if (analyzed) {
        invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
        invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);
      }

      invertState.boost *= field.boost();
}

1、 tokenStream = field.tokenStream(docState.analyzer, tokenStream)用分析器analyzer将该域的数据分析为词汇单元流tokenStream，它就像一个枚举类一样，通过stream.incrementToken()获取下一个词汇单元。

2、 FieldInvertState invertState，TermsHashPerField termsHashPerField 是被PerField所持有的对象，invertState用于记录该域中词汇单元的数量以及位置信息。

invertState.posIncrAttribute 是指前后词汇单元位置的增量，一般为1，说明前后两个词汇单元中间没有空位。如果大于1，说明前后两个词汇单元中间存在空位，这可能是分析器将停词删除后留下的空位。如果如果等于0，有可能是存在同义词的情况了。这时候记录词汇覆盖数的invertState.numOverlap就会加1。invertState.length用于记录词汇数量。

3、 termsHashPerField.add()则用于完成词项索引过程，将索引内容存储于内存缓冲之中。

a) bytesHash.add将词项的文本转成字节数组及数组长度存在bytesHash中。完了会返回一个termId。如果termId<0表示这个词项不是第一次添加。

b) newTerm(termID) 根据termID创建词项对象，在创建过程中，把词频、文档ID、偏移量、位置等信息都记录在了FreqProxPostingsArray之中。在FreqProxPostingsArray中维护了词频、文档ID、偏移量、位置等信息的数组，数组的每一个元素就是一个词项所对应的信息。下面是FreqProxPostingsArray的代码：

  static final class FreqProxPostingsArray extends ParallelPostingsArray {
    public FreqProxPostingsArray(int size, boolean writeFreqs, boolean writeProx, boolean writeOffsets) {
      super(size);
      if (writeFreqs) {
        termFreqs = new int[size];
      }
      lastDocIDs = new int[size];
      lastDocCodes = new int[size];
      if (writeProx) {
        lastPositions = new int[size];
        if (writeOffsets) {
          lastOffsets = new int[size];
        }
      } else {
        assert !writeOffsets;
      }
      //System.out.println("PA init freqs=" + writeFreqs + " pos=" + writeProx + " offs=" + writeOffsets);
    }

    int termFreqs[];                                   // # times this term occurs in the current doc
    int lastDocIDs[];                                  // Last docID where this term occurred
    int lastDocCodes[];                                // Code for prior doc
    int lastPositions[];                               // Last position where this term occurred
    int lastOffsets[];                                 // Last endOffset where this term occurred

参考 Lucene索引过程分析

道友，且慢

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Lucene 6.6.1源码分析---倒排索引

之前分析Lucene的添加文档过程是已经知道，文档的添加可分解为域的添加，而域的添加过程就是倒排索引的过程。本文将以域的添加作为入口来分析倒排索引的过程。首先看添加域的入口方法：private int processField(IndexableField field, long fieldGen, int fieldCount) throws IOException, AbortingExce...
复制链接

扫一扫