Lucene倒排索引在内存中的构建过程(基于7.3.1)

最新推荐文章于 2024-07-08 06:53:03 发布

低调的JVM

最新推荐文章于 2024-07-08 06:53:03 发布

阅读量1.1k

点赞数

分类专栏： lucene 文章标签： lucene elasticsearch 搜索引擎

本文链接：https://blog.csdn.net/qq_27529917/article/details/90754355

版权

lucene 专栏收录该内容

8 篇文章 2 订阅

订阅专栏

本篇文章只讲倒排索引在内存中的构建过程，数据flush到磁盘的过程没有关联，一个Field的倒排索引在内存中公用一个字节数组，但flush到磁盘后，会根据数据类型写入不同的数据文件。本篇博客只讲构建，不讲刷盘。

Lucene根据Field自定的 IndexOptions（索引构建选项）级别记忆是否支持Payload会存储不同的数据信息：

public enum IndexOptions { 
  // NOTE: order is important here; FieldInfo uses this
  // order to merge two conflicting IndexOptions (always
  // "downgrades" by picking the lowest).
  /** Not indexed */
  NONE,
  /** 
   * Only documents are indexed: term frequencies and positions are omitted.
   * Phrase and other positional queries on the field will throw an exception, and scoring
   * will behave as if any term in the document appears only once.
   */
  DOCS,
  /** 
   * Only documents and term frequencies are indexed: positions are omitted. 
   * This enables normal scoring, except Phrase and other positional queries
   * will throw an exception.
   */  
  DOCS_AND_FREQS,
  /** 
   * Indexes documents, frequencies and positions.
   * This is a typical default for full-text search: full scoring is enabled
   * and positional queries are supported.
   */
  DOCS_AND_FREQS_AND_POSITIONS,
  /** 
   * Indexes documents, frequencies, positions and offsets.
   * Character offsets are encoded alongside the positions. 
   */
  DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS,
}

看上述注释很容易看明白：

freqs就是频率，可以用来计算权重，排序打分；
position就是当前term相对于上一个term的序号增量，大部分情况下是1，除非上一个词是stopword，被忽略了，可在连续多个词搜索时做过滤；
offset是当前 term相对于上一个term的字符增量，可在加亮显示时起作用。

有offset就代表之前的docID，freqs，position都有。

如果此Field支持payload，那么payload数据也会被存入。

倒排索引在内存中被 TermsHashPerField 的 bytePool 持有：

abstract class TermsHashPerField implements Comparable<TermsHashPerField> {
	......
	// 存储指向bytePool的位置指针
	final IntBlockPool intPool;
	// 存储倒排数据
   	final ByteBlockPool bytePool;

	/**
     * 指向 {@link #intPool} intPool.buffer
     *
     * @see {@link #add()}
     */
    int[] intUptos;
    /**
     * 当前数据在 intPool.buffer 中的下一个数据可以写入的位置
     * 当前block里的数据起始位置, intUptoStart+0: freq的写入位置, intUptoStart+1: prox和offset的写入位置
     * 每写一个数据, intUptos[intUptoStart + stream] 位置的值就会自增1,也就是指向的bytePool里的位置+1
     *
     * @see #writeByte(int, byte) 的末尾行
     */
    int intUptoStart;
	
	/**
     * 存储termID在 {@link #intPool} 和 {@link #bytePool} 中的数据位置
     */
    ParallelPostingsArray postingsArray;
	
}

TermsHashPerField ，每一个Field都有其对应的实例化对象，

ByteBlockPool ，从字面理解，字节块的池，每一个term的倒排信息都存在各自的字节块里，每个term对应1个或者2个块：

第一个块存docID和freqs，如果 IndexOptions 设置成 NONE，也就是不索引的话，那么倒排信息都不存在
第二个存position和offset，也就是 IndexOptions 设置成 DOCS_AND_FREQS_AND_POSITIONS 和 DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS才会有这个块。

每个块默认初始大小是5个字节，且最后一个字节用16(byte) 表示结束符，如果有两个块，第二个会紧跟第一个块分配(prox就是指position，此处和lucene统一口径)：
在这里插入图片描述
当然在一个数组里分配字节块，最后肯定是会超出界限的，所以ByteBlockPool 里维护了一个二维数组，同时有一个指向当前最新数组的指针，同时指定最新数组里下次可分配的起始位置：

public final class ByteBlockPool {
	/**
     * array of buffers currently used in the pool. Buffers are allocated if
     * needed don't modify this outside of this class.
     */
    public byte[][] buffers = new byte[10][];

    /**
     * index into the buffers array pointing to the current buffer used as the head, buffer在buffers里的位置
     */
    private int bufferUpto = -1;
    /**
     * Which buffer we are upto, Where we are in head buffer,当前数据在buffer中的最大位置,nextBuffer(..)中初始化为0
     */
    public int byteUpto = BYTE_BLOCK_SIZE;

    /**
     * Current head buffer,指向buffers里的某个元素
     */
    public byte[] buffer;
    /**
     * Current head offset, 当前buffer的起始点在buffers所有数据的offset, 比如第3个buffer,那就是(3-1) * 8192
     * byteUpto + byteOffset 代表当前数据在buffers里的位置
     */
    public int byteOffset = -BYTE_BLOCK_SIZE;
	......
}

如果在当前数组上分配一个或者二个块时，超出界限，那么重新生成一个byte[]，然后在新的数组上分配。当然有些时候最初分配的字节块会不够用，比如当一个term出现多次时，docID和freqs 的块肯定很快就超出界限了，此时会将此块扩容，ByteBlockPool 里定义了不同级别块的相应长度：

public final class ByteBlockPool {

	 /**
     * 跳跃表的层级
     * An array holding the offset into the {@link ByteBlockPool#LEVEL_SIZE_ARRAY}
     * to quickly navigate to the next slice level.
     */
    public final static int[] NEXT_LEVEL_ARRAY = {1, 2, 3, 4, 5, 6, 7, 8, 9, 9};

    /**
     * 跳跃表的相应层次的长度
     * 每个字节块的大小,每一层用(15+级别)作为结束符
     * An array holding the level sizes for byte slices.
     */
    public final static int[] LEVEL_SIZE_ARRAY = {5, 14, 20, 30, 40, 40, 80, 80, 120, 200};

}

第一次分配时第一级的，也就是5个字节，不够再分配时就是第二级的，14个字节，每次扩容依次类推，同时将前一级的末4个字节存储指向下一个块的位置。因为字节块时预先分配好的，下一次扩容时很大情况下是和当前块是不连续的：
在这里插入图片描述

ByteBlockPool 的扩容代码如下：

	/**
     * 此函数仅仅在upto已经是当前块的结尾的时候方才调用来分配新块。
     * Creates a new byte slice with the given starting size and
     * returns the slices offset in the pool.
     */
    public int allocSlice(final byte[] slice, final int upto) {
        //可根据块的结束符来得到块所在的层次。从而我们可以推断，每个层次的块都有不同的结束符，第1层为16，第2层位17，第3层18，依次类推。
        final int level = slice[upto] & 15;
        final int newLevel = NEXT_LEVEL_ARRAY[level];
        //从数组总得到下一个层次及下一层块的大小。
        final int newSize = LEVEL_SIZE_ARRAY[newLevel];

        // Maybe allocate another block
        // 如果当前缓存总量不够大，则从DocumentsWriter的freeByteBlocks中分配。
        if (byteUpto > BYTE_BLOCK_SIZE - newSize) {
            nextBuffer();
        }

        final int newUpto = byteUpto;
        //
        final int offset = newUpto + byteOffset;
        byteUpto += newSize;

        // Copy forward the past 3 bytes (which we are about
        // to overwrite with the forwarding address):
        //当分配了新的块的时候，需要有一个指针从本块指向下一个块，使得读取此信息的时候，能够在此块读取结束后，到下一个块继续读取。
        //这个指针需要4个byte，在本块中，除了结束符所占用的一个byte之外，之前的三个byte的数据都应该移到新的块中，从而四个byte连起来形成一个指针。
        buffer[newUpto] = slice[upto - 3];
        buffer[newUpto + 1] = slice[upto - 2];
        buffer[newUpto + 2] = slice[upto - 1];

        // 将偏移量(也即指针)写入到连同结束符在内的四个byte
        // Write forwarding address at end of last slice:
        // 保留int最高8位
        slice[upto - 3] = (byte)(offset >>> 24);
        // 保留int的16-24位
        slice[upto - 2] = (byte)(offset >>> 16);
        // 保留int的8-16位
        slice[upto - 1] = (byte)(offset >>> 8);
        // 保留int的0-8位
        // 在原先的块结束符16的位置放下一个块的起始位置
        slice[upto] = (byte)offset;
        // 上述4个字节拼接成一个int, 来指向此块扩容的后半截的起始序号

        // Write new level:
        // 在新的块的末尾写入当前块的级别, 17，18,19 ......
        buffer[byteUpto - 1] = (byte)(16 | newLevel);

        return newUpto + 3;
    }

当搜索时一定要制定term是属于某个Field下，只有同一个Field下的term信息才会聚合在一起。

当一个Field在添加一个term时，先根据term字符的hash值来确定之前是否添加过此term：

未添加过，那么确定当前term是此Field下第几个term，也就是当前Field下有多少个唯一性的term了，这个序号就是termID，从0开始
之前添加过，那么将之前的序号+1 然后取负数，通过符号来确定此term是否重复出现

如果当前term时第一次出现，那么根据是否要存储prox和offset做如下操作：

在bytePool中分配1-2个字节块，每块长度为5。bytePool会存储之前分配过的空间的最大序号，比如之前分配到了第280个字节，下一次就分配 280-284,285-289。
将分配的字节块的起始序号（280,285）存入bytePool的最高可分配位置，比如之前分配到了64，那么bytePool.buffer[65]=280，bytePool.buffer[66]=285
TermsHashPerField 有一个postingsArray的属性，其会将bytePool和intPool的数据位置信息都存起来，ParallelPostingsArray.intStarts[termID] = 65，ParallelPostingsArray.byteStarts[termID] = 280

这样之后就仅仅通过termID就嫩知道这个term的docID，freqs，prox，offset数据的存储位置，就能很容易的提取处理。

如果当前term之前添加过，那么通过 postingsArray就能拿到之前写数据的位置，然后跟着追加，如果字节块不够了，就按上述规则扩容。

以下就是postingsArray 和添加term的相关源码及注释：

class ParallelPostingsArray {
	  ......
	  /**
	   * 本来是用来记录term本身在ByteBlockPool中的起始位置的，建索引的时候没有用到这个字段。
	   */
	  final int[] textStarts;
	  /**
	   * 提交数组里第几个term的在 IntBlockPool#buffers 里的总的数据起始位置
	   */
	  final int[] intStarts;
	  /**
	   * 在term的位置上存储当前term 指向 intPool里当前数据的起始位置, intPool又指向bytePool的数据位置
	   */
	  final int[] byteStarts;
}

abstract class TermsHashPerField implements Comparable<TermsHashPerField> {

	/**
     * Called once per inverted token.  This is the primary
     * entry point (for first TermsHash); postings use this
     * API.
     *
     * 在ByteBlockPool中，文档号和词频(freq)信息是应用或然跟随原则写到一个块中去的，而位置信息(prox)是写入到另一个块中去的，
     * 对于同一个词，这两块的偏移量保存在IntBlockPool中。因而在IntBlockPool中，每一个词都有两个int，
     * 0：第0个表示docid +freq在ByteBlockPool中的偏移量，
     * 1：第1个表示prox在ByteBlockPool中的偏移量。
     * 在写入docid + freq信息的时候，调用termsHashPerField.writeVInt(0, p.lastDocCode)，
     * 第一个参数表示向此词的第0个偏移量写入；在写入prox信息的时候，调用termsHashPerField.writeVInt(1, (proxCode<<1)|1)，第一个参数表示向此词的第1个偏移量写入。
     */
    void add() throws IOException {
        // We are first in the chain so we must "intern" the
        // term text into textStart address
        // Get the text & hash of this term.
        // termID :也就是此term在当前field里的序号,  termAtt.getBytesRef() : 也就是term的值,以字节形式展示
        // termID正常是递增的,但是如果这个term之前在此Field里存储过,那么会返回之前的 -(第一次termId + 1)
        // byteHash存储term的字节长度和字节数据, length(1,2字节) + body
        int termID = bytesHash.add(termAtt.getBytesRef());
        // 打印数据
        System.out.println("add term=" + termAtt.getBytesRef().utf8ToString() + " doc=" + docState.docID + " termID=" + termID);
        // New posting, 也就是此term是当前field里第一次写入
        if (termID >= 0) {
            bytesHash.byteStart(termID);
            // Init stream slices, 如果当前buffer在加上待提交的超过了最大长度,新生成一个buffer,指向下一个buffer
            if (numPostingInt + intPool.intUpto > IntBlockPool.INT_BLOCK_SIZE) {
                intPool.nextBuffer();
            }
            // 一个term对应1或者2个int数据, 一个int对应5个字节
            if (ByteBlockPool.BYTE_BLOCK_SIZE - bytePool.byteUpto < numPostingInt * ByteBlockPool.FIRST_LEVEL_SIZE) {
                bytePool.nextBuffer();
            }
            // 指向当前最新的buffer
            intUptos = intPool.buffer;
            // 指向最新buffer里的最新数据位置
            intUptoStart = intPool.intUpto;
            // 最新buffer里的数据位置+1/2, 一个用于存储freq, 一个存储prox和offset
            intPool.intUpto += streamCount;

            // 提交数组里第几个term的在 IntBlockPool#buffers 里的总的数据起始位置
            postingsArray.intStarts[termID] = intUptoStart + intPool.intOffset;

            // 在intPool里分配1/2个位置, 存储的是bytePool里的字节起始位置, 每个int对应5个字节, 第5个存16(0x10)来做分隔开
            for (int i = 0; i < streamCount; i++) {
                // 在bytePool里分配5个字节,返回第一个字节的位置
                final int upto = bytePool.newSlice(ByteBlockPool.FIRST_LEVEL_SIZE);
                // intPool的 intUpto+i 指向bytePool的buffers里的offset
                intUptos[intUptoStart + i] = upto + bytePool.byteOffset;
            }
            // byteStarts 在term的位置上存储当前term 执行 intPool里当前数据的起始位置, intPool又指向bytePool的数据位置
            postingsArray.byteStarts[termID] = intUptos[intUptoStart];

            newTerm(termID);

        }
        // 当前field里此term不是第一次出现
        else {
            termID = (-termID) - 1;
            int intStart = postingsArray.intStarts[termID];
            // 拿到这个term第一次存的intPool的位置
            intUptos = intPool.buffers[intStart >> IntBlockPool.INT_BLOCK_SHIFT];
            intUptoStart = intStart & IntBlockPool.INT_BLOCK_MASK;
            addTerm(termID);
        }

        if (doNextCall) {
            nextPerField.add(postingsArray.textStarts[termID]);
        }
    }

}