lucene内存建立倒排索引分析_lucene::index::indexwriter 倒排索引-CSDN博客

本文链接：https://blog.csdn.net/gs_albb/article/details/117607278

本文深入剖析了Lucene8.8.2中倒排索引的写入过程，通过示例代码详细解释了如何从文档内容构建term的PostingList，包括term在文档中的词频、位置信息的存储策略，以及如何处理重复term的情况。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

基本知识

基本可以参考倒排表上这篇文章。

示例与分析

测试源码见github

    /**
     * 测试只增加倒排索引,索引选项为{@link org.apache.lucene.index.IndexOptions#DOCS_AND_FREQS_AND_POSITIONS}
     * 不包含偏移量
     */
    @Test
    public void testCreateInvertFieldDocIndex() throws IOException, URISyntaxException {
        IndexWriter writer = getIndexWriter();
        Document doc = new Document();
        doc.add(new TextField("info", "study play football ! study", Field.Store.YES));
        writer.addDocument(doc);

        doc = new Document();
        doc.add(new TextField("info", "hi, every one, good play study", Field.Store.YES));
        writer.addDocument(doc);

        doc = new Document();
        doc.add(new TextField("info", "play basketball is one good interest", Field.Store.YES));
        writer.addDocument(doc);

        writer.commit();
        writer.close();
    }

lucenene源码中对每篇Doc的每个Field的内容，词条解析成一个个token(或者term)后，DefaultIndexingChain.PerField#invert方法内部将会调用TermsHashPerField#add(BytesRef termBytes, final int docID),这就是构建term倒排索引的基础。建议读者多debug。
在这里插入图片描述

写入doc0 term0(study)

写入第一个Doc，第一个term，即study后，索引内部信息如下图
在这里插入图片描述

PostingArray:
内部有多个int数组，数组的下标表示termID。一个field可能有多个term，按照term出现的顺序termID依次为0,1,2,…

textStarts: 值为term文本在bytePool中的起始位置。
byteStarts: 值为term文档号，词频信息在bytePool中的起始位置。
addressOffset: 值为term元数据在intPool中的位置偏移量，每个term在intPool中占用两个int元素，第一个元素初始指向term在bytePool中文档号，词频信息的写入位置，第二个元素初始指向term在bytePool中pos, payload, offset的写入位置。源码中intPool中的元素称作stream。
Tips: 名字中包含adress的这种变量总是跟指针(数组的下标)有关系

lastDocIds: 值为term上次出现的docId。
lastDocCodes: 值为term上次出现的docId的编码值，第一次出现则为docId<<1; 后续再出现，则为(docId-lastDocId) <<1。

termFreq: 值为term在当前文档当前field出现的词频。
lastPositions: 值为term在当前field文本内容的位置。

写入过程
1、textStarts[0]写入0，表示从bytePool的该位置写入term的长度及文本内容。
2、bytePool[0]至bytePool[5]记录study的长度5及s,t,u,d,y5个字符。
3、addressOffset[0]写入0, 表示只想intPool数组的0位置。
4、bytePool分配一个初级块，byte[6]-byte[10],5个字节, byte[10]=16,表示初级块的结束标记。
5、intPool[0]=6,指向上一步刚分配的块的首个元素地址。
6、4，5步再执行一次，因为一个term需要两个stream(占用intPool两个元素)。执行完后,intPool[1]=11, 指向bytePool分配的第二个初级块的首元素地址。
7、byteStarts[0]=intPool[0]=6，指向study 文档号，词频的写入位置。
8、lastDocIDs[0]= 0, 记录study最后一次出现的文档号为0。
9、lastDocCodes[0] = docId <<1 = 0, docId的编码后的存储值。
10、termFreqs[0] = 1, 词频为1，出现1次。
11、bytePool[11] = prox <<1, 没有payload信息，则位置信息prox(0) <<1左移1位记录到pos信息应该记录的位置(该term对应的stream[1]指向的位置，即bytePool[11])，写入0。
12、intPool[1]=12,由于第11步，将term的prox信息已经写入到bytePool[11]中，所以该term对应的stream[1](即intPool[1])将指针+1, payLoad等信息(如果有)需要在bytePool中的后续字节里写入。
13、lastPosition[0]=0, study最后一次出现的位置信息为0，因为是study play football ! study的第一个词。

写入doc0 term1(play)

写入第一个Doc，第二个term，即play后，索引内部信息如下图
在这里插入图片描述

postingArray
textStarts[1]=16, play从bytePool[16]开始写起, bytePool[16]=4记录play长度, bytePool[17]-bytePool[20]记录play的4个字符。

byteStarts[1]=21, play的所属文档号,词频信息从bytePool[21]开始写。

addressOffset[1]=2,play占用两个stream，第一个stream从intPool[2]开始。

lastDocIds[1] =0, play出现的上一个文档号为0。

lastDocCodes[1] = 0, play出现的上一个文档号编码后为0<<1=0。

termFreqs[1]=1, play出现的词频为1。

lastPositions[1]=1, play出现的位置为1。

intPool
intPool[2]=21, 表示playterm的文档号，词频信息应该写在bytePool[21]位置；
intPool[3]=27, play的位置信息已经写到bytePool[26],余下的payload等信息要从bytePool[27]开始写。

bytePool
bytePool[26]=位置信息(1) <<1 = 2。

写入doc0 term2(football)

写入第一个Doc，第三个term，即football后，索引内部信息如下图
在这里插入图片描述

postingArray
textStarts[2]=31, football从bytePool[31]开始写起, bytePool[31]=4记录play长度, bytePool[32]-bytePool[39]记录football的8个字符。

byteStarts[2]=40, football的所属文档号,词频信息从bytePool[40]开始写。

addressOffset[2]=4,football占用两个stream，第一个stream从intPool[4]开始。

lastDocIds[2] =0, football出现的上一个文档号为0。

lastDocCodes[2] = 0, football出现的上一个文档号编码后为0<<1=0。

termFreqs[2]=1, football出现的词频为1。

lastPositions[2]=1, football出现的位置为2。

intPool
intPool[4]=40, 表示footballterm的文档号，词频信息应该写在bytePool[40]位置；
intPool[5]=46, football的位置信息已经写到bytePool[45],余下的payload等信息要从bytePool[46]开始写。

bytePool
bytePool[45]=位置信息(2) <<1 = 4。

再写入doc0 term0(study)

写入第一个doc，第四个term，即study后。由于study是第一个term，之前已经写过，这里需要更新词频等信息。索引内部信息如下图
在这里插入图片描述
1、首先更新termFreqs[0]=2;
2、更新bytePool[12]= 3<<1=6(新增了一个pos, 因为重复term)；
3、更新intPool[1]=13, 该term下一次在bytePool中pos写入的位置；
4、更新lastPositions[0]=3;

写入doc1 term1(play)

索引内部信息如下图
在这里插入图片描述

1、bytePool[21]=(0<<1)|1=1, play在doc0的 info filed已经出现过，所以要将play上次出现的文档号和词频信息填充到bytePool中。
情况A:term在上一个doc中的词频为1, 则写入 postings.lastDocCodes[termID]|1(文档号左移1位|1)
情况B:term在上一个doc中的词频>1, 则lastDocCodes[termID], termFreqs[termID]分开存储。

2、intPool[2]=22(bytePool[21]已写入文档号和词频)

3、termFreqs[1]=1, 重新初始化，记录当前doc的该term的词频。

4、lastDocCodes[1]=(currentDocId - lastDocId) << 1= (1-0)<<1 = 2;

5、lastDocIDs[1] = currentDocId = 1;

6、bytePool[27] = (pos <<1) = (4 <<1) = 8;

7、 intPool[3] = (intPool[3]++) = (27+1) = 28

总结

对于不同的filed，bytePool和intPool都是共享的，但是postingArray是每个field有一个的，所以这三个对象的关系的出发点在于postArray。也就是读的时候，以postingArray为出发点。
对于一个指定的field，指定的term。它的文档号及其词频信息在bytePool中仅当其下次出现在随后的文档中，才被写入，有滞后性。也可以这样说，处理当前term时，如果发现该term上次出现的文档不是当前文档时，将上次的文档号和词频写到bytePool中。