lucene索引笔记

最新推荐文章于 2020-05-21 22:54:44 发布

todaylxp

最新推荐文章于 2020-05-21 22:54:44 发布

阅读量669

点赞数

分类专栏： IR 文章标签： lucene 文档 allocation null 存储磁盘

本文链接：https://blog.csdn.net/todaylxp/article/details/4242068

版权

IR 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

一、一个文档添加

1. 添加一个新的document

即是

新加一个segment，段名是通过下划线和数字组成(段数)，(如果每次加入文档都要做字段信息的统计和写磁盘，浪费，因为所有的文档拥有相同的字段和属性)

并将段信息写入fnm文件

文件结构:

1. 字段数

2. 段名字符串长度

3. 段名字符串值

4. 属性(存储/索引等)

fdx文件写入，字段信息偏移地址

fdt文件结构:

1. 需要存储的字段数

2. 如果字段需要存储则记录以下信息

1. 字段编号-int

2. 字段属性(分词/二进制存储/压缩)-bit

3. 写入原文(正向文档)，关于原文

1. 压缩原文不支持

2. 二进制，原文已经为二进制字节流，写入

3. 原文为空

4. 原文不为空且不为以上属性，直接写入原文

tii文件

1. 写入词项-采用想用前缀存储

1. 相同前缀的起始

2. 相同前缀的长度

3. 去除相同前缀后的字符串

4. 所属字段编号

2. 词项频率(df)-如果是一个文档(一个段)，实际是tf

3. 频率文件指针

4. 位置文件指针

frq文件

写入文档频率(tf)

1. 如果文档频率为1则写入1

2. 如果文档频率大于1，则写入0再写入频率

3. 其实已经等同于写入了文档编号的默认值0

解析如下

1.如果频率是1，写入的是变长整型值1，编码时写入1个字节(字节值=1)

解码时当读入1个字节(字节值=1)时，返回变长整型值1(只读1个字节)即是docCode

_doc += docCode >> 1 文档值等于解码出来的差分值(变长整型值)+前一个文档值(当前是0)，所以_doc=0+(1>>1)=0;

解码得到文档编号为0，且docCode & 1 = 1 不为零，所以频率为1，解码完成，doc = 0 ; freq =1;

2.如果频率大于1 ，写入的是两个变长整型值0和freq,写入的字节流是:八位的一个字节其值等于零，再写入freq的变长编码

解码时当读入1个字节(字节值=0)时，返回变长整型值0(此次读取也只是1个字节)即是docCode

_doc += docCode >> 1 文档值等于解码出来的差分值(变长整型值)+前一个文档值(当前是0)，所以_doc=0+(1>>1)=0;

解码得到文档编号为0,但由于 docCode & 1 = 0为零，所以意味着仍需读入一个变长整型值，即是频率值freq

当添加一个文档时，对应生成一个段，我想对应一个外部未分配(实际是无法控制)但是内部分配好的文档(编号，且编号值从0取起)

SegmentInfo* si = _CLNEW SegmentInfo(segmentName, 1, ramDirectory);

二、合并过程

1.字段合并

循环所有需要合并的段，通过以下函数读取该段的字段信息

FieldInfos::read(IndexInput* input)

将字段信息合并至新的字段集合中

2.词项合并

此函数合并词项：

void SegmentMerger::mergeTermInfos()

void SegmentMerger::mergeTermInfos(){

//Func - Merges all TermInfos into a single segment

//Pre - true

//Post - All TermInfos have been merged into a single segment

//Condition check to see if queue points to a valid instance

CND_CONDITION(queue != NULL, "Memory allocation for queue failed") ;

//base is the id of the first document in a segment

int32_t base = 0;

IndexReader* reader = NULL;

SegmentMergeInfo* smi = NULL;

//iterate through all the readers

for (uint32_t i = 0; i < readers.size(); i++)

{

//Get the i-th reader

reader = readers[i];

//Condition check to see if reader points to a valid instance

CND_CONDITION(reader != NULL, "No IndexReader found");

//Get the term enumeration of the reader

TermEnum* termEnum = reader->terms();

//Instantiate a new SegmentMerginfo for the current reader and enumeration

smi = _CLNEW SegmentMergeInfo(base, termEnum, reader);

//Condition check to see if smi points to a valid instance

CND_CONDITION(smi != NULL, "Memory allocation for smi failed") ;

//Increase the base by the number of documents that have not been marked deleted

//so base will contain a new value for the first document of the next iteration

base += reader->numDocs();

//Get the next current term

if (smi->next()){

//Store the SegmentMergeInfo smi with the initialized SegmentTermEnum TermEnum

//into the queue

queue->put(smi);

}else{

//Apparently the end of the TermEnum of the SegmentTerm has been reached so

//close the SegmentMergeInfo smi

smi->close();

//And destroy the instance and set smi to NULL (It will be used later in this method)

_CLDELETE(smi);

}

//Instantiate an array of SegmentMergeInfo instances called match

SegmentMergeInfo** match = _CL_NEWARRAY(SegmentMergeInfo*,readers.size()+1);

//Condition check to see if match points to a valid instance

CND_CONDITION(match != NULL, "Memory allocation for match failed") ;

SegmentMergeInfo* top = NULL;

//As long as there are SegmentMergeInfo instances stored in the queue

while (queue->size() > 0) {

int32_t matchSize = 0;

// pop matching terms

//Pop the first SegmentMergeInfo from the queue

match[matchSize++] = queue->pop();

//Get the Term of match[0]

Term* term = match[0]->term;// 弹出最小的词项

//Condition check to see if term points to a valid instance

CND_CONDITION(term != NULL,"term is NULL") ;

//Get the current top of the queue

top = queue->top();

//For each SegmentMergInfo still in the queue

//Check if term matches the term of the SegmentMergeInfo instances in the queue

while (top != NULL && term->equals(top->term) ){ //遍历所有段，取出含有最小词项的段，加至临时SegmentMergeInfo中以待合并用

//A match has been found so add the matching SegmentMergeInfo to the match array

match[matchSize++] = queue->pop();

//Get the next SegmentMergeInfo

top = queue->top();

}

match[matchSize]=NULL;

//add new TermInfo

mergeTermInfo(match); //matchSize //合并之

//Restore the SegmentTermInfo instances in the match array back into the queue

while (matchSize > 0){

smi = match[--matchSize];

//Condition check to see if smi points to a valid instance

CND_CONDITION(smi != NULL,"smi is NULL") ;

//Move to the next term in the enumeration of SegmentMergeInfo smi

if (smi->next()){//移动至下一个最小的词项

//There still are some terms so restore smi in the queue

queue->put(smi);

}else{

//Done with a segment

//No terms anymore so close this SegmentMergeInfo instance

smi->close();

_CLDELETE( smi );

}

_CLDELETE_ARRAY(match);

}

合并后frq文件格式已经修改了

1.频率为1(tf),写入(docCode | 1 ),docDoc是差分值

2.如果频率大于1 ，写入文档编号(差分值),写入频率

1.根据段顺序为每个段加上偏移值，一般是base += reader->numDocs()-即是当前段的偏移+当前段的文档数=下一个段的偏移值

2.根据词典的词项做词项多路归并:

1.遍历所有的段，取出词项值最小的词项。

2.遍历所有的段，取出含最小词项的段。

三、合并策略

lucene采用的归并策略为立即归并:只要有满足合并条件的索引立即合并

minMergeDocs为最小合并文档数，默认是10，即是小于10个文档记录的索引会在内存中

如果超过10会将内存索引合并成一个新的段到磁盘中

mergeFactor为合并因子，倍乘系数，简述如下(例如合并因子为10):

1. 当候选集有10个段，每个均为1个文档集，合并成1个段，10个文档集，写入磁盘

2.继续扫描，如果有10个段，每个段有10个文档集的，合并成1个段，100个文档集

如果没有本趟合并过程结束

说明:

一般情况下是添加一新文档触发立即归并过程，由低上溯，合并相同索引集的段为一个新段

但会出现大索引集和一个小索引集合并，小索引无法上溯

todaylxp

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
lucene索引笔记

一、一个文档添加 1. 添加一个新的document即是新加一个segment，段名是通过下划线和数字组成(段数)，(如果每次加入文档都要做字段信息的统计和写磁盘，浪费，因为所有的文档拥有相同的字段和属性) 并将段信息写入fnm文件文件结构:1. 字段数2. 段名字符串长度3. 段名字符串值4.
复制链接

扫一扫