一、一个文档添加
1. 添加一个新的document
即是
新加一个segment,段名是通过下划线和数字组成(段数),(如果每次加入文档都要做字段信息的统计和写磁盘,浪费,因为所有的文档拥有相同的字段和属性)
并将段信息写入fnm文件
文件结构:
1. 字段数
2. 段名字符串长度
3. 段名字符串值
4. 属性(存储/索引等)
fdx文件写入,字段信息偏移地址
fdt文件结构:
1. 需要存储的字段数
2. 如果字段需要存储则记录以下信息
1. 字段编号-int
2. 字段属性(分词/二进制存储/压缩)-bit
3. 写入原文(正向文档),关于原文
1. 压缩原文不支持
2. 二进制,原文已经为二进制字节流,写入
3. 原文为空
4. 原文不为空且不为以上属性,直接写入原文
tii文件
1. 写入词项-采用想用前缀存储
1. 相同前缀的起始
2. 相同前缀的长度
3. 去除相同前缀后的字符串
4. 所属字段编号
2. 词项频率(df)-如果是一个文档(一个段),实际是tf
3. 频率文件指针
4. 位置文件指针
frq文件
写入文档频率(tf)
1. 如果文档频率为1则写入1
2. 如果文档频率大于1,则写入0再写入频率
当添加一个文档时,对应生成一个段,我想对应一个外部未分配(实际是无法控制)但是内部分配好的文档(编号,且编号值从0取起)
SegmentInfo* si = _CLNEW SegmentInfo(segmentName, 1, ramDirectory);
二、合并过程
1.字段合并
循环所有需要合并的段,通过以下函数读取该段的字段信息
FieldInfos::read(IndexInput* input)
将字段信息合并至新的字段集合中
2.词项合并
此函数合并词项:
void SegmentMerger::mergeTermInfos()
void SegmentMerger::mergeTermInfos(){
//Func - Merges all TermInfos into a single segment
//Pre - true
//Post - All TermInfos have been merged into a single segment
//Condition check to see if queue points to a valid instance
CND_CONDITION(queue != NULL, "Memory allocation for queue failed") ;
//base is the id of the first document in a segment
int32_t base = 0;
IndexReader* reader = NULL;
SegmentMergeInfo* smi = NULL;
//iterate through all the readers
for (uint32_t i = 0; i < readers.size(); i++)
{
//Get the i-th reader
reader = readers[i];
//Condition check to see if reader points to a valid instance
CND_CONDITION(reader != NULL, "No IndexReader found");
//Get the term enumeration of the reader
TermEnum* termEnum = reader->terms();
//Instantiate a new SegmentMerginfo for the current reader and enumeration
smi = _CLNEW SegmentMergeInfo(base, termEnum, reader);
//Condition check to see if smi points to a valid instance
CND_CONDITION(smi != NULL, "Memory allocation for smi failed") ;
//Increase the base by the number of documents that have not been marked deleted
//so base will contain a new value for the first document of the next iteration
base += reader->numDocs();
//Get the next current term
if (smi->next()){
//Store the SegmentMergeInfo smi with the initialized SegmentTermEnum TermEnum
//into the queue
queue->put(smi);
}else{
//Apparently the end of the TermEnum of the SegmentTerm has been reached so
//close the SegmentMergeInfo smi
smi->close();
//And destroy the instance and set smi to NULL (It will be used later in this method)
_CLDELETE(smi);
}
}
//Instantiate an array of SegmentMergeInfo instances called match
SegmentMergeInfo** match = _CL_NEWARRAY(SegmentMergeInfo*,readers.size()+1);
//Condition check to see if match points to a valid instance
CND_CONDITION(match != NULL, "Memory allocation for match failed") ;
SegmentMergeInfo* top = NULL;
//As long as there are SegmentMergeInfo instances stored in the queue
while (queue->size() > 0) {
int32_t matchSize = 0;
// pop matching terms
//Pop the first SegmentMergeInfo from the queue
match[matchSize++] = queue->pop();
//Get the Term of match[0]
Term* term = match[0]->term;// 弹出最小的词项
//Condition check to see if term points to a valid instance
CND_CONDITION(term != NULL,"term is NULL") ;
//Get the current top of the queue
top = queue->top();
//For each SegmentMergInfo still in the queue
//Check if term matches the term of the SegmentMergeInfo instances in the queue
while (top != NULL && term->equals(top->term) ){ //遍历所有段,取出含有最小词项的段,加至临时SegmentMergeInfo中以待合并用
//A match has been found so add the matching SegmentMergeInfo to the match array
match[matchSize++] = queue->pop();
//Get the next SegmentMergeInfo
top = queue->top();
}
match[matchSize]=NULL;
//add new TermInfo
mergeTermInfo(match); //matchSize //合并之
//Restore the SegmentTermInfo instances in the match array back into the queue
while (matchSize > 0){
smi = match[--matchSize];
//Condition check to see if smi points to a valid instance
CND_CONDITION(smi != NULL,"smi is NULL") ;
//Move to the next term in the enumeration of SegmentMergeInfo smi
if (smi->next()){//移动至下一个最小的词项
//There still are some terms so restore smi in the queue
queue->put(smi);
}else{
//Done with a segment
//No terms anymore so close this SegmentMergeInfo instance
smi->close();
_CLDELETE( smi );
}
}
}
_CLDELETE_ARRAY(match);