1.5 IndexWriter的关闭细节
IndexWriter索引器创建内存索引的整体流程在前几篇文章中已经详细阐述了,当我们利用IndexWriter创建完内存索引表之后,剩下的工作就只剩下关闭IndexWriter了。IndexWriter在关闭的时候除了清理内存中的对象之外,还有一个非常重要的工作,就是把内存要存储的信息(需要保存的Fields信息,倒排索引表等)写入Lucene的磁盘索引文件。 关于Lucene的每个磁盘索引文件我们会用一个专题来系列阐述,这里只是了解一下写文件的流程。
◆ IndexWriter. optimize()
在《索引创建(1):IndexWriter索引器 》1.1节索引创建的代码中,当对所有的Document对象建立索引之后,我们会调用IndexWriter.optimize();来关闭索引器。 调用源代码的流程如下:
IndexWriter.optimize()
---> IndexWriter.optimize(boolean doWait)
---> IndexWriter.optimize(int maxNumSegments, boolean doWait)
public void optimize(int maxNumSegments, boolean doWait){
//.......
flush(true, false, true); //将索引信息由内存写入到磁盘文件
//......
}
IndexWriter.flush(boolean triggerMerge, boolean flushDocStores, boolean flushDeletes)
--> IndexWriter.doFlush(boolean flushDocStores, boolean flushDeletes)
--> IndexWriter.doFlushInternal(boolean flushDocStores, boolean flushDeletes)
private synchronized final boolean doFlushInternal(boolean flushDocStores, boolean flushDeletes) {
//待索引的文档数量
final int numDocs = docWriter.getNumDocsInRAM();
//存储域和词向量所要要写入的段名,"_0"
String docStoreSegment = docWriter.getDocStoreSegment();
//存储域和词向量要写入的段中的偏移量
int docStoreOffset = docWriter.getDocStoreOffset();
//是否使用复合索引文件存储
boolean docStoreIsCompoundFile = false;
//得到要写入的段名:"0"
String segment = docWriter.getSegment();
//开始将缓存的索引信息写入段
flushedDocCount = docWriter.flush(flushDocStores);
◆ DocumentsWriter. flush()
接上面的源码,IndexWriter会调用DocumentsWriter的flush()方法来进一步完成关闭工作。像创建索引的时候DocumentsWriter会调用一个索引链(加工车间)来完成创建工作一样,关闭的时候DocumentsWriter也会通过索引链一步步关闭,并将创建索引时每一步产生的指定信息写入相应的磁盘文件中。
先来看看DocumentsWriter.flush(boolean closeDocStore)的主要工作:
1、 closeDocStore(); 按照基本索引链关闭存储域和词向量信息
其主要是根据基本索引链结构,关闭存储域和词向量信息
consumer(DocFieldProcessor ).closeDocStore(flushState);
consumer(DocInverter ).closeDocStore(state);
consumer(TermsHash ).closeDocStore(state);
consumer(FreqProxTermsWriter ).closeDocStore(state);
if (nextTermsHash != null) nextTermsHash.closeDocStore(state);
consumer(TermVectorsTermsWriter ).closeDocStore(state);
endConsumer(NormsWriter ).closeDocStore(state);
fieldsWriter(StoredFieldsWriter ).closeDocStore(state);
其中有实质意义的是以下两个closeDocStore:
(1)词向量的关闭:TermVectorsTermsWriter.closeDocStore(SegmentWriteState)
void closeDocStore(final SegmentWriteState state) throws IOException {
if (tvx != null) {
//为不保存词向量的文档在tvd文件中写入零。即便不保存词向量,在tvx, tvd中也保留一个位置
fill(state.numDocsInStore - docWriter.getDocStoreOffset());
//关闭tvx, tvf, tvd文件的写入流
tvx.close();
tvf.close();
tvd.close();
tvx = null;
//记录写入的文件名,为以后生成cfs文件的时候,将这些写入的文件生成一个统一的cfs文件。
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION);
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION);
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION);
//从DocumentsWriter的成员变量openFiles中删除,未来可能被IndexFileDeleter删除
docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION);
docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION);
docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION);
lastDocID = 0;
}
}
(2) 存储域的关闭:StoredFieldsWriter.closeDocStore(SegmentWriteState)
public void closeDocStore(SegmentWriteState state) throws IOException {
//关闭fdx, fdt写入流
fieldsWriter.close();
--> fieldsStream.close();
--> indexStream.close();
fieldsWriter = null;
lastDocID = 0;
//记录写入的文件名
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_EXTENSION);
state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_INDEX_EXTENSION);
state.docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_EXTENSION);
state.docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_INDEX_EXTENSION);
}
2、consumer.flush(threads, flushState); 按照基本索引链结构将索引结果写入指定段名的磁盘文件
其写入磁盘的调用顺序也按照基本索引链,如下:
DocFieldProcessor.flush // 将Fields信息写入.fdx和.fdt
--> DocInverter.flush //将倒排索引表写入
--> TermsHash.flush
--> FreqProxTermsWriter.flush
--> TermVectorsTermsWriter.flush()
--> NormsWriter.flush
第一步: 首先调用 DocFieldProcessor.flush(Collection<DocConsumerPerThread> threads, SegmentWriteState state) 将Fields信息写入.fdx和.fdt
public void flush(Collection<DocConsumerPerThread> threads, SegmentWriteState state){
//回收fieldHash,以便用于下一轮的索引,为提高效率,索引链中的对象是被复用的。
Map<DocFieldConsumerPerThread, Collection<DocFieldConsumerPerField>> childThreadsAndFields = new HashMap<DocFieldConsumerPerThread, Collection<DocFieldConsumerPerField>>();
for ( DocConsumerPerThread thread : threads) {
DocFieldProcessorPerThread perThread = (DocFieldProcessorPerThread) thread;
childThreadsAndFields.put(perThread.consumer, perThread.fields());
perThread.trimFields(state);
}
//将Fields信息写入.fdx和.fdt文件
fieldsWriter.flush(state);
//调用索引链的第二步DocInverter将索引表写入磁盘文件
consumer.flush(childThreadsAndFields, state);
//写入域元数据信息,并记录写入的文件名,以便以后生成cfs文件
final String fileName = state.segmentFileName(IndexFileNames.FIELD_INFOS_EXTENSION);
fieldInfos.write(state.directory, fileName);
state.flushedFiles.add(fileName);
}
在这一步中,主要作用就是调用StoredFieldsWriter将Document对象中所需要存储的Field信息写入.fdx和.fdt文件《索引文件格式(3):Field数据[.fdx/.fdt/.fnm] 》。其调用流程如下:
StoredFieldsWriter.flush(SegmentWriteState state)
---> FieldsWriter.flush()
void flush() throws IOException {
indexStream.flush(); //写入.fdx文件
fieldsStream.flush(); //写入.fdt文件
}
--->BufferedIndexOutput.flush()
--->BufferedIndexOutput.flushBuffer(byte[] b, int len)
--->SimpleFSDirectory.flushBuffer(byte[] b, int offset, int size)
public void flushBuffer(byte[] b, int offset, int size) throws IOException {
file.write(b, offset, size); //JDK RandomAccessFile.write
}
第二步 再调用 DocInverter.flush(Map<DocFieldConsumerPerThread, Collection<DocFieldConsumerPerField>> threadsAndFields, SegmentWriteState state) 将倒排索引表信息写入文件.tii、.tis、.frq、.prx文件。
//DocInverter.flush()
void flush(Map<DocFieldConsumerPerThread, Collection<DocFieldConsumerPerField>> threadsAndFields, SegmentWriteState state) {
//写入倒排表及词向量信息(TermHash.flush())
consumer.flush(childThreadsAndFields, state);
//写入标准化因子
endConsumer.flush(endChildThreadsAndFields, state);
}
DocInverter.flush --> TermsHash.flush
//TermsHash.flush()
flush(Map<InvertedDocConsumerPerThread,Collection<InvertedDocConsumerPerField>>, SegmentWriteState){
//写入倒排表信息(FreqProxTermsWriter.flush())
consumer.flush(childThreadsAndFields, state);
//回收RawPostingList
shrinkFreePostings(threadsAndFields, state);
//写入词向量信息(TermVectorsTermsWriter.flush())
if (nextTermsHash != null) nextTermsHash.flush(nextThreadsAndFields, state);
consumer.flush(childThreadsAndFields, state);
}
-->(1) FreqProxTermsWriter.flush() 将倒排索引表的词典信息写入tii和tis文件《 索引文件格式(4):Terms数据[.tii/.tis] 》,将posting的docID和position分别写入frq和prx文件。
public void flush(Map<TermsHashConsumerPerThread,Collection<TermsHashConsumerPerField>> threadsAndFields, final SegmentWriteState state) {
// 搜集所有线程已经建立索引的Field
List<FreqProxTermsWriterPerField> allFields = new ArrayList<FreqProxTermsWriterPerField>();
for(Map.Entry<TermsHashConsumerPerThread,Collection<TermsHashConsumerPerField>> entry : threadsAndFields.entrySet()) {
Collection<TermsHashConsumerPerField> fields = entry.getValue();
for (final TermsHashConsumerPerField i : fields) {
final FreqProxTermsWriterPerField perField = (FreqProxTermsWriterPerField) i;
//当前Field有posting,表明此域已经被建立了索引结构
if (perField.termsHashPerField.numPostings > 0)
allFields.add(perField);
}
}
//所有Field按名称排序,使得同名域能够一起处理(因为多线程处理,每个Document对象都会有相同的域)
Collections.sort(allFields);
final int numAllFields = allFields.size();
//①生成倒排表的写对象
final FormatPostingsFieldsConsumer consumer = new FormatPostingsFieldsWriter(state, fieldInfos);
int start = 0;
//对于每一个域
while(start < numAllFields) {
//将所有线程处理的名字相同的域放在FreqProxTermsWriterPerField[]数组中准备一起写入文件
final FieldInfo fieldInfo = allFields.get(start).fieldInfo;
final String fieldName = fieldInfo.name;
int end = start+1;
while(end < numAllFields && allFields.get(end).fieldInfo.name.equals(fieldName))
end++;
FreqProxTermsWriterPerField[] fields = new FreqProxTermsWriterPerField[end-start];
for(int i=start;i<end;i++) {
fields[i-start] = allFields.get(i);
fieldInfo.storePayloads |= fields[i-start].hasPayloads;
}
// ②如果这个域被建立好的索引,那么把这个域的倒排索引加入到文件中
appendPostings(fields, consumer);
// 释放空间
for(int i=0;i<fields.length;i++) {
TermsHashPerField perField = fields[i].termsHashPerField;
int numPostings = perField.numPostings;
perField.reset();
perField.shrinkHash(numPostings);
fields[i].reset();
}
start = end;
}
for (Map.Entry<TermsHashConsumerPerThread,Collection<TermsHashConsumerPerField>> entry : threadsAndFields.entrySet()) {
FreqProxTermsWriterPerThread perThread = (FreqProxTermsWriterPerThread) entry.getKey();
perThread.termsHashPerThread.reset(true);
}
consumer.finish();
}
上述代码有几个重要的地方:
①生成倒排表的写对象: 其基本作用就是创建tii,tis,freq, prox文件用于存放倒排索引表。此时还没有写任何数据信息。
public FormatPostingsFieldsWriter(SegmentWriteState state, FieldInfos fieldInfos) throws IOException {
dir = state.directory;
segment = state.segmentName;
totalNumDocs = state.numDocs;
this.fieldInfos = fieldInfos;
//TermInfosWriter用于写tii,tis
termsOut = new TermInfosWriter(dir, segment, fieldInfos, state.termIndexInterval);
//DefaultSkipListWriter用于写freq, prox的跳表
skipListWriter = new DefaultSkipListWriter(termsOut.skipInterval, termsOut.maxSkipLevels, totalNumDocs, null, null);
//记录写入的文件名,
state.flushedFiles.add(state.segmentFileName(IndexFileNames.TERMS_EXTENSION));
state.flushedFiles.add(state.segmentFileName(IndexFileNames.TERMS_INDEX_EXTENSION));
//用以上两个写对象,按照一定的格式写入段
termsWriter = new FormatPostingsTermsWriter(state, this);
}
②将同名的域的倒排索引信息添加进文件中
void appendPostings(FreqProxTermsWriterPerField[] fields,
FormatPostingsFieldsConsumer consumer)
throws CorruptIndexException, IOException {
int numFields = fields.length;
final FreqProxFieldMergeState[] mergeStates = new FreqProxFieldMergeState[numFields];
for(int i=0;i<numFields;i++) {
FreqProxFieldMergeState fms = mergeStates[i] = new FreqProxFieldMergeState(fields[i]);
assert fms.field.fieldInfo == fields[0].fieldInfo;
boolean result = fms.nextTerm();//对所有的域,取第一个词(Term)
assert result;
}
//添加此域,虽然有多个域,但是由于是同名域,只取第一个域的信息即可。返回的是用于添加此域中的词的对象。
final FormatPostingsTermsConsumer termsConsumer = consumer.addField(fields[0].fieldInfo);
//FreqProxFieldMergeState 被DocumentWriter掉用合并多个线程的posting进入同一个段中
FreqProxFieldMergeState[] termStates = new FreqProxFieldMergeState[numFields];
final boolean currentFieldOmitTermFreqAndPositions = fields[0].fieldInfo.omitTermFreqAndPositions;
//此while循环是遍历每一个尚有未处理的词的域,依次按照词典顺序处理这些域所包含的词。当一个域中的所有的词都被处理过后,则 numFields减一,并从mergeStates数组中移除此域。直到所有的域的所有的词都处理完毕,方才退出此循环。
while(numFields > 0) {
//找出所有域中按字典顺序的下一个词。可能多个同名域中,都包含同一个term,因而要遍历所有的numFields,得到所有的域里的下一个词,numToMerge即为有多少个域包含此词。
termStates[0] = mergeStates[0];
int numToMerge = 1;
for(int i=1;i<numFields;i++) {
final char[] text = mergeStates[i].text;
final int textOffset = mergeStates[i].textOffset;
final int cmp = compareText(text, textOffset, termStates[0].text, termStates[0].textOffset);
if (cmp < 0) {
termStates[0] = mergeStates[i];
numToMerge = 1;
} else if (cmp == 0)
termStates[numToMerge++] = mergeStates[i];
}
//添加此词,返回FormatPostingsDocsConsumer用于添加文档号(doc ID)及词频信息(freq)
final FormatPostingsDocsConsumer docConsumer = termsConsumer.addTerm(termStates[0].text, termStates[0].textOffset);
//由于共numToMerge个域都包含此词,每个词都有一个链表的文档号表示包含这些词的文档。此循环遍历所有的包含此词的域,依次按照从小到大的循序添加包含此词的文档号及词频信息。当一个域中对此词的所有文档号都处理过了,则numToMerge减一,并从termStates数组中移除此域。当所有包含此词的域的所有文档号都处理过了,则结束此循环。
while(numToMerge > 0) {
//找出最小的文档号
FreqProxFieldMergeState minState = termStates[0];
for(int i=1;i<numToMerge;i++)
if (termStates[i].docID < minState.docID)
minState = termStates[i];
final int termDocFreq = minState.termFreq;
//添加文档号及词频信息,并形成跳表,返回FormatPostingsPositionsConsumer用于添加位置(prox)信息
final FormatPostingsPositionsConsumer posConsumer = docConsumer.addDoc(minState.docID, termDocFreq);
//ByteSliceReader是用于读取bytepool中的prox信息的。
final ByteSliceReader prox = minState.prox;
if (!currentFieldOmitTermFreqAndPositions) {
int position = 0;
// 此循环对包含此词的文档,添加位置信息
for(int j=0;j<termDocFreq;j++) {
final int code = prox.readVInt();
position += code >> 1;
final int payloadLength;
// 如果此位置有payload信息,则从bytepool中读出,否则设为零。
if ((code & 1) != 0) {
payloadLength = prox.readVInt();
if (payloadBuffer == null || payloadBuffer.length < payloadLength)
payloadBuffer = new byte[payloadLength];
prox.readBytes(payloadBuffer, 0, payloadLength);
} else
payloadLength = 0;
//添加位置(prox)信息
posConsumer.addPosition(position, payloadBuffer, 0, payloadLength);
} //End for
posConsumer.finish();
}
//判断退出条件,上次选中的域取得下一个文档号,如果没有,则说明此域包含此词的文档已经处理完毕,则从termStates中删除此域,并将 numToMerge减一。然后此域取得下一个词,当循环到(2)的时候,表明此域已经开始处理下一个词。如果没有下一个词,说明此域中的所有的词都处理完毕,则从mergeStates中删除此域,并将numFields减一,当numFields为0的时候,循环(2)也就结束了。
if (!minState.nextDoc()) {
int upto = 0;
for(int i=0;i<numToMerge;i++)
if (termStates[i] != minState)
termStates[upto++] = termStates[i];
numToMerge--;
assert upto == numToMerge;
if (!minState.nextTerm()) {
upto = 0;
for(int i=0;i<numFields;i++)
if (mergeStates[i] != minState)
mergeStates[upto++] = mergeStates[i];
numFields--;
assert upto == numFields;
}
}
}
//经过上面的过程,docid和freq信息虽已经写入段文件,而跳表信息并没有写到文件中,而是写入skip buffer里面了,此处真正写入文件。并且词典(tii, tis)也应该写入文件。
docConsumer.finish();
}
termsConsumer.finish();
}
-->(2) TermVectorsTermsWriter.flush() 写入tvx, tvd, tvf三个文件.
TermVectorsTermsWriter.flush (Map<TermsHashConsumerPerThread,Collection<TermsHashConsumerPerField>>
threadsAndFields, final SegmentWriteState state) {
if (tvx != null) {
if (state.numDocsInStore > 0)
fill(state.numDocsInStore - docWriter.getDocStoreOffset());
tvx.flush();
tvd.flush();
tvf.flush();
}
for (Map.Entry<TermsHashConsumerPerThread,Collection<TermsHashConsumerPerField>> entry :
threadsAndFields.entrySet()) {
for (final TermsHashConsumerPerField field : entry.getValue() ) {
TermVectorsTermsWriterPerField perField = (TermVectorsTermsWriterPerField) field;
perField.termsHashPerField.reset();
perField.shrinkHash();
}
TermVectorsTermsWriterPerThread perThread = (TermVectorsTermsWriterPerThread) entry.getKey();
perThread.termsHashPerThread.reset(true);
}
}
第三步:写入标准化因子