lucene源码分析—删除索引
本章介绍lucene中索引的删除,主要介绍IndexWriter的deleteDocuments函数,该函数可以介绍两种参数,一种是Term,将所有包含该Term的文档都删除,另一种是Query,删除所有根据该Query查询得到的文档。本章只介绍第一种情况,以下面的一个例子开始。再次申明一下,博文里的基本上所有代码都在不影响整体功能的情况下进行了或多或少的修改或删除以方便阅读。
indexWriter.deleteDocuments(new Term("id", "value"));
indexWriter.commit();
该例子首先创建一个Term,域名为“id”,值为“value”,然后通过IndexWriter的deleteDocuments函数添加删除操作,最后通过commit函数真正执行删除。
IndexWriter::deleteDocuments
public void deleteDocuments(Term... terms) throws IOException {
if (docWriter.deleteTerms(terms)) {
processEvents(true, false);
}
}
IndexWriter的成员变量docWriter是在其构造函数中创建的DocumentsWriter。deleteDocuments函数首先通过DocumentsWriter的deleteTerms函数执行主要的删除工作,然后调用processEvents函数处理删除过程中产生的事件,例如ApplyDeletesEvent、MergePendingEvent、ForcedPurgeEvent。
1. DocumentsWriter::deleteTerms
IndexWriter::deleteDocuments->DocumentsWriter::deleteTerms
synchronized boolean deleteTerms(final Term... terms) throws IOException {
final DocumentsWriterDeleteQueue deleteQueue = this.deleteQueue;
deleteQueue.addDelete(terms);
flushControl.doOnDelete();
return applyAllDeletes(deleteQueue);
}
DocumentsWriter的成员变量deleteQueue被初始化为DocumentsWriterDeleteQueue队列,接下来通过addDelete函数将待删除的Term列表添加到该队列中。doOnDelete和applyAllDeletes函数会根据条件将队列中添加的删除信息直接添加到缓存中,本文不考虑这种情况,即如果是先删除的词,再添加的文档,则不会对后添加的文档进行操作操作。
IndexWriter::deleteDocuments->DocumentsWriter::deleteTerms->DocumentsWriterDeleteQueue::addDelete
void addDelete(Term... terms) {
add(new TermArrayNode(terms));
tryApplyGlobalSlice();
}
首先将待删除的Term数组封装成TermArrayNode,TermArrayNode继承自Node实现链表操作。
add函数最终将Term数组添加到链表中,其函数内部实现了原子添加操作。tryApplyGlobalSlice函数用于将添加的节点写入DocumentsWriterDeleteQueue的globalBufferedUpdates缓存中。
IndexWriter::deleteDocuments->DocumentsWriter::deleteTerms->DocumentsWriterDeleteQueue::addDelete->add
void add(Node<?> item) {
final Node<?> currentTail = this.tail;
if (currentTail.casNext(null, item)) {
tailUpdater.compareAndSet(this, currentTail, item);
return;
}
}
tail成员变量是链表中的尾节点,也即最新节点currentTail。首先通过casNext函数在尾节点currentTail的下一个节点位置上插入item,然后设置DocumentsWriterDeleteQueue的tail指向最新插入的节点item。
IndexWriter::deleteDocuments->DocumentsWriter::deleteTerms->DocumentsWriterDeleteQueue::addDelete->tryApplyGlobalSlice
void tryApplyGlobalSlice() {
if (updateSlice(globalSlice)) {
globalSlice.apply(globalBufferedUpdates, BufferedUpdates.MAX_INT);
}
}
boolean updateSlice(DeleteSlice slice) {
if (slice.sliceTail != tail) {
slice.sliceTail = tail;
return true;
}
return false;
}
DocumentsWriterDeleteQueue的成员变量globalSlice被设置为DeleteSlice,其内部的成员变量sliceHead和sliceTail分别指向当前Slice的头节点和尾节点,updateSlice函数将sliceTail更新为最新插入的节点,也即上面add函数中最后插入的item。然后通过DeleteSlice的apply函数将数据写入globalBufferedUpdates中。
IndexWriter::deleteDocuments->DocumentsWriter::deleteTerms->DocumentsWriterDeleteQueue::addDelete->tryApplyGlobalSlice->DeleteSlice::apply
void apply(BufferedUpdates del, int docIDUpto) {
Node<?> current = sliceHead;
do {
current = current.next;
current.apply(del, docIDUpto);
} while (current != sliceTail);
reset();
}
void reset() {
sliceHead = sliceTail;
}
void apply(BufferedUpdates bufferedUpdates, int docIDUpto) {
for (Term term : item) {
bufferedUpdates.addTerm(term, docIDUpto);
}
}
首先获得整个Slice的头节点,即sliceHead,然后依次遍历直至尾节点sliceTail,依次调用每个node的apply函数将要删除的Term信息设置进del即globalBufferedUpdates中,最后通过reset函数重置slice。
2. IndexWriter::commit
根据deleteTerms函数的分析,删除的Term信息最终会被保存在DocumentsWriter的DocumentsWriterDeleteQueue的globalBufferedUpdates中,接下来通过IndexWriter的commit的函数要取出这部分信息执行删除操作了,下面来看。
IndexWriter::commit->commitInternal->prepareCommitInternal
public final void commit() throws IOException {
commitInternal(config.getMergePolicy());
}
private final void commitInternal(MergePolicy mergePolicy) throws IOException {
prepareCommitInternal(mergePolicy);
finishCommit();
}
IndexWriter的成员变量config在其构造函数中被创建为IndexWriterConfig,getMergePolicy返回默认的合并策略TieredMergePolicy,后面会介绍该类的函数。
commitInternal函数首先通过prepareCommitInternal函数执行文档删除的最主要操作,再调用finishCommit函数执行一些收尾工作,下面一一来看。
IndexWriter::commit->commitInternal->prepareCommitInternal
private void prepareCommitInternal(MergePolicy mergePolicy) throws IOException {
boolean anySegmentsFlushed = docWriter.flushAllThreads();
processEvents(false, true);
maybeApplyDeletes(true);
SegmentInfos toCommit = segmentInfos.clone();
...
if (anySegmentsFlushed) {
maybeMerge(mergePolicy, MergeTrigger.FULL_FLUSH, UNBOUNDED_MAX_MERGE_SEGMENTS);
}
startCommit(toCommit);
}
flushAllThreads函数执行最主要的flush操作,processEvents函数监听flush产生的事件,maybeApplyDeletes进行具体的删除操作,最终会在有删除的段创建.liv文件,省略的部分是进行一些初始化工作。
anySegmentsFlushed标志位表示在flush时有一些新的段产生,此时调用maybeMerge操作找出需要合并的段并对其进行合并,lucene的合并操作将在下一章介绍。
2.1 flushAllThreads
IndexWriter::commit->commitInternal->prepareCommitInternal->DocumentsWriter::flushAllThreads
boolean flushAllThreads() {
flushControl.markForFullFlush();
boolean anythingFlushed = false;
DocumentsWriterPerThread flushingDWPT;
while ((flushingDWPT = flushControl.nextPendingFlush()) != null) {
anythingFlushed |= doFlush(flushingDWPT);
}
ticketQueue.forcePurge(writer);
return anythingFlushed;
}
DocumentsWriter的成员变量flushControl被创建为DocumentsWriterFlushControl,其markForFullFlush函数从线程池中选择对应的DocumentsWriterPerThread添加到DocumentsWriterFlushControl的flushQueue中等待flush。
flushAllThreads函数接下来通过nextPendingFlush函数遍历DocumentsWriterFlushControl中的flushQueue,依次取出在markForFullFlush函数中添加的DocumentsWriterPerThread,然后调用doFlush函数进行处理。
doFlush函数是执行flush的主要函数,该函数将添加的文档写入lucene的各个文件中,doFlush函数还将DocumentsWriterPerThread中关联的删除信息封装成SegmentFlushTicket再添加到DocumentsWriterFlushQueue队列中等待处理。
最后的forcePurge函数从将DocumentsWriterFlushQueue队列中依次取出SegmentFlushTicket并添加到IndexWriter的BufferedUpdatesStream缓存中等待最后的处理。
这里值得注意的是删除信息再各个类的各个结构之间传递来传递去是有意义的,因为flush操作会产生段的增加和更新操作,容易和删除操作产生冲突。
2.1.1 markForFullFlush
markForFullFlush函数的主要任务是从线程池DocumentsWriterPerThreadPool中选择和当前删除队列DocumentsWriterDeleteQueue一致的ThreadState,将其中的DocumentsWriterPerThread添加到flushQueue中等待处理。
IndexWriter::commit->commitInternal->prepareCommitInternal->DocumentsWriter::flushAllThreads->DocumentsWriterFlushControl::markForFullFlush
void markForFullFlush() {
final DocumentsWriterDeleteQueue flushingQueue = documentsWriter.deleteQueue;;
DocumentsWriterDeleteQueue newQueue = new DocumentsWriterDeleteQueue(flushingQueue.generation+1);
documentsWriter.deleteQueue = newQueue;
final int limit = perThreadPool.getActiveThreadStateCount();
for (int i = 0; i < limit; i++) {
final ThreadState next = perThreadPool.getThreadState(i);
if (next.dwpt.deleteQueue != flushingQueue) {
continue;
}
addFlushableState(next);
}
flushQueue.addAll(fullFlushBuffer);
fullFlushBuffer.clear();
}
markForFullFlush函数首先获取前面创建的DocumentsWriterDeleteQueue,其内部保存了删除的Term信息,然后创建一个新的DocumentsWriterDeleteQueue用于重置DocumentsWriter中原来的deleteQueue。
成员变量perThreadPool默认为DocumentsWriterPerThreadPool线程池,getActiveThreadStateCount函数获取线程池中可用线程的数量,然后通过getThreadState从中获取ThreadState,通过其成员变量dwpt即DocumentsWriterPerThread的deleteQueue是否与前面创建的DocumentsWriterDeleteQueue一致来判断是否为对应的线程,如果不是,就继续遍历线程池寻找,如果找到,就通过addFlushableState函数将其中的DocumentsWriterPerThread添加到DocumentsWriter的成员变量fullFlushBuffer和flushingWriters中,并重置该ThreadState。
最后将fullFlushBuffer中刚刚添加的所有DocumentsWriterPerThread添加到flushQueue中,并清空fullFlushBuffer。
IndexWriter::commit->commitInternal->prepareCommitInternal->DocumentsWriter::flushAllThreads->DocumentsWriterFlushControl::markForFullFlush->addFlushableState
void addFlushableState(ThreadState perThread) {
final DocumentsWriterPerThread dwpt = perThread.dwpt;
if (dwpt.getNumDocsInRAM() > 0) {
if (!perThread.flushPending) {
setFlushPending(perThread);
}
final DocumentsWriterPerThread flushingDWPT = internalTryCheckOutForFlush(perThread);
fullFlushBuffer.add(flushingDWPT);
}
}
如果当前DocumentsWriterPerThread在内存中的文档数量getNumDocsInRAM()大于0,就通过setFlushPending将该ThreadState的flushPending设置为true,表示该DocumentsWriterPerThread需要flush。
internalTryCheckOutForFlush函数将ThreadState中的DocumentsWriterPerThread保存在成员变量flushingWriters中并返回,同时重置该ThreadState。
最后将该DocumentsWriterPerThread添加到fullFlushBuffer中,fullFlushBuffer是一个DocumentsWriterPerThread列表。
IndexWriter::commit->commitInternal->prepareCommitInternal->DocumentsWriter::flushAllThreads->DocumentsWriterFlushControl::markForFullFlush->addFlushableState->internalTryCheckOutForFlush
private DocumentsWriterPerThread internalTryCheckOutForFlush(ThreadState perThread) {
final long bytes = perThread.bytesUsed;
DocumentsWriterPerThread dwpt = perThreadPool.reset(perThread);
flushingWriters.put(dwpt, Long.valueOf(bytes));
return dwpt;
}
DocumentsWriterPerThread reset(ThreadState threadState) {
final DocumentsWriterPerThread dwpt = threadState.dwpt;
threadState.reset();
return dwpt;
}
internalTryCheckOutForFlush函数通过DocumentsWriterPerThreadPool的reset函数重置DocumentsWriterPerThread,并获取其中的DocumentsWriterPerThread。然后将该DocumentsWriterPerThread添加到flushingWriters中,flushingWriters是一个map。
2.1.2 doFlush
doFlush函数在《lucene源码分析—5》中已经重点介绍过了,该函数和删除操作并没有直接联系,这里为了完整性,只看其中和本章相关的部分代码。
IndexWriter::commit->commitInternal->prepareCommitInternal->DocumentsWriter::flushAllThreads->doFlush
private boolean doFlush(DocumentsWriterPerThread flushingDWPT) throws IOException, AbortingException {
boolean hasEvents = false;
while (flushingDWPT != null) {
hasEvents = true;
SegmentFlushTicket ticket = ticketQueue.addFlushTicket(flushingDWPT);
final int flushingDocsInRam = flushingDWPT.getNumDocsInRAM();
final FlushedSegment newSegment = flushingDWPT.flush();
ticketQueue.addSegment(ticket, newSegment);
subtractFlushedNumDocs(flushingDocsInRam);
flushControl.doAfterFlush(flushingDWPT);
flushingDWPT = flushControl.nextPendingFlush();
}
return hasEvents;
}
ticketQueue被创建为DocumentsWriterFlushQueue,代表flush的队列,其addFlushTicket函数将DocumentsWriterPerThread中的待删除信息封装成一个SegmentFlushTicket,保存在queue列表中并返回。
getNumDocsInRAM函数返回DocumentsWriterPerThread中的文档数,该数量是新添加的文档数量,在其finishDocument函数中递增。
DocumentsWriterPerThread的flush函数完成最主要的flush操作,该函数向lucene的各个文件中保存更新的文档信息,并返回新创建的段信息FlushedSegment。
addSegment函数将flush后新的段信息FlushedSegment添加到SegmentFlushTicket中。
subtractFlushedNumDocs函数将待flush的文档数减去刚刚通过flush函数保存到文件中的文档数,并更新numDocsInRAM变量。
IndexWriter::commit->commitInternal->prepareCommitInternal->DocumentsWriter::flushAllThreads->doFlush->DocumentsWriterFlushQueue::addFlushTicket
synchronized SegmentFlushTicket addFlushTicket(DocumentsWriterPerThread dwpt) {
final SegmentFlushTicket ticket = new SegmentFlushTicket(dwpt.prepareFlush());
queue.add(ticket);
return ticket;
}
DocumentsWriterPerThread的prepareFlush函数将待删除的信息封装成FrozenBufferedUpdates并返回。然后再将该FrozenBufferedUpdates封装成SegmentFlushTicket,最后添加到queue列表中并返回。
FrozenBufferedUpdates是对删除信息的一种更高效的封装。
IndexWriter::commit->commitInternal->prepareCommitInternal->DocumentsWriter::flushAllThreads->doFlush->DocumentsWriterFlushQueue::addFlushTicket->DocumentsWriterPerThread::prepareFlush
FrozenBufferedUpdates prepareFlush() {
final FrozenBufferedUpdates globalUpdates = deleteQueue.freezeGlobalBuffer(deleteSlice);
return globalUpdates;
}
FrozenBufferedUpdates freezeGlobalBuffer(DeleteSlice callerSlice) {
...
final FrozenBufferedUpdates packet = new FrozenBufferedUpdates(globalBufferedUpdates, false);
globalBufferedUpdates.clear();
return packet;
}
freezeGlobalBuffer函数的主要任务是将DocumentsWriterDeleteQueue删除队列中的删除信息globalBufferedUpdates封装成FrozenBufferedUpdates并返回,然后清空globalBufferedUpdates。
IndexWriter::commit->commitInternal->prepareCommitInternal->DocumentsWriter::flushAllThreads->doFlush->DocumentsWriterPerThread::flush
FlushedSegment flush() throws IOException, AbortingException {
segmentInfo.setMaxDoc(numDocsInRAM);
final SegmentWriteState flushState = new SegmentWriteState(infoStream, directory, segmentInfo, fieldInfos.finish(), pendingUpdates, new IOContext(new FlushInfo(numDocsInRAM, bytesUsed())));
final double startMBUsed = bytesUsed() / 1024. / 1024.;
consumer.flush(flushState);
pendingUpdates.terms.clear();
segmentInfo.setFiles(new HashSet<>(directory.getCreatedFiles()));
final SegmentCommitInfo segmentInfoPerCommit = new SegmentCommitInfo(segmentInfo, 0, -1L, -1L, -1L);
pendingUpdates.clear();
BufferedUpdates segmentDeletes= null;
FlushedSegment fs = new FlushedSegment(segmentInfoPerCommit, flushState.fieldInfos, segmentDeletes, flushState.liveDocs, flushState.delCountOnFlush);
sealFlushedSegment(fs);
return fs;
}
flush函数首先向segmentInfo中添加更新的文档数量信息。FieldInfos.Builder的finish函数返回FieldInfos,内部封装了所有域Field的信息。接下来创建FlushInfo,包含了文档数和内存字节信息,然后创建IOContext,进而创建SegmentWriteState。
成员变量consumer为DefaultIndexingChain,其flush函数最终向.doc、.tim、.nvd、.pos、.fdx、.nvm、.fnm、.fdt、.tip文件写入信息。
完成信息的写入后,接下来清空pendingUpdates中的terms信息,并向段segmentInfo中设置新创建的文件名,然后清空pendingUpdates。
flush最后创建FlushedSegment,然后通过sealFlushedSegment函数创建新的.si文件并向其中写入段信息。
IndexWriter::commit->commitInternal->prepareCommitInternal->DocumentsWriter::flushAllThreads->doFlush->DocumentsWriterFlushControl::doAfterFlush
synchronized void doAfterFlush(DocumentsWriterPerThread dwpt) {
Long bytes = flushingWriters.remove(dwpt);
flushBytes -= bytes.longValue();
perThreadPool.recycle(dwpt);
}
doAfterFlush函数从flushingWriters中移除刚刚处理过的DocumentsWriterPerThread,然后将flushBytes减去刚刚flush的字节数,recycle函数回收DocumentsWriterPerThread,便于重复利用。
2.1.3 forcePurge
回到flushAllThreads函数中,markForFullFlush函数将线程池中的DocumentsWriterPerThread添加到flushQueue中等待处理;doFlush函数依次处理flushQueue中的每个DocumentsWriterPerThread,将更新的文档flush到各个文件中,并将DocumentsWriterDeleteQueue中的删除信息封装成SegmentFlushTicket再添加到DocumentsWriterFlushQueue队列中等待处理;本小节分析的forcePurge函数最终将DocumentsWriterFlushQueue中的删除信息添加到IndexWriter的BufferedUpdatesStream中等待最终的处理,下面来看。
IndexWriter::commit->commitInternal->prepareCommitInternal->DocumentsWriter::flushAllThreads->DocumentsWriterFlushQueue::forcePurge
int forcePurge(IndexWriter writer) throws IOException {
return innerPurge(writer);
}
private int innerPurge(IndexWriter writer) throws IOException {
int numPurged = 0;
while (true) {
final FlushTicket head = queue.peek();
numPurged++;
head.publish(writer);
queue.poll();
ticketCount.decrementAndGet();
}
return numPurged;
}
doFlush函数将待删除的Term信息封装成SegmentFlushTicket添加到DocumentsWriterFlushQueue的queue成员变量,innerPurge函数依次从该队列中获取SegmentFlushTicket,调用其publish函数将其写入IndexWriter的BufferedUpdatesStream中。操作成功后通过poll函数从队列中删除该SegmentFlushTicket。
IndexWriter::commit->commitInternal->prepareCommitInternal->DocumentsWriter::flushAllThreads->DocumentsWriterFlushQueue::forcePurge->innerPurge->SegmentFlushTicket::publish
protected void publish(IndexWriter writer) throws IOException {
finishFlush(writer, segment, frozenUpdates);
}
protected final void finishFlush(IndexWriter indexWriter, FlushedSegment newSegment, FrozenBufferedUpdates bufferedUpdates) throws IOException {
publishFlushedSegment(indexWriter, newSegment, bufferedUpdates);
}
protected final void publishFlushedSegment(IndexWriter indexWriter, FlushedSegment newSegment, FrozenBufferedUpdates globalPacket) throws IOException {
indexWriter.publishFlushedSegment(newSegment.segmentInfo, segmentUpdates, globalPacket);
}
void publishFlushedSegment(SegmentCommitInfo newSegment, FrozenBufferedUpdates packet, FrozenBufferedUpdates globalPacket) throws IOException {
bufferedUpdatesStream.push(globalPacket);
nextGen = bufferedUpdatesStream.getNextGen();
newSegment.setBufferedDeletesGen(nextGen);
segmentInfos.add(newSegment);
checkpoint();
}
public synchronized long push(FrozenBufferedUpdates packet) {
packet.setDelGen(nextGen++);
updates.add(packet);
numTerms.addAndGet(packet.numTermDeletes);
bytesUsed.addAndGet(packet.bytesUsed);
return packet.delGen();
}
SegmentFlushTicket的publish函数最终会调用IndexWriter的publishFlushedSegment函数,传入的第一个参数为新增的段信息,第二个参数在Term删除时为null,不管它,最后一个参数就是前面创建的FrozenBufferedUpdates ,封装了Term的删除信息。
publishFlushedSegment通过BufferedUpdatesStream的push函数添加packet即FrozenBufferedUpdates,最终添加到BufferedUpdatesStream的updates列表中并更新相应信息。
publishFlushedSegment函数接下来将新增的段newSegment添加到SegmentInfos中,最后通过checkpoint函数更新文件的引用次数,在必要时删除文件,本章最后会分析该函数。
2.2 processEvents
回到prepareCommitInternal函数中,下面简单介绍一下processEvents函数,flushAllThreads函数操作过后会产生各类事件,processEvents函数监听这些事件并执行相应的操作。
IndexWriter::commit->commitInternal->prepareCommitInternal->processEvents
private boolean processEvents(boolean triggerMerge, boolean forcePurge) throws IOException {
return processEvents(eventQueue, triggerMerge, forcePurge);
}
private boolean processEvents(Queue<Event> queue, boolean triggerMerge, boolean forcePurge) throws IOException {
boolean processed = false;
if (tragedy == null) {
Event event;
while((event = queue.poll()) != null) {
processed = true;
event.process(this, triggerMerge, forcePurge);
}
}
return processed;
}
lucene默认实现的event包括ApplyDeletesEvent、MergePendingEvent、ForcedPurgeEvent,processEvents函数从事件队列queue中依次取出这些事件,并调用process函数执行操作。这些事件和本章介绍的内容没有直接关系,这里就不往下看了。
2.3 maybeApplyDeletes
maybeApplyDeletes是lucene执行删除的最主要函数,下面重点分析该函数。
IndexWriter::commit->commitInternal->prepareCommitInternal->maybeApplyDeletes
final synchronized boolean maybeApplyDeletes(boolean applyAllDeletes) throws IOException {
return applyAllDeletesAndUpdates();
}
final synchronized boolean applyAllDeletesAndUpdates() throws IOException {
final BufferedUpdatesStream.ApplyDeletesResult result;
result = bufferedUpdatesStream.applyDeletesAndUpdates(readerPool, segmentInfos.asList());
if (result.anyDeletes) {
checkpoint();
}
if (!keepFullyDeletedSegments && result.allDeleted != null) {
for (SegmentCommitInfo info : result.allDeleted) {
if (!mergingSegments.contains(info)) {
segmentInfos.remove(info);
pendingNumDocs.addAndGet(-info.info.maxDoc());
readerPool.drop(info);
}
}
checkpoint();
}
bufferedUpdatesStream.prune(segmentInfos);
return result.anyDeletes;
}
BufferedUpdatesStream的applyDeletesAndUpdates执行主要的删除操作,最终将删除的文档ID标记在对应段的.liv文件中。
如果有文档被删除,则调用checkpoint函数递减对应段的引用字数,如果引用计数到达0,则删除该文件。
keepFullyDeletedSegments标记表示当一个段的文档被全部删除时,是否要删除对应的段,如果此时有的段文档被全部删除了,则遍历对应的段,从segmentInfos中删除该段,pendingNumDocs删除对应段的所有文档数,再从ReaderPool中删除该段。
最后的BufferedUpdatesStream的prune函数继续做一些收尾工作,删除前面创建的FrozenBufferedUpdates。
IndexWriter::commit->commitInternal->prepareCommitInternal->applyAllDeletesAndUpdates->maybeApplyDeletes->BufferedUpdatesStream::applyDeletesAndUpdates
public synchronized ApplyDeletesResult applyDeletesAndUpdates(IndexWriter.ReaderPool pool, List<SegmentCommitInfo> infos) throws IOException {
SegmentState[] segStates = null;
long totDelCount = 0;
long totTermVisitedCount = 0;
boolean success = false;
ApplyDeletesResult result = null;
infos = sortByDelGen(infos);
CoalescedUpdates coalescedUpdates = null;
int infosIDX = infos.size()-1;
int delIDX = updates.size()-1;
while (infosIDX >= 0) {
final FrozenBufferedUpdates packet = delIDX >= 0 ? updates.get(delIDX) : null;
final SegmentCommitInfo info = infos.get(infosIDX);
final long segGen = info.getBufferedDeletesGen();
if (packet != null && segGen < packet.delGen()) {
if (!packet.isSegmentPrivate && packet.any()) {
if (coalescedUpdates == null) {
coalescedUpdates = new CoalescedUpdates();
}
coalescedUpdates.update(packet);
}
delIDX--;
} else if (packet != null && segGen == packet.delGen()) {
...
} else {
if (coalescedUpdates != null) {
segStates = openSegmentStates(pool, infos);
SegmentState segState = segStates[infosIDX];
int delCount = 0;
delCount += applyQueryDeletes(coalescedUpdates.queriesIterable(), segState);
DocValuesFieldUpdates.Container dvUpdates = new DocValuesFieldUpdates.Container();
applyDocValuesUpdatesList(coalescedUpdates.numericDVUpdates, segState, dvUpdates);
applyDocValuesUpdatesList(coalescedUpdates.binaryDVUpdates, segState, dvUpdates);
if (dvUpdates.any()) {
segState.rld.writeFieldUpdates(info.info.dir, dvUpdates);
}
totDelCount += delCount;
}
infosIDX--;
}
}
if (coalescedUpdates != null && coalescedUpdates.totalTermCount != 0) {
if (segStates == null) {
segStates = openSegmentStates(pool, infos);
}
totTermVisitedCount += applyTermDeletes(coalescedUpdates, segStates);
}
result = closeSegmentStates(pool, segStates, success, gen);
return result;
}
applyDeletesAndUpdates函数首先进入循环,遍历所有的段,然后从成员变量updates列表中获取在forcePurge函数中添加的FrozenBufferedUpdates,并获取段信息SegmentCommitInfo,再通过getBufferedDeletesGen函数获取该段的bufferedDeletesGen变量,用来表示操作的时间顺序,这里暂时叫做更新度。
下面的三个条件语句,第一个if表示要删除的FrozenBufferedUpdates存在,并且段的更新度小于要删除数据的更新度,表示可以删除,此时创建CoalescedUpdates,用来封装删除的信息,例如有些删除是通过Term,有些删除通过Query,这里全部封装起来。
第二个if语句表示更新度相同,这里不考虑这种情况。
第三个if语句表示对应的段有需要删除的数据,首先通过openSegmentStates函数将段信息封装成SegmentState,再通过applyQueryDeletes删除Query指定的删除信息,然后调用applyDocValuesUpdatesList函数检查是否有更新,如果有,则通过writeFieldUpdates进行更新,这里假设没有更新。
退出循环后,coalescedUpdates封装了待删除的Term信息,如果不为null,则通过applyTermDeletes执行删除操作。
删除完成后通过closeSegmentStates函数获取是否某段中的所有文件都被删除了,将该结果封装成ApplyDeletesResult并返回,该函数还会调用SegmentState的finish函数将applyTermDeletes函数中的标记写入到.liv文件中去,这里就不往下看了。
IndexWriter::commit->commitInternal->prepareCommitInternal->maybeApplyDeletes->BufferedUpdatesStream::applyAllDeletesAndUpdates->CoalescedUpdates::update
void update(FrozenBufferedUpdates in) {
totalTermCount += in.terms.size();
terms.add(in.terms);
...
}
CoalescedUpdates的update函数用于封装删除信息,如果是通过Term删除,则直接添加到成员变量terms列表中。
IndexWriter::commit->commitInternal->prepareCommitInternal->maybeApplyDeletes->BufferedUpdatesStream::applyAllDeletesAndUpdates->applyTermDeletes
private synchronized long applyTermDeletes(CoalescedUpdates updates, SegmentState[] segStates) throws IOException {
long startNS = System.nanoTime();
int numReaders = segStates.length;
long delTermVisitedCount = 0;
long segTermVisitedCount = 0;
FieldTermIterator iter = updates.termIterator();
String field = null;
SegmentQueue queue = null;
BytesRef term;
while ((term = iter.next()) != null) {
if (iter.field() != field) {
field = iter.field();
queue = new SegmentQueue(numReaders);
long segTermCount = 0;
for(int i=0;i<numReaders;i++) {
SegmentState state = segStates[i];
Terms terms = state.reader.fields().terms(field);
if (terms != null) {
segTermCount += terms.size();
state.termsEnum = terms.iterator();
state.term = state.termsEnum.next();
if (state.term != null) {
queue.add(state);
}
}
}
}
delTermVisitedCount++;
long delGen = iter.delGen();
while (queue.size() != 0) {
SegmentState state = queue.top();
segTermVisitedCount++;
int cmp = term.compareTo(state.term);
if (cmp < 0) {
break;
} else if (cmp == 0) {
} else {
TermsEnum.SeekStatus status = state.termsEnum.seekCeil(term);
if (status == TermsEnum.SeekStatus.FOUND) {
} else {
if (status == TermsEnum.SeekStatus.NOT_FOUND) {
state.term = state.termsEnum.term();
queue.updateTop();
} else {
queue.pop();
}
continue;
}
}
if (state.delGen < delGen) {
final Bits acceptDocs = state.rld.getLiveDocs();
state.postingsEnum = state.termsEnum.postings(state.postingsEnum, PostingsEnum.NONE);
while (true) {
final int docID = state.postingsEnum.nextDoc();
if (docID == DocIdSetIterator.NO_MORE_DOCS) {
break;
}
if (acceptDocs != null && acceptDocs.get(docID) == false) {
continue;
}
if (!state.any) {
state.rld.initWritableLiveDocs();
state.any = true;
}
state.rld.delete(docID);
}
}
state.term = state.termsEnum.next();
if (state.term == null) {
queue.pop();
} else {
queue.updateTop();
}
}
}
return delTermVisitedCount;
}
applyTermDeletes函数进行具体的Term删除操作,首先通过termIterator函数获得Term的迭代器TermIterator。
第一个if语句表示field域有变化,首先通过field函数获得要删除的Term所在的域,然后创建SegmentQueue用来排序,再针对每个段,获取其中的FieldReader,并添加到SegmentQueue中。
接下来通过compareTo函数比较待删除的Term(term)和该段中存储的Term(state.term),因为段中的词是经过排序的,因此比较的结果cmp小于0代表该段没有要找的词,直接break返回。如果相等,则什么也不做,进行下一步,如果大于0,则需要通过seekCeil函数继续在该段寻找词,如果找到了,则也继续进行下一步,如果没找到,则要通过updateTop函数更新SegmentQueue队列,等待查找下一个词,如果是其他情况,则直接退出该段的查找过程。
函数到达第二个if循环表示在该段找到了待删除的词,如果delGen表示更新度小于删除Term的更新度,则表示该词建立的时间要早于删除词的时间,此时要进行删除操作,进入while循环,循环读取包含该次的下一个文档id,获得文档id后,就要将其记录在.liv文件中,如果还未初始化,就先要调用initWritableLiveDocs函数初始化对应段的.liv文件。初始化完成后就通过delete函数在.liv文件对应的缓存中标记删除。
再往下再检查是否段没有词了,没有就通过pop函数从queue中删除该段,如果还有,则调用updateTop函数更新队列,等待下一个词的咋找。最后如果两个段都从队列queue删除了,则退出while循环。
最后返回删除的词的数量delTermVisitedCount。
IndexWriter::commit->commitInternal->prepareCommitInternal->maybeApplyDeletes->BufferedUpdatesStream::applyAllDeletesAndUpdates->applyTermDeletes ->ReadersAndUpdates::initWritableLiveDocs
public synchronized void initWritableLiveDocs() throws IOException {
if (liveDocsShared) {
LiveDocsFormat liveDocsFormat = info.info.getCodec().liveDocsFormat();
if (liveDocs == null) {
liveDocs = liveDocsFormat.newLiveDocs(info.info.maxDoc());
} else {
liveDocs = liveDocsFormat.newLiveDocs(liveDocs);
}
liveDocsShared = false;
}
}
liveDocsFormat函数最后返回Lucene50LiveDocsFormat,然后调用其newLiveDocs函数进行初始化,返回一个FixedBitSet,该结构用bit位记录文档id用于判断该文档是否未被删除。
IndexWriter::commit->commitInternal->prepareCommitInternal->maybeApplyDeletes->BufferedUpdatesStream::applyAllDeletesAndUpdates->applyTermDeletes ->ReadersAndUpdates::delete
public synchronized boolean delete(int docID) {
final boolean didDelete = liveDocs.get(docID);
if (didDelete) {
((MutableBits) liveDocs).clear(docID);
pendingDeleteCount++;
}
return didDelete;
}
delete函数根据文档id,在前面创建的FixedBitSet里标记位置表示删除,注意这里是将删除后保留的文档id对应的bit位置上标记为1。
IndexWriter::commit->commitInternal->prepareCommitInternal->maybeApplyDeletes->BufferedUpdatesStream::applyAllDeletesAndUpdates->closeSegmentStates
private ApplyDeletesResult closeSegmentStates(IndexWriter.ReaderPool pool, SegmentState[] segStates, boolean success, long gen) throws IOException {
int numReaders = segStates.length;
Throwable firstExc = null;
List<SegmentCommitInfo> allDeleted = new ArrayList<>();
long totDelCount = 0;
for (int j=0;j<numReaders;j++) {
SegmentState segState = segStates[j];
totDelCount += segState.rld.getPendingDeleteCount() - segState.startDelCount;
segState.reader.getSegmentInfo().setBufferedDeletesGen(gen);
int fullDelCount = segState.rld.info.getDelCount() + segState.rld.getPendingDeleteCount();
if (fullDelCount == segState.rld.info.info.maxDoc()) {
allDeleted.add(segState.reader.getSegmentInfo());
}
segStates[j].finish(pool);
}
return new ApplyDeletesResult(totDelCount > 0, gen, allDeleted);
}
closeSegmentStates函数在删除操作执行后检查是否删除了某个段的所有文档,如果有,就将其添加到allDeleted列表中,最终在返回时封装成ApplyDeletesResult。closeSegmentStates函数也会遍历每个段,对前面创建的SegmentState执行finish函数,将对应的FixedBitSet结构写入.liv文件中去,因为涉及到文件格式,这里就不往下看了。
2.4 startCommit
IndexWriter::commit->commitInternal->prepareCommitInternal->startCommit
private void startCommit(final SegmentInfos toSync) throws IOException {
...
toSync.prepareCommit(directory);
...
filesToSync = toSync.files(false);
directory.sync(filesToSync);
...
}
执行到这里时,有一些段由于flush操作新生成,有一些段有数据生成,有一些段进行了合并操作,执行到这里,需要对segments段文件执行更新操作。
首先调用SegmentInfos的prepareCommit函数,创建pending_segments文件,向其写入基本信息。
然后通过files函数获得创建后索引目录下最终的所有文件,再通过sync函数将文件同步到硬盘中去。
2.5 finishCommit
执行到这里,主要的删除任务已经结束,最终删除的文档ID会被标记在对应段的.liv文件中,finishCommit函数完成接下来的收尾工作。
IndexWriter::commit->commitInternal->finishCommit
private final void finishCommit() throws IOException {
pendingCommit.finishCommit(directory);
deleter.checkpoint(pendingCommit, true);
segmentInfos.updateGeneration(pendingCommit);
rollbackSegments = pendingCommit.createBackupSegmentInfos();
}
finishCommit函数重命名前面创建的临时段文件。成员变量deleter被初始化为IndexFileDeleter,其checkpoint函数检查是否有待删除的文件并将其删除。
updateGeneration函数更新段的generation信息。createBackupSegmentInfos函数备份当前段的最新信息保存在rollbackSegments中。
IndexWriter::commit->commitInternal->finishCommit->SegmentInfos::finishCommit
final String finishCommit(Directory dir) throws IOException {
final String src = IndexFileNames.fileNameFromGeneration(IndexFileNames.PENDING_SEGMENTS, "", generation);
String dest = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS, "", generation);
dir.renameFile(src, dest);
pendingCommit = false;
lastGeneration = generation;
return dest;
}
finishCommit函数主要完成将索引目录下的临时段文件重命名为正式的段文件,例如将pending_segments_1文件重命名为segments_1文件。
IndexWriter::commit->commitInternal->finishCommit->IndexFileDeleter::checkpoint
public void checkpoint(SegmentInfos segmentInfos, boolean isCommit) throws IOException {
incRef(segmentInfos, isCommit);
commits.add(new CommitPoint(commitsToDelete, directoryOrig, segmentInfos));
policy.onCommit(commits);
deleteCommits();
}
checkpoint函数首先通过incRef递增段中每个文件的引用次数,然后将待删除文件的信息封装成CommitPoint并添加到commits列表中。
policy是在LiveIndexWriterConfig中默认的KeepOnlyLastCommitDeletionPolicy,onCommit函数用来保存最新的CommitPoint。deleteCommits函数降低文件的引用次数,可能执行最终的删除操作。
IndexWriter::commit->commitInternal->finishCommit->IndexFileDeleter::checkpoint->incRef
void incRef(SegmentInfos segmentInfos, boolean isCommit) throws IOException {
for(final String fileName: segmentInfos.files(isCommit)) {
incRef(fileName);
}
}
public Collection<String> files(boolean includeSegmentsFile) throws IOException {
HashSet<String> files = new HashSet<>();
if (includeSegmentsFile) {
final String segmentFileName = getSegmentsFileName();
if (segmentFileName != null) {
files.add(segmentFileName);
}
}
final int size = size();
for(int i=0;i<size;i++) {
final SegmentCommitInfo info = info(i);
files.addAll(info.files());
}
return files;
}
getSegmentsFileName函数获得对应段的文件名,例如segments_1,然后将段文件名添加到files集合中。
size返回段信息SegmentCommitInfo的数量。info函数从SegmentCommitInfo列表中获取对应的SegmentCommitInfo,然后调用其files函数获取该段对应的诸如.doc、.tim、.si、.nvd、.pos、.fdx、.nvm、.fnm、.fdt、.tip等文件名,然后将这些文件名添加到files集合中并返回。
IndexWriter::commit->commitInternal->finishCommit->IndexFileDeleter::checkpoint->incRef->incRef
void incRef(String fileName) {
RefCount rc = getRefCount(fileName);
rc.IncRef();
}
private RefCount getRefCount(String fileName) {
RefCount rc;
if (!refCounts.containsKey(fileName)) {
rc = new RefCount(fileName);
refCounts.put(fileName, rc);
} else {
rc = refCounts.get(fileName);
}
return rc;
}
getRefCount从IndexFileDeleter的成员变量refCounts中获得当前每个文件的引用次数,然后将其加1。
IndexWriter::commit->commitInternal->finishCommit->IndexFileDeleter::checkpoint->KeepOnlyLastCommitDeletionPolicy::onCommit
public void onCommit(List<? extends IndexCommit> commits) {
int size = commits.size();
for(int i=0;i<size-1;i++) {
commits.get(i).delete();
}
}
public void delete() {
if (!deleted) {
deleted = true;
commitsToDelete.add(this);
}
}
onCommit会将最新添加的IndexCommit继续保存在commits列表中,并将其余的IndexCommit添加到commitsToDelete列表中。
IndexWriter::commit->commitInternal->finishCommit->IndexFileDeleter::checkpoint->deleteCommits
private void deleteCommits() {
int size = commitsToDelete.size();
for(int i=0;i<size;i++) {
CommitPoint commit = commitsToDelete.get(i);
decRef(commit.files);
}
commitsToDelete.clear();
size = commits.size();
int readFrom = 0;
int writeTo = 0;
while(readFrom < size) {
CommitPoint commit = commits.get(readFrom);
if (!commit.deleted) {
if (writeTo != readFrom) {
commits.set(writeTo, commits.get(readFrom));
}
writeTo++;
}
readFrom++;
}
while(size > writeTo) {
commits.remove(size-1);
size--;
}
}
遍历commitsToDelete中的CommitPoint,降低其对应文件的引用次数,如果等于0,就将其删除。
删除完成后,再通过一个while循环更新commits列表,取出删除的CommitPoint。
IndexWriter::commit->commitInternal->finishCommit->IndexFileDeleter::checkpoint->deleteCommits->decRef
void decRef(Collection<String> files) throws IOException {
Set<String> toDelete = new HashSet<>();
for(final String file : files) {
if (decRef(file)) {
toDelete.add(file);
}
}
deleteFiles(toDelete);
}
decRef函数和前面介绍的incRef类似,用于降低该文件的引用次数,如果等于0,就返回true,表示要删除该文件,将其添加到待删除的文件列表toDelete中,最后调用deleteFiles删除这些文件。