Lucene索引创建过程2

最新推荐文章于 2022-06-26 21:48:48 发布

chenqiang_99

最新推荐文章于 2022-06-26 21:48:48 发布

阅读量307

点赞数

本文链接：https://blog.csdn.net/chenqiang_99/article/details/48506805

版权

5 DocumentsWriterPerThread.updateDocument详细步骤

该Document的更新交给一个DocumentsWriterPerThread之后，我们再往下看。

 
           public  
           void  
           updateDocument(Iterable<?  
           extends  
           IndexableField> doc, Analyzer analyzer, Term delTerm)  
           throws  
           IOException, AbortingException { 
          
           testPoint( 
           "DocumentsWriterPerThread addDocument start" 
           ); 
          
           assert  
           deleteQueue !=  
           null 
           ; 
          
           reserveOneDoc(); 
          
           docState.doc = doc; 
          
           docState.analyzer = analyzer; 
          
           docState.docID = numDocsInRAM; 
          
           if  
           (INFO_VERBOSE && infoStream.isEnabled( 
           "DWPT" 
           )) { 
          
           infoStream.message( 
           "DWPT" 
           , Thread.currentThread().getName() +  
           " update delTerm="  
           + delTerm +  
           " docID="  
           + docState.docID +  
           " seg="  
           + segmentInfo.name); 
          
           } 
          
           // Even on exception, the document is still added (but marked 
          
           // deleted), so we don't need to un-reserve at that point. 
          
           // Aborting exceptions will actually "lose" more than one 
          
           // document, so the counter will be "wrong" in that case, but 
          
           // it's very hard to fix (we can't easily distinguish aborting 
          
           // vs non-aborting exceptions): 
          
           boolean  
           success =  
           false 
           ; 
          
           try  
           { 
          
           try  
           { 
          
           consumer.processDocument(); 
          
           }  
           finally  
           { 
          
           docState.clear(); 
          
           } 
          
           success =  
           true 
           ; 
          
           }  
           finally  
           { 
          
           if  
           (!success) { 
          
           // mark document as deleted 
          
           deleteDocID(docState.docID); 
          
           numDocsInRAM++; 
          
           } 
          
           } 
          
           finishDocument(delTerm); 
          
           }

该线程里面我们只关心一行代码

consumer.processDocument();

从这里差不多就豁然开朗了，一切最后该Document的处理是交给了一个DocConsumer来处理。而这个DocConsumer的获取见下：

 
          abstract  
          DocConsumer getChain(DocumentsWriterPerThread documentsWriterPerThread)  
          throws  
          IOException;

Lucene实现了一个默认的DocConsumer即：DefaultIndexingChain。那接下来就看该DocConsumer是如何处理该Document的了就行了。

6 DefaultIndexingChain.processDocument详细步骤

 
           @Override 
          
           public  
           void  
           processDocument()  
           throws  
           IOException, AbortingException { 
          
           // How many indexed field names we've seen (collapses 
          
           // multiple field instances by the same name): 
          
           int  
           fieldCount =  
           0 
           ; 
          
           long  
           fieldGen = nextFieldGen++; 
          
           // NOTE: we need two passes here, in case there are 
          
           // multi-valued fields, because we must process all 
          
           // instances of a given field at once, since the 
          
           // analyzer is free to reuse TokenStream across fields 
          
           // (i.e., we cannot have more than one TokenStream 
          
           // running "at once"): 
          
           termsHash.startDocument(); 
          
           fillStoredFields(docState.docID); 
          
           startStoredFields(); 
          
           boolean  
           aborting =  
           false 
           ; 
          
           try  
           { 
          
           for  
           (IndexableField field : docState.doc) { 
           //挨个遍历每个Field做处理，哈哈，终于露出可爱的尾巴了 
          
           fieldCount = processField(field, fieldGen, fieldCount); 
          
           } 
          
           }  
           catch  
           (AbortingException ae) { 
          
           aborting =  
           true 
           ; 
          
           throw  
           ae; 
          
           }  
           finally  
           { 
          
           if  
           (aborting ==  
           false 
           ) { 
          
           // Finish each indexed field name seen in the document: 
          
           for  
           ( 
           int  
           i= 
           0 
           ;i<fieldCount;i++) { 
          
           fields[i].finish(); 
          
           } 
          
           finishStoredFields(); 
          
           } 
          
           } 
          
           try  
           { 
          
           termsHash.finishDocument(); 
          
           }  
           catch  
           (Throwable th) { 
          
           // Must abort, on the possibility that on-disk term 
          
           // vectors are now corrupt: 
          
           throw  
           AbortingException.wrap(th); 
          
           } 
          
           }

看到上面代码，我笑了。哈哈，越来越清晰，有没有。对该Document的处理，无非就是演化成遍历每个Field，对Field做处理就行了。但是具体Field怎么处理，该wiki不涉及，放到另外一篇wiki中深入记录（参考：Document存储细节）。

五，Commit Document

indexWriter.commit();

提交Commit完成如下工作：

凡是挂起的改变都提交到index中。包括新增加的文档，要删除的文档，segement的合并。
该操作会执行Directory.sync，sync操作会将文件系统的cache都刷新到disk上面。虽然比较耗时（同步耗时），但是刷新到disk上之后，VM挂掉（或者断电）都不影响这些挂起的更新。

sync操作具体的解释可参考如下一段解释：

 
         传统的UNIX实现在内核中设有缓冲区高速缓存或页面高速缓存，大多数磁盘I/O都通过缓冲进行。当将数据写入文件时，内核通常先将该数据复制到其中一个缓冲区中，如果该缓冲区尚未写满，则并不将其排入输出队列，而是等待其写满或者当内核需要重用该缓冲区以便存放其他磁盘块数据时，再将该缓冲排入输出队列，然后待其到达队首时，才进行实际的I/O操作。这种输出方式被称为延迟写（delayed write）（Bach [ 
         1986 
         ]第 
         3 
         章详细讨论了缓冲区高速缓存）。 
        
         延迟写减少了磁盘读写次数，但是却降低了文件内容的更新速度，使得欲写到文件中的数据在一段时间内并没有写到磁盘上。当系统发生故障时，这种延迟可能造成文件更新内容的丢失。为了保证磁盘上实际文件系统与缓冲区高速缓存中内容的一致性，UNIX系统提供了sync、fsync和fdatasync三个函数。 
        
         sync函数只是将所有修改过的块缓冲区排入写队列，然后就返回，它并不等待实际写磁盘操作结束。 
        
         通常称为update的系统守护进程会周期性地（一般每隔 
         30 
         秒）调用sync函数。这就保证了定期冲洗内核的块缓冲区。命令sync( 
         1 
         )也调用sync函数。 
        
         fsync函数只对由文件描述符filedes指定的单一文件起作用，并且等待写磁盘操作结束，然后返回。fsync可用于数据库这样的应用程序，这种应用程序需要确保将修改过的块立即写到磁盘上。 
        
         fdatasync函数类似于fsync，但它只影响文件的数据部分。而除数据外，fsync还会同步更新文件的属性。 
        
         对于提供事务支持的数据库，在事务提交时，都要确保事务日志（包含该事务所有的修改操作以及一个提交记录）完全写到硬盘上，才认定事务提交成功并返回给应用层。

看完这段解释就能明白，sync操作就是将文件系统（甚至内核）中的缓存数据都刷新到disk上面，保证数据的安全性（OS挂掉，断电，数据不会丢失）。

那具体Lucene做了些什么呢？

 
           private  
           final  
           void  
           commitInternal(MergePolicy mergePolicy)  
           throws  
           IOException { 
          
           if  
           (infoStream.isEnabled( 
           "IW" 
           )) { 
          
           infoStream.message( 
           "IW" 
           ,  
           "commit: start" 
           ); 
          
           } 
          
           synchronized 
           (commitLock) { 
          
           ensureOpen( 
           false 
           ); 
          
           if  
           (infoStream.isEnabled( 
           "IW" 
           )) { 
          
           infoStream.message( 
           "IW" 
           ,  
           "commit: enter lock" 
           ); 
          
           } 
          
           if  
           (pendingCommit ==  
           null 
           ) { 
          
           if  
           (infoStream.isEnabled( 
           "IW" 
           )) { 
          
           infoStream.message( 
           "IW" 
           ,  
           "commit: now prepare" 
           ); 
          
           } 
          
           prepareCommitInternal(mergePolicy); 
           //最关键的一行 
          
           }  
           else  
           { 
          
           if  
           (infoStream.isEnabled( 
           "IW" 
           )) { 
          
           infoStream.message( 
           "IW" 
           ,  
           "commit: already prepared" 
           ); 
          
           } 
          
           } 
          
           finishCommit(); 
          
           } 
          
           }