ElasticSearch源码解析（三）：索引创建

最新推荐文章于 2024-03-22 09:47:52 发布

闲庭细步

最新推荐文章于 2024-03-22 09:47:52 发布

阅读量2.7k

点赞数

分类专栏：搜索文章标签： elasticsearch 搜索 lucene 分布式应用源码

本文链接：https://blog.csdn.net/flashflight/article/details/51835215

版权

搜索专栏收录该内容

9 篇文章 1 订阅

订阅专栏

我们先来看看索引创建的事例代码：

	Directory directory = FSDirectory.getDirectory("/tmp/testindex"); // Use standard analyzer 
	Analyzer analyzer = new StandardAnalyzer();  // Create IndexWriter object 
	IndexWriter iwriter = new IndexWriter(directory, analyzer, true); 
	iwriter.setMaxFieldLength(25000); // make a new, empty document 
	Document doc = new Document(); 
	File f = new File("/tmp/test.txt"); 
	// Add the path of the file as a field named "path".  Use a field that is      // indexed (i.e. searchable), but don't tokenize the field into words.     	

	doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED)); 
	String text = "This is the text to be indexed."; 
	doc.add(new Field("fieldname", text, Field.Store.YES,Field.Index.TOKENIZED)); 
	// Add the last modified date of the file a field named "modified".  Use      // a field that is indexed (i.e. searchable), but don't tokenize the field     // into words. 
	doc.add(new Field("modified",DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),Field.Store.YES, Field.Index.UN_TOKENIZED)); 
	// Add the contents of the file to a field named "contents".  Specify a Reader,     // so that the text of the file is tokenized and indexed, but not stored. 
	// Note that FileReader expects the file to be in the system's default encoding.     // If that's not the case searching for special characters will fail.
	doc.add(new Field("contents", new FileReader(f)));     
	iwriter.addDocument(doc);
	iwriter.optimize();
	iwriter.close();

从代码中可以看出来索引index的创建主要是在IndexWriter中进行的。IndexWriter的调用关系如下图所示：

最终生成索引文件。

.fdx是field索引文件，.fdt是field数据文件，.nrm是Norms调节因子文件，计算文档得分用的，.tvf是term向量文件之一，保存了term列表、词频还有可选的位置和偏移信息，.tvx存储在.tvf域文件和.tvd文档数据文件中的偏移量，.tvd是field数据文件，它包含fields的数目，有term向量的fields的列表，还有指向term向量域文件（.tvf）中的域信息的指针列表。该文件用于映射（map out）出那些存储了term向量的fields，以及这些field信息在.tvf文件中的位置。.prx文件是位置信息数据文件容纳了每一个term出现在所有文档中的位置的列表。.tti/.tis分别是term信息索引文件和term信息数据文件。

知道了IndexWriter的调用关系，那么它的源码究竟是怎么样的呢？接下来我们就来分析索引创建的相关源码。IndexWriter的addDocument函数最终是调用DocementWriter的updateDocument函数，先上updateDocument函数的图：

boolean updateDocument(Iterable<? extends IndexableField> doc, Analyzer analyzer, Term delTerm) throws IOException, AbortingException {
<span style="white-space:pre">	</span>//预处理，下面会讲这个函数的作用
        boolean hasEvents = this.preUpdate();
<span style="white-space:pre">	</span>//获取锁
        ThreadState perThread = this.flushControl.obtainAndLock();
        
        DocumentsWriterPerThread flushingDWPT;
        try {
<span style="white-space:pre">	</span>    //确定文档已经打开
            this.ensureOpen();
            this.ensureInitialized(perThread);

            assert perThread.isInitialized();
<span style="white-space:pre">	</span>    //异步flush内存中已经存在的文档到磁盘
            DocumentsWriterPerThread dwpt = perThread.dwpt;
            int dwptNumDocs = dwpt.getNumDocsInRAM();

            try {
                dwpt.updateDocument(doc, analyzer, delTerm);
            } catch (AbortingException var18) {
                this.flushControl.doOnAbort(perThread);
                dwpt.abort();
                throw var18;
            } finally {
<span style="white-space:pre">		</span>//获取还在内存中的文档的数目
                this.numDocsInRAM.addAndGet(dwpt.getNumDocsInRAM() - dwptNumDocs);
            }

            boolean isUpdate = delTerm != null;
	  //后置处理
            flushingDWPT = this.flushControl.doAfterDocument(perThread, isUpdate);
        } finally {
           //释放线程池中的当前使用线程
            this.perThreadPool.release(perThread);
        }
<span style="white-space:pre">	</span>//后置刷新
        return this.postUpdate(flushingDWPT, hasEvents);
    }

下面看看前置update处理和后置update处理

private boolean preUpdate() throws IOException, AbortingException {
        this.ensureOpen();
        boolean hasEvents = false;
        //如果存在停滞的线程或待刷新队列有内容
        if(this.flushControl.anyStalledThreads() || this.flushControl.numQueuedFlushes() > 0) {
            //如果当前输出流具有删除和写入权限
            if(this.infoStream.isEnabled("DW")) {
                this.infoStream.message("DW", "DocumentsWriter has queued dwpt; will hijack this thread to flush pending segment(s)");
            }
            //多个线程不断将segment同步地写入到directory中去
            while(true) {
                DocumentsWriterPerThread flushingDWPT;
                while((flushingDWPT = this.flushControl.nextPendingFlush()) == null) {
                    if(this.infoStream.isEnabled("DW") && this.flushControl.anyStalledThreads()) {
                        this.infoStream.message("DW", "WARNING DocumentsWriter has stalled threads; waiting");
                    }

                    this.flushControl.waitIfStalled();
                    if(this.flushControl.numQueuedFlushes() == 0) {
                        if(this.infoStream.isEnabled("DW")) {
                            this.infoStream.message("DW", "continue indexing after helping out flushing DocumentsWriter is healthy");
                        }

                        return hasEvents;
                    }
                }

                hasEvents |= this.doFlush(flushingDWPT);
            }
        } else {
            return hasEvents;
        }
    }

private boolean postUpdate(DocumentsWriterPerThread flushingDWPT, boolean hasEvents) throws IOException, AbortingException {
       //如果有待刷新的segment在内存中，那么把它们刷入文件
        hasEvents |= this.applyAllDeletes(this.deleteQueue);
        if(flushingDWPT != null) {
            hasEvents |= this.doFlush(flushingDWPT);
        } else {
            DocumentsWriterPerThread nextPendingFlush = this.flushControl.nextPendingFlush();
            if(nextPendingFlush != null) {
                hasEvents |= this.doFlush(nextPendingFlush);
            }
        }

        return hasEvents;
    }

public void updateDocument(Iterable<? extends IndexableField> doc, Analyzer analyzer, Term delTerm) throws IOException, AbortingException {
        this.testPoint("DocumentsWriterPerThread addDocument start");

        assert this.deleteQueue != null;

        this.reserveOneDoc();
        this.docState.doc = doc;
        this.docState.analyzer = analyzer;
        this.docState.docID = this.numDocsInRAM;
        boolean success = false;

        try {
            try {
                this.consumer.processDocument();
            } finally {
                this.docState.clear();
            }

            success = true;
        } finally {
            if(!success) {
                this.deleteDocID(this.docState.docID);
                ++this.numDocsInRAM;
            }

        }

        this.finishDocument(delTerm);
    }

DocumentWriter会分配不同的线程去处理内存中的document，并挨个分析doc中的Fields创建对应的索引文件。这样索引文件就生成保存在磁盘上了，consumer利用analyzer将Document中不同的fields分成不同的term创建索引的细节可以参照上一章讲的。

闲庭细步

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
ElasticSearch源码解析（三）：索引创建

我们先来看看索引创建的事例代码： Directory directory = FSDirectory.getDirectory("/tmp/testindex"); // Use standard analyzer Analyzer analyzer = new StandardAnalyzer(); // Create IndexWriter object IndexWriter
复制链接

扫一扫