lucene索引源码分析1

前面的一些文章主要分析了一些solr索引处理的流程,和索引文件打交道的是lucene的工作,下面我们基于lucene5.3.1对它的索引流程进行分析。
在开始前请允许我盗图一张,下面是lucene索引链的流程图</span>

 

bubuko.com,布布扣

我们一般用IndexWriter写索引的代码如下:indexWriter.addDocument(doc1);

或者

indexWriter.addDocuments(docs);

这个函数的代码如下:

 public void updateDocument(Term term, Iterable<? extends IndexableField> doc) throws IOException {
    ensureOpen();   //保证IndexWriter打开
    try {
      boolean success = false;
      try {
        if (docWriter.updateDocument(doc, analyzer, term)) {  //利用docWriter更新索引
          processEvents(true, false);
        }
        success = true;
      } finally {
        if (!success) {
          if (infoStream.isEnabled("IW")) {
            infoStream.message("IW", "hit exception updating document");
          }
        }
      }
    } catch (AbortingException | OutOfMemoryError tragedy) {
      tragicEvent(tragedy, "updateDocument");
    }
  }
可以看到这一步调用了DocumentsWriter的方法,

 boolean updateDocument(final Iterable<? extends IndexableField> doc, final Analyzer analyzer,
      final Term delTerm) throws IOException, AbortingException {

    boolean hasEvents = preUpdate();    //操作前先把之前的请求处理完,如更新删除操作

    final ThreadState perThread = flushControl.obtainAndLock(); //从对象池中获取一个ThreadState

    final DocumentsWriterPerThread flushingDWPT;
    try {
      // This must happen after we've pulled the ThreadState because IW.close
      // waits for all ThreadStates to be released:
      ensureOpen();
      ensureInitialized(perThread);   //保证perThread中的dwpt被初始化
      assert perThread.isInitialized();
      final DocumentsWriterPerThread dwpt = perThread.dwpt;
      final int dwptNumDocs = dwpt.getNumDocsInRAM();
      try {
        System.out.println(dwpt.hashCode());
        dwpt.updateDocument(doc, analyzer, delTerm);  //利用dwpt更新文档
      } catch (AbortingException ae) {
        flushControl.doOnAbort(perThread);
        dwpt.abort();
        throw ae;
      } finally {
        // We don't know whether the document actually
        // counted as being indexed, so we must subtract here to
        // accumulate our separate counter:
        numDocsInRAM.addAndGet(dwpt.getNumDocsInRAM() - dwptNumDocs);
      }
      final boolean isUpdate = delTerm != null;
      flushingDWPT = flushControl.doAfterDocument(perThread, isUpdate);
    } finally {
      perThreadPool.release(perThread);
    }

    return postUpdate(flushingDWPT, hasEvents);   //操作完成后的处理
  }

上面代码里面比较重要的就是ThreadState,lucene对他的解释如下

  /**
   * {@link ThreadState} references and guards a
   * {@link DocumentsWriterPerThread} instance that is used during indexing to
   * build a in-memory index segment. {@link ThreadState} also holds all flush
   * related per-thread data controlled by {@link DocumentsWriterFlushControl}.
   * <p>
   * A {@link ThreadState}, its methods and members should only accessed by one
   * thread a time. Users must acquire the lock via {@link ThreadState#lock()}
   * and release the lock in a finally block via {@link ThreadState#unlock()}
   * before accessing the state.
   */

大概意思就是ThreadState中维护的dwpt被用于在索引写入内存并生成段信息,同时ThreadState控制了各个相关线程的数据向磁盘刷新。

获取ThreadState的方式是看freeList是不是为空,如果是则新建一个返回,如果不为空则从中取一个返回。用完后还给freeList。最早的lucene默认freeList的大小为8,现在没有默认值了。获取到ThreadState后就利用它的dwpt更新文档代码如下:

public void updateDocument(Iterable<? extends IndexableField> doc, Analyzer analyzer, Term delTerm) throws IOException, AbortingException {
    testPoint("DocumentsWriterPerThread addDocument start");
    assert deleteQueue != null;
    reserveOneDoc();
    docState.doc = doc;
    docState.analyzer = analyzer;
    docState.docID = numDocsInRAM;
    if (INFO_VERBOSE && infoStream.isEnabled("DWPT")) {
      infoStream.message("DWPT", Thread.currentThread().getName() + " update delTerm=" + delTerm + " docID=" + docState.docID + " seg=" + segmentInfo.name);
    }
    // Even on exception, the document is still added (but marked
    // deleted), so we don't need to un-reserve at that point.
    // Aborting exceptions will actually "lose" more than one
    // document, so the counter will be "wrong" in that case, but
    // it's very hard to fix (we can't easily distinguish aborting
    // vs non-aborting exceptions):
    boolean success = false;
    try {
      try {
        consumer.processDocument(); //调用默认索引链更新
      } finally {
        docState.clear();
      }
上面的consumer.processDocument(); 实际上调用的就是DefaultIndexingChain的processDocument方法

public void processDocument() throws IOException, AbortingException {
    
    // How many indexed field names we've seen (collapses
    // multiple field instances by the same name):
    int fieldCount = 0;
    
    long fieldGen = nextFieldGen++;
    
    // NOTE: we need two passes here, in case there are
    // multi-valued fields, because we must process all
    // instances of a given field at once, since the
    // analyzer is free to reuse TokenStream across fields
    // (i.e., we cannot have more than one TokenStream
    // running "at once"):
    
    termsHash.startDocument(); // 开始前准备工作,将fields清空
    
    fillStoredFields(docState.docID);
    startStoredFields();
    
    boolean aborting = false;
    try {
      for (IndexableField field : docState.doc) {
        fieldCount = processField(field, fieldGen, fieldCount); // 开始对每个字段进行处理
      }
    } catch (AbortingException ae) {
      aborting = true;
      throw ae;
    } finally {
      if (aborting == false) {
        // Finish each indexed field name seen in the document:
        for (int i = 0; i < fieldCount; i++) {
          fields[i].finish();
        }
        finishStoredFields();
      }
    }
    
    try {
      termsHash.finishDocument();
    } catch (Throwable th) {
      // Must abort, on the possibility that on-disk term
      // vectors are now corrupt:
      throw AbortingException.wrap(th);
    }
  }
  
  private int processField(IndexableField field, long fieldGen, int fieldCount) throws IOException, AbortingException {
    String fieldName = field.name();
    IndexableFieldType fieldType = field.fieldType();
    
    PerField fp = null;
    
    if (fieldType.indexOptions() == null) {
      throw new NullPointerException("IndexOptions must not be null (field: \"" + field.name() + "\")");
    }
    
    // Invert indexed fields:
    if (fieldType.indexOptions() != IndexOptions.NONE) {
      
      // if the field omits norms, the boost cannot be indexed.
      if (fieldType.omitNorms() && field.boost() != 1.0f) {
        throw new UnsupportedOperationException("You cannot set an index-time boost: norms are omitted for field '"
            + field.name() + "'");
      }
      
      fp = getOrAddField(fieldName, fieldType, true); // 获取PerField对象,对于每个dwpt来说,每个field都有一个唯一的该对象,用于在缓存中建立索引
      boolean first = fp.fieldGen != fieldGen;
      fp.invert(field, first); // 开始调用分词器生成token流,分词形成
      
      if (first) {
        fields[fieldCount++] = fp;
        fp.fieldGen = fieldGen;
      }
    } else {
      verifyUnIndexedFieldType(fieldName, fieldType);
    }
    
    // Add stored fields:
    if (fieldType.stored()) {
      if (fp == null) {
        fp = getOrAddField(fieldName, fieldType, false);
      }
      if (fieldType.stored()) {
        try {
          storedFieldsWriter.writeField(fp.fieldInfo, field); // 如果是存储的字段,则写入存储doc文件里面
        } catch (Throwable th) {
          throw AbortingException.wrap(th);
        }
      }
    }
    
    DocValuesType dvType = fieldType.docValuesType(); // 对DocValues进行处理
    if (dvType == null) {
      throw new NullPointerException("docValuesType cannot be null (field: \"" + fieldName + "\")");
    }
    if (dvType != DocValuesType.NONE) {
      if (fp == null) {
        fp = getOrAddField(fieldName, fieldType, false);
      }
      indexDocValue(fp, dvType, field);
    }
    
    return fieldCount;
  }
 fp.invert()这个方法里面主要是生成了一些索引结构信息,保存在内存缓冲区,结束后调用flush()操作刷新到磁盘上。这一块具体的结构我们后续分析,lucene系列创建索引的解析大概会分几章逐步介绍。上面是一些基本的索引链,有不对的请指正。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值