Lucene 6.6.1源码分析---IndexWriter创建索引

最新推荐文章于 2024-01-24 16:14:09 发布

道友，且慢

最新推荐文章于 2024-01-24 16:14:09 发布

阅读量849

点赞数 1

分类专栏： Lucene

本文链接：https://blog.csdn.net/qqqq0199181/article/details/83825451

版权

Lucene 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

索引的创建

Lucene创建索引的步骤非常简单，只需要创建一个IndexWriter实例即可，这个实例就代表一个索引，它负责创建和维护索引。因此要了解索引创建的详情就需要从IndexWriter的构造函数开始。

public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException {
    if (d instanceof FSDirectory && ((FSDirectory) d).checkPendingDeletions()) {
      throw new IllegalArgumentException("Directory " + d + " still has pending deleted files; cannot initialize IndexWriter");
    }

    conf.setIndexWriter(this); // prevent reuse by other instances
    config = conf;
    infoStream = config.getInfoStream();

    // obtain the write.lock. If the user configured a timeout,
    // we wrap with a sleeper and this might take some time.
    writeLock = d.obtainLock(WRITE_LOCK_NAME);
    
    boolean success = false;
    try {
      directoryOrig = d;
      directory = new LockValidatingDirectoryWrapper(d, writeLock);

      analyzer = config.getAnalyzer();
      mergeScheduler = config.getMergeScheduler();
      mergeScheduler.setInfoStream(infoStream);
      codec = config.getCodec();

      bufferedUpdatesStream = new BufferedUpdatesStream(infoStream);
      poolReaders = config.getReaderPooling();

      OpenMode mode = config.getOpenMode();
      boolean create;
      if (mode == OpenMode.CREATE) {
        create = true;
      } else if (mode == OpenMode.APPEND) {
        create = false;
      } else {
        // CREATE_OR_APPEND - create only if an index does not exist
        create = !DirectoryReader.indexExists(directory);
      }

      // If index is too old, reading the segments will throw
      // IndexFormatTooOldException.

      boolean initialIndexExists = true;

      String[] files = directory.listAll();

      // Set up our initial SegmentInfos:
      IndexCommit commit = config.getIndexCommit();

      // Set up our initial SegmentInfos:
      StandardDirectoryReader reader;
      if (commit == null) {
        reader = null;
      } else {
        reader = commit.getReader();
      }

      if (create) {

        if (config.getIndexCommit() != null) {
          // We cannot both open from a commit point and create:
          if (mode == OpenMode.CREATE) {
            throw new IllegalArgumentException("cannot use IndexWriterConfig.setIndexCommit() with OpenMode.CREATE");
          } else {
            throw new IllegalArgumentException("cannot use IndexWriterConfig.setIndexCommit() when index has no commit");
          }
        }

        // Try to read first.  This is to allow create
        // against an index that's currently open for
        // searching.  In this case we write the next
        // segments_N file with no segments:
        SegmentInfos sis = null;
        try {
          sis = SegmentInfos.readLatestCommit(directory);
          sis.clear();
        } catch (IOException e) {
          // Likely this means it's a fresh directory
          initialIndexExists = false;
          sis = new SegmentInfos();
        }
        
        segmentInfos = sis;

        rollbackSegments = segmentInfos.createBackupSegmentInfos();

        // Record that we have a change (zero out all
        // segments) pending:
        changed();

      } else if (reader != null) {
        // Init from an existing already opened NRT or non-NRT reader:
      
        if (reader.directory() != commit.getDirectory()) {
          throw new IllegalArgumentException("IndexCommit's reader must have the same directory as the IndexCommit");
        }

        if (reader.directory() != directoryOrig) {
          throw new IllegalArgumentException("IndexCommit's reader must have the same directory passed to IndexWriter");
        }

        if (reader.segmentInfos.getLastGeneration() == 0) {  
          // TODO: maybe we could allow this?  It's tricky...
          throw new IllegalArgumentException("index must already have an initial commit to open from reader");
        }

        // Must clone because we don't want the incoming NRT reader to "see" any changes this writer now makes:
        segmentInfos = reader.segmentInfos.clone();

        SegmentInfos lastCommit;
        try {
          lastCommit = SegmentInfos.readCommit(directoryOrig, segmentInfos.getSegmentsFileName());
        } catch (IOException ioe) {
          throw new IllegalArgumentException("the provided reader is stale: its prior commit file \"" + segmentInfos.getSegmentsFileName() + "\" is missing from index");
        }

        if (reader.writer != null) {

          // The old writer better be closed (we have the write lock now!):
          assert reader.writer.closed;

          // In case the old writer wrote further segments (which we are now dropping),
          // update SIS metadata so we remain write-once:
          segmentInfos.updateGenerationVersionAndCounter(reader.writer.segmentInfos);
          lastCommit.updateGenerationVersionAndCounter(reader.writer.segmentInfos);
        }

        rollbackSegments = lastCommit.createBackupSegmentInfos();

        if (infoStream.isEnabled("IW")) {
          infoStream.message("IW", "init from reader " + reader);
          messageState();
        }
      } else {
        // Init from either the latest commit point, or an explicit prior commit point:

        String lastSegmentsFile = SegmentInfos.getLastCommitSegmentsFileName(files);
        if (lastSegmentsFile == null) {
          throw new IndexNotFoundException("no segments* file found in " + directory + ": files: " + Arrays.toString(files));
        }

        // Do not use SegmentInfos.read(Directory) since the spooky
        // retrying it does is not necessary here (we hold the write lock):
        segmentInfos = SegmentInfos.readCommit(directoryOrig, lastSegmentsFile);

        if (commit != null) {
          // Swap out all segments, but, keep metadata in
          // SegmentInfos, like version & generation, to
          // preserve write-once.  This is important if
          // readers are open against the future commit
          // points.
          if (commit.getDirectory() != directoryOrig) {
            throw new IllegalArgumentException("IndexCommit's directory doesn't match my directory, expected=" + directoryOrig + ", got=" + commit.getDirectory());
          }
          
          SegmentInfos oldInfos = SegmentInfos.readCommit(directoryOrig, commit.getSegmentsFileName());
          segmentInfos.replace(oldInfos);
          changed();

          if (infoStream.isEnabled("IW")) {
            infoStream.message("IW", "init: loaded commit \"" + commit.getSegmentsFileName() + "\"");
          }
        }

        rollbackSegments = segmentInfos.createBackupSegmentInfos();
      }

      commitUserData = new HashMap<String,String>(segmentInfos.getUserData()).entrySet();

      pendingNumDocs.set(segmentInfos.totalMaxDoc());

      // start with previous field numbers, but new FieldInfos
      // NOTE: this is correct even for an NRT reader because we'll pull FieldInfos even for the un-committed segments:
      globalFieldNumberMap = getFieldNumberMap();

      validateIndexSort();

      config.getFlushPolicy().init(config);
      docWriter = new DocumentsWriter(this, config, directoryOrig, directory);
      eventQueue = docWriter.eventQueue();

      // Default deleter (for backwards compatibility) is
      // KeepOnlyLastCommitDeleter:

      // Sync'd is silly here, but IFD asserts we sync'd on the IW instance:
      synchronized(this) {
        deleter = new IndexFileDeleter(files, directoryOrig, directory,
                                       config.getIndexDeletionPolicy(),
                                       segmentInfos, infoStream, this,
                                       initialIndexExists, reader != null);

        // We incRef all files when we return an NRT reader from IW, so all files must exist even in the NRT case:
        assert create || filesExist(segmentInfos);
      }

      if (deleter.startingCommitDeleted) {
        // Deletion policy deleted the "head" commit point.
        // We have to mark ourself as changed so that if we
        // are closed w/o any further changes we write a new
        // segments_N file.
        changed();
      }

      if (reader != null) {
        // Pre-enroll all segment readers into the reader pool; this is necessary so
        // any in-memory NRT live docs are correctly carried over, and so NRT readers
        // pulled from this IW share the same segment reader:
        List<LeafReaderContext> leaves = reader.leaves();
        assert segmentInfos.size() == leaves.size();

        for (int i=0;i<leaves.size();i++) {
          LeafReaderContext leaf = leaves.get(i);
          SegmentReader segReader = (SegmentReader) leaf.reader();
          SegmentReader newReader = new SegmentReader(segmentInfos.info(i), segReader, segReader.getLiveDocs(), segReader.numDocs());
          readerPool.readerMap.put(newReader.getSegmentInfo(), new ReadersAndUpdates(this, newReader));
        }

        // We always assume we are carrying over incoming changes when opening from reader:
        segmentInfos.changed();
        changed();
      }

      if (infoStream.isEnabled("IW")) {
        infoStream.message("IW", "init: create=" + create);
        messageState();
      }

      success = true;

    } finally {
      if (!success) {
        if (infoStream.isEnabled("IW")) {
          infoStream.message("IW", "init: hit exception on init; releasing write lock");
        }
        IOUtils.closeWhileHandlingException(writeLock);
        writeLock = null;
      }
    }
  }

这是一段比较长的代码，可想而知Lucene在创建索引的过程中确实做了不少事情。

参数

首先说一下构造函数的两个参数Directory d，IndexWriterConfig conf。

d是这个索引所在的文件夹
conf是这个索引的配置信息
检查pendingDeletes中的文件是否已经被删除

public boolean checkPendingDeletions() throws IOException {
    deletePendingFiles();
    return pendingDeletes.isEmpty() == false;
  }
public synchronized void deletePendingFiles() throws IOException {
    if (pendingDeletes.isEmpty() == false) {

      // TODO: we could fix IndexInputs from FSDirectory subclasses to call this when they are closed?

      // Clone the set since we mutate it in privateDeleteFile:
      for(String name : new HashSet<>(pendingDeletes)) {
        privateDeleteFile(name, true);
      }
    }
  }

pendingDeletes 的存在可以说是专门问windows操作系统服务的，因为windows下，删除一个文件可能因为这个文件正在被其他进程打开而失败，对于这种文件，Lucene会将它添加到pendingDeletes中。Lucene在d文件夹中建立索引之前需要保证这个文件夹下的pendingDeletes都已经被删除，否则索引初始化将失败。

config持有本IndexWriter实例

conf.setIndexWriter(this);

这么做是为了防止这个config被其他索引实例获取，因为一个config实例只能服务与一个索引实例。

获取文件锁

Lucene使用文件锁的方式保证线程安全，Lucene会尝试在d文件夹下创建一个锁文件write.lock。如果这个文件已经存在，说明已经有其他线程在操作索引，需要等待释放。如果write.lock还没被创建，则创建文件，并且返回持有write.lock创建时间的锁对象，表示获取到了文件锁。

获取配置

analyzer 分析器
mergeScheduler 合并策略
codec 编码器
mode 打开方式，有三种：
1、CREATE（创建一个新的索引，或者覆盖旧的索引），
2、APPEND（打开一个已经存在的索引），
3、 CREATE_OR_APPEND（如果索引不存在则创建，如果已经存在则打开后追加）

初始化段信息

对于第一次创建索引，此时索引目录下还没有记录段信息的segments.*文件。会做以下3件事：

1、new 一个SegmentInfos实例，
2、然后再复制一个rollbackSegments作为段信息的备份。
3、将段信息的版本号加1
4、生成段信息的文件名segments

从段信息中获取的数据

pendingNumDocs.set(segmentInfos.totalMaxDoc());

      // start with previous field numbers, but new FieldInfos
      // NOTE: this is correct even for an NRT reader because we'll pull FieldInfos even for the un-committed segments:
      globalFieldNumberMap = getFieldNumberMap();

1、获取这个索引下所有段中文档的总数量
2、获取域的相关信息

初始化DocumentsWriter

DocumentsWriter主要是负责添加文档和写段文件。

创建IndexFileDeleter实例

deleter = new IndexFileDeleter(files, directoryOrig, directory,
                                       config.getIndexDeletionPolicy(),
                                       segmentInfos, infoStream, this,
                                       initialIndexExists, reader != null);

IndexFileDeleter的作用是维护内存中的SegmentInfos实例，与文件段信息文件segmentN的映射关系。SegmentInfos存在于内存中，随着索引内容的变化不断更新，每当进行commit操作SegmentInfos的内容就会保存到segmentN文件中。