Lucene 6.6.1源码分析---IndexWriter创建索引

索引的创建

Lucene创建索引的步骤非常简单,只需要创建一个IndexWriter实例即可,这个实例就代表一个索引,它负责创建和维护索引。因此要了解索引创建的详情就需要从IndexWriter的构造函数开始。

public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException {
    if (d instanceof FSDirectory && ((FSDirectory) d).checkPendingDeletions()) {
      throw new IllegalArgumentException("Directory " + d + " still has pending deleted files; cannot initialize IndexWriter");
    }

    conf.setIndexWriter(this); // prevent reuse by other instances
    config = conf;
    infoStream = config.getInfoStream();

    // obtain the write.lock. If the user configured a timeout,
    // we wrap with a sleeper and this might take some time.
    writeLock = d.obtainLock(WRITE_LOCK_NAME);
    
    boolean success = false;
    try {
      directoryOrig = d;
      directory = new LockValidatingDirectoryWrapper(d, writeLock);

      analyzer = config.getAnalyzer();
      mergeScheduler = config.getMergeScheduler();
      mergeScheduler.setInfoStream(infoStream);
      codec = config.getCodec();

      bufferedUpdatesStream = new BufferedUpdatesStream(infoStream);
      poolReaders = config.getReaderPooling();

      OpenMode mode = config.getOpenMode();
      boolean create;
      if (mode == OpenMode.CREATE) {
        create = true;
      } else if (mode == OpenMode.APPEND) {
        create = false;
      } else {
        // CREATE_OR_APPEND - create only if an index does not exist
        create = !DirectoryReader.indexExists(directory);
      }

      // If index is too old, reading the segments will throw
      // IndexFormatTooOldException.

      boolean initialIndexExists = true;

      String[] files = directory.listAll();

      // Set up our initial SegmentInfos:
      IndexCommit commit = config.getIndexCommit();

      // Set up our initial SegmentInfos:
      StandardDirectoryReader reader;
      if (commit == null) {
        reader = null;
      } else {
        reader = commit.getReader();
      }

      if (create) {

        if (config.getIndexCommit() != null) {
          // We cannot both open from a commit point and create:
          if (mode == OpenMode.CREATE) {
            throw new IllegalArgumentException("cannot use IndexWriterConfig.setIndexCommit() with OpenMode.CREATE");
          } else {
            throw new IllegalArgumentException("cannot use IndexWriterConfig.setIndexCommit() when index has no commit");
          }
        }

        // Try to read first.  This is to allow create
        // against an index that's currently open for
        // searching.  In this case we write the next
        // segments_N file with no segments:
        SegmentInfos sis = null;
        try {
          sis = SegmentInfos.readLatestCommit(directory);
          sis.clear();
        } catch (IOException e) {
          // Likely this means it's a fresh directory
          initialIndexExists = false;
          sis = new SegmentInfos();
        }
        
        segmentInfos = sis;

        rollbackSegments = segmentInfos.createBackupSegmentInfos();

        // Record that we have a change (zero out all
        // segments) pending:
        changed();

      } else if (reader != null) {
        // Init from an existing already opened NRT or non-NRT reader:
      
        if (reader.directory() != commit.getDirectory()) {
          throw new IllegalArgumentException("IndexCommit's reader must have the same directory as the IndexCommit");
        }

        if (reader.directory() != directoryOrig) {
          throw new IllegalArgumentException("IndexCommit's reader must have the same directory passed to IndexWriter");
        }

        if (reader.segmentInfos.getLastGeneration() == 0) {  
          // TODO: maybe we could allow this?  It's tricky...
          throw new IllegalArgumentException("index must already have an initial commit to open from reader");
        }

        // Must clone because we don't want the incoming NRT reader to "see" any changes this writer now makes:
        segmentInfos = reader.segmentInfos.clone();

        SegmentInfos lastCommit;
        try {
          lastCommit = SegmentInfos.readCommit(directoryOrig, segmentInfos.getSegmentsFileName());
        } catch (IOException ioe) {
          throw new IllegalArgumentException("the provided reader is stale: its prior commit file \"" + segmentInfos.getSegmentsFileName() + "\" is missing from index");
        }

        if (reader.writer != null) {

          // The old writer better be closed (we have the write lock now!):
          assert reader.writer.closed;

          // In case the old writer wrote further segments (which we are now dropping),
          // update SIS metadata so we remain write-once:
          segmentInfos.updateGenerationVersionAndCounter(reader.writer.segmentInfos);
          lastCommit.updateGenerationVersionAndCounter(reader.writer.segmentInfos);
        }

        rollbackSegments = lastCommit.createBackupSegmentInfos();

        if (infoStream.isEnabled("IW")) {
          infoStream.message("IW", "init from reader " + reader);
          messageState();
        }
      } else {
        // Init from either the latest commit point, or an explicit prior commit point:

        String lastSegmentsFile = SegmentInfos.getLastCommitSegmentsFileName(files);
        if (lastSegmentsFile == null) {
          throw new IndexNotFoundException("no segments* file found in " + directory + ": files: " + Arrays.toString(files));
        }

        // Do not use SegmentInfos.read(Directory) since the spooky
        // retrying it does is not necessary here (we hold the write lock):
        segmentInfos = SegmentInfos.readCommit(directoryOrig, lastSegmentsFile);

        if (commit != null) {
          // Swap out all segments, but, keep metadata in
          // SegmentInfos, like version & generation, to
          // preserve write-once.  This is important if
          // readers are open against the future commit
          // points.
          if (commit.getDirectory() != directoryOrig) {
            throw new IllegalArgumentException("IndexCommit's directory doesn't match my directory, expected=" + directoryOrig + ", got=" + commit.getDirectory());
          }
          
          SegmentInfos oldInfos = SegmentInfos.readCommit(directoryOrig, commit.getSegmentsFileName());
          segmentInfos.replace(oldInfos);
          changed();

          if (infoStream.isEnabled("IW")) {
            infoStream.message("IW", "init: loaded commit \"" + commit.getSegmentsFileName() + "\"");
          }
        }

        rollbackSegments = segmentInfos.createBackupSegmentInfos();
      }

      commitUserData = new HashMap<String,String>(segmentInfos.getUserData()).entrySet();

      pendingNumDocs.set(segmentInfos.totalMaxDoc());

      // start with previous field numbers, but new FieldInfos
      // NOTE: this is correct even for an NRT reader because we'll pull FieldInfos even for the un-committed segments:
      globalFieldNumberMap = getFieldNumberMap();

      validateIndexSort();

      config.getFlushPolicy().init(config);
      docWriter = new DocumentsWriter(this, config, directoryOrig, directory);
      eventQueue = docWriter.eventQueue();

      // Default deleter (for backwards compatibility) is
      // KeepOnlyLastCommitDeleter:

      // Sync'd is silly here, but IFD asserts we sync'd on the IW instance:
      synchronized(this) {
        deleter = new IndexFileDeleter(files, directoryOrig, directory,
                                       config.getIndexDeletionPolicy(),
                                       segmentInfos, infoStream, this,
                                       initialIndexExists, reader != null);

        // We incRef all files when we return an NRT reader from IW, so all files must exist even in the NRT case:
        assert create || filesExist(segmentInfos);
      }

      if (deleter.startingCommitDeleted) {
        // Deletion policy deleted the "head" commit point.
        // We have to mark ourself as changed so that if we
        // are closed w/o any further changes we write a new
        // segments_N file.
        changed();
      }

      if (reader != null) {
        // Pre-enroll all segment readers into the reader pool; this is necessary so
        // any in-memory NRT live docs are correctly carried over, and so NRT readers
        // pulled from this IW share the same segment reader:
        List<LeafReaderContext> leaves = reader.leaves();
        assert segmentInfos.size() == leaves.size();

        for (int i=0;i<leaves.size();i++) {
          LeafReaderContext leaf = leaves.get(i);
          SegmentReader segReader = (SegmentReader) leaf.reader();
          SegmentReader newReader = new SegmentReader(segmentInfos.info(i), segReader, segReader.getLiveDocs(), segReader.numDocs());
          readerPool.readerMap.put(newReader.getSegmentInfo(), new ReadersAndUpdates(this, newReader));
        }

        // We always assume we are carrying over incoming changes when opening from reader:
        segmentInfos.changed();
        changed();
      }

      if (infoStream.isEnabled("IW")) {
        infoStream.message("IW", "init: create=" + create);
        messageState();
      }

      success = true;

    } finally {
      if (!success) {
        if (infoStream.isEnabled("IW")) {
          infoStream.message("IW", "init: hit exception on init; releasing write lock");
        }
        IOUtils.closeWhileHandlingException(writeLock);
        writeLock = null;
      }
    }
  }

这是一段比较长的代码,可想而知Lucene在创建索引的过程中确实做了不少事情。

参数

首先说一下构造函数的两个参数Directory d,IndexWriterConfig conf。

  • d是这个索引所在的文件夹
  • conf是这个索引的配置信息
    检查pendingDeletes中的文件是否已经被删除
public boolean checkPendingDeletions() throws IOException {
    deletePendingFiles();
    return pendingDeletes.isEmpty() == false;
  }
public synchronized void deletePendingFiles() throws IOException {
    if (pendingDeletes.isEmpty() == false) {

      // TODO: we could fix IndexInputs from FSDirectory subclasses to call this when they are closed?

      // Clone the set since we mutate it in privateDeleteFile:
      for(String name : new HashSet<>(pendingDeletes)) {
        privateDeleteFile(name, true);
      }
    }
  }

pendingDeletes 的存在可以说是专门问windows操作系统服务的,因为windows下,删除一个文件可能因为这个文件正在被其他进程打开而失败,对于这种文件,Lucene会将它添加到pendingDeletes中。Lucene在d文件夹中建立索引之前需要保证这个文件夹下的pendingDeletes都已经被删除,否则索引初始化将失败。

config持有本IndexWriter实例

conf.setIndexWriter(this);

这么做是为了防止这个config被其他索引实例获取,因为一个config实例只能服务与一个索引实例。

获取文件锁

Lucene使用文件锁的方式保证线程安全,Lucene会尝试在d文件夹下创建一个锁文件write.lock。如果这个文件已经存在,说明已经有其他线程在操作索引,需要等待释放。如果write.lock还没被创建,则创建文件,并且返回持有write.lock创建时间的锁对象,表示获取到了文件锁。

获取配置

  • analyzer 分析器
  • mergeScheduler 合并策略
  • codec 编码器
  • mode 打开方式,有三种:
    1、CREATE(创建一个新的索引,或者覆盖旧的索引),
    2、APPEND(打开一个已经存在的索引),
    3、 CREATE_OR_APPEND(如果索引不存在则创建,如果已经存在则打开后追加)

初始化段信息

对于第一次创建索引,此时索引目录下还没有记录段信息的segments.*文件。会做以下3件事:

  • 1、new 一个SegmentInfos实例,
  • 2、然后再复制一个rollbackSegments作为段信息的备份。
  • 3、将段信息的版本号加1
  • 4、生成段信息的文件名segments

从段信息中获取的数据

pendingNumDocs.set(segmentInfos.totalMaxDoc());

      // start with previous field numbers, but new FieldInfos
      // NOTE: this is correct even for an NRT reader because we'll pull FieldInfos even for the un-committed segments:
      globalFieldNumberMap = getFieldNumberMap();
  • 1、 获取这个索引下所有段中文档的总数量
  • 2、 获取域的相关信息

初始化DocumentsWriter

DocumentsWriter主要是负责添加文档和写段文件。

创建IndexFileDeleter实例

deleter = new IndexFileDeleter(files, directoryOrig, directory,
                                       config.getIndexDeletionPolicy(),
                                       segmentInfos, infoStream, this,
                                       initialIndexExists, reader != null);

IndexFileDeleter的作用是维护内存中的SegmentInfos实例,与文件段信息文件segmentN的映射关系。SegmentInfos存在于内存中,随着索引内容的变化不断更新,每当进行commit操作SegmentInfos的内容就会保存到segmentN文件中。

总结:

Lucene创建索引的步骤如下:

  • 1、检查pendingDeletes中的文件是否已经被删除
  • 2、获取文件锁
  • 3、获取配置
  • 4、初始化段信息
  • 5、从段信息中获取文档总数和域信息数据
  • 6、创建IndexFileDeleter实例用于管理段文件
Lucene是一个全文检索引擎,它的核心数据结构包括倒排索引和正排索引。其中,倒排索引Lucene最重要的数据结构之一,它通过将文档中的每个词都映射到包含该词的文档列表来实现快速的文本搜索。 Lucene中的Term Dictionary和Term Index是倒排索引中的两个重要组成部分。Term Dictionary用于存储所有唯一的词项(term),而Term Index则用于快速定位某个词项的位置。 在Lucene中,Term Dictionary和Term Index通常存储在磁盘上。Term Dictionary通常使用一种称为Trie树的数据结构来实现。Trie树是一种树形数据结构,它可以快速地查找某个字符串是否存在,以及在字符串集合中查找前缀匹配的字符串。 Term Index则通常存储在一个称为倒排索引表(Inverted Index Table)的结构中。倒排索引表是由一系列的倒排索引条目(Inverted Index Entry)组成的,每个倒排索引条目包含了一个词项及其在倒排索引中的位置信息,例如该词项在文档列表中出现的次数、该词项在哪些文档中出现等。 当进行文本搜索时,Lucene会首先在Term Dictionary中查找搜索关键词是否存在,然后通过Term Index快速定位到包含该词的文档列表,最后根据文档列表中的文档ID查找正排索引中具体的文档内容。这种基于倒排索引的搜索方式可以实现非常高效的文本搜索,是Lucene等全文检索引擎的核心技术之一。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值