lucene源码分析---2

lucene源码分析—lucene创建索引之准备工作

为了方便分析,这里再贴一次在上一章中lucene关于建立索引的实例的源代码,

            String filePath = ...//文件路径
            String indexPath = ...//索引路径
            File fileDir = new File(filePath);    
            Directory dir = FSDirectory.open(Paths.get(indexPath));  

            Analyzer luceneAnalyzer = new SimpleAnalyzer();
            IndexWriterConfig iwc = new IndexWriterConfig(luceneAnalyzer);  
            iwc.setOpenMode(OpenMode.CREATE);  
            IndexWriter indexWriter = new IndexWriter(dir,iwc);    
            File[] textFiles = fileDir.listFiles();    

            for (int i = 0; i < textFiles.length; i++) {    
                if (textFiles[i].isFile()) {     
                    String temp = FileReaderAll(textFiles[i].getCanonicalPath(),    
                            "GBK");    
                    Document document = new Document();    
                    Field FieldPath = new StringField("path", textFiles[i].getPath(), Field.Store.YES);
                    Field FieldBody = new TextField("body", temp, Field.Store.YES);    
                    document.add(FieldPath);    
                    document.add(FieldBody);    
                    indexWriter.addDocument(document);    
                }    
            }    
            indexWriter.close();

首先,FSDirectory的open函数用来打开索引文件夹,用来存放后面生成的索引文件,代码如下,

  public static FSDirectory open(Path path) throws IOException {
    return open(path, FSLockFactory.getDefault());
  }

  public static FSDirectory open(Path path, LockFactory lockFactory) throws IOException {
    if (Constants.JRE_IS_64BIT && MMapDirectory.UNMAP_SUPPORTED) {
      return new MMapDirectory(path, lockFactory);
    } else if (Constants.WINDOWS) {
      return new SimpleFSDirectory(path, lockFactory);
    } else {
      return new NIOFSDirectory(path, lockFactory);
    }
  }

FSLockFactory获得的默认LockFactory是NativeFSLockFactory,该工厂可以获得文件锁NativeFSLock,后面如果分析到再来细看这方面代码。这里假设FSDirectory的open函数创建了一个NIOFSDirectory,NIOFSDirectory继承自FSDirectory,并且直接调用了其父类FSDirectory的构造函数,

  protected FSDirectory(Path path, LockFactory lockFactory) throws IOException {
    super(lockFactory);
    if (!Files.isDirectory(path)) {
      Files.createDirectories(path);
    }
    directory = path.toRealPath();
  }

FSDirectory的构造函数根据Path创建了一个目录或者文件,并且保存了对应的路径。FSDirectory继承自BaseDirectory,其构造函数只是简单保存了LockFactory,这里就不要往下看了。

回到最上面的例子中,接下来构造了SimpleAnalyzer,然后根据构造的SimpleAnalyzer创建一个IndexWriterConfig,其构造函数直接调用了其父类LiveIndexWriterConfig的构造函数,

  LiveIndexWriterConfig(Analyzer analyzer) {
    this.analyzer = analyzer;
    ramBufferSizeMB = IndexWriterConfig.DEFAULT_RAM_BUFFER_SIZE_MB;
    maxBufferedDocs = IndexWriterConfig.DEFAULT_MAX_BUFFERED_DOCS;
    maxBufferedDeleteTerms = IndexWriterConfig.DEFAULT_MAX_BUFFERED_DELETE_TERMS;
    mergedSegmentWarmer = null;
    delPolicy = new KeepOnlyLastCommitDeletionPolicy();
    commit = null;
    useCompoundFile = IndexWriterConfig.DEFAULT_USE_COMPOUND_FILE_SYSTEM;
    openMode = OpenMode.CREATE_OR_APPEND;
    similarity = IndexSearcher.getDefaultSimilarity();
    mergeScheduler = new ConcurrentMergeScheduler();
    indexingChain = DocumentsWriterPerThread.defaultIndexingChain;
    codec = Codec.getDefault();
    infoStream = InfoStream.getDefault();
    mergePolicy = new TieredMergePolicy();
    flushPolicy = new FlushByRamOrCountsPolicy();
    readerPooling = IndexWriterConfig.DEFAULT_READER_POOLING;
    indexerThreadPool = new DocumentsWriterPerThreadPool();
    perThreadHardLimitMB = IndexWriterConfig.DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB;
  }

LiveIndexWriterConfig构造函数又创建并保存了一系列组件,在后面的代码分析中如果碰到会一一分析,这里就不往下看了。

回到lucene实例中,接下来根据刚刚创建的LiveIndexWriterConfig创建一个IndexWriter,IndexWriter时lucene创建索引最为核心的类,其构造函数比较长,下面一一来看,

  public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException {
    if (d instanceof FSDirectory && ((FSDirectory) d).checkPendingDeletions()) {
      throw new IllegalArgumentException();
    }

    conf.setIndexWriter(this);
    config = conf;
    infoStream = config.getInfoStream();
    writeLock = d.obtainLock(WRITE_LOCK_NAME);

    boolean success = false;
    try {
      directoryOrig = d;
      directory = new LockValidatingDirectoryWrapper(d, writeLock);
      mergeDirectory = addMergeRateLimiters(directory);

      analyzer = config.getAnalyzer();
      mergeScheduler = config.getMergeScheduler();
      mergeScheduler.setInfoStream(infoStream);
      codec = config.getCodec();

      bufferedUpdatesStream = new BufferedUpdatesStream(infoStream);
      poolReaders = config.getReaderPooling();

      OpenMode mode = config.getOpenMode();
      boolean create;
      if (mode == OpenMode.CREATE) {
        create = true;
      } else if (mode == OpenMode.APPEND) {
        create = false;
      } else {
        create = !DirectoryReader.indexExists(directory);
      }
      boolean initialIndexExists = true;
      String[] files = directory.listAll();
      IndexCommit commit = config.getIndexCommit();

      StandardDirectoryReader reader;
      if (commit == null) {
        reader = null;
      } else {
        reader = commit.getReader();
      }

      if (create) {

        if (config.getIndexCommit() != null) {
          if (mode == OpenMode.CREATE) {
            throw new IllegalArgumentException();
          } else {
            throw new IllegalArgumentException();
          }
        }

        SegmentInfos sis = null;
        try {
          sis = SegmentInfos.readLatestCommit(directory);
          sis.clear();
        } catch (IOException e) {
          initialIndexExists = false;
          sis = new SegmentInfos();
        }
        segmentInfos = sis;
        rollbackSegments = segmentInfos.createBackupSegmentInfos();
        changed();

      } else if (reader != null) {
        ...
      } else {
        ...
      }

      pendingNumDocs.set(segmentInfos.totalMaxDoc());
      globalFieldNumberMap = getFieldNumberMap();
      config.getFlushPolicy().init(config);
      docWriter = new DocumentsWriter(this, config, directoryOrig, directory);
      eventQueue = docWriter.eventQueue();
      synchronized(this) {
        deleter = new IndexFileDeleter(files, directoryOrig, directory,
                                       config.getIndexDeletionPolicy(),
                                       segmentInfos, infoStream, this,
                                       initialIndexExists, reader != null);
        assert create || filesExist(segmentInfos);
      }

      if (deleter.startingCommitDeleted) {
        changed();
      }

      if (reader != null) {
        ...
      }

      success = true;

    } finally {
      if (!success) {
        IOUtils.closeWhileHandlingException(writeLock);
        writeLock = null;
      }
    }
  }

IndexWriter构造函数首先通过checkPendingDeletions函数删除被标记的文件,checkPendingDeletions函数定义在FSDirectory中,如下所示

  public boolean checkPendingDeletions() throws IOException {
    deletePendingFiles();
    return pendingDeletes.isEmpty() == false;
  }

  public synchronized void deletePendingFiles() throws IOException {
    if (pendingDeletes.isEmpty() == false) {
      for(String name : new HashSet<>(pendingDeletes)) {
        privateDeleteFile(name, true);
      }
    }
  }

  private void privateDeleteFile(String name, boolean isPendingDelete) throws IOException {
    try {
      Files.delete(directory.resolve(name));
      pendingDeletes.remove(name);
    } catch (NoSuchFileException | FileNotFoundException e) {

    } catch (IOException ioe) {

    }
  }

checkPendingDeletions函数最后调用Files的delete函数删除保存在pendingDeletes的文件。

回到IndexWriter的构造函数中,接下来通过infoStream获得在LiveIndexWriterConfig构造函数中创建的NoOutput,该infoStream用来显示信息,然后调用FSDirectory的obtainLock函数获得文件的写锁,这里就不往下分析了。

回到IndexWriter的构造函数中,接下来会经过一系列的创建和赋值操作,假设create为true,即表示第一次创建或者重新创建索引,然后会通过SegmentInfos的readLatestCommit函数读取段信息,

  public static final SegmentInfos readLatestCommit(Directory directory) throws IOException {
    return new FindSegmentsFile<SegmentInfos>(directory) {
      @Override
      protected SegmentInfos doBody(String segmentFileName) throws IOException {
        return readCommit(directory, segmentFileName);
      }
    }.run();
  }

SegmentInfos的readLatestCommit函数创建了一个FindSegmentsFile并调用其run函数,定义如下,

    public T run() throws IOException {
      return run(null);
    }

    public T run(IndexCommit commit) throws IOException {
      long lastGen = -1;
      long gen = -1;
      IOException exc = null;

      for (;;) {
        lastGen = gen;
        String files[] = directory.listAll();
        String files2[] = directory.listAll();
        Arrays.sort(files);
        Arrays.sort(files2);
        if (!Arrays.equals(files, files2)) {
          continue;
        }
        gen = getLastCommitGeneration(files);
        if (gen == -1) {
          throw new IndexNotFoundException();
        } else if (gen > lastGen) {
          String segmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS, "", gen);

          try {
            T t = doBody(segmentFileName);
            return t;
          } catch (IOException err) {

          }
        } else {
          throw exc;
        }
      }
    }

这里的泛型T就是SegmentInfos,run函数首先调用getLastCommitGeneration函数获得gen信息,假设索引文件夹下有一个文件名为segments_6的文件,则getLastCommitGeneration最后会返回6赋值到gen中,接下来,如果gen大于lastGen,就表示段信息有更新了,这时候就要通过doBody函数读取该segments_6文件的信息,并返回一个SegmentInfos。
根据前面readLatestCommit的代码,doBody函数最后会调用readCommit函数,定义在SegmentInfos中,代码如下

  public static final SegmentInfos readCommit(Directory directory, String segmentFileName) throws IOException {
    long generation = generationFromSegmentsFileName(segmentFileName);
    try (ChecksumIndexInput input = directory.openChecksumInput(segmentFileName, IOContext.READ)) {
      return readCommit(directory, input, generation);
    }
  }

readCommit函数首先创建一个ChecksumIndexInput,然后通过readCommit函数读取段信息并返回一个SegmentInfos,这里的readCommit函数和具体的segments_*文件格式和协议相关,这里就不往下看了。最后返回的SegmentInfos保存了段信息。

回到IndexWriter的构造函数中,如果readLatestCommit函数返回的SegmentInfos不为空,就调用其clear清空,如果是第一次创建索引,就会构造一个SegmentInfos,SegmentInfos的构造函数为空函数。接下来调用SegmentInfos的createBackupSegmentInfos函数备份其中的SegmentCommitInfo信息列表,该备份主要是为了回滚rollback操作使用。IndexWriter然后调用changed表示段信息发生了变化。

继续往下看IndexWriter的构造函数,pendingNumDocs函数记录了索引记录的文档总数,globalFieldNumberMap记录了该段中Field的相关信息,getFlushPolicy返回在LiveIndexWriterConfig构造函数中创建的FlushByRamOrCountsPolicy,然后通过FlushByRamOrCountsPolicy的init函数进行简单的赋值。再往下创建了一个DocumentsWriter,并获得其事件队列保存在eventQueue中。IndexWriter的构造函数接下来会创建一个IndexFileDeleter,IndexFileDeleter用来管理索引文件,例如添加引用计数,在多线程环境下操作索引文件时可以保持同步性。

下一章继续分析lucene创建索引的实例的源代码。

展开阅读全文

没有更多推荐了,返回首页