lucene源码分析---15

lucene源码分析—合并段

本章分析lucene中段的合并,承接上一章的分析,IndexWriter的prepareCommitInternal函数最后会调用maybeMerge检查是否需要对段进行合并,如果需要,则进行合并,下面来看。

maybeMerge(mergePolicy, MergeTrigger.FULL_FLUSH, UNBOUNDED_MAX_MERGE_SEGMENTS);

IndexWriter::maybeMerge

  private final void maybeMerge(MergePolicy mergePolicy, MergeTrigger trigger, int maxNumSegments) throws IOException {
    boolean newMergesFound = updatePendingMerges(mergePolicy, trigger, maxNumSegments);
    mergeScheduler.merge(this, trigger, newMergesFound);
  }

传入的参数mergePolicy为TieredMergePolicy,在LiveIndexWriterConfig配置中创建,TieredMergePolicy类主要用于决定在什么情况下触发一次段的合并。maybeMerge函数首先通过updatePendingMerges函数查找需要合并的段,然后通过merge函数执行合并操作。

1. updatePendingMerges

IndexWriter::maybeMerge->updatePendingMerges

  private synchronized boolean updatePendingMerges(MergePolicy mergePolicy, MergeTrigger trigger, int maxNumSegments) throws IOException {

    boolean newMergesFound = false;
    final MergePolicy.MergeSpecification spec;

    spec = mergePolicy.findMerges(trigger, segmentInfos, this);

    newMergesFound = spec != null;
    if (newMergesFound) {
      final int numMerges = spec.merges.size();
      for(int i=0;i<numMerges;i++) {
        registerMerge(spec.merges.get(i));
      }
    }
    return newMergesFound;
  }

updatePendingMerges函数首先通过TieredMergePolicy的findMerges函数计算符合合并的段,保存在结果MergeSpecification中。接下来通过registerMerge函数对即将合并的段进行注册。

IndexWriter::maybeMerge->updatePendingMerges->TieredMergePolicy::findMerges

  public MergeSpecification findMerges(MergeTrigger mergeTrigger, SegmentInfos infos, IndexWriter writer) throws IOException {
    final Collection<SegmentCommitInfo> merging = writer.getMergingSegments();
    final Collection<SegmentCommitInfo> toBeMerged = new HashSet<>();

    final List<SegmentCommitInfo> infosSorted = new ArrayList<>(infos.asList());
    Collections.sort(infosSorted, new SegmentByteSizeDescending(writer));

    long totIndexBytes = 0;
    long minSegmentBytes = Long.MAX_VALUE;

    for(SegmentCommitInfo info : infosSorted) {
      final long segBytes = size(info, writer);
      minSegmentBytes = Math.min(segBytes, minSegmentBytes);
      totIndexBytes += segBytes;
    }

    int tooBigCount = 0;
    while (tooBigCount < infosSorted.size()) {
      long segBytes = size(infosSorted.get(tooBigCount), writer);
      if (segBytes < maxMergedSegmentBytes/2.0) {
        break;
      }
      totIndexBytes -= segBytes;
      tooBigCount++;
    }

    minSegmentBytes = floorSize(minSegmentBytes);

    long levelSize = minSegmentBytes;
    long bytesLeft = totIndexBytes;
    double allowedSegCount = 0;
    while(true) {
      final double segCountLevel = bytesLeft / (double) levelSize;
      if (segCountLevel < segsPerTier) {
        allowedSegCount += Math.ceil(segCountLevel);
        break;
      }
      allowedSegCount += segsPerTier;
      bytesLeft -= segsPerTier * levelSize;
      levelSize *= maxMergeAtOnce;
    }
    int allowedSegCountInt = (int) allowedSegCount;
    MergeSpecification spec = null;

    while(true) {
      long mergingBytes = 0;
      final List<SegmentCommitInfo> eligible = new ArrayList<>();
      for(int idx = tooBigCount; idx<infosSorted.size(); idx++) {
        final SegmentCommitInfo info = infosSorted.get(idx);
        if (merging.contains(info)) {
          mergingBytes += size(info, writer);
        } else if (!toBeMerged.contains(info)) {
          eligible.add(info);
        }
      }
      final boolean maxMergeIsRunning = mergingBytes >= maxMergedSegmentBytes;
      if (eligible.size() == 0) {
        return spec;
      }

      if (eligible.size() > allowedSegCountInt) {
        MergeScore bestScore = null;
        List<SegmentCommitInfo> best = null;
        boolean bestTooLarge = false;
        long bestMergeBytes = 0;

        for(int startIdx = 0;startIdx <= eligible.size()-maxMergeAtOnce; startIdx++) {

          long totAfterMergeBytes = 0;

          final List<SegmentCommitInfo> candidate = new ArrayList<>();
          boolean hitTooLarge = false;
          for(int idx = startIdx;idx<eligible.size() && candidate.size() < maxMergeAtOnce;idx++) {
            final SegmentCommitInfo info = eligible.get(idx);
            final long segBytes = size(info, writer);

            if (totAfterMergeBytes + segBytes > maxMergedSegmentBytes) {
              hitTooLarge = true;
              continue;
            }
            candidate.add(info);
            totAfterMergeBytes += segBytes;
          }

          final MergeScore score = score(candidate, hitTooLarge, mergingBytes, writer);
          if ((bestScore == null || score.getScore() < bestScore.getScore()) && (!hitTooLarge || !maxMergeIsRunning)) {
            best = candidate;
            bestScore = score;
            bestTooLarge = hitTooLarge;
            bestMergeBytes = totAfterMergeBytes;
          }
        }

        if (best != null) {
          if (spec == null) {
            spec = new MergeSpecification();
          }
          final OneMerge merge = new OneMerge(best);
          spec.add(merge);
          for(SegmentCommitInfo info : merge.segments) {
            toBeMerged.add(info);
          }
        } else {
          return spec;
        }
      } else {
        return spec;
      }
    }
  }

findMerges函数首先通过sort对当前所有段按照其占用的字节数从大到小排序。
第一个for语句遍历所有的段,计算其中占用字节的最小值,保存在minSegmentBytes中,再计算所有段占用的总字节数,保存在totIndexBytes中。
接下来while语句判断是否有段的大小大于maxMergedSegmentBytes的一半,如果有,则记录最大段在排序完的infosSorted中的位置为tooBigCount,后续要略过这些段。
再往下通过floorSize函数对段大小的最小值minSegmentBytes进行调整,以避免后面阈值的计算。
再往下的while循环计算段数量的阈值allowedSegCount,当正常段的数量大于这个阈值时,就需要进行一次合并。计算公式为所有段的总字节数除以最小段的字节数,如果大于segsPerTier(默认为10),则将阈值递增segsPerTier,进行一次记录,并将最小段的字节数乘以maxMergeAtOnce(默认为10),再循环计算。

进入最大的while循环,第一个for循环根据前面计算的tooBigCount值忽略掉较大段的SegmentCommitInfo,将剩余的段添加到eligible列表中,如果段已经被添加到merging列表中,则表示该段正在进行合并,此时只统计该段的大小mergingBytes,并不添加到eligible列表中。
maxMergeAtOnce变量也是一次可以合并的最大段的数量,默认10,因为存在该限制,如果此时lucene索引目录中段的数量大于maxMergeAtOnce,就需要从所有段中选择最优的maxMergeAtOnce个段,但是这里并不是遍历所有的段的组合,而是一种次优的选择,例如lucene索引目录中有0~11个段,该for循环只是循环比较0~9、1~10、2~11这三个组合,并选出其中最优的组合。
如果正常段的数量大于allowedSegCountInt和maxMergeAtOnce,这种情况下需要进行合并,否则,直接返回一个空的MergeSpecification,表示不要进行合并。
totAfterMergeBytes表示即将要合并的段的大小,如果totAfterMergeBytes加上待处理的段的大小大于maxMergedSegmentBytes,则设定为不满足合并条件,继续循环,否则加入候选段列表candidate中,并递增totAfterMergeBytes。
再往下通过score函数,对候选段评分,具体的评分公式涉及到公式问题,这里就不看了。接下来要比较上一次候选段的评分和本次选出的段的评分,如果本次选出的段由于上一次的候选段,则进行替换。

退出for循环后,再往下创建MergeSpecification,将所有要封装的段封装成OneMerge,并添加到MergeSpecification中。

IndexWriter::maybeMerge->updatePendingMerges->registerMerge

  final synchronized boolean registerMerge(MergePolicy.OneMerge merge) throws IOException {

    boolean isExternal = false;
    for(SegmentCommitInfo info : merge.segments) {
      if (mergingSegments.contains(info)) {
        return false;
      }
      if (!segmentInfos.contains(info)) {
        return false;
      }
      if (info.info.dir != directoryOrig) {
        isExternal = true;
      }
      if (segmentsToMerge.containsKey(info)) {
        merge.maxNumSegments = mergeMaxNumSegments;
      }
    }

    ensureValidMerge(merge);

    pendingMerges.add(merge);


    merge.mergeGen = mergeGen;
    merge.isExternal = isExternal;

    for(SegmentCommitInfo info : merge.segments) {
      mergingSegments.add(info);
    }

    for(SegmentCommitInfo info : merge.segments) {
      if (info.info.maxDoc() > 0) {
        final int delCount = numDeletedDocs(info);
        final double delRatio = ((double) delCount)/info.info.maxDoc();
        merge.estimatedMergeBytes += info.sizeInBytes() * (1.0 - delRatio);
        merge.totalMergeBytes += info.sizeInBytes();
      }
    }

    merge.registerDone = true;

    return true;
  }

registerMerge函数遍历所有的候选段,通过集合mergingSegments判断即将合并的段是否正在被合并中,如果是则直接返回。如果即将合并的段已经被合并或删除,即不在segmentInfos中,也直接返回。如果是外部的目录,设置isExternal为true。
接下来将待合并的候选段集合添加到pendingMerges中。
然后遍历所有段,依次添加到mergingSegments中,表示正在合并。

再往下再次遍历所有段,numDeletedDocs函数获得该段需要删除的文档数,统计estimatedMergeBytes和totalMergeBytes,其中estimatedMergeBytes包含删除的文档数。

2. merge

updatePendingMerges函数找出了需要合并的段集合,接下来通过merge函数执行合并操作,下面来看。

IndexWriter::maybeMerge->ConcurrentMergeScheduler::merge

  public synchronized void merge(IndexWriter writer, MergeTrigger trigger, boolean newMergesFound) throws IOException {
    while (true) {
      OneMerge merge = writer.getNextMerge();
      if (merge == null) {
        return;
      }

      final MergeThread merger = getMergeThread(writer, merge);
      mergeThreads.add(merger);
      merger.start();
    }
  }

merge函数首先循环调用getNextMerge函数依次取出updatePendingMerges函数注册的候选段集合OneMerge,然后通过getMergeThread函数创建用于合并的线程MergeThread,然后通过其start函数执行合并操作。

IndexWriter::maybeMerge->ConcurrentMergeScheduler::merge->IndexWriter::getNextMerge

  public synchronized MergePolicy.OneMerge getNextMerge() {
    if (pendingMerges.size() == 0) {
      return null;
    } else {
      MergePolicy.OneMerge merge = pendingMerges.removeFirst();
      runningMerges.add(merge);
      return merge;
    }
  }

getNextMerge函数返回前面注册的候选段集合OneMerge,并将其添加到runningMerges中。

IndexWriter::maybeMerge->ConcurrentMergeScheduler::merge->getMergeThread

  protected synchronized MergeThread getMergeThread(IndexWriter writer, OneMerge merge) throws IOException {
    final MergeThread thread = new MergeThread(writer, merge);
    thread.setDaemon(true);
    thread.setName("Lucene Merge Thread #" + mergeThreadCount++);
    return thread;
  }

getMergeThread函数创建MergeThread,进行相应的设置并返回。

MergeThread::run

    public void run() {
      doMerge(writer, merge);
      merge(writer, MergeTrigger.MERGE_FINISHED, true);
      removeMergeThread();
      ...
    }

MergeThread线程通过doMerge函数完成段的合并,merge函数检查是否还有merge操作需要执行,如果需要则开启一个线程继续执行,removeMergeThread函数从mergeThreads中移除当前线程。

MergeThread::run->doMerge

  protected void doMerge(IndexWriter writer, OneMerge merge) throws IOException {
    writer.merge(merge);
  }

  public void merge(MergePolicy.OneMerge merge) throws IOException {
    rateLimiters.set(merge.rateLimiter);
    final long t0 = System.currentTimeMillis();
    final MergePolicy mergePolicy = config.getMergePolicy();
    mergeInit(merge);
    mergeMiddle(merge, mergePolicy);
    mergeFinish(merge);
  }

mergeInit函数进行合并的初始化,并创建合并后的新段,mergeMiddle是合并的主要函数,内部完成合并操作,最后的mergeFinish函数执行一些收尾工作,下面一一分析。

2.1 mergeInit

MergeThread::run->doMerge->IndexWriter::merge->mergeInit

  final synchronized void mergeInit(MergePolicy.OneMerge merge) throws IOException {
      _mergeInit(merge);
  }
  synchronized private void _mergeInit(MergePolicy.OneMerge merge) throws IOException {

    final BufferedUpdatesStream.ApplyDeletesResult result = bufferedUpdatesStream.applyDeletesAndUpdates(readerPool, merge.segments);

    ...

    final String mergeSegmentName = newSegmentName();
    SegmentInfo si = new SegmentInfo(directoryOrig, Version.LATEST, mergeSegmentName, -1, false, codec, Collections.emptyMap(), StringHelper.randomId(), new HashMap<>());
    Map<String,String> details = new HashMap<>();
    details.put("mergeMaxNumSegments", "" + merge.maxNumSegments);
    details.put("mergeFactor", Integer.toString(merge.segments.size()));
    setDiagnostics(si, SOURCE_MERGE, details);
    merge.setMergeInfo(new SegmentCommitInfo(si, 0, -1L, -1L, -1L));

    ...
  }

_mergeInit函数首先通过BufferedUpdatesStream的applyDeletesAndUpdates函数执行缓存中还没删除和更新的任务,接下来通过newSegmentName函数获得新的段名(例如“_3”),然后创建对应该段的SegmentInfo并进行相应的设置,最后将其设置到OneMerge中。

2.2 mergeMiddle

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle

  private int mergeMiddle(MergePolicy.OneMerge merge, MergePolicy mergePolicy) throws IOException {

    List<SegmentCommitInfo> sourceSegments = merge.segments;
    IOContext context = new IOContext(merge.getStoreMergeInfo());
    final TrackingDirectoryWrapper dirWrapper = new TrackingDirectoryWrapper(mergeDirectory);
    merge.readers = new ArrayList<>(sourceSegments.size());

    boolean success = false;

    int segUpto = 0;
    while(segUpto < sourceSegments.size()) {

      final SegmentCommitInfo info = sourceSegments.get(segUpto);
      final ReadersAndUpdates rld = readerPool.get(info, true);

      SegmentReader reader;
      final Bits liveDocs;
      final int delCount;

      reader = rld.getReaderForMerge(context);
      liveDocs = rld.getReadOnlyLiveDocs();
      delCount = rld.getPendingDeleteCount() + info.getDelCount();

      if (reader.numDeletedDocs() != delCount) {
        SegmentReader newReader = new SegmentReader(info, reader, liveDocs, info.info.maxDoc() - delCount);
        rld.release(reader);
        reader = newReader;
      }

      merge.readers.add(reader);
      segUpto++;
    }

    final SegmentMerger merger = new SegmentMerger(merge.getMergeReaders(), merge.info.info, infoStream, dirWrapper, globalFieldNumberMap, context);
    if (merger.shouldMerge()) {
      merger.merge();
    }

    MergeState mergeState = merger.mergeState;
    merge.info.info.setFiles(new HashSet<>(dirWrapper.getCreatedFiles()));

    ...

    codec.segmentInfoFormat().write(directory, merge.info.info, context);
    commitMerge(merge, mergeState)
    return merge.info.info.maxDoc();
  }

getStoreMergeInfo返回MergeInfo,封装成IOContext。MergeInfo包含了本次合并的信息,例如合并的文档数、字节数、段数等等。
然后将合并的目录mergeDirectory(默认就是lucene的索引目录)封装成TrackingDirectoryWrapper。mergeDirectory在IndexWriter的构造函数中创建。
接着遍历即将合并的段,ReaderPool的get函数内部会创建并保存一个ReadersAndUpdates,最后返回该ReadersAndUpdates,ReadersAndUpdates内部封装了SegmentReader,用于删除、更新或合并操作。
再往下,getReaderForMerge函数创建并返回对应段SegmentReader,其中SegmentReader的构造函数会去读取.liv文件,根据文件中的创建Bits,内部存储了该段文档的删除信息,然后存储在成员变量liveDocs中。接着getReadOnlyLiveDocs函数直接返回刚刚读取的liveDocs。delCount记录了段删除的文档数,其中包括等待删除和已经删除的文档数的和。
如果delCount和numDeletedDocs函数返回的文档数不等,表示期间别的线程已经删除了部分文档,此时要重新创建SegmentReader,这部分代码省略。最后将新的SegmentReader添加到OneMerge的readers列表中。

接下来创建SegmentMerger,SegmentMerger封装了基本上一次合并的所有信息,其内部创建的MergeState封装了所有段的所有文件的各个读取器Reader,然后调用SegmentMerger的merge函数执行合并操作,该函数是整个合并过程的核心函数。
合并完成后,调用TrackingDirectoryWrapper的getCreatedFiles函数获得刚刚合并过程中创建的所有新文件,然后设置进OneMerge中。
接下来省略的一大段代码用来判断是否要合并到cfs文件中,如果需要则进行合并。
再往下调用segmentInfoFormat函数获得Lucene50SegmentInfoFormat,调用其write函数在新段中创建.si文件并向其中添加段的信息。
commitMerge完成收尾工作,包括未删除的文档和更新操作,关于lucene的更新将在后面的章节介绍,这里就不仔细往下看了。
最后返回合并后新段的最大文档数。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->ReaderPool::get

    public synchronized ReadersAndUpdates get(SegmentCommitInfo info, boolean create) {
      ReadersAndUpdates rld = readerMap.get(info);
      if (rld == null) {
        rld = new ReadersAndUpdates(IndexWriter.this, info);
        readerMap.put(info, rld);
      }
      return rld;
    }

将IndexWriter和段的信息SegmentCommitInfo封装成ReadersAndUpdates,并添加到readerMap中并返回。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->ReadersAndUpdates::getReaderForMerge

  synchronized SegmentReader getReaderForMerge(IOContext context) throws IOException {
    isMerging = true;
    return getReader(context);
  }

  public SegmentReader getReader(IOContext context) throws IOException {
    if (reader == null) {
      reader = new SegmentReader(info, context);
      if (liveDocs == null) {
        liveDocs = reader.getLiveDocs();
      }
    }
    return reader;
  }

  public SegmentReader(SegmentCommitInfo si, IOContext context) throws IOException {
    this.si = si;
    core = new SegmentCoreReaders(si.info.dir, si, context);
    segDocValues = new SegmentDocValues();
    final Codec codec = si.info.getCodec();
    if (si.hasDeletions()) {
        liveDocs = codec.liveDocsFormat().readLiveDocs(directory(), si, IOContext.READONCE);
    }

    numDocs = si.info.maxDoc() - si.getDelCount();
    fieldInfos = initFieldInfos();
    docValuesProducer = initDocValuesProducer();
  }

ReadersAndUpdates的getReaderForMerge函数会创建SegmentReader,SegmentReader在之前的文章介绍过了,其封装了对该段各个文件的读取类,用于读取各个文件。hasDeletions返回true表示该段包含删除标记,此时通过readLiveDocs读取.liv文件,将删除的标记信息保存在liveDocs中。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::SegmentMerger->MergeState::MergeState

  MergeState(List<CodecReader> readers, SegmentInfo segmentInfo, InfoStream infoStream){
    int numReaders = readers.size();
    docMaps = new DocMap[numReaders];
    docBase = new int[numReaders];
    maxDocs = new int[numReaders];
    fieldsProducers = new FieldsProducer[numReaders];
    normsProducers = new NormsProducer[numReaders];
    storedFieldsReaders = new StoredFieldsReader[numReaders];
    termVectorsReaders = new TermVectorsReader[numReaders];
    docValuesProducers = new DocValuesProducer[numReaders];
    pointsReaders = new PointsReader[numReaders];
    fieldInfos = new FieldInfos[numReaders];
    liveDocs = new Bits[numReaders];

    for(int i=0;i<numReaders;i++) {
      final CodecReader reader = readers.get(i);

      maxDocs[i] = reader.maxDoc();
      liveDocs[i] = reader.getLiveDocs();
      fieldInfos[i] = reader.getFieldInfos();

      normsProducers[i] = reader.getNormsReader();
      if (normsProducers[i] != null) {
        normsProducers[i] = normsProducers[i].getMergeInstance();
      }

      docValuesProducers[i] = reader.getDocValuesReader();
      if (docValuesProducers[i] != null) {
        docValuesProducers[i] = docValuesProducers[i].getMergeInstance();
      }

      storedFieldsReaders[i] = reader.getFieldsReader();
      if (storedFieldsReaders[i] != null) {
        storedFieldsReaders[i] = storedFieldsReaders[i].getMergeInstance();
      }

      termVectorsReaders[i] = reader.getTermVectorsReader();
      if (termVectorsReaders[i] != null) {
        termVectorsReaders[i] = termVectorsReaders[i].getMergeInstance();
      }

      fieldsProducers[i] = reader.getPostingsReader().getMergeInstance();
      pointsReaders[i] = reader.getPointsReader();
      if (pointsReaders[i] != null) {
        pointsReaders[i] = pointsReaders[i].getMergeInstance();
      }
    }

    this.segmentInfo = segmentInfo;
    setDocMaps(readers);
  }

MergeState的构造函数主要封装了每个段的每个文件的Reader,例如fieldsProducers数组中包含了用于读取每个段.doc、.pos文件的BlockTreeTermsReader。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge

  MergeState merge() throws IOException {
    mergeFieldInfos();
    int numMerged = mergeFields();

    final SegmentWriteState segmentWriteState = new SegmentWriteState(mergeState.infoStream, directory, mergeState.segmentInfo, mergeState.mergeFieldInfos, null, context);
    mergeTerms(segmentWriteState);
    mergeNorms(segmentWriteState);
    codec.fieldInfosFormat().write(directory, mergeState.segmentInfo, "", mergeState.mergeFieldInfos, context);
    return mergeState;
  }

首先通过mergeFieldInfos函数对域进行合并。然后根据最新合并的域信息,通过mergeFields函数对所有段的.fdt和.fdx文件进行合并。
再往下创建SegmentWriteState,接着调用mergeTerms函数,用于合并每个段的.doc、.pos、.pay、.tip、.tim文件。
然后继续通过mergeNorms函数合并.nvd、.nvm文件。
最后通过fieldInfosFormat函数获得Lucene60FieldInfosFormat,调用write向新的.fnm文件写入最新段的域信息。

2.2.1 mergeFieldInfos

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeFieldInfos

  public void mergeFieldInfos() throws IOException {
    for (FieldInfos readerFieldInfos : mergeState.fieldInfos) {
      for (FieldInfo fi : readerFieldInfos) {
        fieldInfosBuilder.add(fi);
      }
    }
    mergeState.mergeFieldInfos = fieldInfosBuilder.finish();
  }

mergeFieldInfos函数遍历每个段的每个域,将每个段的域信息FieldInfo通过add函数添加到fieldInfosBuilder中,最后调用FieldInfos.Builder的finish函数整合所有域的信息,为新的段生成一个新的域信息FieldInfos。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeFieldInfos->FieldInfos.Builder::add

    public FieldInfo add(FieldInfo fi) {
      return addOrUpdateInternal(fi.name, fi.number, fi.hasVectors(),
                                 fi.omitsNorms(), fi.hasPayloads(),
                                 fi.getIndexOptions(), fi.getDocValuesType(),
                                 fi.getPointDimensionCount(), fi.getPointNumBytes());
    }

    private FieldInfo addOrUpdateInternal(String name, int preferredFieldNumber, boolean storeTermVector, boolean omitNorms, boolean storePayloads,IndexOptions indexOptions, DocValuesType docValues, int dimensionCount, int dimensionNumBytes) {

      FieldInfo fi = fieldInfo(name);
      if (fi == null) {
        final int fieldNumber = globalFieldNumbers.addOrGet(name, preferredFieldNumber, docValues, dimensionCount, dimensionNumBytes);
        fi = new FieldInfo(name, fieldNumber, storeTermVector, omitNorms, storePayloads, indexOptions, docValues, -1, new HashMap<>(), dimensionCount, dimensionNumBytes);
        globalFieldNumbers.verifyConsistent(Integer.valueOf(fi.number), fi.name, fi.getDocValuesType());
        byName.put(fi.name, fi);
      } else {
        fi.update(storeTermVector, omitNorms, storePayloads, indexOptions, dimensionCount, dimensionNumBytes);
        if (docValues != DocValuesType.NONE) {
          boolean updateGlobal = fi.getDocValuesType() == DocValuesType.NONE;
          if (updateGlobal) {
            globalFieldNumbers.setDocValuesType(fi.number, name, docValues);
          }
          fi.setDocValuesType(docValues);
        }
      }
      return fi;
    }

fieldInfo会从成员变量byName中获取是否已经添加了对应域的FieldInfo,如果没有添加,则通过addOrGet函数从全局获得域对应的number,然后创建FieldInfo添加到byName中,如果已经添加过了,就调用FieldInfo的update函数和原来添加的FieldInfo的值进行合并。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeFieldInfos->FieldInfos.Builder::add->addOrUpdateInternal->FieldInfo::update

  void update(boolean storeTermVector, boolean omitNorms, boolean storePayloads, IndexOptions indexOptions,
              int dimensionCount, int dimensionNumBytes) {
    if (this.indexOptions != indexOptions) {
      if (this.indexOptions == IndexOptions.NONE) {
        this.indexOptions = indexOptions;
      } else if (indexOptions != IndexOptions.NONE) {
        this.indexOptions = this.indexOptions.compareTo(indexOptions) < 0 ? this.indexOptions : indexOptions;
      }
    }

    if (this.pointDimensionCount == 0 && dimensionCount != 0) {
      this.pointDimensionCount = dimensionCount;
      this.pointNumBytes = dimensionNumBytes;
    }

    if (this.indexOptions != IndexOptions.NONE) {
      this.storeTermVector |= storeTermVector;
      this.storePayloads |= storePayloads;

      if (indexOptions != IndexOptions.NONE && this.omitNorms != omitNorms) {
        this.omitNorms = true;
      }
    }
    if (this.indexOptions == IndexOptions.NONE || this.indexOptions.compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) < 0) {
      this.storePayloads = false;
    }
  }

update函数可以看到域相互合并的规则,这里就不详细介绍了。

2.2.2 mergeFields

mergeFieldInfos函数创建了新段的域信息,mergeFields函数根据该域信息对所有段的.fdt和.fdx文件进行合并。
MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeFields

  private int mergeFields() throws IOException {
    try (StoredFieldsWriter fieldsWriter = codec.storedFieldsFormat().fieldsWriter(directory, mergeState.segmentInfo, context)) {
      return fieldsWriter.merge(mergeState);
    }
  }

fieldsWriter函数最终返回CompressingStoredFieldsWriter,然后调用其merge函数开始合并。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeFields->CompressingStoredFieldsWriter::merge

  public int merge(MergeState mergeState) throws IOException {
    int docCount = 0;
    int numReaders = mergeState.maxDocs.length;

    MatchingReaders matching = new MatchingReaders(mergeState);

    for (int readerIndex=0;readerIndex<numReaders;readerIndex++) {
      MergeVisitor visitor = new MergeVisitor(mergeState, readerIndex);
      CompressingStoredFieldsReader matchingFieldsReader = null;
      if (matching.matchingReaders[readerIndex]) {
        final StoredFieldsReader fieldsReader = mergeState.storedFieldsReaders[readerIndex];
        matchingFieldsReader = (CompressingStoredFieldsReader) fieldsReader;
      }

      final int maxDoc = mergeState.maxDocs[readerIndex];
      final Bits liveDocs = mergeState.liveDocs[readerIndex];

      if (matchingFieldsReader == null || matchingFieldsReader.getVersion() != VERSION_CURRENT || BULK_MERGE_ENABLED == false) {

        ...

      } else if (matchingFieldsReader.getCompressionMode() == compressionMode && 
                 matchingFieldsReader.getChunkSize() == chunkSize && 
                 matchingFieldsReader.getPackedIntsVersion() == PackedInts.VERSION_CURRENT &&
                 liveDocs == null &&
                 !tooDirty(matchingFieldsReader)) {  
        matchingFieldsReader.checkIntegrity();
        IndexInput rawDocs = matchingFieldsReader.getFieldsStream();
        CompressingStoredFieldsIndexReader index = matchingFieldsReader.getIndexReader();
        rawDocs.seek(index.getStartPointer(0));
        int docID = 0;
        while (docID < maxDoc) {
          int base = rawDocs.readVInt();
          int code = rawDocs.readVInt();
          int bufferedDocs = code >>> 1;

          indexWriter.writeIndex(bufferedDocs, fieldsStream.getFilePointer());
          fieldsStream.writeVInt(docBase);
          fieldsStream.writeVInt(code);
          docID += bufferedDocs;
          docBase += bufferedDocs;
          docCount += bufferedDocs;
          final long end;
          if (docID == maxDoc) {
            end = matchingFieldsReader.getMaxPointer();
          } else {
            end = index.getStartPointer(docID);
          }
          fieldsStream.copyBytes(rawDocs, end - rawDocs.getFilePointer());
        }               
        numChunks += matchingFieldsReader.getNumChunks();
        numDirtyChunks += matchingFieldsReader.getNumDirtyChunks();
      } else {
        matchingFieldsReader.checkIntegrity();
        for (int docID = 0; docID < maxDoc; docID++) {
          if (liveDocs != null && liveDocs.get(docID) == false) {
            continue;
          }
          SerializedDocument doc = matchingFieldsReader.document(docID);
          bufferedDocs.copyBytes(doc.in, doc.length);
          numStoredFieldsInDoc = doc.numStoredFields;
          finishDocument();
          ++docCount;
        }
      }
    }
    finish(mergeState.mergeFieldInfos, docCount);
    return docCount;
  }

首先创建MatchingReaders,MatchingReaders的构造函数标记了合并后段的域是否和现有段的域相同。然后遍历所有段,根据MatchingReaders的标记matchingReaders,如果一致,就直接从MergeState的成员变量storedFieldsReaders里获得对应段的CompressingStoredFieldsReader。
接下来从MergeState中获得最大的文档数maxDoc和删除的信息liveDocs。
再往下的第一个if语句表示要合并的段和最终的段在域上结构不一致,此时要改写document文档,这部分代码省略。
第二个if语句表示该段没有要删除的数据,即liveDocs为null。此时,首先通过checkIntegrity函数确认对应段的.fdt文件的合法性,然后获得.fdt文件的数据流rawDocs和索引信息index,初始化数据流的指针。成员变量indexWriter用于向新段的.fdx文件写入数据,fieldsStream用于向新段的.fdt文件的写入数据。接着根据rawDocs的原有数据向新的.fdx和.fdt文件写入数据。其中bufferedDocs代表一次性读取的文档数,end是计算的下一组文档的文件起始指针,也就是上一组文档的结束指针,然后更新.fdx索引文件,更新.fdt文件中的文档号docBase,最后通过copyBytes函数拷贝.fdt文件的文档数据。
第三个if语句表示合并的段和最终的段在结构上一直,但是有要删除的数据。此时首先通过checkIntegrity函数继续检查要合并的段的.fdt文件的合法性,然后遍历该段所有文档,如果该文档的文档号docID在liveDocs的对应比特上为0,则表示该文档已经被标记为删除,继续循环,如果找到未被删除的文档,就通过CompressingStoredFieldsReader的document函数获得文档对象SerializedDocument,将数据拷贝到bufferedDocs中,存储该文档的域的数量写入numStoredFieldsInDoc中,然后调用finishDocument函数开始合并。

当处理完所有的段后,调用finish函数向.fdt和.fdx的文件尾部写入相应信息,至此,最新段的.fdt和.fdx文件就诞生了。最后返回新段的文件数。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeFields->CompressingStoredFieldsWriter::merge->MatchingReaders::MatchingReaders

  MatchingReaders(MergeState mergeState) {
    int numReaders = mergeState.maxDocs.length;
    int matchedCount = 0;
    matchingReaders = new boolean[numReaders];

    nextReader:
    for (int i = 0; i < numReaders; i++) {
      for (FieldInfo fi : mergeState.fieldInfos[i]) {
        FieldInfo other = mergeState.mergeFieldInfos.fieldInfo(fi.number);
        if (other == null || !other.name.equals(fi.name)) {
          continue nextReader;
        }
      }
      matchingReaders[i] = true;
      matchedCount++;
    }
    this.count = matchedCount;
  }

MatchingReaders的构造函数遍历所有的段,再遍历该段的所有域,通过FieldInfo的number判断该域和即将合并后的域是否一致,如果某个段的所有域都和最新段的域一致,就将matchingReaders的相应位置标记为true。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeFields->CompressingStoredFieldsWriter::merge->finishDocument

  public void finishDocument() throws IOException {
    if (numBufferedDocs == this.numStoredFields.length) {
      final int newLength = ArrayUtil.oversize(numBufferedDocs + 1, 4);
      this.numStoredFields = Arrays.copyOf(this.numStoredFields, newLength);
      endOffsets = Arrays.copyOf(endOffsets, newLength);
    }
    this.numStoredFields[numBufferedDocs] = numStoredFieldsInDoc;
    numStoredFieldsInDoc = 0;
    endOffsets[numBufferedDocs] = bufferedDocs.length;
    ++numBufferedDocs;
    if (triggerFlush()) {
      flush();
    }
  }

finishDocument函数将当前文档的域的数量存储在numStoredFields中,将缓存中的指针存储在endOffsets中,在必要的时候通过flush函数将数据写入新的.fdx和.fdt文件。

2.2.3 mergeTerms

mergeTerms主要用于生成合并后段的.doc、.pos、.pay、.tim、.tip文件。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeTerms

  private void mergeTerms(SegmentWriteState segmentWriteState) throws IOException {
    try (FieldsConsumer consumer = codec.postingsFormat().fieldsConsumer(segmentWriteState)) {
      consumer.merge(mergeState);
    }
  }

consumer最终为PerFieldPostingsFormat.FieldsWriter,然后调用其merge函数执行合并操作。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeTerms->FieldsWriter::merge

  public void merge(MergeState mergeState) throws IOException {
    final List<Fields> fields = new ArrayList<>();
    final List<ReaderSlice> slices = new ArrayList<>();
    int docBase = 0;

    for(int readerIndex=0;readerIndex<mergeState.fieldsProducers.length;readerIndex++) {
      final FieldsProducer f = mergeState.fieldsProducers[readerIndex];

      final int maxDoc = mergeState.maxDocs[readerIndex];
      f.checkIntegrity();
      slices.add(new ReaderSlice(docBase, maxDoc, readerIndex));
      fields.add(f);
      docBase += maxDoc;
    }

    Fields mergedFields = new MappedMultiFields(mergeState, new MultiFields(fields.toArray(Fields.EMPTY_ARRAY), slices.toArray(ReaderSlice.EMPTY_ARRAY)));
    write(mergedFields);
  }

merge函数遍历每个段的FieldsProducer,通过checkIntegrity函数检查.tim、.pos、.doc、.pay文件的完整性,创建的ReaderSlice封装了每个段的基本信息,最后创建MappedMultiFields,调用write函数继续执行。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeTerms->FieldsWriter::merge->write

    public void write(Fields fields) throws IOException {

      Map<PostingsFormat,FieldsGroup> formatToGroups = new HashMap<>();
      Map<String,Integer> suffixes = new HashMap<>();

      for(String field : fields) {
        FieldInfo fieldInfo = writeState.fieldInfos.fieldInfo(field);
        final PostingsFormat format = getPostingsFormatForField(field);
        String formatName = format.getName();

        FieldsGroup group = formatToGroups.get(format);
        if (group == null) {
          Integer suffix = suffixes.get(formatName);
          if (suffix == null) {
            suffix = 0;
          } else {
            suffix = suffix + 1;
          }
          suffixes.put(formatName, suffix);
          String segmentSuffix = getFullSegmentSuffix(field, writeState.segmentSuffix, getSuffix(formatName, Integer.toString(suffix)));
          group = new FieldsGroup();
          group.state = new SegmentWriteState(writeState, segmentSuffix);
          group.suffix = suffix;
          formatToGroups.put(format, group);
        }

        group.fields.add(field);
        String previousValue = fieldInfo.putAttribute(PER_FIELD_FORMAT_KEY, formatName);
        previousValue = fieldInfo.putAttribute(PER_FIELD_SUFFIX_KEY, Integer.toString(group.suffix));
      }

      for(Map.Entry<PostingsFormat,FieldsGroup> ent : formatToGroups.entrySet()) {
        PostingsFormat format = ent.getKey();
        final FieldsGroup group = ent.getValue();

        Fields maskedFields = new FilterFields(fields) {
            @Override
            public Iterator<String> iterator() {
              return group.fields.iterator();
            }
        };

        FieldsConsumer consumer = format.fieldsConsumer(group.state);
        consumer.write(maskedFields);
      }
    }

write函数首先遍历所有的域,根据域名,获取新段对应的FieldInfo,getPostingsFormatForField函数默认返回Lucene50PostingsFormat,再往下创建的FieldsGroup是对不同的域相同的存储格式的封装,然后将其添加到formatToGroups集合中。
再往下遍历formatToGroups,fieldsConsumer函数根据SegmentWriteState的信息和存储格式format,生成对应的.doc、.pos、.pay、.tim、.tip文件,创建输入流并初始化,最后调用返回的BlockTreeTermsWriter的write函数将原有数据写入这些新的文件。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeTerms->FieldsWriter::merge->write->BlockTreeTermsWriter::write

  public void write(Fields fields) throws IOException {

    String lastField = null;
    for(String field : fields) {
      lastField = field;
      Terms terms = fields.terms(field);
      FieldInfo fieldInfo = fieldInfos.fieldInfo(field);

      TermsEnum termsEnum = terms.iterator();
      TermsWriter termsWriter = new TermsWriter(fieldInfos.fieldInfo(field));
      while (true) {
        BytesRef term = termsEnum.next();
        if (term == null) break;
        termsWriter.write(term, termsEnum, null);
      }
      termsWriter.finish();
    }
  }

write函数遍历所有的域,对每个域通过terms函数返回一个MultiTerms包含之前所有域的FieldReader。
接下来获得对应域的FieldInfo信息。再往下通过一个MappedMultiTerms的iterator函数最终返回一个MappedMultiTermsEnum,内部封装了每个段的FieldReader。
然后创建TermsWriter,进入循环,获得倒排列表中的每个Term,然后调用TermsWriter的write和finish函数将这些信息逐个写入.doc、.pos、.pay、.tip和.tim文件中,这部分代码和《lucene源码分析—9》中分析的类似,不同点是要对所有段的FieldReader读出的字节进行排序,这里就不往下看了。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeTerms->FieldsWriter::merge->write->BlockTreeTermsWriter::write->FilterFields::terms

  public Terms terms(String field) throws IOException {
    return in.terms(field);
  }

  public Terms terms(String field) throws IOException {
    MultiTerms terms = (MultiTerms) in.terms(field);
    if (terms == null) {
      return null;
    } else {
      return new MappedMultiTerms(field, mergeState, terms);
    }
  }

  public Terms terms(String field) throws IOException {
    Terms result = terms.get(field);
    if (result != null)
      return result;

    final List<Terms> subs2 = new ArrayList<>();
    final List<ReaderSlice> slices2 = new ArrayList<>();

    for(int i=0;i<subs.length;i++) {
      final Terms terms = subs[i].terms(field);
      if (terms != null) {
        subs2.add(terms);
        slices2.add(subSlices[i]);
      }
    }

    result = new MultiTerms(subs2.toArray(Terms.EMPTY_ARRAY), slices2.toArray(ReaderSlice.EMPTY_ARRAY));
    terms.put(field, result);

    return result;
  }

terms函数遍历所有段subs,获得每个段对应域field的FieldReader,添加到subs2中,再将每个段对应的基本信息添加到slices2中,封装成MultiTerms,添加到缓存terms中,最后返回MultiTerms。
然后将返回的MultiTerms和合并信息mergeState封装成MappedMultiTerms并返回。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeTerms->FieldsWriter::merge->write->BlockTreeTermsWriter::write->MappedMultiTerms::iterator

  public TermsEnum iterator() throws IOException {
    TermsEnum iterator = in.iterator();
    return new MappedMultiTermsEnum(field, mergeState, (MultiTermsEnum) iterator);
  }

  public TermsEnum iterator() throws IOException {

    final List<MultiTermsEnum.TermsEnumIndex> termsEnums = new ArrayList<>();
    for(int i=0;i<subs.length;i++) {
      final TermsEnum termsEnum = subs[i].iterator();
      if (termsEnum != null) {
        termsEnums.add(new MultiTermsEnum.TermsEnumIndex(termsEnum, i));
      }
    }

    return new MultiTermsEnum(subSlices).reset(termsEnums.toArray(MultiTermsEnum.TermsEnumIndex.EMPTY_ARRAY));
  }

MappedMultiTerms的iterator函数进而调用MultiTerms的iterator函数,该函数遍历所有段,调用每个段的FieldReader的iterator函数创建SegmentTermsEnum并返回,然后封装成TermsEnumIndex添加到termsEnums列表中,最后创建MultiTermsEnum并通过reset函数封装这些信息。

2.2.4 mergeNorms

mergeNorms相对前几个文件的合并操作要简单得多,其最终向.nvd和.nvm文件写入数据。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeNorms

  private void mergeNorms(SegmentWriteState segmentWriteState) throws IOException {
    try (NormsConsumer consumer = codec.normsFormat().normsConsumer(segmentWriteState)) {
      consumer.merge(mergeState);
    }
  }

consumer最终获得Lucene53NormsConsumer,调用其merge函数继续执行。

MergeThread::run->doMerge->IndexWriter::merge->mergeMiddle->SegmentMerger::merge->mergeNorms->Lucene53NormsConsumer::merge

  public void merge(MergeState mergeState) throws IOException {
    for(NormsProducer normsProducer : mergeState.normsProducers) {
        normsProducer.checkIntegrity();
    }
    for (FieldInfo mergeFieldInfo : mergeState.mergeFieldInfos) {
      if (mergeFieldInfo.hasNorms()) {
        List<NumericDocValues> toMerge = new ArrayList<>();
        for (int i=0;i<mergeState.normsProducers.length;i++) {
          NormsProducer normsProducer = mergeState.normsProducers[i];
          NumericDocValues norms = null;
          if (normsProducer != null) {
            FieldInfo fieldInfo = mergeState.fieldInfos[i].fieldInfo(mergeFieldInfo.name);
            if (fieldInfo != null && fieldInfo.hasNorms()) {
              norms = normsProducer.getNorms(fieldInfo);
            }
          }
          toMerge.add(norms);
        }
        mergeNormsField(mergeFieldInfo, mergeState, toMerge);
      }
    }
  }

这里首先对每个段的Lucene53NormsProducer做文件检查,然后外循环遍历所有域,内循环遍历所有段,将每个段的norm信息添加到toMerge列表中,最后调用mergeNormsField写入新段的.nvd和.nvm文件中去,具体的写这里就不往下看了。

2.3 mergeFinish

mergeFinish完成最后的收尾工作。

MergeThread::run->doMerge->IndexWriter::merge->mergeFinish

  final synchronized void mergeFinish(MergePolicy.OneMerge merge) {

    if (merge.registerDone) {
      final List<SegmentCommitInfo> sourceSegments = merge.segments;
      for (SegmentCommitInfo info : sourceSegments) {
        mergingSegments.remove(info);
      }
      merge.registerDone = false;
    }
    runningMerges.remove(merge);
  }

mergeFinish函数将每个段对应的SegmentCommitInfo从mergingSegments集合中删除,再从runningMerges移除前面初始化构造的OneMerge,表示合并完成。

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值