HBase MemStore flush执行流程

最新推荐文章于 2023-12-23 17:47:05 发布

宋大虾的猫

最新推荐文章于 2023-12-23 17:47:05 发布

阅读量539

点赞数 1

分类专栏： HBase

本文链接：https://blog.csdn.net/qq_35542970/article/details/109515237

版权

HBase 专栏收录该内容

9 篇文章 2 订阅

订阅专栏

前言

上一篇博客中（https://blog.csdn.net/qq_35542970/article/details/109390109），我们分析了memstore flush的几种触发条件，从中可以看出HBase将需要flush的请求存放于MemStoreFlusher中定义的flushQueue中。那么队列中的flush请求又是如何处理的呢？

1、flush队列的处理

1.1、flush请求队列的处理流程：

以put等操作触发的flush为例，流程如下：
在这里插入图片描述（图片来自链接https://blog.csdn.net/youmengjiuzhuiba/article/details/45531151）

1.2 flushHandler

MemStoreFlusher中的flush工作线程定义在了flushHandler中，初始化代码如下：

    int handlerCount = conf.getInt("hbase.hstore.flusher.count", 2);
    this.flushHandlers = new FlushHandler[handlerCount];

其中的handlerCount定义了regionserver中用于flush的线程数量，默认值是2，偏小，建议在实际应用中将该值调大一些。（参数调优中，效果很好）
HRegionServer启动的时候，会一并将这些工作线程也启动，start代码如下：
MemStoreFlusher#start

  synchronized void start(UncaughtExceptionHandler eh) {
    ThreadFactory flusherThreadFactory = Threads.newDaemonThreadFactory(
        server.getServerName().toShortString() + "-MemStoreFlusher", eh);
    for (int i = 0; i < flushHandlers.length; i++) {
      flushHandlers[i] = new FlushHandler("MemStoreFlusher." + i);
      flusherThreadFactory.newThread(flushHandlers[i]);
      flushHandlers[i].start();
    }
  }

flusherHandler的具体逻辑：

  private class FlushHandler extends HasThread {

    private FlushHandler(String name) {
      super(name);
    }

    @Override
    public void run() {
      // 
      while (!server.isStopped()) {
        FlushQueueEntry fqe = null;
        try {
          wakeupPending.set(false); // allow someone to wake us up again
          fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS);
          if (fqe == null || fqe == WAKEUPFLUSH_INSTANCE) {
		    // 无flush请求 或 空请求
            FlushType type = isAboveLowWaterMark();
            if (type != FlushType.NORMAL) {
              ...
              if (!flushOneForGlobalPressure()) {
                Thread.sleep(1000);
                wakeUpIfBlocking();
              }
              // Enqueue another one of these tokens so we'll wake up again
              wakeupFlushThread(); 
            }
            continue;
          }
          FlushRegionEntry fre = (FlushRegionEntry) fqe;
          if (!flushRegion(fre)) {
            break;
          }
        } catch (InterruptedException ex) {
          continue;
        } catch (ConcurrentModificationException ex) {
          continue;
        } catch (Exception ex) {
          LOG.error("Cache flusher failed for entry " + fqe, ex);
          if (!server.checkFileSystem()) {
            break;
          }
        }
      }
      synchronized (regionsInQueue) {
        regionsInQueue.clear();
        flushQueue.clear();
      }

      // Signal anyone waiting, so they see the close flag
      wakeUpIfBlocking();
      LOG.info(getName() + " exiting");
    }
  }

可以看到run方法中定义了一个循环，只要当前regionserver没有停止，则flusherHandler会不停地从请求队列中获取具体的请求fqe，如果当前无flush请求或者获取的flush请求是一个空请求，则根据当前regionServer上全局MemStore的大小判断一下是否需要flush。
这里将会触发上篇博客中提到的第4种flush机制。

  // 根据全局MemStore的大小判断一下是否需要flush
  public FlushType isAboveLowWaterMark() {
    // for onheap memstore we check if the global memstore size and the
    // global heap overhead is greater than the global memstore lower mark limit
    if (memType == MemoryType.HEAP) {
      if (getGlobalMemStoreHeapSize() >= globalMemStoreLimitLowMark) {
        return FlushType.ABOVE_ONHEAP_LOWER_MARK;
      }
    } else {
      if (getGlobalMemStoreOffHeapSize() >= globalMemStoreLimitLowMark) {
        // Indicates that the offheap memstore's size is greater than the global memstore
        // lower limit
        return FlushType.ABOVE_OFFHEAP_LOWER_MARK;
      } else if (getGlobalMemStoreHeapSize() >= globalOnHeapMemstoreLimitLowMark) {
        // Indicates that the offheap memstore's heap overhead is greater than the global memstore
        // onheap lower limit
        return FlushType.ABOVE_ONHEAP_LOWER_MARK;
      }
    }
    return FlushType.NORMAL;
  }

  this.globalMemStoreLimitLowMark =
      (long) (this.globalMemStoreLimit * this.globalMemStoreLimitLowMarkPercent);

这里定义了两个阈值，分别是globalMemStoreLimit和globalMemStoreLimitLowMark，默认配置里前者是整个RegionServer中MemStore总大小的40%，而后者又是前者的95%，为什么要这么设置，简单来说就是，当MemStore的大小占到整个RegionServer总内存大小的40%时，该regionServer上的update操作会被阻塞住，此时MemStore中的内容强制刷盘，这是一个非常影响性能的操作，因此需要在达到前者的95%的时候，就提前启动MemStore的刷盘动作，不同的是此时的刷盘不会阻塞读写。

回到上面的run方法，当需要强制flush的时候，调用的是flushOneForGlobalPressure执行强制flush，为了提高flush的效率，同时减少带来的阻塞时间，flushOneForGlobalPressure中对执行flush的region选择做了很多优化，总体来说，需要满足以下两个条件：
（1）Region中的StoreFile数量不能过多，意味着挑选flush起来更快的region，减少阻塞时间；
（2）满足条件1的所有Region中大小为最大值，意味着尽量最大化本次强制flush的执行效果；
相应代码如下：

  /**
   * The memstore across all regions has exceeded the low water mark. Pick
   * one region to flush and flush it synchronously (this is called from the
   * flush thread)
   * @return true if successful
   */
  private boolean flushOneForGlobalPressure() {
    SortedMap<Long, HRegion> regionsBySize = null;
	// 根据堆内、对外内存，对region排序
    switch(flushType) {
      case ABOVE_OFFHEAP_HIGHER_MARK:
      case ABOVE_OFFHEAP_LOWER_MARK:
        regionsBySize = server.getCopyOfOnlineRegionsSortedByOffHeapSize();
        break;
      case ABOVE_ONHEAP_HIGHER_MARK:
      case ABOVE_ONHEAP_LOWER_MARK:
      default:
        regionsBySize = server.getCopyOfOnlineRegionsSortedByOnHeapSize();
    }
    ...

    boolean flushedOne = false;
    while (!flushedOne) {
	  // Find the biggest region that doesn't have too many storefiles (might be null!)
      HRegion bestFlushableRegion =
          getBiggestMemStoreRegion(regionsBySize, excludedRegions, true);
      // Find the biggest region, total, even if it might have too many flushes.
      HRegion bestAnyRegion = getBiggestMemStoreRegion(regionsBySize, excludedRegions, false);
      // Find the biggest region that is a secondary region
      HRegion bestRegionReplica = getBiggestMemStoreOfRegionReplica(regionsBySize, excludedRegions);
      if (bestAnyRegion == null) {
        // If bestAnyRegion is null, assign replica. It may be null too. Next step is check for null
        bestAnyRegion = bestRegionReplica;
      }
      if (bestAnyRegion == null) {
        LOG.error("Above memory mark but there are no flushable regions!");
        return false;
      }
      ...
      HRegion regionToFlush;
	  ...
	  // 调用flushRegion
      flushedOne = flushRegion(regionToFlush, true, false, FlushLifeCycleTracker.DUMMY);
      ...
    }
    return true;
  }

  //挑选合适的region
  private HRegion getBiggestMemStoreRegion(
      SortedMap<Long, HRegion> regionsBySize,
      Set<HRegion> excludedRegions,
      boolean checkStoreFileCount) {
    synchronized (regionsInQueue) {
      for (HRegion region : regionsBySize.values()) {
        if (excludedRegions.contains(region)) {
          continue;
        }

        if (region.writestate.flushing || !region.writestate.writesEnabled) {
          continue;
        }

        if (checkStoreFileCount && isTooManyStoreFiles(region)) {
          continue;
        }
        return region;
      }
    }
    return null;
  }

2、Flush具体实现

它首先会检查当前region内的storeFiles的数量，如果storefile过多，会首先发出一个对该region的compact请求，然后再将region重新加入到flushQueue中等待下一次的flush请求处理，当然，再次加入到flushQueue时，其等待时间被相应缩短了。
MemStoreFlusher#flushRegion

  private boolean flushRegion(final FlushRegionEntry fqe) {
    HRegion region = fqe.region;
    if (!region.getRegionInfo().isMetaRegion() && isTooManyStoreFiles(region)) {
      if (fqe.isMaximumWait(this.blockingWaitTime)) {
        ...
      } else {
        // If this is first time we've been put off, then emit a log message.
        if (fqe.getRequeueCount() <= 0) {
          // Note: We don't impose blockingStoreFiles constraint on meta regions
          LOG.warn("{} has too many store files({}); delaying flush up to {} ms",
              region.getRegionInfo().getEncodedName(), getStoreFileCount(region),
              this.blockingWaitTime);
          if (!this.server.compactSplitThread.requestSplit(region)) {
            try {
              this.server.compactSplitThread.requestSystemCompaction(region,
                Thread.currentThread().getName());
            } catch (IOException e) {
              ...
            }
          }
        }

        // Put back on the queue.  Have it come back out of the queue
        // after a delay of this.blockingWaitTime / 100 ms.
        this.flushQueue.add(fqe.requeue(this.blockingWaitTime / 100));
        // Tell a lie, it's not flushed but it's ok
        return true;
      }
    }
	// storefile数量满足要求，默认16个
    return flushRegion(region, false, fqe.isForceFlushAllStores(), fqe.getTracker());
  }

storeFile数量满足要求的flush请求会进入Region的flush实现

  private boolean flushRegion(HRegion region, boolean emergencyFlush, boolean forceFlushAllStores,
      FlushLifeCycleTracker tracker) {
    synchronized (this.regionsInQueue) {
      FlushRegionEntry fqe = this.regionsInQueue.remove(region);
      flushQueue.remove(fqe); //将flush请求从请求队列中移除
    }
	
    // ReentrantReadWriteLock
    lock.readLock().lock(); //region加上共享锁
    try {
      notifyFlushRequest(region, emergencyFlush);
      FlushResult flushResult = region.flushcache(forceFlushAllStores, false, tracker);
      boolean shouldCompact = flushResult.isCompactionNeeded();
      // We just want to check the size
      boolean shouldSplit = region.checkSplit() != null;
      if (shouldSplit) {
        this.server.compactSplitThread.requestSplit(region); //处理flush之后的可能的split
      } else if (shouldCompact) {
        server.compactSplitThread.requestSystemCompaction(region, Thread.currentThread().getName()); //处理flush之后的可能compact
      }
    } catch (DroppedSnapshotException ex) {
	  ...
      server.abort("Replay of WAL required. Forcing server shutdown", ex);
      return false;
    } catch (IOException ex) {
      ...
      if (!server.checkFileSystem()) {
        return false;
      }
    } finally {
      lock.readLock().unlock();
      wakeUpIfBlocking(); //唤醒所有等待的线程
      tracker.afterExecution();
    }
    return true;
  }

1、flush期间，该region是被readLock保护起来的，也就是试图获得writeLock的请求会被阻塞掉，包括move region、compact等等；其二是flush之后，可能会产生数量较多的storefile，这会触发一次compact，同样的flush后形成的较大storefile也会触发一次split；
2、region.flushcache(forceFlushAllStores)这一句是可看出flush操作是region级别的，也就是触发flush后，该region上的所有MemStore均会参与flush，这里对region又加上了一次readLock，ReentrantReadWriteLock是可重入的，所以倒无大碍。该方法中还检查了region的状态，如果当前region正处于closing或者closed状态，则不会执行compact或者flush请求，这是由于类似flush这样的操作，一般比较耗时，会增加region的下线关闭时间。
所有检查通过后，开始真正的flush实现，一层层进入调用的函数，最终的实现在internalFlushCache，
代码如下：
HRegion#internalFlushcache

  /**
   * Flush the memstore. Flushing the memstore is a little tricky. We have a lot of updates in the
   * memstore, all of which have also been written to the wal. We need to write those updates in the
   * memstore out to disk, while being able to process reads/writes as much as possible during the
   * flush operation.
   */
  protected FlushResultImpl internalFlushcache(WAL wal, long myseqid,
      Collection<HStore> storesToFlush, MonitoredTask status, boolean writeFlushWalMarker,
      FlushLifeCycleTracker tracker) throws IOException {
    // internalPrepareFlushCache执行snapshot，打快照
    PrepareFlushResult result =
        internalPrepareFlushCache(wal, myseqid, storesToFlush, status, writeFlushWalMarker, tracker);
    // 返回的result中的result是null.因此会执行internalFlushchacheAndCommit方法执行第二和第三阶段。
    if (result.result == null) {
      return internalFlushCacheAndCommit(wal, status, result, storesToFlush);
    } else {
      return result.result; // early exit due to failure from prepare stage
    }
  }

其中internalPrepareFlushCache进行flush前的准备工作，包括生成一次MVCC的事务ID，准备flush时所需要的缓存和中间数据结构，以及生成当前MemStore的一个快照。
internalFlushCacheAndCommit则执行了具体的flush行为，包括首先将数据写入临时的tmp文件，提交一次更新事务(commit)，最后再将文件移入hdfs中的正确目录下。

Flush操作的3个阶段：

阶段1：创建快照

HRegion#internalPrepareFlushCache

  protected PrepareFlushResult internalPrepareFlushCache(WAL wal, long myseqid,
      Collection<HStore> storesToFlush, MonitoredTask status, boolean writeFlushWalMarker,
      FlushLifeCycleTracker tracker) throws IOException {
    ...
    // block waiting for the lock for internal flush
    // 获取update的写锁
    this.updatesLock.writeLock().lock();
    ...
    // storeFlushCtxs,committedFiles,storeFlushableSize，比较重要的是storeFlushCtxs和committedFiles。他们都被定义为以CF做key的TreeMap，
    // 分别代表了store的CF实际执行（StoreFlusherImpl）和最终刷写的HFlile文件。
    // 其中storeFlushContext的实现类StoreFlusherImpl里包含了flush相关的核心操作：prepare,flushcache,commit,abort等。
    // 所以这里保存的是每一个store的flush实例,后面就是通过这里的StoreFlushContext进行flush的
    TreeMap<byte[], StoreFlushContext> storeFlushCtxs = new TreeMap<>(Bytes.BYTES_COMPARATOR); //用来存储每个store和它对应的hdfs commit路径的映射
    ...
    try {
      ...
      // 循环遍历region下面的storeFile，为每个storeFile生成了一个StoreFlusherImpl类，
      // 生成MemStore的快照就是调用每个StoreFlusherImpl的prepare方法生成每个storeFile的快照，
      // 至于internalFlushCacheAndCommit中的flush和commti行为也是调用了region中每个storeFile的flushCache和commit接口。

      for (HStore s : storesToFlush) { //循环遍历该region的所有storefile,初始化storeFlushCtxs&committedFiles
        // 为每一个store生成自己的storeFlushImpl
        storeFlushCtxs.put(s.getColumnFamilyDescriptor().getName(),
          s.createFlushContext(flushOpSeqId, tracker));
        // for writing stores to WAL
        // 此时还没有生成flush的hfile路径
        committedFiles.put(s.getColumnFamilyDescriptor().getName(), null);
      }
      ...
      // Prepare flush (take a snapshot)
      // 这里的StoreFlushContext就是StoreFlusherImpl
      storeFlushCtxs.forEach((name, flush) -> {
        // 迭代region下的每一个store,把memstore下的kvset复制到memstore的snapshot中并清空kvset的值
        // 把memstore的snapshot复制到HStore的snapshot中
        MemStoreSize snapshotSize = flush.prepare(); //其prepare方法就是调用store的storeFlushImpl的snapshot方法生成快照
        ...
      });
    } catch (IOException ex) {
      doAbortFlushToWAL(wal, flushOpSeqId, committedFiles);
      throw ex;
    } finally {
      // 做完snapshot释放锁，此时不会阻塞业务的读写操作了
      this.updatesLock.writeLock().unlock();
    }
    ...
  }

这里面有几个关键点：
其一，该方法是被updatesLock().writeLock()保护起来的，updatesLock与上文中提到的lock一样，都是ReentrantReadWriteLock，这里为什么还要再加锁呢。前面已经加过的锁是对region整体行为而言，如split、move、merge等宏观行为，而这里的updatesLock是数据的更新请求，快照生成期间加入updatesLock是为了保证数据一致性，快照生成后立即释放了updatesLock，保证了用户请求与快照flush到磁盘同时进行，提高系统并发的吞吐量。
其二，那么MemStore的snapshot、flush以及commit操作具体是如何实现的，在internalPrepareFlushCache中有下面的一段代码：

      for (HStore s : storesToFlush) { //循环遍历该region的所有storefile,初始化storeFlushCtxs&committedFiles
        // 为每一个store生成自己的storeFlushImpl
        storeFlushCtxs.put(s.getColumnFamilyDescriptor().getName(),
          s.createFlushContext(flushOpSeqId, tracker));
        // for writing stores to WAL
        // 此时还没有生成flush的hfile路径
        committedFiles.put(s.getColumnFamilyDescriptor().getName(), null);
      }

storeFlushCtxs中的StoreFlusherImpl负责flush相关的核心操作：prepare,flushcache,commit,abort等，StoreFlusherImpl是HStore的内部类：

  public StoreFlushContext createFlushContext(long cacheFlushId, FlushLifeCycleTracker tracker) {
    return new StoreFlusherImpl(cacheFlushId, tracker);
  }


  private final class StoreFlusherImpl implements StoreFlushContext {
    /**
     * This is not thread safe. The caller should have a lock on the region or the store.
     * If necessary, the lock can be added with the patch provided in HBASE-10087
     */
    @Override
    public MemStoreSize prepare() {
      // passing the current sequence number of the wal - to allow bookkeeping in the memstore
      // 在region调用storeFlusherImpl的prepare的时候，前面提到是在region的update.write.lock中的，
      // 因此这里面所有的耗时操作都会影响业务正在进行的读写操作.
      // 在snapshot中的逻辑中只是将memstore的跳跃表赋值给snapshot的跳跃表，在返回memstoresnapshot的时候，
      // 调用的snapshot的size()方法
      this.snapshot = memstore.snapshot();
      // MemstoreSnapshot的getCellsCount方法即在memstore的shapshot中返回的MemStoresnapshot中传入的snapshot.size()值，时间复杂度是o（n）
      ...
    }

    @Override
    public void flushCache(MonitoredTask status) throws IOException {
	  ...
      tempFiles =
          HStore.this.flushCache(cacheFlushSeqNum, snapshot, status, throughputController, tracker);
    }

    @Override
    public boolean commit(MonitoredTask status) throws IOException {
      ...
      List<HStoreFile> storeFiles = new ArrayList<>(this.tempFiles.size());
      for (Path storeFilePath : tempFiles) {
        try {
          HStoreFile sf = HStore.this.commitFile(storeFilePath, cacheFlushSeqNum, status);
          outputFileSize += sf.getReader().length();
          storeFiles.add(sf);
        } catch (IOException ex) {
          ...
      }
	  ...
    }
	
  }

阶段2&3：数据落盘和移动

HRegion#internalFlushCacheAndCommit

  protected FlushResultImpl internalFlushCacheAndCommit(WAL wal, MonitoredTask status,
      PrepareFlushResult prepareResult, Collection<HStore> storesToFlush) throws IOException {
    ...
    try {
      // A.  Flush memstore to all the HStores.
      // Keep running vector of all store files that includes both old and the
      // just-made new flush store file. The new flushed file is still in the
      // tmp directory.
      // 迭代region下的每一个store,调用HStore.storeFlushImpl.flushCache方法，
      // 把store中snapshot的数据flush到hfile中,当然这里是flush到tmp文件中，最终是通过commit将其移到正确的路径下

      for (StoreFlushContext flush : storeFlushCtxs.values()) {
        flush.flushCache(status);
      }

      // Switch snapshot (in memstore) -> new hfile (thus causing
      // all the store scanners to reset/reseek).
      Iterator<HStore> it = storesToFlush.iterator();
      // stores.values() and storeFlushCtxs have same order
      for (StoreFlushContext flush : storeFlushCtxs.values()) {
	    // 从临时路径移动至对应列簇下
        boolean needsCompaction = flush.commit(status);
        if (needsCompaction) {
          compactionRequested = true;
        }
        ...
      }
      storeFlushCtxs.clear();
    ...
  }