HBase MemStore flush执行流程

前言

上一篇博客中(https://blog.csdn.net/qq_35542970/article/details/109390109),我们分析了memstore flush的几种触发条件,从中可以看出HBase将需要flush的请求存放于MemStoreFlusher中定义的flushQueue中。那么队列中的flush请求又是如何处理的呢?

1、flush队列的处理

1.1、flush请求队列的处理流程:

以put等操作触发的flush为例,流程如下:
在这里插入图片描述(图片来自链接https://blog.csdn.net/youmengjiuzhuiba/article/details/45531151)

1.2 flushHandler

MemStoreFlusher中的flush工作线程定义在了flushHandler中,初始化代码如下:

    int handlerCount = conf.getInt("hbase.hstore.flusher.count", 2);
    this.flushHandlers = new FlushHandler[handlerCount];

其中的handlerCount定义了regionserver中用于flush的线程数量,默认值是2,偏小,建议在实际应用中将该值调大一些。(参数调优中,效果很好)
HRegionServer启动的时候,会一并将这些工作线程也启动,start代码如下:
MemStoreFlusher#start

  synchronized void start(UncaughtExceptionHandler eh) {
    ThreadFactory flusherThreadFactory = Threads.newDaemonThreadFactory(
        server.getServerName().toShortString() + "-MemStoreFlusher", eh);
    for (int i = 0; i < flushHandlers.length; i++) {
      flushHandlers[i] = new FlushHandler("MemStoreFlusher." + i);
      flusherThreadFactory.newThread(flushHandlers[i]);
      flushHandlers[i].start();
    }
  }

flusherHandler的具体逻辑:

  private class FlushHandler extends HasThread {

    private FlushHandler(String name) {
      super(name);
    }

    @Override
    public void run() {
      // 
      while (!server.isStopped()) {
        FlushQueueEntry fqe = null;
        try {
          wakeupPending.set(false); // allow someone to wake us up again
          fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS);
          if (fqe == null || fqe == WAKEUPFLUSH_INSTANCE) {
		    // 无flush请求 或 空请求
            FlushType type = isAboveLowWaterMark();
            if (type != FlushType.NORMAL) {
              ...
              if (!flushOneForGlobalPressure()) {
                Thread.sleep(1000);
                wakeUpIfBlocking();
              }
              // Enqueue another one of these tokens so we'll wake up again
              wakeupFlushThread(); 
            }
            continue;
          }
          FlushRegionEntry fre = (FlushRegionEntry) fqe;
          if (!flushRegion(fre)) {
            break;
          }
        } catch (InterruptedException ex) {
          continue;
        } catch (ConcurrentModificationException ex) {
          continue;
        } catch (Exception ex) {
          LOG.error("Cache flusher failed for entry " + fqe, ex);
          if (!server.checkFileSystem()) {
            break;
          }
        }
      }
      synchronized (regionsInQueue) {
        regionsInQueue.clear();
        flushQueue.clear();
      }

      // Signal anyone waiting, so they see the close flag
      wakeUpIfBlocking();
      LOG.info(getName() + " exiting");
    }
  }

可以看到run方法中定义了一个循环,只要当前regionserver没有停止,则flusherHandler会不停地从请求队列中获取具体的请求fqe,如果当前无flush请求或者获取的flush请求是一个空请求,则根据当前regionServer上全局MemStore的大小判断一下是否需要flush。
这里将会触发上篇博客中提到的第4种flush机制。

  // 根据全局MemStore的大小判断一下是否需要flush
  public FlushType isAboveLowWaterMark() {
    // for onheap memstore we check if the global memstore size and the
    // global heap overhead is greater than the global memstore lower mark limit
    if (memType == MemoryType.HEAP) {
      if (getGlobalMemStoreHeapSize() >= globalMemStoreLimitLowMark) {
        return FlushType.ABOVE_ONHEAP_LOWER_MARK;
      }
    } else {
      if (getGlobalMemStoreOffHeapSize() >= globalMemStoreLimitLowMark) {
        // Indicates that the offheap memstore's size is greater than the global memstore
        // lower limit
        return FlushType.ABOVE_OFFHEAP_LOWER_MARK;
      } else if (getGlobalMemStoreHeapSize() >= globalOnHeapMemstoreLimitLowMark) {
        // Indicates that the offheap memstore's heap overhead is greater than the global memstore
        // onheap lower limit
        return FlushType.ABOVE_ONHEAP_LOWER_MARK;
      }
    }
    return FlushType.NORMAL;
  }

  this.globalMemStoreLimitLowMark =
      (long) (this.globalMemStoreLimit * this.globalMemStoreLimitLowMarkPercent);

这里定义了两个阈值,分别是globalMemStoreLimit和globalMemStoreLimitLowMark,默认配置里前者是整个RegionServer中MemStore总大小的40%,而后者又是前者的95%,为什么要这么设置,简单来说就是,当MemStore的大小占到整个RegionServer总内存大小的40%时,该regionServer上的update操作会被阻塞住,此时MemStore中的内容强制刷盘,这是一个非常影响性能的操作,因此需要在达到前者的95%的时候,就提前启动MemStore的刷盘动作,不同的是此时的刷盘不会阻塞读写。

回到上面的run方法,当需要强制flush的时候,调用的是flushOneForGlobalPressure执行强制flush,为了提高flush的效率,同时减少带来的阻塞时间,flushOneForGlobalPressure中对执行flush的region选择做了很多优化,总体来说,需要满足以下两个条件:
(1)Region中的StoreFile数量不能过多,意味着挑选flush起来更快的region,减少阻塞时间;
(2)满足条件1的所有Region中大小为最大值,意味着尽量最大化本次强制flush的执行效果;
相应代码如下:

  /**
   * The memstore across all regions has exceeded the low water mark. Pick
   * one region to flush and flush it synchronously (this is called from the
   * flush thread)
   * @return true if successful
   */
  private boolean flushOneForGlobalPressure() {
    SortedMap<Long, HRegion> regionsBySize = null;
	// 根据堆内、对外内存,对region排序
    switch(flushType) {
      case ABOVE_OFFHEAP_HIGHER_MARK:
      case ABOVE_OFFHEAP_LOWER_MARK:
        regionsBySize = server.getCopyOfOnlineRegionsSortedByOffHeapSize();
        break;
      case ABOVE_ONHEAP_HIGHER_MARK:
      case ABOVE_ONHEAP_LOWER_MARK:
      default:
        regionsBySize = server.getCopyOfOnlineRegionsSortedByOnHeapSize();
    }
    ...

    boolean flushedOne = false;
    while (!flushedOne) {
	  // Find the biggest region that doesn't have too many storefiles (might be null!)
      HRegion bestFlushableRegion =
          getBiggestMemStoreRegion(regionsBySize, excludedRegions, true);
      // Find the biggest region, total, even if it might have too many flushes.
      HRegion bestAnyRegion = getBiggestMemStoreRegion(regionsBySize, excludedRegions, false);
      // Find the biggest region that is a secondary region
      HRegion bestRegionReplica = getBiggestMemStoreOfRegionReplica(regionsBySize, excludedRegions);
      if (bestAnyRegion == null) {
        // If bestAnyRegion is null, assign replica. It may be null too. Next step is check for null
        bestAnyRegion = bestRegionReplica;
      }
      if (bestAnyRegion == null) {
        LOG.error("Above memory mark but there are no flushable regions!");
        return false;
      }
      ...
      HRegion regionToFlush;
	  ...
	  // 调用flushRegion
      flushedOne = flushRegion(regionToFlush, true, false, FlushLifeCycleTracker.DUMMY);
      ...
    }
    return true;
  }

  //挑选合适的region
  private HRegion getBiggestMemStoreRegion(
      SortedMap<Long, HRegion> regionsBySize,
      Set<HRegion> excludedRegions,
      boolean checkStoreFileCount) {
    synchronized (regionsInQueue) {
      for (HRegion region : regionsBySize.values()) {
        if (excludedRegions.contains(region)) {
          continue;
        }

        if (region.writestate.flushing || !region.writestate.writesEnabled) {
          continue;
        }

        if (checkStoreFileCount && isTooManyStoreFiles(region)) {
          continue;
        }
        return region;
      }
    }
    return null;
  }

2、Flush具体实现

它首先会检查当前region内的storeFiles的数量,如果storefile过多,会首先发出一个对该region的compact请求,然后再将region重新加入到flushQueue中等待下一次的flush请求处理,当然,再次加入到flushQueue时,其等待时间被相应缩短了。
MemStoreFlusher#flushRegion

  private boolean flushRegion(final FlushRegionEntry fqe) {
    HRegion region = fqe.region;
    if (!region.getRegionInfo().isMetaRegion() && isTooManyStoreFiles(region)) {
      if (fqe.isMaximumWait(this.blockingWaitTime)) {
        ...
      } else {
        // If this is first time we've been put off, then emit a log message.
        if (fqe.getRequeueCount() <= 0) {
          // Note: We don't impose blockingStoreFiles constraint on meta regions
          LOG.warn("{} has too many store files({}); delaying flush up to {} ms",
              region.getRegionInfo().getEncodedName(), getStoreFileCount(region),
              this.blockingWaitTime);
          if (!this.server.compactSplitThread.requestSplit(region)) {
            try {
              this.server.compactSplitThread.requestSystemCompaction(region,
                Thread.currentThread().getName());
            } catch (IOException e) {
              ...
            }
          }
        }

        // Put back on the queue.  Have it come back out of the queue
        // after a delay of this.blockingWaitTime / 100 ms.
        this.flushQueue.add(fqe.requeue(this.blockingWaitTime / 100));
        // Tell a lie, it's not flushed but it's ok
        return true;
      }
    }
	// storefile数量满足要求,默认16个
    return flushRegion(region, false, fqe.isForceFlushAllStores(), fqe.getTracker());
  }

storeFile数量满足要求的flush请求会进入Region的flush实现

  private boolean flushRegion(HRegion region, boolean emergencyFlush, boolean forceFlushAllStores,
      FlushLifeCycleTracker tracker) {
    synchronized (this.regionsInQueue) {
      FlushRegionEntry fqe = this.regionsInQueue.remove(region);
      flushQueue.remove(fqe); //将flush请求从请求队列中移除
    }
	
    // ReentrantReadWriteLock
    lock.readLock().lock(); //region加上共享锁
    try {
      notifyFlushRequest(region, emergencyFlush);
      FlushResult flushResult = region.flushcache(forceFlushAllStores, false, tracker);
      boolean shouldCompact = flushResult.isCompactionNeeded();
      // We just want to check the size
      boolean shouldSplit = region.checkSplit() != null;
      if (shouldSplit) {
        this.server.compactSplitThread.requestSplit(region); //处理flush之后的可能的split
      } else if (shouldCompact) {
        server.compactSplitThread.requestSystemCompaction(region, Thread.currentThread().getName()); //处理flush之后的可能compact
      }
    } catch (DroppedSnapshotException ex) {
	  ...
      server.abort("Replay of WAL required. Forcing server shutdown", ex);
      return false;
    } catch (IOException ex) {
      ...
      if (!server.checkFileSystem()) {
        return false;
      }
    } finally {
      lock.readLock().unlock();
      wakeUpIfBlocking(); //唤醒所有等待的线程
      tracker.afterExecution();
    }
    return true;
  }

1、flush期间,该region是被readLock保护起来的,也就是试图获得writeLock的请求会被阻塞掉,包括move region、compact等等;其二是flush之后,可能会产生数量较多的storefile,这会触发一次compact,同样的flush后形成的较大storefile也会触发一次split;
2、region.flushcache(forceFlushAllStores)这一句是可看出flush操作是region级别的,也就是触发flush后,该region上的所有MemStore均会参与flush,这里对region又加上了一次readLock,ReentrantReadWriteLock是可重入的,所以倒无大碍。该方法中还检查了region的状态,如果当前region正处于closing或者closed状态,则不会执行compact或者flush请求,这是由于类似flush这样的操作,一般比较耗时,会增加region的下线关闭时间。
所有检查通过后,开始真正的flush实现,一层层进入调用的函数,最终的实现在internalFlushCache,
代码如下:
HRegion#internalFlushcache

  /**
   * Flush the memstore. Flushing the memstore is a little tricky. We have a lot of updates in the
   * memstore, all of which have also been written to the wal. We need to write those updates in the
   * memstore out to disk, while being able to process reads/writes as much as possible during the
   * flush operation.
   */
  protected FlushResultImpl internalFlushcache(WAL wal, long myseqid,
      Collection<HStore> storesToFlush, MonitoredTask status, boolean writeFlushWalMarker,
      FlushLifeCycleTracker tracker) throws IOException {
    // internalPrepareFlushCache执行snapshot,打快照
    PrepareFlushResult result =
        internalPrepareFlushCache(wal, myseqid, storesToFlush, status, writeFlushWalMarker, tracker);
    // 返回的result中的result是null.因此会执行internalFlushchacheAndCommit方法执行第二和第三阶段。
    if (result.result == null) {
      return internalFlushCacheAndCommit(wal, status, result, storesToFlush);
    } else {
      return result.result; // early exit due to failure from prepare stage
    }
  }

其中internalPrepareFlushCache进行flush前的准备工作,包括生成一次MVCC的事务ID,准备flush时所需要的缓存和中间数据结构,以及生成当前MemStore的一个快照。
internalFlushCacheAndCommit则执行了具体的flush行为,包括首先将数据写入临时的tmp文件,提交一次更新事务(commit),最后再将文件移入hdfs中的正确目录下。

Flush操作的3个阶段:

阶段1:创建快照

HRegion#internalPrepareFlushCache

  protected PrepareFlushResult internalPrepareFlushCache(WAL wal, long myseqid,
      Collection<HStore> storesToFlush, MonitoredTask status, boolean writeFlushWalMarker,
      FlushLifeCycleTracker tracker) throws IOException {
    ...
    // block waiting for the lock for internal flush
    // 获取update的写锁
    this.updatesLock.writeLock().lock();
    ...
    // storeFlushCtxs,committedFiles,storeFlushableSize,比较重要的是storeFlushCtxs和committedFiles。他们都被定义为以CF做key的TreeMap,
    // 分别代表了store的CF实际执行(StoreFlusherImpl)和最终刷写的HFlile文件。
    // 其中storeFlushContext的实现类StoreFlusherImpl里包含了flush相关的核心操作:prepare,flushcache,commit,abort等。
    // 所以这里保存的是每一个store的flush实例,后面就是通过这里的StoreFlushContext进行flush的
    TreeMap<byte[], StoreFlushContext> storeFlushCtxs = new TreeMap<>(Bytes.BYTES_COMPARATOR); //用来存储每个store和它对应的hdfs commit路径的映射
    ...
    try {
      ...
      // 循环遍历region下面的storeFile,为每个storeFile生成了一个StoreFlusherImpl类,
      // 生成MemStore的快照就是调用每个StoreFlusherImpl的prepare方法生成每个storeFile的快照,
      // 至于internalFlushCacheAndCommit中的flush和commti行为也是调用了region中每个storeFile的flushCache和commit接口。

      for (HStore s : storesToFlush) { //循环遍历该region的所有storefile,初始化storeFlushCtxs&committedFiles
        // 为每一个store生成自己的storeFlushImpl
        storeFlushCtxs.put(s.getColumnFamilyDescriptor().getName(),
          s.createFlushContext(flushOpSeqId, tracker));
        // for writing stores to WAL
        // 此时还没有生成flush的hfile路径
        committedFiles.put(s.getColumnFamilyDescriptor().getName(), null);
      }
      ...
      // Prepare flush (take a snapshot)
      // 这里的StoreFlushContext就是StoreFlusherImpl
      storeFlushCtxs.forEach((name, flush) -> {
        // 迭代region下的每一个store,把memstore下的kvset复制到memstore的snapshot中并清空kvset的值
        // 把memstore的snapshot复制到HStore的snapshot中
        MemStoreSize snapshotSize = flush.prepare(); //其prepare方法就是调用store的storeFlushImpl的snapshot方法生成快照
        ...
      });
    } catch (IOException ex) {
      doAbortFlushToWAL(wal, flushOpSeqId, committedFiles);
      throw ex;
    } finally {
      // 做完snapshot释放锁,此时不会阻塞业务的读写操作了
      this.updatesLock.writeLock().unlock();
    }
    ...
  }

这里面有几个关键点:
其一,该方法是被updatesLock().writeLock()保护起来的,updatesLock与上文中提到的lock一样,都是ReentrantReadWriteLock,这里为什么还要再加锁呢。前面已经加过的锁是对region整体行为而言,如split、move、merge等宏观行为,而这里的updatesLock是数据的更新请求,快照生成期间加入updatesLock是为了保证数据一致性,快照生成后立即释放了updatesLock,保证了用户请求与快照flush到磁盘同时进行,提高系统并发的吞吐量。
其二,那么MemStore的snapshot、flush以及commit操作具体是如何实现的,在internalPrepareFlushCache中有下面的一段代码:

      for (HStore s : storesToFlush) { //循环遍历该region的所有storefile,初始化storeFlushCtxs&committedFiles
        // 为每一个store生成自己的storeFlushImpl
        storeFlushCtxs.put(s.getColumnFamilyDescriptor().getName(),
          s.createFlushContext(flushOpSeqId, tracker));
        // for writing stores to WAL
        // 此时还没有生成flush的hfile路径
        committedFiles.put(s.getColumnFamilyDescriptor().getName(), null);
      }

storeFlushCtxs中的StoreFlusherImpl负责flush相关的核心操作:prepare,flushcache,commit,abort等,StoreFlusherImpl是HStore的内部类:

  public StoreFlushContext createFlushContext(long cacheFlushId, FlushLifeCycleTracker tracker) {
    return new StoreFlusherImpl(cacheFlushId, tracker);
  }


  private final class StoreFlusherImpl implements StoreFlushContext {
    /**
     * This is not thread safe. The caller should have a lock on the region or the store.
     * If necessary, the lock can be added with the patch provided in HBASE-10087
     */
    @Override
    public MemStoreSize prepare() {
      // passing the current sequence number of the wal - to allow bookkeeping in the memstore
      // 在region调用storeFlusherImpl的prepare的时候,前面提到是在region的update.write.lock中的,
      // 因此这里面所有的耗时操作都会影响业务正在进行的读写操作.
      // 在snapshot中的逻辑中只是将memstore的跳跃表赋值给snapshot的跳跃表,在返回memstoresnapshot的时候,
      // 调用的snapshot的size()方法
      this.snapshot = memstore.snapshot();
      // MemstoreSnapshot的getCellsCount方法即在memstore的shapshot中返回的MemStoresnapshot中传入的snapshot.size()值,时间复杂度是o(n)
      ...
    }

    @Override
    public void flushCache(MonitoredTask status) throws IOException {
	  ...
      tempFiles =
          HStore.this.flushCache(cacheFlushSeqNum, snapshot, status, throughputController, tracker);
    }

    @Override
    public boolean commit(MonitoredTask status) throws IOException {
      ...
      List<HStoreFile> storeFiles = new ArrayList<>(this.tempFiles.size());
      for (Path storeFilePath : tempFiles) {
        try {
          HStoreFile sf = HStore.this.commitFile(storeFilePath, cacheFlushSeqNum, status);
          outputFileSize += sf.getReader().length();
          storeFiles.add(sf);
        } catch (IOException ex) {
          ...
      }
	  ...
    }
	
  }
阶段2&3:数据落盘和移动

HRegion#internalFlushCacheAndCommit

  protected FlushResultImpl internalFlushCacheAndCommit(WAL wal, MonitoredTask status,
      PrepareFlushResult prepareResult, Collection<HStore> storesToFlush) throws IOException {
    ...
    try {
      // A.  Flush memstore to all the HStores.
      // Keep running vector of all store files that includes both old and the
      // just-made new flush store file. The new flushed file is still in the
      // tmp directory.
      // 迭代region下的每一个store,调用HStore.storeFlushImpl.flushCache方法,
      // 把store中snapshot的数据flush到hfile中,当然这里是flush到tmp文件中,最终是通过commit将其移到正确的路径下

      for (StoreFlushContext flush : storeFlushCtxs.values()) {
        flush.flushCache(status);
      }

      // Switch snapshot (in memstore) -> new hfile (thus causing
      // all the store scanners to reset/reseek).
      Iterator<HStore> it = storesToFlush.iterator();
      // stores.values() and storeFlushCtxs have same order
      for (StoreFlushContext flush : storeFlushCtxs.values()) {
	    // 从临时路径移动至对应列簇下
        boolean needsCompaction = flush.commit(status);
        if (needsCompaction) {
          compactionRequested = true;
        }
        ...
      }
      storeFlushCtxs.clear();
    ...
  }

相关参数调优

1、表的列簇数量,对整个region flush,而非单个store(2.0以上版本中,可以选择不同的flush策略,避免对所以列簇做flush操作)。
2、hbase.hstore.flusher.count:memstore刷写到磁盘的线程数,加快落盘速度,减少阻塞,效果可观
3、hbase.hregion.memstore.flush.size:单个region的memstore刷写阈值,默认128M。超过后整个region执行flush,调高可降低flush频率
4、hbase.regionserver.global.memstore.size:默认堆内存的40%
5、hbase.regionserver.global.memstore.size.lower.limit:强制刷新前,RS中所有memstore的最大大小(40% * 95 %)

参考:
https://blog.csdn.net/bryce123phy/article/details/54291728
https://blog.csdn.net/youmengjiuzhuiba/article/details/45531151

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值