前言
上一篇博客中(https://blog.csdn.net/qq_35542970/article/details/109390109),我们分析了memstore flush的几种触发条件,从中可以看出HBase将需要flush的请求存放于MemStoreFlusher中定义的flushQueue中。那么队列中的flush请求又是如何处理的呢?
1、flush队列的处理
1.1、flush请求队列的处理流程:
以put等操作触发的flush为例,流程如下:
(图片来自链接https://blog.csdn.net/youmengjiuzhuiba/article/details/45531151)
1.2 flushHandler
MemStoreFlusher中的flush工作线程定义在了flushHandler中,初始化代码如下:
int handlerCount = conf.getInt("hbase.hstore.flusher.count", 2);
this.flushHandlers = new FlushHandler[handlerCount];
其中的handlerCount定义了regionserver中用于flush的线程数量,默认值是2,偏小,建议在实际应用中将该值调大一些。(参数调优中,效果很好)
HRegionServer启动的时候,会一并将这些工作线程也启动,start代码如下:
MemStoreFlusher#start
synchronized void start(UncaughtExceptionHandler eh) {
ThreadFactory flusherThreadFactory = Threads.newDaemonThreadFactory(
server.getServerName().toShortString() + "-MemStoreFlusher", eh);
for (int i = 0; i < flushHandlers.length; i++) {
flushHandlers[i] = new FlushHandler("MemStoreFlusher." + i);
flusherThreadFactory.newThread(flushHandlers[i]);
flushHandlers[i].start();
}
}
flusherHandler的具体逻辑:
private class FlushHandler extends HasThread {
private FlushHandler(String name) {
super(name);
}
@Override
public void run() {
//
while (!server.isStopped()) {
FlushQueueEntry fqe = null;
try {
wakeupPending.set(false); // allow someone to wake us up again
fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS);
if (fqe == null || fqe == WAKEUPFLUSH_INSTANCE) {
// 无flush请求 或 空请求
FlushType type = isAboveLowWaterMark();
if (type != FlushType.NORMAL) {
...
if (!flushOneForGlobalPressure()) {
Thread.sleep(1000);
wakeUpIfBlocking();
}
// Enqueue another one of these tokens so we'll wake up again
wakeupFlushThread();
}
continue;
}
FlushRegionEntry fre = (FlushRegionEntry) fqe;
if (!flushRegion(fre)) {
break;
}
} catch (InterruptedException ex) {
continue;
} catch (ConcurrentModificationException ex) {
continue;
} catch (Exception ex) {
LOG.error("Cache flusher failed for entry " + fqe, ex);
if (!server.checkFileSystem()) {
break;
}
}
}
synchronized (regionsInQueue) {
regionsInQueue.clear();
flushQueue.clear();
}
// Signal anyone waiting, so they see the close flag
wakeUpIfBlocking();
LOG.info(getName() + " exiting");
}
}
可以看到run方法中定义了一个循环,只要当前regionserver没有停止,则flusherHandler会不停地从请求队列中获取具体的请求fqe,如果当前无flush请求或者获取的flush请求是一个空请求,则根据当前regionServer上全局MemStore的大小判断一下是否需要flush。
这里将会触发上篇博客中提到的第4种flush机制。
// 根据全局MemStore的大小判断一下是否需要flush
public FlushType isAboveLowWaterMark() {
// for onheap memstore we check if the global memstore size and the
// global heap overhead is greater than the global memstore lower mark limit
if (memType == MemoryType.HEAP) {
if (getGlobalMemStoreHeapSize() >= globalMemStoreLimitLowMark) {
return FlushType.ABOVE_ONHEAP_LOWER_MARK;
}
} else {
if (getGlobalMemStoreOffHeapSize() >= globalMemStoreLimitLowMark) {
// Indicates that the offheap memstore's size is greater than the global memstore
// lower limit
return FlushType.ABOVE_OFFHEAP_LOWER_MARK;
} else if (getGlobalMemStoreHeapSize() >= globalOnHeapMemstoreLimitLowMark) {
// Indicates that the offheap memstore's heap overhead is greater than the global memstore
// onheap lower limit
return FlushType.ABOVE_ONHEAP_LOWER_MARK;
}
}
return FlushType.NORMAL;
}
this.globalMemStoreLimitLowMark =
(long) (this.globalMemStoreLimit * this.globalMemStoreLimitLowMarkPercent);
这里定义了两个阈值,分别是globalMemStoreLimit和globalMemStoreLimitLowMark,默认配置里前者是整个RegionServer中MemStore总大小的40%,而后者又是前者的95%,为什么要这么设置,简单来说就是,当MemStore的大小占到整个RegionServer总内存大小的40%时,该regionServer上的update操作会被阻塞住,此时MemStore中的内容强制刷盘,这是一个非常影响性能的操作,因此需要在达到前者的95%的时候,就提前启动MemStore的刷盘动作,不同的是此时的刷盘不会阻塞读写。
回到上面的run方法,当需要强制flush的时候,调用的是flushOneForGlobalPressure执行强制flush,为了提高flush的效率,同时减少带来的阻塞时间,flushOneForGlobalPressure中对执行flush的region选择做了很多优化,总体来说,需要满足以下两个条件:
(1)Region中的StoreFile数量不能过多,意味着挑选flush起来更快的region,减少阻塞时间;
(2)满足条件1的所有Region中大小为最大值,意味着尽量最大化本次强制flush的执行效果;
相应代码如下:
/**
* The memstore across all regions has exceeded the low water mark. Pick
* one region to flush and flush it synchronously (this is called from the
* flush thread)
* @return true if successful
*/
private boolean flushOneForGlobalPressure() {
SortedMap<Long, HRegion> regionsBySize = null;
// 根据堆内、对外内存,对region排序
switch(flushType) {
case ABOVE_OFFHEAP_HIGHER_MARK:
case ABOVE_OFFHEAP_LOWER_MARK:
regionsBySize = server.getCopyOfOnlineRegionsSortedByOffHeapSize();
break;
case ABOVE_ONHEAP_HIGHER_MARK:
case ABOVE_ONHEAP_LOWER_MARK:
default:
regionsBySize = server.getCopyOfOnlineRegionsSortedByOnHeapSize();
}
...
boolean flushedOne = false;
while (!flushedOne) {
// Find the biggest region that doesn't have too many storefiles (might be null!)
HRegion bestFlushableRegion =
getBiggestMemStoreRegion(regionsBySize, excludedRegions, true);
// Find the biggest region, total, even if it might have too many flushes.
HRegion bestAnyRegion = getBiggestMemStoreRegion(regionsBySize, excludedRegions, false);
// Find the biggest region that is a secondary region
HRegion bestRegionReplica = getBiggestMemStoreOfRegionReplica(regionsBySize, excludedRegions);
if (bestAnyRegion == null) {
// If bestAnyRegion is null, assign replica. It may be null too. Next step is check for null
bestAnyRegion = bestRegionReplica;
}
if (bestAnyRegion == null) {
LOG.error("Above memory mark but there are no flushable regions!");
return false;
}
...
HRegion regionToFlush;
...
// 调用flushRegion
flushedOne = flushRegion(regionToFlush, true, false, FlushLifeCycleTracker.DUMMY);
...
}
return true;
}
//挑选合适的region
private HRegion getBiggestMemStoreRegion(
SortedMap<Long, HRegion> regionsBySize,
Set<HRegion> excludedRegions,
boolean checkStoreFileCount) {
synchronized (regionsInQueue) {
for (HRegion region : regionsBySize.values()) {
if (excludedRegions.contains(region)) {
continue;
}
if (region.writestate.flushing || !region.writestate.writesEnabled) {
continue;
}
if (checkStoreFileCount && isTooManyStoreFiles(region)) {
continue;
}
return region;
}
}
return null;
}
2、Flush具体实现
它首先会检查当前region内的storeFiles的数量,如果storefile过多,会首先发出一个对该region的compact请求,然后再将region重新加入到flushQueue中等待下一次的flush请求处理,当然,再次加入到flushQueue时,其等待时间被相应缩短了。
MemStoreFlusher#flushRegion
private boolean flushRegion(final FlushRegionEntry fqe) {
HRegion region = fqe.region;
if (!region.getRegionInfo().isMetaRegion() && isTooManyStoreFiles(region)) {
if (fqe.isMaximumWait(this.blockingWaitTime)) {
...
} else {
// If this is first time we've been put off, then emit a log message.
if (fqe.getRequeueCount() <= 0) {
// Note: We don't impose blockingStoreFiles constraint on meta regions
LOG.warn("{} has too many store files({}); delaying flush up to {} ms",
region.getRegionInfo().getEncodedName(), getStoreFileCount(region),
this.blockingWaitTime);
if (!this.server.compactSplitThread.requestSplit(region)) {
try {
this.server.compactSplitThread.requestSystemCompaction(region,
Thread.currentThread().getName());
} catch (IOException e) {
...
}
}
}
// Put back on the queue. Have it come back out of the queue
// after a delay of this.blockingWaitTime / 100 ms.
this.flushQueue.add(fqe.requeue(this.blockingWaitTime / 100));
// Tell a lie, it's not flushed but it's ok
return true;
}
}
// storefile数量满足要求,默认16个
return flushRegion(region, false, fqe.isForceFlushAllStores(), fqe.getTracker());
}
storeFile数量满足要求的flush请求会进入Region的flush实现
private boolean flushRegion(HRegion region, boolean emergencyFlush, boolean forceFlushAllStores,
FlushLifeCycleTracker tracker) {
synchronized (this.regionsInQueue) {
FlushRegionEntry fqe = this.regionsInQueue.remove(region);
flushQueue.remove(fqe); //将flush请求从请求队列中移除
}
// ReentrantReadWriteLock
lock.readLock().lock(); //region加上共享锁
try {
notifyFlushRequest(region, emergencyFlush);
FlushResult flushResult = region.flushcache(forceFlushAllStores, false, tracker);
boolean shouldCompact = flushResult.isCompactionNeeded();
// We just want to check the size
boolean shouldSplit = region.checkSplit() != null;
if (shouldSplit) {
this.server.compactSplitThread.requestSplit(region); //处理flush之后的可能的split
} else if (shouldCompact) {
server.compactSplitThread.requestSystemCompaction(region, Thread.currentThread().getName()); //处理flush之后的可能compact
}
} catch (DroppedSnapshotException ex) {
...
server.abort("Replay of WAL required. Forcing server shutdown", ex);
return false;
} catch (IOException ex) {
...
if (!server.checkFileSystem()) {
return false;
}
} finally {
lock.readLock().unlock();
wakeUpIfBlocking(); //唤醒所有等待的线程
tracker.afterExecution();
}
return true;
}
1、flush期间,该region是被readLock保护起来的,也就是试图获得writeLock的请求会被阻塞掉,包括move region、compact等等;其二是flush之后,可能会产生数量较多的storefile,这会触发一次compact,同样的flush后形成的较大storefile也会触发一次split;
2、region.flushcache(forceFlushAllStores)这一句是可看出flush操作是region级别的,也就是触发flush后,该region上的所有MemStore均会参与flush,这里对region又加上了一次readLock,ReentrantReadWriteLock是可重入的,所以倒无大碍。该方法中还检查了region的状态,如果当前region正处于closing或者closed状态,则不会执行compact或者flush请求,这是由于类似flush这样的操作,一般比较耗时,会增加region的下线关闭时间。
所有检查通过后,开始真正的flush实现,一层层进入调用的函数,最终的实现在internalFlushCache,
代码如下:
HRegion#internalFlushcache
/**
* Flush the memstore. Flushing the memstore is a little tricky. We have a lot of updates in the
* memstore, all of which have also been written to the wal. We need to write those updates in the
* memstore out to disk, while being able to process reads/writes as much as possible during the
* flush operation.
*/
protected FlushResultImpl internalFlushcache(WAL wal, long myseqid,
Collection<HStore> storesToFlush, MonitoredTask status, boolean writeFlushWalMarker,
FlushLifeCycleTracker tracker) throws IOException {
// internalPrepareFlushCache执行snapshot,打快照
PrepareFlushResult result =
internalPrepareFlushCache(wal, myseqid, storesToFlush, status, writeFlushWalMarker, tracker);
// 返回的result中的result是null.因此会执行internalFlushchacheAndCommit方法执行第二和第三阶段。
if (result.result == null) {
return internalFlushCacheAndCommit(wal, status, result, storesToFlush);
} else {
return result.result; // early exit due to failure from prepare stage
}
}
其中internalPrepareFlushCache进行flush前的准备工作,包括生成一次MVCC的事务ID,准备flush时所需要的缓存和中间数据结构,以及生成当前MemStore的一个快照。
internalFlushCacheAndCommit则执行了具体的flush行为,包括首先将数据写入临时的tmp文件,提交一次更新事务(commit),最后再将文件移入hdfs中的正确目录下。
Flush操作的3个阶段:
阶段1:创建快照
HRegion#internalPrepareFlushCache
protected PrepareFlushResult internalPrepareFlushCache(WAL wal, long myseqid,
Collection<HStore> storesToFlush, MonitoredTask status, boolean writeFlushWalMarker,
FlushLifeCycleTracker tracker) throws IOException {
...
// block waiting for the lock for internal flush
// 获取update的写锁
this.updatesLock.writeLock().lock();
...
// storeFlushCtxs,committedFiles,storeFlushableSize,比较重要的是storeFlushCtxs和committedFiles。他们都被定义为以CF做key的TreeMap,
// 分别代表了store的CF实际执行(StoreFlusherImpl)和最终刷写的HFlile文件。
// 其中storeFlushContext的实现类StoreFlusherImpl里包含了flush相关的核心操作:prepare,flushcache,commit,abort等。
// 所以这里保存的是每一个store的flush实例,后面就是通过这里的StoreFlushContext进行flush的
TreeMap<byte[], StoreFlushContext> storeFlushCtxs = new TreeMap<>(Bytes.BYTES_COMPARATOR); //用来存储每个store和它对应的hdfs commit路径的映射
...
try {
...
// 循环遍历region下面的storeFile,为每个storeFile生成了一个StoreFlusherImpl类,
// 生成MemStore的快照就是调用每个StoreFlusherImpl的prepare方法生成每个storeFile的快照,
// 至于internalFlushCacheAndCommit中的flush和commti行为也是调用了region中每个storeFile的flushCache和commit接口。
for (HStore s : storesToFlush) { //循环遍历该region的所有storefile,初始化storeFlushCtxs&committedFiles
// 为每一个store生成自己的storeFlushImpl
storeFlushCtxs.put(s.getColumnFamilyDescriptor().getName(),
s.createFlushContext(flushOpSeqId, tracker));
// for writing stores to WAL
// 此时还没有生成flush的hfile路径
committedFiles.put(s.getColumnFamilyDescriptor().getName(), null);
}
...
// Prepare flush (take a snapshot)
// 这里的StoreFlushContext就是StoreFlusherImpl
storeFlushCtxs.forEach((name, flush) -> {
// 迭代region下的每一个store,把memstore下的kvset复制到memstore的snapshot中并清空kvset的值
// 把memstore的snapshot复制到HStore的snapshot中
MemStoreSize snapshotSize = flush.prepare(); //其prepare方法就是调用store的storeFlushImpl的snapshot方法生成快照
...
});
} catch (IOException ex) {
doAbortFlushToWAL(wal, flushOpSeqId, committedFiles);
throw ex;
} finally {
// 做完snapshot释放锁,此时不会阻塞业务的读写操作了
this.updatesLock.writeLock().unlock();
}
...
}
这里面有几个关键点:
其一,该方法是被updatesLock().writeLock()保护起来的,updatesLock与上文中提到的lock一样,都是ReentrantReadWriteLock,这里为什么还要再加锁呢。前面已经加过的锁是对region整体行为而言,如split、move、merge等宏观行为,而这里的updatesLock是数据的更新请求,快照生成期间加入updatesLock是为了保证数据一致性,快照生成后立即释放了updatesLock,保证了用户请求与快照flush到磁盘同时进行,提高系统并发的吞吐量。
其二,那么MemStore的snapshot、flush以及commit操作具体是如何实现的,在internalPrepareFlushCache中有下面的一段代码:
for (HStore s : storesToFlush) { //循环遍历该region的所有storefile,初始化storeFlushCtxs&committedFiles
// 为每一个store生成自己的storeFlushImpl
storeFlushCtxs.put(s.getColumnFamilyDescriptor().getName(),
s.createFlushContext(flushOpSeqId, tracker));
// for writing stores to WAL
// 此时还没有生成flush的hfile路径
committedFiles.put(s.getColumnFamilyDescriptor().getName(), null);
}
storeFlushCtxs中的StoreFlusherImpl负责flush相关的核心操作:prepare,flushcache,commit,abort等,StoreFlusherImpl是HStore的内部类:
public StoreFlushContext createFlushContext(long cacheFlushId, FlushLifeCycleTracker tracker) {
return new StoreFlusherImpl(cacheFlushId, tracker);
}
private final class StoreFlusherImpl implements StoreFlushContext {
/**
* This is not thread safe. The caller should have a lock on the region or the store.
* If necessary, the lock can be added with the patch provided in HBASE-10087
*/
@Override
public MemStoreSize prepare() {
// passing the current sequence number of the wal - to allow bookkeeping in the memstore
// 在region调用storeFlusherImpl的prepare的时候,前面提到是在region的update.write.lock中的,
// 因此这里面所有的耗时操作都会影响业务正在进行的读写操作.
// 在snapshot中的逻辑中只是将memstore的跳跃表赋值给snapshot的跳跃表,在返回memstoresnapshot的时候,
// 调用的snapshot的size()方法
this.snapshot = memstore.snapshot();
// MemstoreSnapshot的getCellsCount方法即在memstore的shapshot中返回的MemStoresnapshot中传入的snapshot.size()值,时间复杂度是o(n)
...
}
@Override
public void flushCache(MonitoredTask status) throws IOException {
...
tempFiles =
HStore.this.flushCache(cacheFlushSeqNum, snapshot, status, throughputController, tracker);
}
@Override
public boolean commit(MonitoredTask status) throws IOException {
...
List<HStoreFile> storeFiles = new ArrayList<>(this.tempFiles.size());
for (Path storeFilePath : tempFiles) {
try {
HStoreFile sf = HStore.this.commitFile(storeFilePath, cacheFlushSeqNum, status);
outputFileSize += sf.getReader().length();
storeFiles.add(sf);
} catch (IOException ex) {
...
}
...
}
}
阶段2&3:数据落盘和移动
HRegion#internalFlushCacheAndCommit
protected FlushResultImpl internalFlushCacheAndCommit(WAL wal, MonitoredTask status,
PrepareFlushResult prepareResult, Collection<HStore> storesToFlush) throws IOException {
...
try {
// A. Flush memstore to all the HStores.
// Keep running vector of all store files that includes both old and the
// just-made new flush store file. The new flushed file is still in the
// tmp directory.
// 迭代region下的每一个store,调用HStore.storeFlushImpl.flushCache方法,
// 把store中snapshot的数据flush到hfile中,当然这里是flush到tmp文件中,最终是通过commit将其移到正确的路径下
for (StoreFlushContext flush : storeFlushCtxs.values()) {
flush.flushCache(status);
}
// Switch snapshot (in memstore) -> new hfile (thus causing
// all the store scanners to reset/reseek).
Iterator<HStore> it = storesToFlush.iterator();
// stores.values() and storeFlushCtxs have same order
for (StoreFlushContext flush : storeFlushCtxs.values()) {
// 从临时路径移动至对应列簇下
boolean needsCompaction = flush.commit(status);
if (needsCompaction) {
compactionRequested = true;
}
...
}
storeFlushCtxs.clear();
...
}
相关参数调优
1、表的列簇数量,对整个region flush,而非单个store(2.0以上版本中,可以选择不同的flush策略,避免对所以列簇做flush操作)。
2、hbase.hstore.flusher.count:memstore刷写到磁盘的线程数,加快落盘速度,减少阻塞,效果可观
3、hbase.hregion.memstore.flush.size:单个region的memstore刷写阈值,默认128M。超过后整个region执行flush,调高可降低flush频率
4、hbase.regionserver.global.memstore.size:默认堆内存的40%
5、hbase.regionserver.global.memstore.size.lower.limit:强制刷新前,RS中所有memstore的最大大小(40% * 95 %)
参考:
https://blog.csdn.net/bryce123phy/article/details/54291728
https://blog.csdn.net/youmengjiuzhuiba/article/details/45531151