memstore的flush流程分析

最新推荐文章于 2020-11-05 17:14:17 发布

隔壁老杨hongs

最新推荐文章于 2020-11-05 17:14:17 发布

阅读量269

点赞数

分类专栏： HADOOP HBASE 文章标签： hbase 源代码分布式

HADOOP 同时被 3 个专栏收录

47 篇文章 0 订阅

订阅专栏

大数据

47 篇文章 0 订阅

订阅专栏

HBASE

36 篇文章 1 订阅

订阅专栏

memstore的flush流程分析

memstore的flush发起主要从以下几个地方进行：

a.在HRegionServer调用multi进行更新时，检查是否超过全局的memstore配置的最大值与最小值，

如果是，发起一个WakeupFlushThread的flush请求，如果超过全局memory的最大值，需要等待flush完成。

b.在HRegionServer进行数据更新时，调用HRegion.batchMutate更新store中数据时，

如果region.memstore的大小超过配置的region memstore size时，发起一个FlushRegionEntry的flush请求，

c.client端显示调用HRegionServer.flushRegion请求

d.通过hbase.regionserver.optionalcacheflushinterval配置，

默认3600000ms的HRegionServer.PeriodicMemstoreFlusher定时flush线程

flush的执行过程

flush的具体执行通过MemStoreFlusher完成，当发起flushRequest时，

会把flush的request添加到flushQueue队列中，同时把request添加到regionsInQueue列表中。

MemStoreFlusher实例生成时会启动MemStoreFlusher.FlushHandler线程实例，

此线程个数通过hbase.hstore.flusher.count配置,默认为1

private class FlushHandler extends HasThread {

@Override

public void run() {

while (!server.isStopped()) {

FlushQueueEntry fqe = null;

try {

wakeupPending.set(false); // allow someone to wake us up again

从队列中取出一个flushrequest，此队列是一个阻塞队列，如果flushQueue队列中没有值，

等待hbase.server.thread.wakefrequency配置的ms,默认为10*1000

fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS);

if (fqe == null || fqe instanceof WakeupFlushThread) {

如果没有flush request或者flush request是一个全局flush的request

检查所有的memstore是否超过hbase.regionserver.global.memstore.lowerLimit配置的值，默认0.35

if (isAboveLowWaterMark()) {

LOG.debug("Flush thread woke up because memory above low water="

StringUtils.humanReadableInt(globalMemStoreLimitLowMark));

超过配置的最小memstore的值，flsuh掉最大的一个memstore的region

此执行方法的流程分析见MemStoreFlusher.flushOneForGlobalPressure流程分析

if (!flushOneForGlobalPressure()) {

....................此处部分代码没有显示

Thread.sleep(1000);

没有需要flush的region,叫醒更新线程的等待，

HregionServer执行数据更新的相关方法如果发现memstore的总和超过配置的最大值时，会wait更新线程，等待flush

wakeUpIfBlocking();

}

// Enqueue another one of these tokens so we'll wake up again

发起另一个叫醒的全局flush request,生成WakeupFlushThread的request

wakeupFlushThread();

}

continue;

}

正常的flush request,

单个region memstore大小超过hbase.hregion.memstore.flush.size配置的值,默认1024*1024*128L

此执行方法的流程分析见MemStoreFlusher.flushRegion

FlushRegionEntry fre = (FlushRegionEntry) fqe;

if (!flushRegion(fre)) {

break;

}

} catch (InterruptedException ex) {

continue;

} catch (ConcurrentModificationException ex) {

continue;

} catch (Exception ex) {

LOG.error("Cache flusher failed for entry " + fqe, ex);

if (!server.checkFileSystem()) {

break;

}

结束MemStoreFlusher的线程调用，通常是regionserver stop

synchronized (regionsInQueue) {

regionsInQueue.clear();

flushQueue.clear();

}

// Signal anyone waiting, so they see the close flag

wakeUpIfBlocking();

LOG.info(getName() + " exiting");

}

MemStoreFlusher.flushOneForGlobalPressure流程分析

此方法主要用来取出所有region是memstore最大的一个region，并执行flush操作。

private boolean flushOneForGlobalPressure() {

SortedMap<Long, HRegion> regionsBySize =

server.getCopyOfOnlineRegionsSortedBySize();

Set<HRegion> excludedRegions = new HashSet<HRegion>();

boolean flushedOne = false;

while (!flushedOne) {

// Find the biggest region that doesn't have too many storefiles

// (might be null!)

取出memstore占用最大的一个region，但这个region需要满足以下条件：

a.region的writestate.flushing==false,同时writestate.writesEnabled==true,非readonly

b.region中所有的store中的storefile的个数小于hbase.hstore.blockingStoreFiles配置的值，默认为7

此处去找region时，是按region的memstore的大小从大到小排序组成。取出满足以上条件的最大的memstore的region

如果都不满足，返回null

HRegion bestFlushableRegion = getBiggestMemstoreRegion(

regionsBySize, excludedRegions, true);

// Find the biggest region, total, even if it might have too many flushes.

取出memstore占用最大的一个region，但这个region需要满足以下条件：

a.region的writestate.flushing==false,同时writestate.writesEnabled==true,非readonly

b.按region的memstore的大小从大到小排序组成。取出满足以上条件的最大的memstore的region

如果都不满足，返回null,此处不检查region中是否有store的文件个数超过指定的配置值。

HRegion bestAnyRegion = getBiggestMemstoreRegion(

regionsBySize, excludedRegions, false);

如果没有拿到上面第二处检查的region，那么表示没有需要flush的region，返回，不进行flush操作。

if (bestAnyRegion == null) {

LOG.error("Above memory mark but there are no flushable regions!");

return false;

}

得到最需要进行flush的region,

如果memstore最大的region的memory使用大小已经超过了没有storefile个数超过配置的region的memory大小的2倍

那么优先flush掉此region的memstore

HRegion regionToFlush;

if (bestFlushableRegion != null &&

bestAnyRegion.memstoreSize.get() > 2 * bestFlushableRegion.memstoreSize.get()) {

....................此处部分代码没有显示

if (LOG.isDebugEnabled()) {

....................此处部分代码没有显示

}

regionToFlush = bestAnyRegion;

} else {

如果要flush的region中没有一个region的storefile个数没有超过配置的值，

(所有region中都有store的file个数超过了配置的store最大storefile个数)，

优先flush掉memstore的占用最大的region

if (bestFlushableRegion == null) {

regionToFlush = bestAnyRegion;

} else {

如果要flush的region中，有region的store还没有超过配置的最大storefile个数，优先flush掉此region

这样做的目的是为了减少一小部分region数据写入过热，compact太多,而数据写入较冷的region一直没有被flush

regionToFlush = bestFlushableRegion;

}

Preconditions.checkState(regionToFlush.memstoreSize.get() > 0);

LOG.info("Flush of region " + regionToFlush + " due to global heap pressure");

执行flush操作，设置全局flush的标识为true,见memStoreFlusher.flushRegion全局流程

如果flush操作出现错误，需要把此region添加到excludedRegions列表中，

表示这次flush一个region的行为中跳过此region,找下一个memstore最大的region进行flush

flushedOne = flushRegion(regionToFlush, true);

if (!flushedOne) {

LOG.info("Excluding unflushable region " + regionToFlush +

" - trying to find a different region to flush.");

excludedRegions.add(regionToFlush);

}

return true;

}

MemStoreFlusher.flushRegion执行流程分析全局

此方法传入的第二个参数=true表示全局flush，否则表示region的memstore达到指定大小

返回true表示flush成功，否则表示flush失败

private boolean flushRegion(final HRegion region, final boolean emergencyFlush) {

synchronized (this.regionsInQueue) {

从regionsInQueue列表中移出此region,并得到region的flush请求

FlushRegionEntry fqe = this.regionsInQueue.remove(region);

如果是全局的flush请求，从flushQueue队列中移出此flush请求

if (fqe != null && emergencyFlush) {

// Need to remove from region from delay queue. When NOT an

// emergencyFlush, then item was removed via a flushQueue.poll.

flushQueue.remove(fqe);

}

lock.readLock().lock();

try {

执行HRegion.flushcache操作，返回true表示需要做compact，否则表示不需要发起compact请求

boolean shouldCompact = region.flushcache();

// We just want to check the size

检查是否需要进行split操作，以下条件不做split

a.如果是meta表，不做split操作。

b.如果region配置有distributedLogReplay,同时region在open后，还没有做replay，isRecovering=true

c.splitRequest的值为false,true表示通过client调用过regionServer.splitregion操作。

d.如果c为false,同时当前region中有store的大小

不超过hbase.hregion.max.filesize的配置值，默认10 * 1024 * 1024 * 1024L(10g)

或者不超过了hbase.hregion.memstore.flush.size配置的值，默认为1024*1024*128L(128m) *

(此region所在的table在当前rs中的所有region个数 * 此region所在的table在当前rs中的所有region个数)

e.如果c为false,或者store中有storefile的类型为reference,也就是此storefile引用了另外一个storefile

f.如果cde的检查结果为true,同时client发起过split请求，

如果client发起请求时指定了在具体的split row时，但此row在当前region中并不存在，不需要做split

g.以上检查都是相反的值时，此时需要做split操作。

boolean shouldSplit = region.checkSplit() != null;

if (shouldSplit) {

如果需要进行region的split操作，发起split请求

this.server.compactSplitThread.requestSplit(region);

} else if (shouldCompact) {

如果需要做compact发起一个系统的compact请求

server.compactSplitThread.requestSystemCompaction(

region, Thread.currentThread().getName());

}

} catch (DroppedSnapshotException ex) {

....................此处部分代码没有显示

server.abort("Replay of HLog required. Forcing server shutdown", ex);

return false;

} catch (IOException ex) {

....................此处部分代码没有显示

if (!server.checkFileSystem()) {

return false;

}

} finally {

lock.readLock().unlock();

叫醒所有对region中数据更新的请求线程，让更新数据向下执行(全局flush会wait做更新)

wakeUpIfBlocking();

}

return true;

}

Hregion.flushcache执行流程分析

执行flush流程，并在执行flush前调用cp的preFlush方法与在执行后调用cp.postFlush方法，

在flush前把 writestate.flushing设置为true,表示region正在做flush操作，完成后设置为false

public boolean flushcache() throws IOException {

// fail-fast instead of waiting on the lock

检查region是否正在进行close。返回false表示不做compact

if (this.closing.get()) {

LOG.debug("Skipping flush on " + this + " because closing");

return false;

}

MonitoredTask status = TaskMonitor.get().createStatus("Flushing " + this);

status.setStatus("Acquiring readlock on region");

// block waiting for the lock for flushing cache

lock.readLock().lock();

try {

如果当前region已经被close掉，不执行flush操作。返回false表示不做compact

if (this.closed.get()) {

LOG.debug("Skipping flush on " + this + " because closed");

status.abort("Skipped: closed");

return false;

}

执行cp的flush前操作

if (coprocessorHost != null) {

status.setStatus("Running coprocessor pre-flush hooks");

coprocessorHost.preFlush();

}

if (numMutationsWithoutWAL.get() > 0) {

numMutationsWithoutWAL.set(0);

dataInMemoryWithoutWAL.set(0);

}

synchronized (writestate) {

把region的状态设置为正在flush

if (!writestate.flushing && writestate.writesEnabled) {

this.writestate.flushing = true;

} else {

....................此处部分代码没有显示

如果当前region正在做flush,或者region是readonly状态，不执行flush操作。返回false表示不做compact

return false;

}

try {

执行flush操作，对region中所有的store的memstore进行flush操作。

返回是否需要做compact操作的一个boolean值

boolean result = internalFlushcache(status);

执行cp的flush后操作

if (coprocessorHost != null) {

status.setStatus("Running post-flush coprocessor hooks");

coprocessorHost.postFlush();

}

status.markComplete("Flush successful");

return result;

} finally {

synchronized (writestate) {

设置正在做flush的状态flushing的值为false,表示flush结束

writestate.flushing = false;

设置region的flush请求为false

this.writestate.flushRequested = false;

叫醒所有等待中的更新线程

writestate.notifyAll();

}

} finally {

lock.readLock().unlock();

status.cleanup();

}

flushcache方法调用此方法，而此方法又掉其的一个重载方法

protected boolean internalFlushcache(MonitoredTask status)

throws IOException {

return internalFlushcache(this.log, -1, status);

}

执行flush操作，通过flushcache调用而来,返回是否需要compact

protected boolean internalFlushcache(

final HLog wal, final long myseqid, MonitoredTask status)

throws IOException {

if (this.rsServices != null && this.rsServices.isAborted()) {

// Don't flush when server aborting, it's unsafe

throw new IOException("Aborting flush because server is abortted...");

}

设置flush的开始时间为当前系统时间,计算flush的耗时用

final long startTime = EnvironmentEdgeManager.currentTimeMillis();

// Clear flush flag.

// If nothing to flush, return and avoid logging start/stop flush.

如果memstore的大小没有值，不执行flsuh直接返回false

if (this.memstoreSize.get() <= 0) {

return false;

}

if (LOG.isDebugEnabled()) {

LOG.debug("Started memstore flush for " + this +

", current region memstore size " +

StringUtils.humanReadableInt(this.memstoreSize.get()) +

((wal != null)? "": "; wal is null, using passed sequenceid=" + myseqid));

}

// Stop updates while we snapshot the memstore of all stores. We only have

// to do this for a moment. Its quick. The subsequent sequence id that

// goes into the HLog after we've flushed all these snapshots also goes

// into the info file that sits beside the flushed files.

// We also set the memstore size to zero here before we allow updates

// again so its value will represent the size of the updates received

// during the flush

MultiVersionConsistencyControl.WriteEntry w = null;

// We have to take a write lock during snapshot, or else a write could

// end up in both snapshot and memstore (makes it difficult to do atomic

// rows then)

status.setStatus("Obtaining lock to block concurrent updates");

// block waiting for the lock for internal flush

this.updatesLock.writeLock().lock();

long flushsize = this.memstoreSize.get();

status.setStatus("Preparing to flush by snapshotting stores");

List<StoreFlushContext> storeFlushCtxs = new ArrayList<StoreFlushContext>(stores.size());

long flushSeqId = -1L;

try {

// Record the mvcc for all transactions in progress.

生成一个 MultiVersionConsistencyControl.WriteEntry实例，此实例的writernumber为mvcc的++memstoreWrite

把WriteEntry添加到mvcc的writeQueue队列中

w = mvcc.beginMemstoreInsert();

取出并移出writeQueue队列中的WriteEntry实例，得到writerNumber的值，

并把最大的writerNumber(最后一个)的值复制给memstoreRead，

叫醒readWaiters的等待(mvcc.waitForRead(w)会等待叫醒)

mvcc.advanceMemstore(w);

if (wal != null) {

把wal中oldestUnflushedSeqNums列表中此region未flush的seqid(append edits日志后最大的seqid)移出

把wal中oldestUnflushedSeqNums中此region的seqid添加到oldestFlushingSeqNums列表中。

得到进行flush的seqid,此值通过wal(FSHLog)的logSeqNum加一得到，

logSeqNum的值通过openRegion调用后得到的regiwriteQueueon的seqid,此值是当前rs中所有region的最大的seqid

同时每次append hlog日志时，会把logSeqNum加一的值加一，并把此值当成hlog的seqid,

Long startSeqId = wal.startCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());

if (startSeqId == null) {

status.setStatus("Flush will not be started for [" + this.getRegionInfo().getEncodedName()

+ "] - WAL is going away");

return false;

}

flushSeqId = startSeqId.longValue();

} else {

flushSeqId = myseqid;

}

for (Store s : stores.values()) {

迭代region下的每一个store,生成HStore.StoreFlusherImpl实例

storeFlushCtxs.add(s.createFlushContext(flushSeqId));

}

// prepare flush (take a snapshot)

for (StoreFlushContext flush : storeFlushCtxs) {

迭代region下的每一个store,把memstore下的kvset复制到memstore的snapshot中并清空kvset的值

把memstore的snapshot复制到HStore的snapshot中

flush.prepare();

}

} finally {

this.updatesLock.writeLock().unlock();

}

String s = "Finished memstore snapshotting " + this +

", syncing WAL and waiting on mvcc, flushsize=" + flushsize;

status.setStatus(s);

if (LOG.isTraceEnabled()) LOG.trace(s);

// sync unflushed WAL changes when deferred log sync is enabled

// see HBASE-8208 for details

if (wal != null && !shouldSyncLog()) {

把wal中的日志写入到HDFS中

wal.sync();

}

// wait for all in-progress transactions to commit to HLog before

// we can start the flush. This prevents

// uncommitted transactions from being written into HFiles.

// We have to block before we start the flush, otherwise keys that

// were removed via a rollbackMemstore could be written to Hfiles.

等待mvcc中writeQueue队列处理完成，得到最大的memstoreRead值，

线程等待到mvcc.advanceMemstore(w)处理完成去叫醒。

mvcc.waitForRead(w);

s = "Flushing stores of " + this;

status.setStatus(s);

if (LOG.isTraceEnabled()) LOG.trace(s);

// Any failure from here on out will be catastrophic requiring server

// restart so hlog content can be replayed and put back into the memstore.

// Otherwise, the snapshot content while backed up in the hlog, it will not

// be part of the current running servers state.

boolean compactionRequested = false;

try {

// A. Flush memstore to all the HStores.

// Keep running vector of all store files that includes both old and the

// just-made new flush store file. The new flushed file is still in the

// tmp directory.

for (StoreFlushContext flush : storeFlushCtxs) {

迭代region下的每一个store,调用HStore.flushCache方法，把store中snapshot的数据flush到hfile中

使用从wal中得到的最新的seqid

通过hbase.hstore.flush.retries.number配置flush失败的重试次数，默认为10次

通过hbase.server.pause配置flush失败时的重试间隔，默认为1000ms

针对每一个Store的flush实例，

通过hbase.hstore.defaultengine.compactionpolicy.class配置，默认DefaultStoreFlusher进行

每一个HStore.StoreEngine通过hbase.hstore.engine.class配置，默认DefaultStoreEngine

生成StoreFile.Writer实例，此实例的路径为region的.tmp目录下生成一个UUID的文件名称，

调用storeFlusher的flushSnapshot方法,并得到flush的.tmp目录下的hfile文件路径,

检查文件是否合法(创建StoreFile.createReader不出错表示合法)

把memstore中的kv写入到此file文件中

把此hfile文件的metadata(fileinfo)中写入flush时的最大seqid.

把生成的hfile临时文件放入到HStore.StoreFlusherImpl实例的tempFiles列表中。

等待调用HStore.StoreFlusherImpl.commit

flush.flushCache(status);

}

// Switch snapshot (in memstore) -> new hfile (thus causing

// all the store scanners to reset/reseek).

for (StoreFlushContext flush : storeFlushCtxs) {

通过HStore.StoreFlusherImpl.commit把.tmp目录下的刚flush的hfile文件移动到指定的cf目录下

针对Hfile文件生成StoreFile与Reader,并把StoreFile添加到HStore的storefiles列表中。

清空HStore.memstore.snapshot的值。

通过hbase.hstore.defaultengine.compactionpolicy.class配置的compactionPolicy,

默认为ExploringCompactionPolicy,检查是否需要做compaction,

通过hbase.hstore.compaction.min配置最小做compaction的文件个数,默认为3.

老版本通过hbase.hstore.compactionThreshold进行配置，最小值不能小于2

如果当前的Store中所有的Storefile的个数减去正在做compact的个数值大于或等于上面配置的值时，

表示需要做compact

boolean needsCompaction = flush.commit(status);

if (needsCompaction) {

compactionRequested = true;

}

storeFlushCtxs.clear();

// Set down the memstore size by amount of flush.

this.addAndGetGlobalMemstoreSize(-flushsize);

} catch (Throwable t) {

// An exception here means that the snapshot was not persisted.

// The hlog needs to be replayed so its content is restored to memstore.

// Currently, only a server restart will do this.

// We used to only catch IOEs but its possible that we'd get other

// exceptions -- e.g. HBASE-659 was about an NPE -- so now we catch

// all and sundry.

if (wal != null) {

wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());

}

DroppedSnapshotException dse = new DroppedSnapshotException("region: " +

Bytes.toStringBinary(getRegionName()));

dse.initCause(t);

status.abort("Flush failed: " + StringUtils.stringifyException(t));

throw dse;

}

// If we get to here, the HStores have been written.

if (wal != null) {

把FSHLog.oldestFlushingSeqNums中此region的上一次flush的seqid移出

wal.completeCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());

}

// Record latest flush time

更新region的最后一次flush时间

this.lastFlushTime = EnvironmentEdgeManager.currentTimeMillis();

// Update the last flushed sequence id for region

if (this.rsServices != null) {

设置regionserver中completeSequenceId的值为最新进行过flush 的wal中的seqid

completeSequenceId = flushSeqId;

}

// C. Finally notify anyone waiting on memstore to clear:

// e.g. checkResources().

synchronized (this) {

notifyAll(); // FindBugs NN_NAKED_NOTIFY

}

long time = EnvironmentEdgeManager.currentTimeMillis() - startTime;

long memstoresize = this.memstoreSize.get();

String msg = "Finished memstore flush of ~" +

StringUtils.humanReadableInt(flushsize) + "/" + flushsize +

", currentsize=" +

StringUtils.humanReadableInt(memstoresize) + "/" + memstoresize +

" for region " + this + " in " + time + "ms, sequenceid=" + flushSeqId +

", compaction requested=" + compactionRequested +

((wal == null)? "; wal=null": "");

LOG.info(msg);

status.setStatus(msg);

this.recentFlushes.add(new Pair<Long,Long>(time/1000, flushsize));

返回是否需要进行compaction操作。

return compactionRequested;

}

Region的MemStore达到指定值时的flush

此种flush是region中memstore size的值达到配置的值上限时，发起的flush request,

通过MemStoreFlusher.FlusherHandler.run-->flushRegion(final FlushRegionEntry fqe)发起

private boolean flushRegion(final FlushRegionEntry fqe) {

HRegion region = fqe.region;

如果region不是meta的region,同时region中有sotre中的storefile个数达到指定的值，

通过hbase.hstore.blockingStoreFiles配置，默认为7

if (!region.getRegionInfo().isMetaRegion() &&

isTooManyStoreFiles(region)) {

检查flush request的等待时间是否超过了指定的等待时间，如果超过打印一些日志

通过hbase.hstore.blockingWaitTime配置，默认为90000ms

if (fqe.isMaximumWait(this.blockingWaitTime)) {

LOG.info("Waited " + (System.currentTimeMillis() - fqe.createTime) +

"ms on a compaction to clean up 'too many store files'; waited " +

"long enough... proceeding with flush of " +

region.getRegionNameAsString());

} else {

如果flush request的等待时间还不到指定可接受的最大等待时间，

同时还没有进行过重新flush request,(在队列中重新排队)

flushQueue队列按FlushRegionEntry的过期时间进行排序，默认情况下是先进先出，

除非调用过FlushRegionEntry.requeue方法显示指定过期时间

// If this is first time we've been put off, then emit a log message.

if (fqe.getRequeueCount() <= 0) {

// Note: We don't impose blockingStoreFiles constraint on meta regions

LOG.warn("Region " + region.getRegionNameAsString() + " has too many " +

"store files; delaying flush up to " + this.blockingWaitTime + "ms");

检查是否需要发起split request,如果是发起split request,如果不需要，发起compaction request.

if (!this.server.compactSplitThread.requestSplit(region)) {

try {

发起compaction request.因为此时store中文件个数太多。

可以通过创建table时使用COMPACTION_ENABLED来控制是否做compaction操作，可设置值TRUE/FALSE

this.server.compactSplitThread.requestSystemCompaction(

region, Thread.currentThread().getName());

} catch (IOException e) {

LOG.error(

"Cache flush failed for region " + Bytes.toStringBinary(region.getRegionName()),

RemoteExceptionHandler.checkIOException(e));

}

// Put back on the queue. Have it come back out of the queue

// after a delay of this.blockingWaitTime / 100 ms.

重新对flushQueue中当前的flush request进行排队，排队到默认900ms后在执行

this.flushQueue.add(fqe.requeue(this.blockingWaitTime / 100));

// Tell a lie, it's not flushed but it's ok

return true;

}

执行flush操作流程，把全局flush的参数设置为false,表示是memstore size的值达到配置的值上限时

执行流程不重复分析，见MemStoreFlusher.flushRegion执行流程分析全局

return flushRegion(region, false);

}