memstore的flush流程分析
memstore的flush发起主要从以下几个地方进行:
a.在HRegionServer调用multi进行更新时,检查是否超过全局的memstore配置的最大值与最小值,
如果是,发起一个WakeupFlushThread的flush请求,如果超过全局memory的最大值,需要等待flush完成。
b.在HRegionServer进行数据更新时,调用HRegion.batchMutate更新store中数据时,
如果region.memstore的大小超过配置的region memstore size时,发起一个FlushRegionEntry的flush请求,
c.client端显示调用HRegionServer.flushRegion请求
d.通过hbase.regionserver.optionalcacheflushinterval配置,
默认3600000ms的HRegionServer.PeriodicMemstoreFlusher定时flush线程
flush的执行过程
flush的具体执行通过MemStoreFlusher完成,当发起flushRequest时,
会把flush的request添加到flushQueue队列中,同时把request添加到regionsInQueue列表中。
MemStoreFlusher实例生成时会启动MemStoreFlusher.FlushHandler线程实例,
此线程个数通过hbase.hstore.flusher.count配置,默认为1
private class FlushHandler extends HasThread {
@Override
public void run() {
while (!server.isStopped()) {
FlushQueueEntry fqe = null;
try {
wakeupPending.set(false); // allow someone to wake us up again
从队列中取出一个flushrequest,此队列是一个阻塞队列,如果flushQueue队列中没有值,
等待hbase.server.thread.wakefrequency配置的ms,默认为10*1000
fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS);
if (fqe == null || fqe instanceof WakeupFlushThread) {
如果没有flush request或者flush request是一个全局flush的request
检查所有的memstore是否超过hbase.regionserver.global.memstore.lowerLimit配置的值,默认0.35
if (isAboveLowWaterMark()) {
LOG.debug("Flush thread woke up because memory above low water="
StringUtils.humanReadableInt(globalMemStoreLimitLowMark));
超过配置的最小memstore的值,flsuh掉最大的一个memstore的region
此执行方法的流程分析见MemStoreFlusher.flushOneForGlobalPressure流程分析
if (!flushOneForGlobalPressure()) {
....................此处部分代码没有显示
Thread.sleep(1000);
没有需要flush的region,叫醒更新线程的等待,
HregionServer执行数据更新的相关方法如果发现memstore的总和超过配置的最大值时,会wait更新线程,等待flush
wakeUpIfBlocking();
}
// Enqueue another one of these tokens so we'll wake up again
发起另一个叫醒的全局flush request,生成WakeupFlushThread的request
wakeupFlushThread();
}
continue;
}
正常的flush request,
单个region memstore大小超过hbase.hregion.memstore.flush.size配置的值,默认1024*1024*128L
此执行方法的流程分析见MemStoreFlusher.flushRegion
FlushRegionEntry fre = (FlushRegionEntry) fqe;
if (!flushRegion(fre)) {
break;
}
} catch (InterruptedException ex) {
continue;
} catch (ConcurrentModificationException ex) {
continue;
} catch (Exception ex) {
LOG.error("Cache flusher failed for entry " + fqe, ex);
if (!server.checkFileSystem()) {
break;
}
}
}
结束MemStoreFlusher的线程调用,通常是regionserver stop
synchronized (regionsInQueue) {
regionsInQueue.clear();
flushQueue.clear();
}
// Signal anyone waiting, so they see the close flag
wakeUpIfBlocking();
LOG.info(getName() + " exiting");
}
}
MemStoreFlusher.flushOneForGlobalPressure流程分析
此方法主要用来取出所有region是memstore最大的一个region,并执行flush操作。
private boolean flushOneForGlobalPressure() {
SortedMap<Long, HRegion> regionsBySize =
server.getCopyOfOnlineRegionsSortedBySize();
Set<HRegion> excludedRegions = new HashSet<HRegion>();
boolean flushedOne = false;
while (!flushedOne) {
// Find the biggest region that doesn't have too many storefiles
// (might be null!)
取出memstore占用最大的一个region,但这个region需要满足以下条件:
a.region的writestate.flushing==false,同时writestate.writesEnabled==true,非readonly
b.region中所有的store中的storefile的个数小于hbase.hstore.blockingStoreFiles配置的值,默认为7
此处去找region时,是按region的memstore的大小从大到小排序组成。取出满足以上条件的最大的memstore的region
如果都不满足,返回null
HRegion bestFlushableRegion = getBiggestMemstoreRegion(
regionsBySize, excludedRegions, true);
// Find the biggest region, total, even if it might have too many flushes.
取出memstore占用最大的一个region,但这个region需要满足以下条件:
a.region的writestate.flushing==false,同时writestate.writesEnabled==true,非readonly
b.按region的memstore的大小从大到小排序组成。取出满足以上条件的最大的memstore的region
如果都不满足,返回null,此处不检查region中是否有store的文件个数超过指定的配置值。
HRegion bestAnyRegion = getBiggestMemstoreRegion(
regionsBySize, excludedRegions, false);
如果没有拿到上面第二处检查的region,那么表示没有需要flush的region,返回,不进行flush操作。
if (bestAnyRegion == null) {
LOG.error("Above memory mark but there are no flushable regions!");
return false;
}
得到最需要进行flush的region,
如果memstore最大的region的memory使用大小已经超过了没有storefile个数超过配置的region的memory大小的2倍
那么优先flush掉此region的memstore
HRegion regionToFlush;
if (bestFlushableRegion != null &&
bestAnyRegion.memstoreSize.get() > 2 * bestFlushableRegion.memstoreSize.get()) {
....................此处部分代码没有显示
if (LOG.isDebugEnabled()) {
....................此处部分代码没有显示
}
regionToFlush = bestAnyRegion;
} else {
如果要flush的region中没有一个region的storefile个数没有超过配置的值,
(所有region中都有store的file个数超过了配置的store最大storefile个数),
优先flush掉memstore的占用最大的region
if (bestFlushableRegion == null) {
regionToFlush = bestAnyRegion;
} else {
如果要flush的region中,有region的store还没有超过配置的最大storefile个数,优先flush掉此region
这样做的目的是为了减少一小部分region数据写入过热,compact太多,而数据写入较冷的region一直没有被flush
regionToFlush = bestFlushableRegion;
}
}
Preconditions.checkState(regionToFlush.memstoreSize.get() > 0);
LOG.info("Flush of region " + regionToFlush + " due to global heap pressure");
执行flush操作,设置全局flush的标识为true,见memStoreFlusher.flushRegion全局流程
如果flush操作出现错误,需要把此region添加到excludedRegions列表中,
表示这次flush一个region的行为中跳过此region,找下一个memstore最大的region进行flush
flushedOne = flushRegion(regionToFlush, true);
if (!flushedOne) {
LOG.info("Excluding unflushable region " + regionToFlush +
" - trying to find a different region to flush.");
excludedRegions.add(regionToFlush);
}
}
return true;
}
MemStoreFlusher.flushRegion执行流程分析全局
此方法传入的第二个参数=true表示全局flush,否则表示region的memstore达到指定大小
返回true表示flush成功,否则表示flush失败
private boolean flushRegion(final HRegion region, final boolean emergencyFlush) {
synchronized (this.regionsInQueue) {
从regionsInQueue列表中移出此region,并得到region的flush请求
FlushRegionEntry fqe = this.regionsInQueue.remove(region);
如果是全局的flush请求,从flushQueue队列中移出此flush请求
if (fqe != null && emergencyFlush) {
// Need to remove from region from delay queue. When NOT an
// emergencyFlush, then item was removed via a flushQueue.poll.
flushQueue.remove(fqe);
}
}
lock.readLock().lock();
try {
执行HRegion.flushcache操作,返回true表示需要做compact,否则表示不需要发起compact请求
boolean shouldCompact = region.flushcache();
// We just want to check the size
检查是否需要进行split操作,以下条件不做split
a.如果是meta表,不做split操作。
b.如果region配置有distributedLogReplay,同时region在open后,还没有做replay,isRecovering=true
c.splitRequest的值为false,true表示通过client调用过regionServer.splitregion操作。
d.如果c为false,同时当前region中有store的大小
不超过hbase.hregion.max.filesize的配置值,默认10 * 1024 * 1024 * 1024L(10g)
或者不超过了hbase.hregion.memstore.flush.size配置的值,默认为1024*1024*128L(128m) *
(此region所在的table在当前rs中的所有region个数 * 此region所在的table在当前rs中的所有region个数)
e.如果c为false,或者store中有storefile的类型为reference,也就是此storefile引用了另外一个storefile
f.如果cde的检查结果为true,同时client发起过split请求,
如果client发起请求时指定了在具体的split row时,但此row在当前region中并不存在,不需要做split
g.以上检查都是相反的值时,此时需要做split操作。
boolean shouldSplit = region.checkSplit() != null;
if (shouldSplit) {
如果需要进行region的split操作,发起split请求
this.server.compactSplitThread.requestSplit(region);
} else if (shouldCompact) {
如果需要做compact发起一个系统的compact请求
server.compactSplitThread.requestSystemCompaction(
region, Thread.currentThread().getName());
}
} catch (DroppedSnapshotException ex) {
....................此处部分代码没有显示
server.abort("Replay of HLog required. Forcing server shutdown", ex);
return false;
} catch (IOException ex) {
....................此处部分代码没有显示
if (!server.checkFileSystem()) {
return false;
}
} finally {
lock.readLock().unlock();
叫醒所有对region中数据更新的请求线程,让更新数据向下执行(全局flush会wait做更新)
wakeUpIfBlocking();
}
return true;
}
Hregion.flushcache执行流程分析
执行flush流程,并在执行flush前调用cp的preFlush方法与在执行后调用cp.postFlush方法,
在flush前把 writestate.flushing设置为true,表示region正在做flush操作,完成后设置为false
public boolean flushcache() throws IOException {
// fail-fast instead of waiting on the lock
检查region是否正在进行close。返回false表示不做compact
if (this.closing.get()) {
LOG.debug("Skipping flush on " + this + " because closing");
return false;
}
MonitoredTask status = TaskMonitor.get().createStatus("Flushing " + this);
status.setStatus("Acquiring readlock on region");
// block waiting for the lock for flushing cache
lock.readLock().lock();
try {
如果当前region已经被close掉,不执行flush操作。返回false表示不做compact
if (this.closed.get()) {
LOG.debug("Skipping flush on " + this + " because closed");
status.abort("Skipped: closed");
return false;
}
执行cp的flush前操作
if (coprocessorHost != null) {
status.setStatus("Running coprocessor pre-flush hooks");
coprocessorHost.preFlush();
}
if (numMutationsWithoutWAL.get() > 0) {
numMutationsWithoutWAL.set(0);
dataInMemoryWithoutWAL.set(0);
}
synchronized (writestate) {
把region的状态设置为正在flush
if (!writestate.flushing && writestate.writesEnabled) {
this.writestate.flushing = true;
} else {
....................此处部分代码没有显示
如果当前region正在做flush,或者region是readonly状态,不执行flush操作。返回false表示不做compact
return false;
}
}
try {
执行flush操作,对region中所有的store的memstore进行flush操作。
返回是否需要做compact操作的一个boolean值
boolean result = internalFlushcache(status);
执行cp的flush后操作
if (coprocessorHost != null) {
status.setStatus("Running post-flush coprocessor hooks");
coprocessorHost.postFlush();
}
status.markComplete("Flush successful");
return result;
} finally {
synchronized (writestate) {
设置正在做flush的状态flushing的值为false,表示flush结束
writestate.flushing = false;
设置region的flush请求为false
this.writestate.flushRequested = false;
叫醒所有等待中的更新线程
writestate.notifyAll();
}
}
} finally {
lock.readLock().unlock();
status.cleanup();
}
}
flushcache方法调用此方法,而此方法又掉其的一个重载方法
protected boolean internalFlushcache(MonitoredTask status)
throws IOException {
return internalFlushcache(this.log, -1, status);
}
执行flush操作,通过flushcache调用而来,返回是否需要compact
protected boolean internalFlushcache(
final HLog wal, final long myseqid, MonitoredTask status)
throws IOException {
if (this.rsServices != null && this.rsServices.isAborted()) {
// Don't flush when server aborting, it's unsafe
throw new IOException("Aborting flush because server is abortted...");
}
设置flush的开始时间为当前系统时间,计算flush的耗时用
final long startTime = EnvironmentEdgeManager.currentTimeMillis();
// Clear flush flag.
// If nothing to flush, return and avoid logging start/stop flush.
如果memstore的大小没有值,不执行flsuh直接返回false
if (this.memstoreSize.get() <= 0) {
return false;
}
if (LOG.isDebugEnabled()) {
LOG.debug("Started memstore flush for " + this +
", current region memstore size " +
StringUtils.humanReadableInt(this.memstoreSize.get()) +
((wal != null)? "": "; wal is null, using passed sequenceid=" + myseqid));
}
// Stop updates while we snapshot the memstore of all stores. We only have
// to do this for a moment. Its quick. The subsequent sequence id that
// goes into the HLog after we've flushed all these snapshots also goes
// into the info file that sits beside the flushed files.
// We also set the memstore size to zero here before we allow updates
// again so its value will represent the size of the updates received
// during the flush
MultiVersionConsistencyControl.WriteEntry w = null;
// We have to take a write lock during snapshot, or else a write could
// end up in both snapshot and memstore (makes it difficult to do atomic
// rows then)
status.setStatus("Obtaining lock to block concurrent updates");
// block waiting for the lock for internal flush
this.updatesLock.writeLock().lock();
long flushsize = this.memstoreSize.get();
status.setStatus("Preparing to flush by snapshotting stores");
List<StoreFlushContext> storeFlushCtxs = new ArrayList<StoreFlushContext>(stores.size());
long flushSeqId = -1L;
try {
// Record the mvcc for all transactions in progress.
生成一个 MultiVersionConsistencyControl.WriteEntry实例,此实例的writernumber为mvcc的++memstoreWrite
把WriteEntry添加到mvcc的writeQueue队列中
w = mvcc.beginMemstoreInsert();
取出并移出writeQueue队列中的WriteEntry实例,得到writerNumber的值,
并把最大的writerNumber(最后一个)的值复制给memstoreRead,
叫醒readWaiters的等待(mvcc.waitForRead(w)会等待叫醒)
mvcc.advanceMemstore(w);
if (wal != null) {
把wal中oldestUnflushedSeqNums列表中此region未flush的seqid(append edits日志后最大的seqid)移出
把wal中oldestUnflushedSeqNums中此region的seqid添加到oldestFlushingSeqNums列表中。
得到进行flush的seqid,此值通过wal(FSHLog)的logSeqNum加一得到,
logSeqNum的值通过openRegion调用后得到的regiwriteQueueon的seqid,此值是当前rs中所有region的最大的seqid
同时每次append hlog日志时,会把logSeqNum加一的值加一,并把此值当成hlog的seqid,
Long startSeqId = wal.startCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
if (startSeqId == null) {
status.setStatus("Flush will not be started for [" + this.getRegionInfo().getEncodedName()
+ "] - WAL is going away");
return false;
}
flushSeqId = startSeqId.longValue();
} else {
flushSeqId = myseqid;
}
for (Store s : stores.values()) {
迭代region下的每一个store,生成HStore.StoreFlusherImpl实例
storeFlushCtxs.add(s.createFlushContext(flushSeqId));
}
// prepare flush (take a snapshot)
for (StoreFlushContext flush : storeFlushCtxs) {
迭代region下的每一个store,把memstore下的kvset复制到memstore的snapshot中并清空kvset的值
把memstore的snapshot复制到HStore的snapshot中
flush.prepare();
}
} finally {
this.updatesLock.writeLock().unlock();
}
String s = "Finished memstore snapshotting " + this +
", syncing WAL and waiting on mvcc, flushsize=" + flushsize;
status.setStatus(s);
if (LOG.isTraceEnabled()) LOG.trace(s);
// sync unflushed WAL changes when deferred log sync is enabled
// see HBASE-8208 for details
if (wal != null && !shouldSyncLog()) {
把wal中的日志写入到HDFS中
wal.sync();
}
// wait for all in-progress transactions to commit to HLog before
// we can start the flush. This prevents
// uncommitted transactions from being written into HFiles.
// We have to block before we start the flush, otherwise keys that
// were removed via a rollbackMemstore could be written to Hfiles.
等待mvcc中writeQueue队列处理完成,得到最大的memstoreRead值,
线程等待到mvcc.advanceMemstore(w)处理完成去叫醒。
mvcc.waitForRead(w);
s = "Flushing stores of " + this;
status.setStatus(s);
if (LOG.isTraceEnabled()) LOG.trace(s);
// Any failure from here on out will be catastrophic requiring server
// restart so hlog content can be replayed and put back into the memstore.
// Otherwise, the snapshot content while backed up in the hlog, it will not
// be part of the current running servers state.
boolean compactionRequested = false;
try {
// A. Flush memstore to all the HStores.
// Keep running vector of all store files that includes both old and the
// just-made new flush store file. The new flushed file is still in the
// tmp directory.
for (StoreFlushContext flush : storeFlushCtxs) {
迭代region下的每一个store,调用HStore.flushCache方法,把store中snapshot的数据flush到hfile中
使用从wal中得到的最新的seqid
通过hbase.hstore.flush.retries.number配置flush失败的重试次数,默认为10次
通过hbase.server.pause配置flush失败时的重试间隔,默认为1000ms
针对每一个Store的flush实例,
通过hbase.hstore.defaultengine.compactionpolicy.class配置,默认DefaultStoreFlusher进行
每一个HStore.StoreEngine通过hbase.hstore.engine.class配置,默认DefaultStoreEngine
生成StoreFile.Writer实例,此实例的路径为region的.tmp目录下生成一个UUID的文件名称,
调用storeFlusher的flushSnapshot方法,并得到flush的.tmp目录下的hfile文件路径,
检查文件是否合法(创建StoreFile.createReader不出错表示合法)
把memstore中的kv写入到此file文件中
把此hfile文件的metadata(fileinfo)中写入flush时的最大seqid.
把生成的hfile临时文件放入到HStore.StoreFlusherImpl实例的tempFiles列表中。
等待调用HStore.StoreFlusherImpl.commit
flush.flushCache(status);
}
// Switch snapshot (in memstore) -> new hfile (thus causing
// all the store scanners to reset/reseek).
for (StoreFlushContext flush : storeFlushCtxs) {
通过HStore.StoreFlusherImpl.commit把.tmp目录下的刚flush的hfile文件移动到指定的cf目录下
针对Hfile文件生成StoreFile与Reader,并把StoreFile添加到HStore的storefiles列表中。
清空HStore.memstore.snapshot的值。
通过hbase.hstore.defaultengine.compactionpolicy.class配置的compactionPolicy,
默认为ExploringCompactionPolicy,检查是否需要做compaction,
通过hbase.hstore.compaction.min配置最小做compaction的文件个数,默认为3.
老版本通过hbase.hstore.compactionThreshold进行配置,最小值不能小于2
如果当前的Store中所有的Storefile的个数减去正在做compact的个数值大于或等于上面配置的值时,
表示需要做compact
boolean needsCompaction = flush.commit(status);
if (needsCompaction) {
compactionRequested = true;
}
}
storeFlushCtxs.clear();
// Set down the memstore size by amount of flush.
this.addAndGetGlobalMemstoreSize(-flushsize);
} catch (Throwable t) {
// An exception here means that the snapshot was not persisted.
// The hlog needs to be replayed so its content is restored to memstore.
// Currently, only a server restart will do this.
// We used to only catch IOEs but its possible that we'd get other
// exceptions -- e.g. HBASE-659 was about an NPE -- so now we catch
// all and sundry.
if (wal != null) {
wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
}
DroppedSnapshotException dse = new DroppedSnapshotException("region: " +
Bytes.toStringBinary(getRegionName()));
dse.initCause(t);
status.abort("Flush failed: " + StringUtils.stringifyException(t));
throw dse;
}
// If we get to here, the HStores have been written.
if (wal != null) {
把FSHLog.oldestFlushingSeqNums中此region的上一次flush的seqid移出
wal.completeCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
}
// Record latest flush time
更新region的最后一次flush时间
this.lastFlushTime = EnvironmentEdgeManager.currentTimeMillis();
// Update the last flushed sequence id for region
if (this.rsServices != null) {
设置regionserver中completeSequenceId的值为最新进行过flush 的wal中的seqid
completeSequenceId = flushSeqId;
}
// C. Finally notify anyone waiting on memstore to clear:
// e.g. checkResources().
synchronized (this) {
notifyAll(); // FindBugs NN_NAKED_NOTIFY
}
long time = EnvironmentEdgeManager.currentTimeMillis() - startTime;
long memstoresize = this.memstoreSize.get();
String msg = "Finished memstore flush of ~" +
StringUtils.humanReadableInt(flushsize) + "/" + flushsize +
", currentsize=" +
StringUtils.humanReadableInt(memstoresize) + "/" + memstoresize +
" for region " + this + " in " + time + "ms, sequenceid=" + flushSeqId +
", compaction requested=" + compactionRequested +
((wal == null)? "; wal=null": "");
LOG.info(msg);
status.setStatus(msg);
this.recentFlushes.add(new Pair<Long,Long>(time/1000, flushsize));
返回是否需要进行compaction操作。
return compactionRequested;
}
Region的MemStore达到指定值时的flush
此种flush是region中memstore size的值达到配置的值上限时,发起的flush request,
通过MemStoreFlusher.FlusherHandler.run-->flushRegion(final FlushRegionEntry fqe)发起
private boolean flushRegion(final FlushRegionEntry fqe) {
HRegion region = fqe.region;
如果region不是meta的region,同时region中有sotre中的storefile个数达到指定的值,
通过hbase.hstore.blockingStoreFiles配置,默认为7
if (!region.getRegionInfo().isMetaRegion() &&
isTooManyStoreFiles(region)) {
检查flush request的等待时间是否超过了指定的等待时间,如果超过打印一些日志
通过hbase.hstore.blockingWaitTime配置,默认为90000ms
if (fqe.isMaximumWait(this.blockingWaitTime)) {
LOG.info("Waited " + (System.currentTimeMillis() - fqe.createTime) +
"ms on a compaction to clean up 'too many store files'; waited " +
"long enough... proceeding with flush of " +
region.getRegionNameAsString());
} else {
如果flush request的等待时间还不到指定可接受的最大等待时间,
同时还没有进行过重新flush request,(在队列中重新排队)
flushQueue队列按FlushRegionEntry的过期时间进行排序,默认情况下是先进先出,
除非调用过FlushRegionEntry.requeue方法显示指定过期时间
// If this is first time we've been put off, then emit a log message.
if (fqe.getRequeueCount() <= 0) {
// Note: We don't impose blockingStoreFiles constraint on meta regions
LOG.warn("Region " + region.getRegionNameAsString() + " has too many " +
"store files; delaying flush up to " + this.blockingWaitTime + "ms");
检查是否需要发起split request,如果是发起split request,如果不需要,发起compaction request.
if (!this.server.compactSplitThread.requestSplit(region)) {
try {
发起compaction request.因为此时store中文件个数太多。
可以通过创建table时使用COMPACTION_ENABLED来控制是否做compaction操作,可设置值TRUE/FALSE
this.server.compactSplitThread.requestSystemCompaction(
region, Thread.currentThread().getName());
} catch (IOException e) {
LOG.error(
"Cache flush failed for region " + Bytes.toStringBinary(region.getRegionName()),
RemoteExceptionHandler.checkIOException(e));
}
}
}
// Put back on the queue. Have it come back out of the queue
// after a delay of this.blockingWaitTime / 100 ms.
重新对flushQueue中当前的flush request进行排队,排队到默认900ms后在执行
this.flushQueue.add(fqe.requeue(this.blockingWaitTime / 100));
// Tell a lie, it's not flushed but it's ok
return true;
}
}
执行flush操作流程,把全局flush的参数设置为false,表示是memstore size的值达到配置的值上限时
执行流程不重复分析,见MemStoreFlusher.flushRegion执行流程分析全局
return flushRegion(region, false);
}