hdfs ec重构块的代码设计巧妙,本文总结其设计思想。
下文先分析DN重构块的流程。
DN收到NN下达的命令后,判断如果是 BlockECReconstructionCommand 命令,DN则开始重构工作。
// BPOfferService.java
case DatanodeProtocol.DNA_ERASURE_CODING_RECONSTRUCTION:
LOG.info("DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY");
Collection<BlockECReconstructionInfo> ecTasks =
((BlockECReconstructionCommand) cmd).getECTasks(); //1
dn.getErasureCodingWorker().processErasureCodingTasks(ecTasks); //2
break;
我们看一下重构的对象 Collection ecTasks 的结构,包括块组id和源及目标DN信息等:
public static class BlockECReconstructionInfo {
private final ExtendedBlock block;
private final DatanodeInfo[] sources;
private DatanodeInfo[] targets;
private String[] targetStorageIDs;
private StorageType[] targetStorageTypes;
private final byte[] liveBlockIndices;
private final ErasureCodingPolicy ecPolicy;
...
dn.getErasureCodingWorker()获得 private ErasureCodingWorker ecWorker; 那么,ecWorker是一个什么角色呢?我们看定义:
/**
* ErasureCodingWorker handles the erasure coding reconstruction work commands.
* These commands would be issued from Namenode as part of Datanode's heart beat
* response. BPOfferService delegates the work to this class for handling EC
* commands.
*/
public final class ErasureCodingWorker {
private static final Logger LOG = DataNode.LOG;
private final DataNode datanode;
private final Configuration conf;
private final float xmitWeight;
private ThreadPoolExecutor stripedReconstructionPool;
private ThreadPoolExecutor stripedReadPool;
ErasureCodingWorker 用于处理EC的重构命令,该命令是NN给DN的心跳回复。
ECWorker在整个架构中的角色如下图所示,用于服务DN有关块构建恢复工作。
线程池篇
我们看一下ecWorker中的两个线程池:
private ThreadPoolExecutor stripedReconstructionPool;
private ThreadPoolExecutor stripedReadPool;
看线程池的初始化:
// ErasureCodingWorker#initializeStripedReadThreadPool
private void initializeStripedReadThreadPool() {
LOG.debug("Using striped reads");
// Essentially, this is a cachedThreadPool.
stripedReadPool = new ThreadPoolExecutor(0, Integer.MAX_VALUE,
60, TimeUnit.SECONDS,
new SynchronousQueue<>(),
new Daemon.DaemonFactory() {
private final AtomicInteger threadIndex = new AtomicInteger(0);
@Override
public Thread newThread(Runnable r) {
Thread t = super.newThread(r);
t.setName("stripedRead-" + threadIndex.getAndIncrement());
return t;
}
},
new ThreadPoolExecutor.CallerRunsPolicy() {
@Override
public void rejectedExecution(Runnable runnable,
ThreadPoolExecutor e) {
LOG.info("Execution for striped reading rejected, "
+ "Executing in current thread");
// will run in the current thread
super.rejectedExecution(runnable, e);
}
});
stripedReadPool.allowCoreThreadTimeOut(true);
}
队列使用了无界的new SynchronousQueue<>()
;使用线程工厂来命名;使用默认的自定义拒绝策略(仅仅打印log,该策略也会执行线程)。核心线程在allowCoreThreadTimeout
被设置为true时会超时退出,默认情况下不会退出。当线程空闲时间达到keepAliveTime(上述为60s),该线程会退出,直到线程数量等于corePoolSize。如果allowCoreThreadTimeout设置为true,则所有线程均会退出直到线程数量为0。
第二个线程池初始化:
// ErasureCodingWorker#initializeStripedBlkReconstructionThreadPool
private void initializeStripedBlkReconstructionThreadPool(int numThreads) {
LOG.debug("Using striped block reconstruction; pool threads={}", numThreads);
stripedReconstructionPool = DFSUtilClient.getThreadPoolExecutor(2,
numThreads, 60, new LinkedBlockingQueue<>(),
"StripedBlockReconstruction-", false);
stripedReconstructionPool.allowCoreThreadTimeOut(true);
}
//上述封装了getThreadPool方法,放在Util中,如下:
// DFSUtilClient#getThreadPoolExecutor
public static ThreadPoolExecutor getThreadPoolExecutor(
int corePoolSize,
int maxPoolSize,
long keepAliveTimeSecs,
BlockingQueue<Runnable> queue,
String threadNamePrefix,
boolean runRejectedExec) {
Preconditions.checkArgument(corePoolSize > 0);
ThreadPoolExecutor threadPoolExecutor = new ThreadPoolExecutor(corePoolSize,
maxPoolSize, keepAliveTimeSecs, TimeUnit.SECONDS,
queue, new Daemon.DaemonFactory() {
private final AtomicInteger threadIndex = new AtomicInteger(0);
@Override
public Thread newThread(Runnable r) {
Thread t = super.newThread(r);
t.setName(threadNamePrefix + threadIndex.getAndIncrement());
return t;
}
});
if (runRejectedExec) {
threadPoolExecutor.setRejectedExecutionHandler(new ThreadPoolExecutor
.CallerRunsPolicy() {
@Override
public void rejectedExecution(Runnable runnable,
ThreadPoolExecutor e) {
LOG.info(threadNamePrefix + " task is rejected by " +
"ThreadPoolExecutor. Executing it in current thread.");
// will run in the current thread
super.rejectedExecution(runnable, e);
}
});
}
return threadPoolExecutor;
}
封装了get线程池方法,总体还是一样。拒绝策略传入false,不设策略,及即使用默认策略:
/**
* The default rejected execution handler
*/
private static final RejectedExecutionHandler defaultHandler = new AbortPolicy();
线程池初始化后,看下线程池提交线程的流程:
public void processErasureCodingTasks(
Collection<BlockECReconstructionInfo> ecTasks) {
for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
StripedReconstructionInfo stripedReconInfo =
new StripedReconstructionInfo(
reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(),
reconInfo.getLiveBlockIndices(), reconInfo.getSourceDnInfos(),
reconInfo.getTargetDnInfos(), reconInfo.getTargetStorageTypes(),
reconInfo.getTargetStorageIDs());
// It may throw IllegalArgumentException from task#stripedReader
// constructor.
final StripedBlockReconstructor task =
new StripedBlockReconstructor(this, stripedReconInfo);
if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
// 1) NN will not send more tasks than what DN can execute and
// 2) DN will not throw away reconstruction tasks, and instead keeps
// an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted);
stripedReconstructionPool.submit(task);
} else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
}
} catch (Throwable e) {
getDatanode().decrementXmitsInProgress(xmitsSubmitted);
LOG.warn("Failed to reconstruct striped block {}",
reconInfo.getExtendedBlock().getLocalBlock(), e);
}
}
}
提交到线程池后,接下来就交给了StripedBlockReconstructor implements Runnable
的run()
。
//StripedBlockReconstructor#run
public void run() {
try {
initDecoderIfNecessary();
getStripedReader().init();
stripedWriter.init();
reconstruct();
stripedWriter.endTargetBlocks();
// Currently we don't check the acks for packets, this is similar as
// block replication.
} catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
} finally {
getDatanode().decrementXmitsInProgress(getXmits());
final DataNodeMetrics metrics = getDatanode().getMetrics();
metrics.incrECReconstructionTasks();
metrics.incrECReconstructionBytesRead(getBytesRead());
metrics.incrECReconstructionRemoteBytesRead(getRemoteBytesRead());
metrics.incrECReconstructionBytesWritten(getBytesWritten());
getStripedReader().close();
stripedWriter.close();
cleanup();
}
}