源码基于hadoop-3.3.0
1 概述
DataNode类封装了整个数据节点逻辑的实现。 它通过DataStorage以及FsDatasetImpl管理着数据节点存储上的所有数据块,DataNode类还会通过流式接口对客户端和其他数据节点提供读数据块、 写数据块、 复制数据块等功能。 同时DataNode类实现了InterDatanodeProtocol以及ClientDatanodeProtocol,使得数据节点可以接收来自其他数据节点以及客户端的远程RPC请求。DataNode类还会通过BlockPoolManager 对象周期性地向Namenode发送心跳、 块汇报、 增量块汇报以及缓存汇报, 同时执行Namenode发回的名字节点指令。 DataNode持有DataBlockScanner对象周期性地检查存储上的所有数据块, 以及DirectoryScanner对象验证存储上数据块和内存中数据块的一致性。
一个集群可能包含上千个DataNode节点,这些DataNode定时和NameNode进行通信,即心跳。
datanode启动时,每个datanode对本地磁盘进行扫描,将本datanode上保存的block信息汇报给namenode,namenode在接收到的block信息以及该block所在的datanode信息等保存在内存中。
DataNode启动后向NameNode注册,通过后周期性(1小时)的向NameNode上报所有的块信息。
而后,通过向NameNode发送心跳保持与其联系(3秒一次),心跳返回结果带有NN的命令
返回的命令为:如块的复制,删除某个数据块…..
如果10分钟没有收到DataNode的心跳,则认为其已经lost,并copy其上的block到其它DataNode。
2 datanode的启动
DataNode的启动是通过main()方法作为入口的。
public static void main(String args[]) {
if (DFSUtil.parseHelpArgument(args, DataNode.USAGE, System.out, true)) {
System.exit(0);
}
secureMain(args, null);
}
public static void secureMain(String args[], SecureResources resources) {
int errorCode = 0;
try {
StringUtils.startupShutdownMessage(DataNode.class, args, LOG);
// 创建DN节点
DataNode datanode = createDataNode(args, null, resources);
if (datanode != null) {
// 阻塞DN
datanode.join();
} else {
errorCode = 1;
}
} catch (Throwable e) {
LOG.error("Exception in secureMain", e);
terminate(1, e);
} finally {
// We need to terminate the process here because either shutdown was called
// or some disk related conditions like volumes tolerated or volumes required
// condition was not met. Also, In secure mode, control will go to Jsvc
// and Datanode process hangs if it does not exit.
LOG.warn("Exiting Datanode");
terminate(errorCode);
}
}
createDataNode是首先调用静态方法instantiateDataNode()创建DataNode实例, 然后调用runDatanodeDaemon()方法启动DataNode上的各个服务
/**
* Instantiate & Start a single datanode daemon and wait for it to
* finish.
* If this thread is specifically interrupted, it will stop waiting.
*/
@VisibleForTesting
@InterfaceAudience.Private
public static DataNode createDataNode(String args[], Configuration conf,
SecureResources resources) throws IOException {
// 实例化数据节点
DataNode dn = instantiateDataNode(args, conf, resources);
if (dn != null) {
// 启动dn上的服务
dn.runDatanodeDaemon();
}
return dn;
}
/**
* Instantiate a single datanode object, along with its secure resources.
* This must be run by invoking{@link DataNode#runDatanodeDaemon()}
* subsequently.
*/
public static DataNode instantiateDataNode(String args [], Configuration conf,
SecureResources resources) throws IOException {
if (conf == null)
conf = new HdfsConfiguration();
if (args != null) {
// parse generic hadoop options
GenericOptionsParser hParser = new GenericOptionsParser(conf, args);
args = hParser.getRemainingArgs();
}
if (!parseArguments(args, conf)) {
printUsage(System.err);
return null;
}
// 构建存储位置对象
Collection<StorageLocation> dataLocations = getStorageLocations(conf);
// 构建用户组信息
UserGroupInformation.setConfiguration(conf);
// 构建授权体系 比如 kerbos
SecurityUtil.login(conf, DFS_DATANODE_KEYTAB_FILE_KEY,
DFS_DATANODE_KERBEROS_PRINCIPAL_KEY, getHostName(conf));
return makeInstance(dataLocations, conf, resources);
}
静态方法instantiateDataNode()首先解析DataNode的启动参数, 获取DataNode配置文件中定义的所有存储目录, 然后调用静态方法makeInstance()。 makeInstance()方法确保定义的存储目录至少有一个可用, 然后调用DataNode的构造方法创建DataNode实例。
/**
* Make an instance of DataNode after ensuring that at least one of the
* given data directories (and their parent directories, if necessary)
* can be created.
* @param dataDirs List of directories, where the new DataNode instance should
* keep its files.
* @param conf Configuration instance to use.
* @param resources Secure resources needed to run under Kerberos
* @return DataNode instance for given list of data dirs and conf, or null if
* no directory from this directory list can be created.
* @throws IOException
*/
static DataNode makeInstance(Collection<StorageLocation> dataDirs,
Configuration conf, SecureResources resources) throws IOException {
List<StorageLocation> locations;
StorageLocationChecker storageLocationChecker =
new StorageLocationChecker(conf, new Timer());
try {
locations = storageLocationChecker.check(conf, dataDirs);
} catch (InterruptedException ie) {
throw new IOException("Failed to instantiate DataNode", ie);
}
DefaultMetricsSystem.initialize("DataNode");
assert locations.size() > 0 : "number of data directories should be > 0";
return new DataNode(conf, locations, storageLocationChecker, resources);
}
最后调用DataNode的构造函数构建数据节点实例:
DataNode的构造函数在初始化了若干配置文件中定义的参数后, 调用startDataNode()方法完成DataNode的初始化操作, startDataNode()方法初始化了DataStorage对象、DataXceiverServer对象、 ShortCircuitRegistry对象, 启动了HttpInfoServer, 初始化了DataNode的IPC Server, 然后创建BlockPoolManager并加载每个块池定义的Namenode列表。
/**
* Create the DataNode given a configuration, an array of dataDirs,
* and a namenode proxy.
*/
DataNode(final Configuration conf,
final List<StorageLocation> dataDirs,
final StorageLocationChecker storageLocationChecker,
final SecureResources resources) throws IOException {
super(conf);
this.tracer = createTracer(conf);
this.tracerConfigurationManager =
new TracerConfigurationManager(DATANODE_HTRACE_PREFIX, conf);
this.fileIoProvider = new FileIoProvider(conf, this);
this.blockScanner = new BlockScanner(this);
this.lastDiskErrorCheck = 0;
this.maxNumberOfBlocksToLog = conf.getLong(DFS_MAX_NUM_BLOCKS_TO_LOG_KEY,
DFS_MAX_NUM_BLOCKS_TO_LOG_DEFAULT);
this.usersWithLocalPathAccess = Arrays.asList(
conf.getTrimmedStrings(DFSConfigKeys.DFS_BLOCK_LOCAL_PATH_ACCESS_USER_KEY));
this.connectToDnViaHostname = conf.getBoolean(
DFSConfigKeys.DFS_DATANODE_USE_DN_HOSTNAME,
DFSConfigKeys.DFS_DATANODE_USE_DN_HOSTNAME_DEFAULT);
this.supergroup = conf.get(DFSConfigKeys.DFS_PERMISSIONS_SUPERUSERGROUP_KEY,
DFSConfigKeys.DFS_PERMISSIONS_SUPERUSERGROUP_DEFAULT);
this.isPermissionEnabled = conf.getBoolean(
DFSConfigKeys.DFS_PERMISSIONS_ENABLED_KEY,
DFSConfigKeys.DFS_PERMISSIONS_ENABLED_DEFAULT);
this.pipelineSupportECN = conf.getBoolean(
DFSConfigKeys.DFS_PIPELINE_ECN_ENABLED,
DFSConfigKeys.DFS_PIPELINE_ECN_ENABLED_DEFAULT);
confVersion = "core-" +
conf.get("hadoop.common.configuration.version", "UNSPECIFIED") +
",hdfs-" +
conf.get("hadoop.hdfs.configuration.version", "UNSPECIFIED");
this.volumeChecker = new DatasetVolumeChecker(conf, new Timer());
this.xferService =
HadoopExecutors.newCachedThreadPool(new Daemon.DaemonFactory());
// 确定是否应尝试将文件描述符传递给客户端 : false
// Determine whether we should try to pass file descriptors to clients.
if (conf.getBoolean(HdfsClientConfigKeys.Read.ShortCircuit.KEY,
HdfsClientConfigKeys.Read.ShortCircuit.DEFAULT)) {
String reason = DomainSocket.getLoadingFailureReason();
if (reason != null) {
LOG.warn("File descriptor passing is disabled because {}", reason);
this.fileDescriptorPassingDisabledReason = reason;
} else {
LOG.info("File descriptor passing is enabled.");
this.fileDescriptorPassingDisabledReason = null;
}
} else {
this.fileDescriptorPassingDisabledReason =
"File descriptor passing was not configured.";
LOG.debug(this.fileDescriptorPassingDisabledReason);
}
this.socketFactory = NetUtils.getDefaultSocketFactory(conf);
try {
hostName = getHostName(conf);
LOG.info("Configured hostname is {}", hostName);
// 启动 DataNode
startDataNode(dataDirs, resources);
} catch (IOException ie) {
shutdown();
throw ie;
}
final int dncCacheMaxSize =
conf.getInt(DFS_DATANODE_NETWORK_COUNTS_CACHE_MAX_SIZE_KEY,
DFS_DATANODE_NETWORK_COUNTS_CACHE_MAX_SIZE_DEFAULT) ;
// network error计数
datanodeNetworkCounts =
CacheBuilder.newBuilder()
.maximumSize(dncCacheMaxSize)
.build(new CacheLoader<String, Map<String, Long>>() {
@Override
public Map<String, Long> load(String key) throws Exception {
final Map<String, Long> ret = new HashMap<String, Long>();
ret.put("networkErrors", 0L);
return ret;
}
});
// oob:out of band,意为额外的传送信息,用来向其它正在读写此DataNode数据的
// client表明它将要马上做restart操作了,然后告诉这些client它们应该在一定超时
// 时间内等待并忽略与此重启操作引发的异常错误。这可以避免client遇到这类异常马
// 上执行错误recovery这类cost更高的操作
// 这里是获取每个oob对象的超时信息
initOOBTimeout();
this.storageLocationChecker = storageLocationChecker;
}
在构造函数中,startDataNode方法是最重要的,根据指定的配置启动一个dn:
/**
* This method starts the data node with the specified conf.
*
* If conf's CONFIG_PROPERTY_SIMULATED property is set
* then a simulated storage based data node is created.
*
* @param dataDirectories - only for a non-simulated storage data node
* @throws IOException
*/
void startDataNode(List<StorageLocation> dataDirectories,
SecureResources resources
) throws IOException {
// settings global for all BPs in the Data Node
this.secureResources = resources;
synchronized (this) {
this.dataDirs = dataDirectories;
}
// 初始化dn conf
this.dnConf = new DNConf(this);
// 检查是否启用了安全配置
checkSecureConfig(dnConf, getConf(), resources);
if (dnConf.maxLockedMemory > 0) {
if (!NativeIO.POSIX.getCacheManipulator().verifyCanMlock()) {
throw new RuntimeException(String.format(
"Cannot start datanode because the configured max locked memory" +
" size (%s) is greater than zero and native code is not available.",
DFS_DATANODE_MAX_LOCKED_MEMORY_KEY));
}
if (Path.WINDOWS) {
NativeIO.Windows.extendWorkingSetSize(dnConf.maxLockedMemory);
} else {
long ulimit = NativeIO.POSIX.getCacheManipulator().getMemlockLimit();
if (dnConf.maxLockedMemory > ulimit) {
throw new RuntimeException(String.format(
"Cannot start datanode because the configured max locked memory" +
" size (%s) of %d bytes is more than the datanode's available" +
" RLIMIT_MEMLOCK ulimit of %d bytes.",
DFS_DATANODE_MAX_LOCKED_MEMORY_KEY,
dnConf.maxLockedMemory,
ulimit));
}
}
}
LOG.info("Starting DataNode with maxLockedMemory = {}",
dnConf.maxLockedMemory);
int volFailuresTolerated = dnConf.getVolFailuresTolerated();
int volsConfigured = dnConf.getVolsConfigured();
if (volFailuresTolerated < MAX_VOLUME_FAILURE_TOLERATED_LIMIT
|| volFailuresTolerated >= volsConfigured) {
throw new HadoopIllegalArgumentException("Invalid value configured for "
+ "dfs.datanode.failed.volumes.tolerated - " + volFailuresTolerated
+ ". Value configured is either less than -1 or >= "
+ "to the number of configured volumes (" + volsConfigured + ").");
}
// 构建存储组件
storage = new DataStorage();
// global DN settings
registerMXBean();
// 初始化 DataXceiverServer
// 存储块大小的估计值以检查磁盘分区是否有足够的空间。 较新的客户端将预期的
// 块大小传递给 DataNode。 对于较旧的客户端,只需使用服务器端默认块大小。
initDataXceiver();
// 启动httpServer:DatanodeHttpServer
startInfoServer();
// 启动监控
pauseMonitor = new JvmPauseMonitor();
pauseMonitor.init(getConf());
pauseMonitor.start();
// BlockPoolTokenSecretManager is required to create ipc server.
// 管理每个bp的密钥
this.blockPoolTokenSecretManager = new BlockPoolTokenSecretManager();
// Login is done by now. Set the DN user name.
dnUserName = UserGroupInformation.getCurrentUser().getUserName();
LOG.info("dnUserName = {}", dnUserName);
LOG.info("supergroup = {}", supergroup);
// 初始化client与dn的ipc server
initIpcServer();
metrics = DataNodeMetrics.create(getConf(), getDisplayName());
peerMetrics = dnConf.peerStatsEnabled ?
DataNodePeerMetrics.create(getDisplayName(), getConf()) : null;
metrics.getJvmMetrics().setPauseMonitor(pauseMonitor);
// 处理ErasureCode命令
ecWorker = new ErasureCodingWorker(getConf(), this);
// block恢复Worker,用来处理block的恢复工作
blockRecoveryWorker = new BlockRecoveryWorker(this);
// BlockPool的管理着
blockPoolManager = new BlockPoolManager(this);
/**
* 刷新nn:这名字取的很扯,里面的操作好几个,远不止刷新nn
* 1. 对于每个新的名称服务,确定它是对现有 NS 的一组 NN 的更新,还是全新的名称服务
* 2. 我们目前拥有但不再存在的任何名称服务都需要删除
* 3. 开启新的ns
* 上面三步是在同步代码块中
* 4. 删除过时的ns,不在同步代码块,可能触发删除等操作,耗时多
* 5. 更新ns
*/
blockPoolManager.refreshNamenodes(getConf());
// Create the ReadaheadPool from the DataNode context so we can
// exit without having to explicitly shutdown its thread pool.
readaheadPool = ReadaheadPool.getInstance();
saslClient = new SaslDataTransferClient(dnConf.getConf(),
dnConf.saslPropsResolver, dnConf.trustedChannelResolver);
saslServer = new SaslDataTransferServer(dnConf, blockPoolTokenSecretManager);
startMetricsLogger();
if (dnConf.diskStatsEnabled) {
diskMetrics = new DataNodeDiskMetrics(this,
dnConf.outliersReportIntervalMs);
}
}
最后回到createDataNode中进行
dn.runDatanodeDaemon();
runDatanodeDaemon()方法启动了blockPoolManager管理的所有线程, 启动了DataXceiverServer线程, 最后启动了DataNode的IPC Server
/** Start a single datanode daemon and wait for it to finish.
* If this thread is specifically interrupted, it will stop waiting.
*/
public void runDatanodeDaemon() throws IOException {
blockPoolManager.startAll();
// start dataXceiveServer
dataXceiverServer.start();
if (localDataXceiverServer != null) {
localDataXceiverServer.start();
}
ipcServer.setTracer(tracer);
ipcServer.start();
startPlugins(getConf());
}