第七章:小朱笔记hadoop之源码分析-hdfs分析
第五节:Datanode 分析
5.1 Datanode 启动过程分析
5.2 Datanode 心跳分析
5.3 Datanode 注册分析
5.4 DataBlockScanner 文件校验
5.5 DataNode 数据块接受/发送
5.1 Datanode 启动过程分析
(1)shell脚本启动DataNode
start-dfs.sh
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt
(2)main()函数启动分析
主线程阻塞,让DataNode的任务循环执行,调用createDataNode方法创建datanode,等datanode线程结束 。
//主线程阻塞,让DataNode的任务循环执行
public static void secureMain(String [] args, SecureResources resources) {
try {
LOG.info("start up datanode...");
if(null!=args){
for(String arg:args) {
LOG.info("arg:"+arg);
}
}
StringUtils.startupShutdownMessage(DataNode.class, args, LOG);
DataNode datanode = createDataNode(args, null, resources);
if (datanode != null)
datanode.join();
} catch (Throwable e) {
LOG.error(StringUtils.stringifyException(e));
System.exit(-1);
} finally {
// We need to add System.exit here because either shutdown was called or
// some disk related conditions like volumes tolerated or volumes required
// condition was not met. Also, In secure mode, control will go to Jsvc and
// the process hangs without System.exit.
LOG.info("Exiting Datanode");
System.exit(0);
}
}
(3)创建Datanode实例,并启动Datanode线程
调用instantiateDataNode方法初始化datanode
调用runDatanodeDaemon方法运行datanode线程
/** Instantiate & Start a single datanode daemon and wait for it to finish.
* If this thread is specifically interrupted, it will stop waiting.
* LimitedPrivate for creating secure datanodes
*/
public static DataNode createDataNode(String args[],
Configuration conf, SecureResources resources) throws IOException {
//初始化DataNode
DataNode dn = instantiateDataNode(args, conf, resources);
runDatanodeDaemon(dn);
//进行DataNode注册,创建线程,设置守护线程,启动线程
return dn;
}
(4)实例化DataNode结点
解析启动参数:
如果设置了机架配置${dfs.network.script},退出程序
通过配置${dfs.data.dir}得到datanode的存储目录
调用makeInstance方法创建实例
makeInstance检查数据存储目录的合法性 并初始化DataNode对象
DataNode构造函数中调用startDataNode根据具体配置文件的信息进行具体的初始化过程
调用startDataNode方法启动datanode
如果启动出错,调用shutdown方法关闭datanode
/**
* Start a Datanode with specified server sockets for secure environments
* where they are run with privileged ports and injected from a higher
* level of capability
*/
DataNode(final Configuration conf,
final AbstractList<File> dataDirs, SecureResources resources) throws IOException {
super(conf);
SecurityUtil.login(conf, DFSConfigKeys.DFS_DATANODE_KEYTAB_FILE_KEY,
DFSConfigKeys.DFS_DATANODE_USER_NAME_KEY);
datanodeObject = this;
supportAppends = conf.getBoolean("dfs.support.append", false);
this.userWithLocalPathAccess = conf
.get(DFSConfigKeys.DFS_BLOCK_LOCAL_PATH_ACCESS_USER_KEY);
try {
startDataNode(conf, dataDirs, resources);
} catch (IOException ie) {
shutdown();
throw ie;
}
}
(5)创建Datanode实例
(a)获得本地主机名和namenode的地址
(b)连接namenode,本地datanode的名称为:“machineName:port”
(c)从namenode得到version和id信息
(d)初始化存储目录结构,如果有目录没有格式化,对其进行格式化
(e)打开datanode监听端口ss,默认端口是50010
(f)初始化DataXceiverServer后台线程,使用ss接收请求
(g)初始化DataBlockScanner,块的校验只支持FSDataset
(h)初始化并启动datanode信息服务器infoServer,默认访问地址是http://0.0.0.0:5007,如果允许https,默认https端口是50475 infoServer添加DataBlockScanner的Servlet,访问地址是http://0.0.0.0:50075/blockScannerReport .
(i)初始化并启动ipc服务器,用于RPC调用,默认端口是50020
/**
* This method starts the data node with the specified conf.
*
* @param conf - the configuration
* if conf's CONFIG_PROPERTY_SIMULATED property is set
* then a simulated storage based data node is created.
*
* @param dataDirs - only for a non-simulated storage data node
* @throws IOException
* @throws MalformedObjectNameException
* @throws MBeanRegistrationException
* @throws InstanceAlreadyExistsException
*/
void startDataNode(Configuration conf,
AbstractList<File> dataDirs, SecureResources resources
) throws IOException {
if(UserGroupInformation.isSecurityEnabled() && resources == null)
throw new RuntimeException("Cannot start secure cluster without " +
"privileged resources.");
this.secureResources = resources;
// use configured nameserver & interface to get local hostname
//设置machineName
if (conf.get("slave.host.name") != null) {
machineName = conf.get("slave.host.name");
}
if (machineName == null) {
machineName = DNS.getDefaultHost(conf.get("dfs.datanode.dns.interface","default"),conf.get("dfs.datanode.dns.nameserver","default"));
}
//获取nameNode的地址信息
InetSocketAddress nameNodeAddr = NameNode.getServiceAddress(conf, true);
//setSocketout时间
this.socketTimeout = conf.getInt("dfs.socket.timeout",HdfsConstants.READ_TIMEOUT);
this.socketWriteTimeout = conf.getInt("dfs.datanode.socket.write.timeout",HdfsConstants.WRITE_TIMEOUT);
/* Based on results on different platforms, we might need set the default
* to false on some of them. */
this.transferToAllowed = conf.getBoolean("dfs.datanode.transferTo.allowed", true);
//写包的大小,默认64K
this.writePacketSize = conf.getInt("dfs.write.packet.size", 64*1024);
//创建本地socketaddress地址
InetSocketAddress socAddr = DataNode.getStreamingAddr(conf);
int tmpPort = socAddr.getPort();
//DataStorage保存了存储相关的信息
storage = new DataStorage();// Data storage information file. 数据存储信息文件
//构造一个注册器
// construct registration
this.dnRegistration = new DatanodeRegistration(machineName + ":" + tmpPort);
//通过动态代理生成namenode实例
// connect to name node
this.namenode = (DatanodeProtocol) RPC.waitForProxy(DatanodeProtocol.class,DatanodeProtocol.versionID,nameNodeAddr, conf);
// get version and id info from the name-node
// 从名称节点获取版本和id信息 主要包含buildVersin和distributeUpgradeVersion,用于版本检验
NamespaceInfo nsInfo = handshake();
StartupOption startOpt = getStartupOption(conf);
assert startOpt != null : "Startup option must be set.";
boolean simulatedFSDataset = conf.getBoolean("dfs.datanode.simulateddatastorage", false);
//判断一下是否是伪分布式,否则走正常判断,此处分析正常逻辑
if (simulatedFSDataset) {
setNewStorageID(dnRegistration);
dnRegistration.storageInfo.layoutVersion = FSConstants.LAYOUT_VERSION;
dnRegistration.storageInfo.namespaceID = nsInfo.namespaceID;
// it would have been better to pass storage as a parameter to
// constructor below - need to augment ReflectionUtils used below.
conf.set("StorageId", dnRegistration.getStorageID());
try {
//Equivalent of following (can't do because Simulated is in test dir)
// this.data = new SimulatedFSDataset(conf);
this.data = (FSDatasetInterface) ReflectionUtils.newInstance(
Class.forName("org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset"), conf);
} catch (ClassNotFoundException e) {
throw new IOException(StringUtils.stringifyException(e));
}
} else { // real storage
// read storage info, lock data dirs and transition fs state if necessary
storage.recoverTransitionRead(nsInfo, dataDirs, startOpt);
// adjust
//将storage进行信息注册
this.dnRegistration.setStorageInfo(storage);
// initialize data node internal structure
//根据storage和conf信息,生成FSDataset,用于数据块操作
this.data = new FSDataset(storage, conf);
}
// register datanode MXBean
this.registerMXBean(conf); // register the MXBean for DataNode
// Allow configuration to delay block reports to find bugs
artificialBlockReceivedDelay = conf.getInt(
"dfs.datanode.artificialBlockReceivedDelay", 0);
// find free port or use privileged port provide
//初始化Socket服务器端,区分NIO和IO
ServerSocket ss;
if(secureResources == null) {
ss = (socketWriteTimeout > 0) ?
ServerSocketChannel.open().socket() : new ServerSocket();
Server.bind(ss, socAddr, 0);
} else {
ss = resources.getStreamingSocket();
}
//设置接收的buffer缓存大小,默认64K
ss.setReceiveBufferSize(DEFAULT_DATA_SOCKET_SIZE);
// adjust machine name with the actual port
tmpPort = ss.getLocalPort();
selfAddr = new InetSocketAddress(ss.getInetAddress().getHostAddress(),
tmpPort);
this.dnRegistration.setName(machineName + ":" + tmpPort);
LOG.info("Opened info server at " + tmpPort);
//服务器 用于接收/发送 一个数据块 。 这是创建监听来自客户或其他 DataNodes 的 请求 。 这种小型服务器不使用 thHadoop IPC机制
//初始化处理类dataXceiverServer
this.threadGroup = new ThreadGroup("dataXceiverServer");
this.dataXceiverServer = new Daemon(threadGroup, new DataXceiverServer(ss, conf, this));
this.threadGroup.setDaemon(true); // auto destroy when empty
//分别设置块状态信息间隔时间和心跳间隔时间
this.blockReportInterval = conf.getLong("dfs.blockreport.intervalMsec", BLOCKREPORT_INTERVAL);
this.initialBlockReportDelay = conf.getLong("dfs.blockreport.initialDelay",BLOCKREPORT_INITIAL_DELAY)* 1000L;
if (this.initialBlockReportDelay >= blockReportInterval) {
this.initialBlockReportDelay = 0;
LOG.info("dfs.blockreport.initialDelay is greater than " +
"dfs.blockreport.intervalMsec." + " Setting initial delay to 0 msec:");
}
this.heartBeatInterval = conf.getLong("dfs.heartbeat.interval", HEARTBEAT_INTERVAL) * 1000L;
DataNode.nameNodeAddr = nameNodeAddr;
//initialize periodic block scanner
String reason = null;
if (conf.getInt("dfs.datanode.scan.period.hours", 0) < 0) {
reason = "verification is turned off by configuration";
} else if ( !(data instanceof FSDataset) ) {
reason = "verifcation is supported only with FSDataset";
}
if ( reason == null ) {
//初始化一个定期检查scanner blockScanner用于定时对文件块进行扫描
blockScanner = new DataBlockScanner(this, (FSDataset)data, conf);
} else {
LOG.info("Periodic Block Verification is disabled because " +
reason + ".");
}
//create a servlet to serve full-file content
//创建HttpServer,内部用jetty实现,用于页面监控
InetSocketAddress infoSocAddr = DataNode.getInfoAddr(conf);
String infoHost = infoSocAddr.getHostName();
int tmpInfoPort = infoSocAddr.getPort();
this.infoServer = (secureResources == null)
? new HttpServer("datanode", infoHost, tmpInfoPort, tmpInfoPort == 0,
conf, SecurityUtil.getAdminAcls(conf, DFSConfigKeys.DFS_ADMIN))
: new HttpServer("datanode", infoHost, tmpInfoPort, tmpInfoPort == 0,
conf, SecurityUtil.getAdminAcls(conf, DFSConfigKeys.DFS_ADMIN),
secureResources.getListener());
if (conf.getBoolean("dfs.https.enable", false)) {
boolean needClientAuth = conf.getBoolean("dfs.https.need.client.auth", false);
InetSocketAddress secInfoSocAddr = NetUtils.createSocketAddr(conf.get(
"dfs.datanode.https.address", infoHost + ":" + 0));
Configuration sslConf = new Configuration(false);
sslConf.addResource(conf.get("dfs.https.server.keystore.resource",
"ssl-server.xml"));
this.infoServer.addSslListener(secInfoSocAddr, sslConf, needClientAuth);
}
this.infoServer.addInternalServlet(null, "/streamFile/*", StreamFile.class);
this.infoServer.addInternalServlet(null, "/getFileChecksum/*",
FileChecksumServlets.GetServlet.class);
this.infoServer.setAttribute("datanode", this);
this.infoServer.setAttribute("datanode.blockScanner", blockScanner);
this.infoServer.setAttribute(JspHelper.CURRENT_CONF, conf);
this.infoServer.addServlet(null, "/blockScannerReport",
DataBlockScanner.Servlet.class);
if (WebHdfsFileSystem.isEnabled(conf, LOG)) {
infoServer.addJerseyResourcePackage(DatanodeWebHdfsMethods.class
.getPackage().getName() + ";" + Param.class.getPackage().getName(),
WebHdfsFileSystem.PATH_PREFIX + "/*");
}
this.infoServer.start();
// adjust info port
this.dnRegistration.setInfoPort(this.infoServer.getPort());
myMetrics = DataNodeInstrumentation.create(conf,
dnRegistration.getStorageID());
// set service-level authorization security policy
if (conf.getBoolean(
ServiceAuthorizationManager.SERVICE_AUTHORIZATION_CONFIG, false)) {
ServiceAuthorizationManager.refresh(conf, new HDFSPolicyProvider());
}
// BlockTokenSecretManager is created here, but it shouldn't be
// used until it is initialized in register().
this.blockTokenSecretManager = new BlockTokenSecretManager(false,0, 0);
//init ipc server
//开启本地ipc服务,监听来自client和其它
InetSocketAddress ipcAddr = NetUtils.createSocketAddr(conf.get("dfs.datanode.ipc.address"));
ipcServer = RPC.getServer(this, ipcAddr.getHostName(), ipcAddr.getPort(),
conf.getInt("dfs.datanode.handler.count", 3), false, conf,
blockTokenSecretManager);
dnRegistration.setIpcPort(ipcServer.getListenerAddress().getPort());
LOG.info("dnRegistration = " + dnRegistration);
}
(6)运行DataNode结点
进行DataNode注册,创建线程,设置守护线程,启动线程。
/** Start a single datanode daemon and wait for it to finish.
* If this thread is specifically interrupted, it will stop waiting.
*/
public static void runDatanodeDaemon(DataNode dn) throws IOException {
if (dn != null) {
//register datanode
dn.register();
dn.dataNodeThread = new Thread(dn, dnThreadName);
dn.dataNodeThread.setDaemon(true); // needed for JUnit testing
dn.dataNodeThread.start();
}
}
启动DataXceiverServer,然后进入datanode的正常运行。检查是否需要升级,调用offerService方法提供服务
public void run() {
LOG.info(dnRegistration + "In DataNode.run, data = " + data);
///启动数据块的流读写服务器,
// start dataXceiveServer
dataXceiverServer.start();
//内部hadoop ipc服务器
ipcServer.start();
while (shouldRun) {
try {
//检测是否需要升级hadoop文件系统
startDistributedUpgradeIfNeeded();
//DataNode提供服务,定时发送心跳给NameNode,响应NameNode返回的命令并执行
offerService();
} catch (Exception ex) {
LOG.error("Exception: " + StringUtils.stringifyException(ex));
if (shouldRun) {
try {
Thread.sleep(5000);
} catch (InterruptedException ie) {
}
}
}
}
LOG.info(dnRegistration + ":Finishing DataNode in: "+data);
shutdown();
}
5.2 Datanode 心跳分析
(1)offerService分析
(a)检查心跳间隔是否超时,如是向namenode发送心跳报告,内容是dfs的容量、剩余的空间和DataXceiverServer的数量等,调用processCommand方法处理namenode返回的命令
(b)通知namenode已经接收的块
(c)检查块报告间隔是否超时,如是向namenode发送块报告,调用processCommand方法处理namenode返回的命令
(d)如果没到下个发送心跳的时候,休眠
/**
* Main loop for the DataNode. Runs until shutdown,
* forever calling remote NameNode functions.
*
* 1.检查心跳间隔是否超时,如是向namenode发送心跳报告,内容是dfs的容量、剩余的空间和DataXceiverServer的数量等,调用processCommand方法处理namenode返回的命令
* 2.通知namenode已经接收的块
* 3.检查块报告间隔是否超时,如是向namenode发送块报告,调用processCommand方法处理namenode返回的命令
* 4.如果没到下个发送心跳的时候,休眠
*
*
* DNA_UNKNOWN = 0:未知操作
* DNA_TRANSFER = 1:传输块到另一个datanode,创建DataTransfer来传输每个块,请求的类型是OP_WRITE_BLOCK,使用BlockSender来发送块和元数据文件,不对块进行校验
* DNA_INVALIDATE = 2:不合法的块,将所有块删除
* DNA_SHUTDOWN = 3:停止datanode,停止infoServer、DataXceiverServer、DataBlockScanner和处理线程,将存储目录解锁,DataBlockScanner结束可能需要等待1小时
* DNA_REGISTER = 4:重新注册
* DNA_FINALIZE = 5:完成升级,调用DataStorage的finalizeUpgrade方法完成升级
* DNA_RECOVERBLOCK = 6:请求块恢复,创建线程来恢复块,每个线程服务一个块,对于每个块,调用recoverBlock来恢复块信息
*
* 会利用保存在receivedBlockList和delHints两个列表中的信息。
* receivedBlockList表明在这个DataNode成功创建的新的数据块
* delHints,是可以删除该数据块的节点
*/
public void offerService() throws Exception {
LOG.info("using BLOCKREPORT_INTERVAL of " + blockReportInterval + "msec" +
" Initial delay: " + initialBlockReportDelay + "msec");
//
// Now loop for a long time....
//
while (shouldRun) {
try {
long startTime = now();
//
// Every so often, send heartbeat or block-report
//
if (startTime - lastHeartbeat > heartBeatInterval) {
//定期发送心跳
//
// All heartbeat messages include following info:
// -- Datanode name
// -- data transfer port
// -- Total capacity
// -- Bytes remaining
//
lastHeartbeat = startTime;
DatanodeCommand[] cmds = namenode.sendHeartbeat(dnRegistration,
data.getCapacity(),
data.getDfsUsed(),
data.getRemaining(),
xmitsInProgress.get(),
getXceiverCount());
myMetrics.addHeartBeat(now() - startTime);
//LOG.info("Just sent heartbeat, with name " + localName);
//响应namenode返回的命令做处理
if (!processCommand(cmds))
continue;
}
// check if there are newly received blocks
Block [] blockArray=null;
String [] delHintArray=null;
synchronized(receivedBlockList) {
synchronized(delHints) {
int numBlocks = receivedBlockList.size();
if (numBlocks > 0) {
if(numBlocks!=delHints.size()) {
LOG.warn("Panic: receiveBlockList and delHints are not of the same length" );
}
//
// Send newly-received blockids to namenode
//
blockArray = receivedBlockList.toArray(new Block[numBlocks]);// receivedBlockList表明在这个DataNode成功创建的新的数据块,而delHints,是可以删除该数据块的节点
delHintArray = delHints.toArray(new String[numBlocks]);//在datanode.notifyNamenodeReceivedBlock函数中发生变化
}
}
}
if (blockArray != null) {
if(delHintArray == null || delHintArray.length != blockArray.length ) {
LOG.warn("Panic: block array & delHintArray are not the same" );
}
namenode.blockReceived(dnRegistration, blockArray, delHintArray);;//Block状态变化报告通过NameNode.blockReceived来报告。
synchronized (receivedBlockList) {
synchronized (delHints) {
for(int i=0; i<blockArray.length; i++) {
receivedBlockList.remove(blockArray[i]);
delHints.remove(delHintArray[i]);
}
}
}
}
// Send latest blockinfo report if timer has expired.
if (startTime - lastBlockReport > blockReportInterval) {// 向namenode报告系统中Block状态的变化
if (data.isAsyncBlockReportReady()) {
// Create block report
long brCreateStartTime = now();
Block[] bReport = data.retrieveAsyncBlockReport();
// Send block report
long brSendStartTime = now();
//向Namenode报告其上的块状态报告
DatanodeCommand cmd = namenode.blockReport(dnRegistration,
BlockListAsLongs.convertToArrayLongs(bReport));
// Log the block report processing stats from Datanode perspective
long brSendCost = now() - brSendStartTime;
long brCreateCost = brSendStartTime - brCreateStartTime;
myMetrics.addBlockReport(brSendCost);
LOG.info("BlockReport of " + bReport.length
+ " blocks took " + brCreateCost + " msec to generate and "
+ brSendCost + " msecs for RPC and NN processing");
//
// If we have sent the first block report, then wait a random
// time before we start the periodic block reports.
//
if (resetBlockReportTime) {
lastBlockReport = startTime -
R.nextInt((int)(blockReportInterval));
resetBlockReportTime = false;
} else {
/* say the last block report was at 8:20:14. The current report
* should have started around 9:20:14 (default 1 hour interval).
* If current time is :
* 1) normal like 9:20:18, next report should be at 10:20:14
* 2) unexpected like 11:35:43, next report should be at
* 12:20:14
*/
lastBlockReport += (now() - lastBlockReport) /
blockReportInterval * blockReportInterval;
}
processCommand(cmd);
} else {
data.requestAsyncBlockReport();
if (lastBlockReport > 0) { // this isn't the first report
long waitingFor =
startTime - lastBlockReport - blockReportInterval;
String msg = "Block report is due, and been waiting for it for " +
(waitingFor/1000) + " seconds...";
if (waitingFor > LATE_BLOCK_REPORT_WARN_THRESHOLD) {
LOG.warn(msg);
} else if (waitingFor > LATE_BLOCK_REPORT_INFO_THRESHOLD) {
LOG.info(msg);
} else if (LOG.isDebugEnabled()) {
LOG.debug(msg);
}
}
}
}
// start block scanner;//启动blockScanner线程 进行block扫描
if (blockScanner != null && blockScannerThread == null &&
upgradeManager.isUpgradeCompleted()) {
LOG.info("Starting Periodic block scanner.");
blockScannerThread = new Daemon(blockScanner);
blockScannerThread.start();
}
//
// There is no work to do; sleep until hearbeat timer elapses,
// or work arrives, and then iterate again.
//
long waitTime = heartBeatInterval - (System.currentTimeMillis() - lastHeartbeat);
synchronized(receivedBlockList) {
if (waitTime > 0 && receivedBlockList.size() == 0) {
try {
receivedBlockList.wait(waitTime);
} catch (InterruptedException ie) {
}
delayBeforeBlockReceived();
}
} // synchronized
} catch(RemoteException re) {
String reClass = re.getClassName();
if (UnregisteredDatanodeException.class.getName().equals(reClass) ||
DisallowedDatanodeException.class.getName().equals(reClass) ||
IncorrectVersionException.class.getName().equals(reClass)) {
LOG.warn("DataNode is shutting down: " +
StringUtils.stringifyException(re));
shutdown();
return;
}
LOG.warn(StringUtils.stringifyException(re));
} catch (IOException e) {
LOG.warn(StringUtils.stringifyException(e));
}
} // while (shouldRun)
} // offerService
(2)processCommand分析
DNA_UNKNOWN = 0:未知操作
DNA_TRANSFER = 1:传输块到另一个datanode,创建DataTransfer来传输每个块,请求的类型是OP_WRITE_BLOCK,使用BlockSender来发送块和元数据文件,不对块进行校验
DNA_INVALIDATE = 2:不合法的块,将所有块删除
DNA_SHUTDOWN = 3:停止datanode,停止infoServer、DataXceiverServer、DataBlockScanner和处理线程,将存储目录解锁,DataBlockScanner结束可能需要等待1小时
DNA_REGISTER = 4:重新注册
DNA_FINALIZE = 5:完成升级,调用DataStorage的finalizeUpgrade方法完成升级
DNA_RECOVERBLOCK = 6:请求块恢复,创建线程来恢复块,每个线程服务一个块,对于每个块,调用recoverBlock来恢复块信息
/**
* DNA_UNKNOWN = 0:未知操作
* DNA_TRANSFER = 1:传输块到另一个datanode,创建DataTransfer来传输每个块,请求的类型是OP_WRITE_BLOCK,使用BlockSender来发送块和元数据文件,不对块进行校验
* DNA_INVALIDATE = 2:不合法的块,将所有块删除
* DNA_SHUTDOWN = 3:停止datanode,停止infoServer、DataXceiverServer、DataBlockScanner和处理线程,将存储目录解锁,DataBlockScanner结束可能需要等待1小时
* DNA_REGISTER = 4:重新注册
* DNA_FINALIZE = 5:完成升级,调用DataStorage的finalizeUpgrade方法完成升级
* DNA_RECOVERBLOCK = 6:请求块恢复,创建线程来恢复块,每个线程服务一个块,对于每个块,调用recoverBlock来恢复块信息
*
* @param cmd
* @return true if further processing may be required or false otherwise.
* @throws IOException
*/
private boolean processCommand(DatanodeCommand cmd) throws IOException {
if (cmd == null)
return true;
final BlockCommand bcmd = cmd instanceof BlockCommand? (BlockCommand)cmd: null;
switch(cmd.getAction()) {
//传输块到另一个datanode,创建DataTransfer来传输每个块,请求的类型是OP_WRITE_BLOCK,使用BlockSender来发送块和元数据文件,不对块进行校验
case DatanodeProtocol.DNA_TRANSFER:
// Send a copy of a block to another datanode
transferBlocks(bcmd.getBlocks(), bcmd.getTargets());
myMetrics.incrBlocksReplicated(bcmd.getBlocks().length);
break;
//不合法的块,将所有块删除
case DatanodeProtocol.DNA_INVALIDATE:
//
// Some local block(s) are obsolete and can be
// safely garbage-collected.
//
Block toDelete[] = bcmd.getBlocks();
try {
if (blockScanner != null) {
blockScanner.deleteBlocks(toDelete);
}
data.invalidate(toDelete);
} catch(IOException e) {
checkDiskError();
throw e;
}
myMetrics.incrBlocksRemoved(toDelete.length);
break;
//停止datanode,停止infoServer、DataXceiverServer、DataBlockScanner和处理线程,将存储目录解锁,DataBlockScanner结束可能需要等待1小时
case DatanodeProtocol.DNA_SHUTDOWN:
// shut down the data node
this.shutdown();
return false;
//重新注册
case DatanodeProtocol.DNA_REGISTER:
// namenode requested a registration - at start or if NN lost contact
LOG.info("DatanodeCommand action: DNA_REGISTER");
if (shouldRun) {
register();
}
break;
//完成升级,调用DataStorage的finalizeUpgrade方法完成升级
case DatanodeProtocol.DNA_FINALIZE:
storage.finalizeUpgrade();
break;
case UpgradeCommand.UC_ACTION_START_UPGRADE:
// start distributed upgrade here
processDistributedUpgradeCommand((UpgradeCommand)cmd);
break;
//请求块恢复,创建线程来恢复块,每个线程服务一个块,对于每个块,调用recoverBlock来恢复块信息
case DatanodeProtocol.DNA_RECOVERBLOCK:
recoverBlocks(bcmd.getBlocks(), bcmd.getTargets());
break;
case DatanodeProtocol.DNA_ACCESSKEYUPDATE:
LOG.info("DatanodeCommand action: DNA_ACCESSKEYUPDATE");
if (isBlockTokenEnabled) {
blockTokenSecretManager.setKeys(((KeyUpdateCommand) cmd).getExportedKeys());
}
break;
//块均衡
case DatanodeProtocol.DNA_BALANCERBANDWIDTHUPDATE:
LOG.info("DatanodeCommand action: DNA_BALANCERBANDWIDTHUPDATE");
int vsn = ((BalancerBandwidthCommand) cmd).getBalancerBandwidthVersion();
if (vsn >= 1) {
long bandwidth =
((BalancerBandwidthCommand) cmd).getBalancerBandwidthValue();
if (bandwidth > 0) {
DataXceiverServer dxcs =
(DataXceiverServer) this.dataXceiverServer.getRunnable();
dxcs.balanceThrottler.setBandwidth(bandwidth);
}
}
break;
default:
LOG.warn("Unknown DatanodeCommand action: " + cmd.getAction());
}
return true;
}
5.3 Datanode 注册分析
DataNode节点向NameNode节点注册,一是告诉NameNode节点自己提供服务的网络地址端口,二是获取NameNode节点对自己的管理与控制。Datanode 注册只会被两个地方调用, Datanode启动和心跳收到注册命令。DatanodeRegistration这个类主要用于,当Datanode向Namenode发送注册信息时,它要向Namenode提供一些自己的注册信息。同时这个类中还可以设置它的父类就是DatanodeID的Name、infoPort、storageInfo这些内容。name 表示DataNode节点提供数据传输/交换的服务地址端口,storageID表示DataNode节点的存储器在整个HDFS集群中的全局编号,infoPort表示查询DataNode节点当前状态信息的端口号,ipcPort表示DataNode节点提供 ClientDatanodeProtocol服务端口号。
(1) datanode启动是会加载DataStorage,如果不存在则会format
(2)初次启动DataStorage的storageID是空的,所以会生成一个storageID
dnReg.storageID = "DS-" + rand + "-"+ ip + "-" + dnReg.getPort() + "-" + System.currentTimeMillis();
(3)注册成功后该storageID会写入磁盘,永久使用。
当NameNode节点收到某一个DataNode节点的注册请求之后,它会首先检查DataNode节点所运行的HDFS系统是否和自己是同一个版本号,然后才交给FSNamesystem来处理,源代码:
/**
* Datanode向Namenode注册,返回Namenode需要验证的有关Datanode的信息DatanodeRegistration的实例
*
* 检查该DataNode是否能接入到NameNode;
* 准备应答,更新请求的DatanodeID;
* 从datanodeMap(保存了StorageID >> DatanodeDescriptor的映射,用于保证DataNode使用的Storage的一致性)得到对应的DatanodeDescriptor,为nodeS;
* 从Host2NodesMap(主机名到DatanodeDescriptor数组的映射)中获取DatanodeDescriptor,为nodeN;
*
* 如果nodeN!=null同时nodeS!=nodeN(后面的条件表明表明DataNode上使用的Storage发生变化),那么我们需要先在系统中删除nodeN(removeDatanode,下面再讨论),并在Host2NodesMap中删除nodeN;
*
* 如果nodeS存在,表明前面已经注册过,则:
* 更新网络拓扑(保存在NetworkTopology),首先在NetworkTopology中删除nodeS,然后跟新nodeS的相关信息,调用resolveNetworkLocation,获得nodeS的位置,并从新加到NetworkTopology里;
* 更新心跳信息(register也是心跳);
*
* 如果nodeS不存在,表明这是一个新注册的DataNode,执行
* 如果注册信息的storageID为空,表明这是一个全新的DataNode,分配storageID;
* 创建DatanodeDescriptor,调用resolveNetworkLocation,获得位置信息;
* 调用unprotectedAddDatanode(后面分析)添加节点;
* 添加节点到NetworkTopology中;
* 添加到心跳数组中。
*
*/
public DatanodeRegistration register(DatanodeRegistration nodeReg
) throws IOException {
verifyVersion(nodeReg.getVersion());
namesystem.registerDatanode(nodeReg);
return nodeReg;
}
在 FSNamesystem中,会首先判断该DataNode是否被允许被连接到NameNode节点.在配置文件中,有这样两个选项:dfs.hosts 和dfs.hosts.exclude,它们的值对应的都是一个文件路径,dfs.hosts指向的文件是一个允许连接到NameNode节点的主机列表,dfs.hosts.exclude指向的文件是一个不允许连接到NameNode节点的主机列表,所以在NameNode节点创建 FSNamesystem的时候,会根据配置文件中的选项值从文件中加载这些主机列表,这样就完成了检查。
如果这时发现DataNode的 storageID为空,会立马给它分配一个全局ID,然后,FSNamesystem会根据DataNode的ip地址把它映射到合适的机架 (rack)中.FSNamesystem还会对该DataNode的描述信息和它对应的storageID做一个映射并保存起来,之后,就把该 DataNode节点加入到heartbeats集合中,让HeartbeartMinitor后台线程来实时监测该DataNode节点当前是否还 alive,NameNode节点会把注册结果信息返回给DataNode节点。
关于注册时发生的特殊情况与处理:
1. 一台DataNode节点突然宕机了,然后立马恢复重启,NameNode节点也没有及时检测到,那么当这台重启的DataNode节点注册时,就会出现 nodeN == nodeS!= null,在这种情况下NameNode节点不会清除与该节点相关的信息,而只是会更新该节点的状态信息,同时重新解析它的ip地址到HDFS集群的网络 拓扑图中(这里之所以会重新解析ip地址,是因为在一个DataNode节点重启的过程中唯一可变的就是它所在的网络拓扑结构发生了变化);
2.一台DataNode节点在突然宕机或者认为stop之后,又马上重启,但是此时该DataNode管理的存储器(逻辑磁盘)并不是之前管理的那一 个存储器了,那么该节点注册时就会出现nodeN != null 、nodeS != nodeN ,对于这种情况就相当于是该DataNode节点第一次注册,但是又必须要清除该DataNode节点上一次的注册信息,特别是Block与 DataNode之间的映射信息;
3.当一个DataNode节点由于更换了ip地址,所以它必须要重启并重新向NameNode节点注册,此时该节点注册时就会出现nodeS != null、nodeS != nodeN(存储器没有变),此时并不需要清除该存储器原来所在的DataNode节点的相关注册信息,主要是Block和DataNode节点之间的映 射信息,而只需要更新原来与存储器绑定的DataNode节点的服务器地址信息,当然此时还必须要把新的DataNode节点的ip地址解析到HDFS集 群的网络拓扑图中。
5.4 DataBlockScanner 文件校验
由于每一个磁盘或者是网络上的I/O操作可能会对正在读写的数据处理不慎而出现错误,所以HDFS提供了下面两种数据检验方式,以此来保证数据的完整性,而且这两种检验方式在DataNode节点上是同时工作的:
(1)校验和
检测损坏数据的常用方法是在第一次进行系统时计算数据的校验和,在通道传输过程中,如果新生成的校验和不完全匹配原始的校验和,那么数据就会被认为是被损坏的。
(2)数据块检测程序(DataBlockScanner)
在DataNode节点上开启一个后台线程,来定期验证存储在它上所有块,这个是防止物理介质出现损减情况而造成的数据损坏。
关于校验和,HDFS以透明的方式检验所有写入它的数据,并在默认设置下,会在读取数据时验证校验和。正对数据的每一个校验块,都会创建一个单独的校验和,默认校验块大小是512字节,对应的校验和是4字节。DataNode节点负载在存储数据(当然包括数据的校验和)之前验证它们收到的数据, 如果此 DataNode节点检测到错误,客户端会收到一个CheckSumException。客户端读取DataNode节点上的数据时,会验证校验和,即将 其与DataNode上存储的校验和进行比较。每一个DataNode节点都会维护着一个连续的校验和和验证日志,里面有着每一个Block的最后验证时 间。客户端成功验证Block之后,便会告诉DataNode节点,Datanode节点随之更新日志。
DataBlockScanner是在一个单独的线程里面进行执行,作用是周期性的对block进行校验,当DFSClient读取时,也会通知DataBlockScanner校验结果。
DataBlockScanner最大扫描速度是8 MB/s,通过BlockTransferThrottler来限制流量,最小扫描速度是1 MB/s,默认扫描周期是21天,扫描周期可通过dfs.datanode.scan.period.hours来设置。
该类中的属性TreeSet<BlockScanInfo> blockInfoSet用来保存block的扫描时间,离现在时间间隔最长的排在首位,扫描的过程如下:
检查blockInfoSet中的第一个block的最后扫描时间距离现在是否超过扫描周期,如果不超过,休眠一定时间然后开始下次检查,如果扫描周期,那么对该block进行校验,校验使用BlockSender来读取,读取的数据输出到 NullOutputStream,我们知道BlockSender在读取数据时,可以检查checksum,以此来判断是否校验成功。如果校验失败,进 行第二次校验,如果两次都失败,说明该block有错误,通知namenode。
与扫描日志相关的参数有:
最大扫描速度是8 MB/s,通过BlockTransferThrottler来限制流量
最小扫描速度是1 MB/s
默认扫描周期是3周,扫描周期可通过配置${dfs.datanode.scan.period.hours}来设置
与扫描日志相关的参数有:
日志文件名前缀是dncp_block_verification.log
共有两个日志:当前日志,文件后缀是.curr;前一个日志,文件后缀是.prev
minRollingPeriod:日志最小滚动周期是6小时
minWarnPeriod:日志最小警告周期是6小时,在一个警告周期内只有发出一个警告
minLineLimit:日志最小行数限制是1000
采用滚动日志方式,只有当前行数curNumLines超过最大行数maxNumLines,并且距离上次滚动日志的时间超过minRollingPeriod时,才将dncp_block_verification.log.curr重命名为dncp_block_verification.log.prev,将新的日志写到dncp_block_verification.log.curr中。
class DataBlockScanner implements Runnable {
public static final Log LOG = LogFactory.getLog(DataBlockScanner.class);
private static final int MAX_SCAN_RATE = 8 * 1024 * 1024; // 8MB per sec
private static final int MIN_SCAN_RATE = 1 * 1024 * 1024; // 1MB per sec
static final long DEFAULT_SCAN_PERIOD_HOURS = 21 * 24L; // three weeks
private static final long ONE_DAY = 24 * 3600 * 1000L;
static final DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss,SSS");
static final String verificationLogFile = "dncp_block_verification.log";
static final int verficationLogLimit = 5; // * numBlocks.
//一个扫描周期,可以由Datanode的配置文件来设置,配置项是:dfs.datanode.scan.period.hours,单位是小时,默认的值是21*24*60*60*1000 ms
private long scanPeriod = DEFAULT_SCAN_PERIOD_HOURS * 3600 * 1000;
DataNode datanode;
//数据块管理器
FSDataset dataset;
// sorted set
//数据块扫描信息集合,按照上一次扫描时间和数据块id升序排序,以便快速获取验证到期的数据块;
TreeSet<BlockScanInfo> blockInfoSet;
//数据块和数据块扫描信息的映射,以便能够根据数据块快速获取对应的扫描信息;
HashMap<Block, BlockScanInfo> blockMap;
long totalScans = 0;
long totalVerifications = 0; // includes remote verification by clients.
long totalScanErrors = 0;
long totalTransientErrors = 0;
long currentPeriodStart = System.currentTimeMillis();
//一个扫描周期中还剩下需要扫描的数据量;
long bytesLeft = 0; // Bytes to scan in this period
//一个扫描周期中需要扫描的总数据量;
long totalBytesToScan = 0;
//数据块的扫描验证日志记录器;
private LogFileHandler verificationLog;
Random random = new Random();
//扫描时I/O速度控制器,需要根据totalBytesToScan和bytesLeft信息来衡量;
BlockTransferThrottler throttler = null;
private static enum ScanType {
REMOTE_READ, // Verified when a block read by a client etc
VERIFICATION_SCAN, // scanned as part of periodic verfication
NONE,
}
......
}
DataBlockScanner被DataNode节点用来检测它所管理的所有Block数据块的一致性,因此,对已DataNode节点上的每一个Block,它都会每隔scanPeriod ms利用Block对应的校验和文件来检测该Block一次,看看这个Block的数据是否已经损坏。由于scanPeriod 的值一般比较大,因为对DataNode节点上的每一个Block扫描一遍要消耗不少系统资源,这就可能带来另外一个问题就是在一个扫描周 期内可能会出现DataNode节点重启的情况,所以为了提供系统性能,避免DataNode节点在启动之后对还没有过期的Block又扫描一遍,DataBlockScanner在其内部使用了日志记录器来持久化保存每一个Block上一次扫描的时间,这样的话, DataNode节点在启动之后通过日志文件来恢复之前所有Block的有效时间。另外,DataNode为了节约系统资源,它对Block的验证不仅仅只依赖于DataBlockScanner后台线程(VERIFICATION_SCAN方式),还会在向某一个客户端传送Block的时候来更行该Block的扫描时间(REMOTE_READ方式),这是因为DataNode向客户端传送一个Block的时候要必须校验该数据块。那么这个时候日志记录器并不会马上把该数据块的扫描信息写到日志,毕竟频繁的磁盘I/O会导致性能下降,至于何时对该Block的最新扫描时间写日志有一个判断条件: 1.如果是VERIFICATION_SCAN方式的Block验证,必须记日志; 2.如果是REMOTE_READ方式,那么该Block上一次的记录日志到现在的时间间隔超过24小时或者超过scanPeriod/3 ms 的话,记日志。 下面来结合源码详细讨论这个过程:
public void run() {
try {
//初始化
//1.为每个Block创建BlockScanInfo
//2.创建 日志扫描器 LogFileHandler
//3.创建 扫描速度控制器 BlockTransferThrottler
init();
// Read last verification times
//为每一个Block分配上一次验证的时间
if (!assignInitialVerificationTimes()) {
return;
}
//调整扫描速度
adjustThrottler();
while (datanode.shouldRun && !Thread.interrupted()) {
long now = System.currentTimeMillis();
synchronized (this) {
if (now >= (currentPeriodStart + scanPeriod)) {
//新周期
startNewPeriod();
}
}
if ((now - getEarliestScanTime()) >= scanPeriod) {
//校验
verifyFirstBlock();
} else {
try {
Thread.sleep(1000);
} catch (InterruptedException ignored) {
}
}
}
} catch (RuntimeException e) {
LOG.warn("RuntimeException during DataBlockScanner.run() : " + StringUtils.stringifyException(e));
throw e;
} finally {
shutdown();
LOG.info("Exiting DataBlockScanner thread.");
}
}
(1)初始化init
初始化两个关键数据结构
blockInfoSet = new TreeSet<BlockScanInfo>();
blockMap = new HashMap<Block, BlockScanInfo>();
为每个Block创建BlockScanInfo,创建日志扫描器 LogFileHandler ,创建扫描速度控制器 BlockTransferThrottler
(2)为每一个Block分配上一次验证的时间
BlockScanInfo info;
while ((info = blockInfoSet.first()).lastScanTime < 0) {
delBlockInfo(info);
info.lastScanTime = lastScanTime;
lastScanTime += verifyInterval;
addBlockInfo(info);
}
(3)调整扫描速度
在一次Blocks扫描验证周期中,DataBlockScanner需要进行大量的磁盘I/O,为了不影响DataNode节点上其它线程的工作资源,同时也为了自身工作的有效性,所以DataBlockScanner采用了扫描验证速度控制器,根据当前的工作量来控制当前数据块的验证速度。
//本次扫描验证还剩余的时间
long timeLeft = currentPeriodStart + scanPeriod - System.currentTimeMillis();
//根据本次验证扫描剩余的工作量和时间来计算速度
long bw = Math.max(bytesLeft * 1000 / timeLeft, MIN_SCAN_RATE);
throttler.setBandwidth(Math.min(bw, MAX_SCAN_RATE));
(4)DataNode节点在向客户端或者其它DataNode节点传输数据时,客户端或者其它DataNode节点会根据接收的数据校验和来验证接收到的数据,当验证出错时,它们会通知传送节点。DataBlockScanner通过自己扮演传输者又扮演接受者来实现数据块的验证的;同时为了防止本地磁盘的I/O的错误,DataBlockScanner采用了两次传输-接收来确保验证的Block的数据是出错了(损坏了)。当发现有出错的Block是,就需要向NameNode节点报告,由NameNode来决定如何处理这个数据块,而不是由DataNode节点擅自作主清除该Block数据信息。
blockSender = new BlockSender(block, 0, -1, false, false, true, datanode);
DataOutputStream out = new DataOutputStream(new IOUtils.NullOutputStream());
blockSender.sendBlock(out, null, throttler);
(5)错误处理
DatanodeInfo[] dnArr = { new DatanodeInfo(datanode.dnRegistration) };
LocatedBlock[] blocks = { new LocatedBlock(block, dnArr) };
//向NameNode节点发送出错的Block
datanode.namenode.reportBadBlocks(blocks);
5.5 DataNode 数据块接受/发送