上一篇本文我详细的分析了在HDFS的文件写操作中,客户端是如何工作的,其工作核心可总结为两点:一是向NameNode申请Block,二是向数据节点传输Block的packet。那么,数据节点是如何来接受这个数据块的呢?这个还得从数据节点的注册说起。
数据节点在启动之后,会向NameNode节点进行注册来告诉它自己的一些信息,这些信息包括自己的存储信息,服务信息等。其中,服务信息包括自己的数据接受地址、RPC地址、Info地址。数据接受地址就是用来接受客户端发送过来的有关数据块的操作。而这个工作统一交给DataXceiverServer来管理,这个DataXceiverServer实际上可以看做是一个线程池,具体的为某一个客户端服务的话,它还是交给工作线程DataXceiver来做的。好吧,那就来具体的看看DataXceiver。
class DataXceiver implements Runnable, FSConstants {
public void run() {
DataInputStream in=null;
try {
in = new DataInputStream( new BufferedInputStream(NetUtils.getInputStream(s), SMALL_BUFFER_SIZE));
short version = in.readShort();
if ( version != DataTransferProtocol.DATA_TRANSFER_VERSION ) {
throw new IOException( "Version Mismatch" );
}
boolean local = s.getInetAddress().equals(s.getLocalAddress());
byte op = in.readByte();
// Make sure the xciver count is not exceeded
int curXceiverCount = datanode.getXceiverCount();
if (curXceiverCount > dataXceiverServer.maxXceiverCount) {
throw new IOException("xceiverCount " + curXceiverCount + " exceeds the limit of concurrent xcievers " + dataXceiverServer.maxXceiverCount);
}
long startTime = DataNode.now();
switch ( op ) {
case DataTransferProtocol.OP_READ_BLOCK:
readBlock( in );
datanode.myMetrics.readBlockOp.inc(DataNode.now() - startTime);
if (local)
datanode.myMetrics.readsFromLocalClient.inc();
else
datanode.myMetrics.readsFromRemoteClient.inc();
break;
case DataTransferProtocol.OP_WRITE_BLOCK:
writeBlock( in );
datanode.myMetrics.writeBlockOp.inc(DataNode.now() - startTime);
if (local)
datanode.myMetrics.writesFromLocalClient.inc();
else
datanode.myMetrics.writesFromRemoteClient.inc();
break;
case DataTransferProtocol.OP_READ_METADATA:
readMetadata( in );
datanode.myMetrics.readMetadataOp.inc(DataNode.now() - startTime);
break;
case DataTransferProtocol.OP_REPLACE_BLOCK: // for balancing purpose; send to a destination
replaceBlock(in);
datanode.myMetrics.replaceBlockOp.inc(DataNode.now() - startTime);
break;
case DataTransferProtocol.OP_COPY_BLOCK:
// for balancing purpose; send to a proxy source
copyBlock(in);
datanode.myMetrics.copyBlockOp.inc(DataNode.now() - startTime);
break;
case DataTransferProtocol.OP_BLOCK_CHECKSUM: //get the checksum of a block
getBlockChecksum(in);
datanode.myMetrics.blockChecksumOp.inc(DataNode.now() - startTime);
break;
default:
throw new IOException("Unknown opcode " + op + " in data stream");
}
} catch (Throwable t) {
LOG.error(datanode.dnRegistration + ":DataXceiver",t);
} finally {
LOG.debug(datanode.dnRegistration + ":Number of active connections is: " + datanode.getXceiverCount());
IOUtils.closeStream(in);
IOUtils.closeSocket(s);
dataXceiverServer.childSockets.remove(s);
}
}
}
在
DataXceiver的run()方法中,他根据客户端的操作要求(如:读数据块、写数据块、复制数据块等)来调用响应的函数
处理,本文既是谈论文件的写操作,那么自然就会具体分析
writeBlock()方法。
private void writeBlock(DataInputStream in) throws IOException {
DatanodeInfo srcDataNode = null;
LOG.debug("writeBlock receive buf size " + s.getReceiveBufferSize() + " tcp no delay " + s.getTcpNoDelay());
//先取出相关的头部信息
Block block = new Block(in.readLong(), dataXceiverServer.estimateBlockSize, in.readLong());//Block的id和创建时间
int pipelineSize = in.readInt(); //Block副本数量
boolean isRecovery = in.readBoolean(); // 是否是恢复Block
String client = Text.readString(in); //客户端名字
boolean hasSrcDataNode = in.readBoolean(); // 是否来自数据节点
if (hasSrcDataNode) {
srcDataNode = new DatanodeInfo();
srcDataNode.readFields(in);//源数据节点信息
}
int numTargets = in.readInt();//存放剩余Block副本数量
if (numTargets < 0) {
throw new IOException("Mislabelled incoming datastream.");
}
DatanodeInfo targets[] = new DatanodeInfo[numTargets];
for (int i = 0; i < targets.length; i++) {
DatanodeInfo tmp = new DatanodeInfo();
tmp.readFields(in);//存放剩余Block副本的数据节点信息
targets[i] = tmp;
}
DataOutputStream mirrorOut = null; // 向下一个数据节点写入Block的网络I/O流
DataInputStream mirrorIn = null; // reply from next target
DataOutputStream replyOut = null; // stream to prev target
Socket mirrorSock = null; // socket to next target
BlockReceiver blockReceiver = null; // responsible for data handling
String mirrorNode = null; // the name:port of next target
String firstBadLink = ""; // first datanode that failed in connection setup
try {
blockReceiver = new BlockReceiver(block, in, s.getRemoteSocketAddress().toString(), s.getLocalSocketAddress().toString(), isRecovery, client, srcDataNode, datanode);
if (targets.length > 0) {
InetSocketAddress mirrorTarget = null;
// Connect to backup machine
mirrorNode = targets[0].getName();
mirrorTarget = NetUtils.createSocketAddr(mirrorNode);
mirrorSock = datanode.newSocket();
try {
int timeoutValue = numTargets * datanode.socketTimeout;
int writeTimeout = datanode.socketWriteTimeout + (HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets);
NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
mirrorSock.setSoTimeout(timeoutValue);
mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);
mirrorOut = new DataOutputStream(new BufferedOutputStream(NetUtils.getOutputStream(mirrorSock, writeTimeout),SMALL_BUFFER_SIZE));
mirrorIn = new DataInputStream(NetUtils.getInputStream(mirrorSock));
// 向下一个数据节点写入Block的头部信息
mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );
mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK );
mirrorOut.writeLong( block.getBlockId() );
mirrorOut.writeLong( block.getGenerationStamp() );
mirrorOut.writeInt( pipelineSize );
mirrorOut.writeBoolean( isRecovery );
Text.writeString( mirrorOut, client );
mirrorOut.writeBoolean(hasSrcDataNode);
if (hasSrcDataNode) { // pass src node information
srcDataNode.write(mirrorOut);
}
mirrorOut.writeInt( targets.length - 1 );
for ( int i = 1; i < targets.length; i++ ) {
targets[i].write( mirrorOut );
}
//向下一个数据节点写入校验信息
blockReceiver.writeChecksumHeader(mirrorOut);
mirrorOut.flush();
//获取下一个数据节点的确认信息
if (client.length() != 0) {
firstBadLink = Text.readString(mirrorIn);
if (LOG.isDebugEnabled() || firstBadLink.length() > 0) {
LOG.info("Datanode " + targets.length +
" got response for connect ack " +
" from downstream datanode with firstbadlink as " +
firstBadLink);
}
}
} catch (IOException e) {
if (client.length() != 0) {
Text.writeString(replyOut, mirrorNode);
replyOut.flush();
}
IOUtils.closeStream(mirrorOut);
mirrorOut = null;
IOUtils.closeStream(mirrorIn);
mirrorIn = null;
IOUtils.closeSocket(mirrorSock);
mirrorSock = null;
if (client.length() > 0) {
throw e;
} else {
}
}
}
if (client.length() != 0) {
if (LOG.isDebugEnabled() || firstBadLink.length() > 0) {
}
Text.writeString(replyOut, firstBadLink);//向前一个一个数据节点或客户回复
replyOut.flush();
}
// 接受前一个数据节点或客户端发送来的Block,并把该Block发送到下一个数据节点
String mirrorAddr = (mirrorSock == null) ? null : mirrorNode;
blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut, mirrorAddr, null, targets.length);
if (client.length() == 0) {
//将接收到的一个Block添加到当前数据节点的receivedBlockList
datanode.notifyNamenodeReceivedBlock(block, DataNode.EMPTY_DEL_HINT);
if (datanode.blockScanner != null) {
datanode.blockScanner.addBlock(block);
}
} catch (IOException ioe) {
throw ioe;
} finally {
// close all opened streams
IOUtils.closeStream(mirrorOut);
IOUtils.closeStream(mirrorIn);
IOUtils.closeStream(replyOut);
IOUtils.closeSocket(mirrorSock);
IOUtils.closeStream(blockReceiver);
}
}
在writeBlock()方法中,数据节点主要是靠BlockReceiver来接受前一个客户端或者是数据接节点发送过来的Block,并把它发送到下一个数据节点。关于BlockReceiver的实现细节在这里我就不再赘述了,有兴趣的盆友可以自己去研究一下它的源代码。当一个DataXceiver成功地接受完一个Block并把它发送到下一个数据节点之后,它就会把刚收到的一个Block放到DataNode的receivedBlockList队列中,而DataNode也会将该Block报告给NameNode节点。
另外,我还用补充一个问题就是关于数据节点的数据传输服务地址的问题。我曾经遇到过这样一个有趣的问题:我只在一台pc上部署HDFS,一个NameNode节点,一个DataNode节点,其中我的pc的ip地址是*.*.*.*(教育网ip地址),然后我把NameNode的服务地址配置成localhost:8020,DataNode的数据服务地址配置成*.*.*.*:50010,结果客户端在向数据节点传输Block时始终都无法连接到DataNode的数据服务端口。盆友们知道这是为什么吗?这是因为数据节点在向NameNode节点注册时,NameNode执行了下面一段代码:
String dnAddress = Server.getRemoteAddress();
if (dnAddress == null) {
dnAddress = nodeReg.getHost();
}
// check if the datanode is allowed to be connect to the namenode
if (!verifyNodeRegistration(nodeReg, dnAddress)) {
throw new DisallowedDatanodeException(nodeReg);
}
String hostName = nodeReg.getHost();
// update the datanode's name with ip:port
DatanodeID dnReg = new DatanodeID(dnAddress + ":" + nodeReg.getPort(), nodeReg.getStorageID(), nodeReg.getInfoPort(), nodeReg.getIpcPort());
nodeReg.updateRegInfo(dnReg);
在上面的代码中,NameNode节点一开始就根据连接的数据节点的远程地址更新了注册信息
nodeReg,即修改了数据节点的数据服务地址,问题也就出在这儿了。由于我设置的NameNode的地址是localhost:8020,所以当数据节点向
NameNode节点连接时,并没有走路由器,那么NameNode收到的数据节点连接的远程地址就是127.0.0.1了,即Server.getRemoteAddress()会返回127.0.0.1,客户端收到的数据节点的数据服务地址就是127.0.0.1:50010,同时数据节点也干了一件轰动的事情,就是把数据服务地址绑定到了*.*.*.*:50010上,而不是只绑定到某一个端口上,着一系列的动作就导致了客户端找不到数据节点的数据服务地址。之后,我就索性直接把NameNode的地址配置成*.*.*.*:8020,然后半点事没有。