在前面的博文中详细地讲到了DataNode如何向客户端或者其它DataNode节点发送数据块Block的,即数据块发送器BlockSender,那么对应的,DataNode节点或者客户端就应该能够正确地接受数据块,也就是我们将要讲到的数据块的接收器BlockReceiver。当然,HDFS客户端和DataNode节点接收Block的方式有所不同,不过本文将主要讲解DataNode节点上的数据块接收器。
我们都知道,HDFS采用流水线的复制方式来传输存储数据块的每一个副本的,所以当一个DataNode节点接收到一个Block的数据之后就需要考虑是否需要把接收到的Block数据发往下一个DataNode节点。在讲解DataNode节点上Block发送器的时候说过,DataNode节点是按照数据包的方式来发送Block数据的,所以接收端的DataNode也就要按照数据包的方式接受Block的数据了,然后呢,它又是按照数据包的方式把接收到的一个个数据包发往下一个DataNode节点,同时它还要在接收到pipe之后所有的DataNode节点对该packet的确认帧之后连同自己对该packet确认帧一起发送给发送端,如下图所示:
但令我一直不解的是,DataNode的Block发送器忽略了接收端发来的所有关于packet的确认帧,如有那位同学知道各种原因不妨告知一下或者和我讨论讨论。
先来看看与BlockReceiver先关的类的基本信息
其实,在DataNode的数据块接收器内部有一个后台线程PacketResponder,用来接受每一个来自接收端的packet确认帧,以及向发送端发送每一个packet的确认帧,所以对于BlockReceiver有两个重点,一个是它如何接受和发送数据包,另一个是如何接受和发送每一个packet的确认帧。
1.BlockReceiver
private Block block; //待接受Block的基本信息
protected boolean finalized;
private DataInputStream in = null; //Block的数据读取流
private DataChecksum checksum; //Block数据的校验器
private OutputStream out = null; //Block数据的写入流(写入DataNode的本地磁盘)<pre name="code" class="java">private DataOutputStream checksumOut = null; //Block数据校验和写入流(写入DataNode的本地磁盘)
private int bytesPerChecksum;//数据校验块大小
private int checksumSize;//数据校验快对应的校验和大小
private ByteBuffer buf; //packet数据缓存块
private int bufRead; //缓存块中已读取的数据大小
private int maxPacketReadLen;
protected long offsetInBlock;//接受的packet在Block中的起始位置
protected final String inAddr;//发送端(DataNode/Client)的ip地址
protected final String myAddr;//当前接收端(DataNode)的ip地址
private String mirrorAddr;//下一个接收端(DataNode)的ip地址
private DataOutputStream mirrorOut;//向下一个接收端发送packet的写入流
private Daemon responder = null;//packet的响应器
private BlockTransferThrottler throttler;
private FSDataset.BlockWriteStreams streams;
private boolean isRecovery = false;
private String clientName;
DatanodeInfo srcDataNode = null;
private Checksum partialCrc = null;
private DataNode datanode = null;
BlockReceiver首先会尽量一次接受一个完整的packet,并把这个packet全部存放到缓存块
中,然后一边把接受到的packet写入本地磁盘,一方面随即把它发往下一个接收端,最后把这个packet交给PacketResponder来确认。当然在第二步之前有一个数据验证的操作,不过这个数据验证有个条件,即是:如果最开始的发送端是Client,则只在pipe中的最后一个DataNode节点上验证数据,如果最开始的发送端是DataNode节点,则需要在pipe中的每一个DataNode节点上进行验证(这里的验证指的是通过接受的校验和来验证接受的数据是否已经损坏了)。在理想情况下,BlockReceiver接受一个Block的主要过程如下:
1).初始化
这里的初始化是指为即将到来的Block数据申请本地磁盘上的存储空间,并且根据接受的头部信息为该Block创建校验器,以及获取对应的校验配置信息。
BlockReceiver(Block block, DataInputStream in, String inAddr,
String myAddr, boolean isRecovery, String clientName,
DatanodeInfo srcDataNode, DataNode datanode) throws IOException {
try{
this.block = block;
this.in = in;
this.inAddr = inAddr;
this.myAddr = myAddr;
this.isRecovery = isRecovery;
this.clientName = clientName;
this.offsetInBlock = 0;
this.srcDataNode = srcDataNode;
this.datanode = datanode;
this.checksum = DataChecksum.newDataChecksum(in);//从头部信息中创建数据校验器
this.bytesPerChecksum = checksum.getBytesPerChecksum();//数据校验块的大小
this.checksumSize = checksum.getChecksumSize();//数据校验块对应的校验和大小
//为即将接受的Block数据创建临时存储空间(创建Block的数据文件和校验和文件)
streams = datanode.data.writeToBlock(block, isRecovery);
this.finalized = datanode.data.isValidBlock(block);
if (streams != null) {
this.out = streams.dataOut;
this.checksumOut = new DataOutputStream(new BufferedOutputStream( streams.checksumOut, SMALL_BUFFER_SIZE));
if (datanode.blockScanner != null && isRecovery) {
datanode.blockScanner.deleteBlock(block);
}
}
} catch (BlockAlreadyExistsException bae) {
throw bae;
} catch(IOException ioe) {
IOUtils.closeStream(this);
cleanupBlock();
IOException cause = FSDataset.getCauseIfDiskError(ioe);
if (cause != null) { // possible disk error
ioe = cause;
datanode.checkDiskError(ioe); // may throw an exception here
}
throw ioe;
}
}
2).接受Block数据
void receiveBlock(DataOutputStream mirrOut, DataInputStream mirrIn, DataOutputStream replyOut, String mirrAddr,BlockTransferThrottler throttlerArg, int numTargets) throws IOException {
mirrorOut = mirrOut;//发往下一个DataNode节点的写入流
mirrorAddr = mirrAddr;
throttler = throttlerArg;
try {
// write data chunk header
if (!finalized) {
BlockMetadataHeader.writeHeader(checksumOut, checksum);
}
if (clientName.length() > 0) {
//创建packet响应器
responder = new Daemon(datanode.threadGroup,new PacketResponder(this, block, mirrIn,replyOut, numTargets));
responder.start(); // start thread to processes reponses
}
//不过的接受/发送packet
while (receivePacket() > 0) {}
//结束向下一个DataNode发送数据
if (mirrorOut != null) {
try {
mirrorOut.writeInt(0); // mark the end of the block
mirrorOut.flush();
} catch (IOException e) {
handleMirrorOutError(e);
}
}
//等待PacketResponder完成对所有packet的确认
if (responder != null) {
((PacketResponder)responder.getRunnable()).close();
}
if (clientName.length() == 0) {
// close the block/crc files
close();
// Finalize the block. Does this fsync()?
block.setNumBytes(offsetInBlock);
datanode.data.finalizeBlock(block);
datanode.myMetrics.blocksWritten.inc();
}
} catch (IOException ioe) {
LOG.info("Exception in receiveBlock for block " + block + " " + ioe);
IOUtils.closeStream(this);
if (responder != null) {
responder.interrupt();
}
cleanupBlock();
throw ioe;
} finally {
if (responder != null) {
try {
responder.join();
} catch (InterruptedException e) {
throw new IOException("Interrupted receiveBlock");
}
responder = null;
}
}
}
3). 接受/发送一个packet
private int readToBuf(int toRead) throws IOException {
if (toRead < 0) {
toRead = (maxPacketReadLen > 0 ? maxPacketReadLen : buf.capacity()) - buf.limit();
}
//读取数据到buf中
int nRead = in.read(buf.array(), buf.limit(), toRead);
if (nRead < 0) {
throw new EOFException("while trying to read " + toRead + " bytes");
}
bufRead = buf.limit() + nRead;
buf.limit(bufRead);
//返回本次读取数据的大小
return nRead;
}
/**
*读取一个完整的packet
*/
private int readNextPacket() throws IOException {
//计算缓存快的大小,并申请对应的内存空间
if (buf == null) {
int chunkSize = bytesPerChecksum + checksumSize;
int chunksPerPacket = (datanode.writePacketSize - DataNode.PKT_HEADER_LEN - SIZE_OF_INTEGER + chunkSize - 1)/chunkSize;
buf = ByteBuffer.allocate(DataNode.PKT_HEADER_LEN + SIZE_OF_INTEGER + Math.max(chunksPerPacket, 1) * chunkSize);
buf.limit(0);
}
// See if there is data left in the buffer :
if (bufRead > buf.limit()) {
buf.limit(bufRead);
}
//不断的读取packet的数据到缓存
while (buf.remaining() < SIZE_OF_INTEGER) {
if (buf.position() > 0) {
shiftBufData();
}
readToBuf(-1);
}
/* We mostly have the full packet or at least enough for an int
*/
buf.mark();
int payloadLen = buf.getInt();
buf.reset();
if (payloadLen == 0) {
//end of stream!
buf.limit(buf.position() + SIZE_OF_INTEGER);
return 0;
}
// check corrupt values for pktLen, 100MB upper limit should be ok?
if (payloadLen < 0 || payloadLen > (100*1024*1024)) {
throw new IOException("Incorrect value for packet payload : " +
payloadLen);
}
int pktSize = payloadLen + DataNode.PKT_HEADER_LEN;
if (buf.remaining() < pktSize) {
//we need to read more data
int toRead = pktSize - buf.remaining();
// first make sure buf has enough space.
int spaceLeft = buf.capacity() - buf.limit();
if (toRead > spaceLeft && buf.position() > 0) {
shiftBufData();
spaceLeft = buf.capacity() - buf.limit();
}
if (toRead > spaceLeft) {
byte oldBuf[] = buf.array();
int toCopy = buf.limit();
buf = ByteBuffer.allocate(toCopy + toRead);
System.arraycopy(oldBuf, 0, buf.array(), 0, toCopy);
buf.limit(toCopy);
}
//now read:
while (toRead > 0) {
toRead -= readToBuf(toRead);
}
}
if (buf.remaining() > pktSize) {
buf.limit(buf.position() + pktSize);
}
if (pktSize > maxPacketReadLen) {
maxPacketReadLen = pktSize;
}
return payloadLen;
}
/**
* 接受一个packet,并保存到本地磁盘,同时发往下一个接收端
*/
private int receivePacket() throws IOException {
//接受一个packet的数据
int payloadLen = readNextPacket();
if (payloadLen <= 0) {
return payloadLen;
}
buf.mark();
//读取packet的头部信息
buf.getInt(); // packet length
offsetInBlock = buf.getLong(); // get offset of packet in block
long seqno = buf.getLong(); // get seqno
boolean lastPacketInBlock = (buf.get() != 0);
int endOfHeader = buf.position();
buf.reset();
setBlockPosition(offsetInBlock);
//First write the packet to the mirror:
if (mirrorOut != null) {
try {
//将packet的头部信息发往下一个接收端
mirrorOut.write(buf.array(), buf.position(), buf.remaining());
mirrorOut.flush();
} catch (IOException e) {
handleMirrorOutError(e);
}
}
buf.position(endOfHeader);
int len = buf.getInt();//packet中数据长度
if (len < 0) {
throw new IOException("Got wrong length during writeBlock(" + block + ") from " + inAddr + " at offset " + offsetInBlock + ": " + len);
}
if (len == 0) {
LOG.debug("Receiving empty packet for block " + block);
} else {
offsetInBlock += len;
//校验数据的长度
int checksumLen = ((len + bytesPerChecksum - 1)/bytesPerChecksum)* checksumSize;
if ( buf.remaining() != (checksumLen + len)) {
throw new IOException("Data remaining in packet does not match " + "sum of checksumLen and dataLen");
}
int checksumOff = buf.position();//校验数据在packet中的开始位置
int dataOff = checksumOff + checksumLen;//真正数据在packet中的开始位置
byte pktBuf[] = buf.array();
buf.position(buf.limit()); // move to the end of the data.
//验证数据
if (mirrorOut == null || clientName.length() == 0) {
verifyChunks(pktBuf, dataOff, len, pktBuf, checksumOff);
}
try {
if (!finalized) {
//将packet中的数据写入磁盘缓存中
out.write(pktBuf, dataOff, len);
// If this is a partial chunk, then verify that this is the only
// chunk in the packet. Calculate new crc for this chunk.
if (partialCrc != null) {
if (len > bytesPerChecksum) {
throw new IOException("Got wrong length during writeBlock(" + block + ") from " + inAddr + " " + "A packet can have only one partial chunk."+ " len = " + len + " bytesPerChecksum " + bytesPerChecksum);
}
partialCrc.update(pktBuf, dataOff, len);
byte[] buf = FSOutputSummer.convertToByteStream(partialCrc, checksumSize);
checksumOut.write(buf);
LOG.debug("Writing out partial crc for data len " + len);
partialCrc = null;
} else {
//将packet中的校验数据写入磁盘缓存
checksumOut.write(pktBuf, checksumOff, checksumLen);
}
datanode.myMetrics.bytesWritten.inc(len);
}
} catch (IOException iex) {
datanode.checkDiskError(iex);
throw iex;
}
}
//将packet的数据和校验数据flush到磁盘
flush();
//将该packet交给PacketResponder来确认
if (responder != null) {
((PacketResponder)responder.getRunnable()).enqueue(seqno, lastPacketInBlock);
}
if (throttler != null) { // throttle I/O
throttler.throttle(payloadLen);
}
return payloadLen;
}
4). 确认Packet
private LinkedList<Packet> ackQueue = new LinkedList<Packet>(); //待确认的packet队列
private volatile boolean running = true;
private Block block;
DataInputStream mirrorIn; // 接受packet确认帧的网络读取流
DataOutputStream replyOut; // 发送packet确认帧的网络写入流
private int numTargets; // 一次接受packet确认帧的数量
PacketResponder按照接受的packet帧的顺序来接受/发送对应的确认帧的,比如当前正在处理编号为seqno的packet,它首先会接受它的接收端对该packet发给来的确认信息,然后把接受到的确认信息以及自己对该packet的确认信息一同发送给它的发送端。在DataNode节点的流水复制过程中,如果一个DataNode节点发生错误,如接收到的packet出错了,那么该DataNode的BlockReceiver自动结束该线程,也不会向发送端发送确认帧,发送端就会迟迟收不到接收端的确认帧,这样的话,接受端就任务它以后的所有DataNode节点在接受该Block的packet是发生了错误,并把这个情况发送给发送端的发送端。