在HDFS集群中,数据主要集中在DataNode节点之间、客户端与DataNode节点之间传输,如:客户利用客户端向HDFS中写入或读取数据;当NameNode节点检测到某一个数据块的副本不够时,会让某一个DataNode节点把该数据块复制到其它的DataNode节点上;当某个DataNode节点存储Block负载过重时会把它上面的一些数据块移动到其它DataNode上等等...所以本文将重点讨论DataNode节点是如何把一个数据块传送到客户端或者其它的DataNode节点上的,即数据块发送器BlockSender的详细实现。
在DataNode节点上,主要有四个地方会用到数据块发送器BlockSender:
1.当用户向HDFS读取某一个文件时,客户端会根据数据所在的位置转向到具体的DataNode节点请求对应数据块的数据,此时DataNode节点会用BlockSender向该客户端发送数据;
2.当NameNode节点发现某个Block的副本不足时,它会要求某一个存储了该Block的DataNode节点向其它DataNode节点复制该Block,当然此时仍然会采用流水线的复制方式,只不过数据来源变成了一个DataNode节点;
3.HDFS开了一个调节DataNode负载均衡的工具Balacer,当它发现某一个DataNode节点存储的Block过多时,就会让这个DataNode节点转移一部分Blocks到新添加到集群的DataNode节点或者存储负载轻的DataNode节点上;
4.DataNode节点在后台开启了一个用于对存储的所有Block进行扫描验证的后台线程,它会定期的利用BlockSender来检查一个Block的数据是否损坏了。
在前面三个场景下,DataNode节点在发送数据时不会对数据进行校验和验证,而是交给了接收端来验证数据的可靠性,这是因为即使在发送端验证正确,但经过网络传输也会发送错误,就不如索性交由接收端来验证;第四个就不同了,因为此时根本就没有接收端,必须在发送数据之前对其进行校验和的验证。
BlockSender中的主要属性介绍:
private Block block; //待发送的数据块
private InputStream blockIn; //Block的数据读取流
private long blockInPosition = -1;
private DataInputStream checksumIn; //Block的校验和读取流
private DataChecksum checksum; //数据校验器
private long offset; //待读取的数据在Block中的开始位置
private long endOffset; //待读取的数据在Block中的结束位置
private long blockLength;//Block的大小
private int bytesPerChecksum; //数据校验块的大小
private int checksumSize; //数据校验块对应的校验和大小
private boolean corruptChecksumOk; //是否忽略读取校验信息的出错
private boolean chunkOffsetOK; //是否要在发送数据之前先发送读取数据的其实位置信息
private long seqno; //数据包的编号
private boolean transferToAllowed = true;
private boolean blockReadFully; //set when the whole block is read
private boolean verifyChecksum; //是否需要在发送数据之前先验证数据校验和
private BlockTransferThrottler throttler;
private final String clientTraceFmt; // format of client trace log message
DataNode在为一个Block创建BlockSender时,会做一些初始化工作,比如:创建Block对应的数据和校验数据的读取流;创建校验器;根据校验器确定待读取数据的开始位置和结束位置(真正需要读取数据的开始位置和结束位置可能并不恰好包含完整的若干个数据校验块,而是开始位置和结束位置都位于数据校验块的中间,这样的话,为了客户端能够利用校验和来验证数据的有效性就需要多传输一些数据)。
BlockSender(Block block, long startOffset, long length, boolean corruptChecksumOk, boolean chunkOffsetOK, boolean verifyChecksum, DataNode datanode, String clientTraceFmt)
throws IOException {
try {
this.block = block;
this.chunkOffsetOK = chunkOffsetOK;
this.corruptChecksumOk = corruptChecksumOk;
this.verifyChecksum = verifyChecksum;
this.blockLength = datanode.data.getLength(block);
this.transferToAllowed = datanode.transferToAllowed;
this.clientTraceFmt = clientTraceFmt;
//Block存在对应的校验和文件
if ( !corruptChecksumOk || datanode.data.metaFileExists(block) ) {
//创建Block对应的校验和数据读取流
checksumIn = new DataInputStream(new BufferedInputStream(datanode.data.getMetaDataInputStream(block),BUFFER_SIZE));
//创建产生Block校验和文件的校验器
BlockMetadataHeader header = BlockMetadataHeader.readHeader(checksumIn);
short version = header.getVersion();
if (version != FSDataset.METADATA_VERSION) {
LOG.warn("Wrong version (" + version + ") for metadata file for " + block + " ignoring ...");
}
checksum = header.getChecksum();
} else {
LOG.warn("Could not find metadata file for " + block);
//创建一个默认的校验器
checksum = DataChecksum.newDataChecksum(DataChecksum.CHECKSUM_NULL,16 * 1024);
}
//调整校验器
bytesPerChecksum = checksum.getBytesPerChecksum();
if (bytesPerChecksum > 10*1024*1024 && bytesPerChecksum > blockLength){
checksum = DataChecksum.newDataChecksum(checksum.getChecksumType(), Math.max((int)blockLength, 10*1024*1024));
bytesPerChecksum = checksum.getBytesPerChecksum();
}
checksumSize = checksum.getChecksumSize();
if (length < 0) {
length = blockLength;
}
endOffset = blockLength;
if (startOffset < 0 || startOffset > endOffset || (length + startOffset) > endOffset) {
String msg = " Offset " + startOffset + " and length " + length + " don't match block " + block + " ( blockLen " + endOffset + " )";
LOG.warn(datanode.dnRegistration + ":sendBlock() : " + msg);
throw new IOException(msg);
}
//根据校验器调整读取的开始位置和结束位置
offset = (startOffset - (startOffset % bytesPerChecksum));
if (length >= 0) {
// Make sure endOffset points to end of a checksumed chunk.
long tmpLen = startOffset + length;
if (tmpLen % bytesPerChecksum != 0) {
tmpLen += (bytesPerChecksum - tmpLen % bytesPerChecksum);
}
if (tmpLen < endOffset) {
endOffset = tmpLen;
}
}
//根据待读取数据的开始位置定位到校验和的开始位置
if (offset > 0) {
long checksumSkip = (offset / bytesPerChecksum) * checksumSize;
// note blockInStream is seeked when created below
if (checksumSkip > 0) {
// Should we use seek() for checksum file as well?
IOUtils.skipFully(checksumIn, checksumSkip);
}
}
seqno = 0;
//定位到待去读数据的开始位置
blockIn = datanode.data.getBlockInputStream(block, offset); // seek to offset
} catch (IOException ioe) {
IOUtils.closeStream(this);
IOUtils.closeStream(blockIn);
throw ioe;
}
}
BlockSender用数据包的方式向接收端发送数据,一个数据包可能包含若干个校验数据块,但它并不需要接收端发送对数据包的确认帧,自己也不接受这些确认帧。一个数据包的格式如下:
packetLen:数据包长度;
offset:数据包中的数据在Block中的开始位置;
seqno:数据包的编号;
endFlag:是否没有数据包标志(0/1);
len:数据包中数据的长度;
chunksum:一个校验和;
datachunk:一个校验数据块;
/*发送一个数据包*/
private int sendChunks(ByteBuffer pkt, int maxChunks, OutputStream out) throws IOException {
//计算数据包的长度
int len = Math.min((int) (endOffset - offset), bytesPerChecksum*maxChunks);
if (len == 0) {
return 0;
}
//计算这个数据包中应该包含有多少个校验数据块
int numChunks = (len + bytesPerChecksum - 1)/bytesPerChecksum;
int packetLen = len + numChunks*checksumSize + 4;
pkt.clear();
//数据包头部信息写入缓存
pkt.putInt(packetLen);
pkt.putLong(offset);
pkt.putLong(seqno);
pkt.put((byte)((offset + len >= endOffset) ? 1 : 0));
pkt.putInt(len);
int checksumOff = pkt.position();
int checksumLen = numChunks * checksumSize;
byte[] buf = pkt.array();
//数据对应的校验和信息写入缓存
if (checksumSize > 0 && checksumIn != null) {
try {
checksumIn.readFully(buf, checksumOff, checksumLen);
} catch (IOException e) {
LOG.warn(" Could not read or failed to veirfy checksum for data" + " at offset " + offset + " for block " + block + " got : " + StringUtils.stringifyException(e));
IOUtils.closeStream(checksumIn);
checksumIn = null;
if (corruptChecksumOk) {
if (checksumOff < checksumLen) {
// Just fill the array with zeros.
Arrays.fill(buf, checksumOff, checksumLen, (byte) 0);
}
} else {
throw e;
}
}
}
int dataOff = checksumOff + checksumLen;
if (blockInPosition < 0) {
//数据写入缓存
IOUtils.readFully(blockIn, buf, dataOff, len);
//对发送的数据验证校验和
if (verifyChecksum) {
int dOff = dataOff;
int cOff = checksumOff;
int dLeft = len;
for (int i=0; i<numChunks; i++) {
checksum.reset();
int dLen = Math.min(dLeft, bytesPerChecksum);
checksum.update(buf, dOff, dLen);
if (!checksum.compare(buf, cOff)) {
throw new ChecksumException("Checksum failed at " + (offset + len - dLeft), len);
}
dLeft -= dLen;
dOff += dLen;
cOff += checksumSize;
}
}
//writing is done below (mainly to handle IOException)
}
try {
if (blockInPosition >= 0) {
//use transferTo(). Checks on out and blockIn are already done.
SocketOutputStream sockOut = (SocketOutputStream)out;
//发送缓存的数据包
sockOut.write(buf, 0, dataOff);
// no need to flush. since we know out is not a buffered stream.
sockOut.transferToFully(((FileInputStream)blockIn).getChannel(), blockInPosition, len);
blockInPosition += len;
} else {
//发送缓存的数据包
out.write(buf, 0, dataOff + len);
}
} catch (IOException e) {
/* exception while writing to the client (well, with transferTo(),
* it could also be while reading from the local file).
*/
throw ioeToSocketException(e);
}
if (throttler != null) { //调整发送速度
throttler.throttle(packetLen);
}
return len;
}
//向接收端发送数据
long sendBlock(DataOutputStream out, OutputStream baseStream, BlockTransferThrottler throttler) throws IOException {
if( out == null ) {
throw new IOException( "out stream is null" );
}
this.throttler = throttler;
long initialOffset = offset;
long totalRead = 0;
OutputStream streamForSendChunks = out;
try {
try {
checksum.writeHeader(out);//发送校验器信息
if ( chunkOffsetOK ) {
out.writeLong( offset );
}
out.flush();
} catch (IOException e) { //socket error
throw ioeToSocketException(e);
}
int maxChunksPerPacket;
int pktSize = DataNode.PKT_HEADER_LEN + SIZE_OF_INTEGER;
if (transferToAllowed && !verifyChecksum && baseStream instanceof SocketOutputStream && blockIn instanceof FileInputStream) {
FileChannel fileChannel = ((FileInputStream)blockIn).getChannel();
// blockInPosition also indicates sendChunks() uses transferTo.
blockInPosition = fileChannel.position();
streamForSendChunks = baseStream;
//计算一个数据包最多包含多少个数据校验快块
maxChunksPerPacket = (Math.max(BUFFER_SIZE, MIN_BUFFER_WITH_TRANSFERTO) + bytesPerChecksum - 1)/bytesPerChecksum;
//计算一个数据包的大小
pktSize += checksumSize * maxChunksPerPacket;
} else {
//计算一个数据包最多包含多少个数据检验块
maxChunksPerPacket = Math.max(1,(BUFFER_SIZE + bytesPerChecksum - 1)/bytesPerChecksum);
//计算一个数据包的大小
pktSize += (bytesPerChecksum + checksumSize) * maxChunksPerPacket;
}
ByteBuffer pktBuf = ByteBuffer.allocate(pktSize);
//一个一个数据包发送数据
while (endOffset > offset) {
long len = sendChunks(pktBuf, maxChunksPerPacket, streamForSendChunks);
offset += len;
totalRead += len + ((len + bytesPerChecksum - 1)/bytesPerChecksum* checksumSize);
seqno++;
}
try {
out.writeInt(0); //标记数据已发送完
out.flush();
} catch (IOException e) { //socket error
throw ioeToSocketException(e);
}
} finally {
if (clientTraceFmt != null) {
ClientTraceLog.info(String.format(clientTraceFmt, totalRead));
}
close();
}
blockReadFully = (initialOffset == 0 && offset >= blockLength);
return totalRead;
}