Hadoop学习笔记5
本节介绍EC模式下,HDFS在文件下载过程中遇到目标节点失效,是如何处理的。
过程简述
用户从HDFS中下载文件的过程如上图所示,在web下载模式中,用户首先登陆web服务器,web服务器会从namenode中获取文件存储的信息,客户端又从web服务器上获得存储信息,然后从存储该文件的其中一个(实际就是第一个)服务器上开始下载文件。显然,被选中的服务器(称为currentnode)上只有该文件的一部分数据,其他数据由其他节点发送给currentnode(实际中是由currentnode读取其他datanode指定位置的数据)。在这个过程中,需要发送给currentnode数据的datanode可能会有失联(fail to connect ),此时,currentnode会读取另外的datanode中的数据,在EC模式下,此时读取的一般为校验块,因此需要进行decode才能得到数据。
源码分析
我们从DFSClient开始分析
/**
* Create an input stream that obtains a nodelist from the
* namenode, and then reads from all the right places. Creates
* inner subclass of InputStream that does the right out-of-band
* work.
*/
public DFSInputStream open(String src, int buffersize, boolean verifyChecksum)
throws IOException {
checkOpen();
// Get block info from namenode
try (TraceScope ignored = newPathTraceScope("newDFSInputStream", src)) {
LocatedBlocks locatedBlocks = getLocatedBlocks(src, 0);
return openInternal(locatedBlocks, src, verifyChecksum);
}
}
首先是open()方法,该方法返回一个DFSInputStream,功能在注释中给出,即根据namenode中的节点信息建立读取数据流。此方法调用了openInternal方法
private DFSInputStream openInternal(LocatedBlocks locatedBlocks, String src,
boolean verifyChecksum) throws IOException {
if (locatedBlocks != null) {
ErasureCodingPolicy ecPolicy = locatedBlocks.getErasureCodingPolicy();
if (ecPolicy != null) {
return new DFSStripedInputStream(this, src, verifyChecksum, ecPolicy,
locatedBlocks);
}
return new DFSInputStream(this, src, verifyChecksum, locatedBlocks);
} else {
throw new IOException("Cannot open filename " + src);
}
}
以上代码比较容易理解,即根据是否使用了EC策略调用不同的InputStream构造函数,接下来看DFSStripedInputStream
读取是从readWithStrategy()方法开始
protected synchronized int readWithStrategy(ReaderStrategy strategy)
throws IOException {
dfsClient.checkOpen();
if (closed.get()) {
throw new IOException("Stream closed");
}
int len = strategy.getTargetLength();
CorruptedBlocks corruptedBlocks = new CorruptedBlocks();
if (pos < getFileLength()) {
try {
if (pos > blockEnd) {
blockSeekTo(pos);
}
int realLen = (int) Math.min(len, (blockEnd - pos + 1L));
synchronized (infoLock) {
if (locatedBlocks.isLastBlockComplete()) {
realLen = (int) Math.min(realLen,
locatedBlocks.getFileLength() - pos);
}
}
/** Number of bytes already read into buffer */
int result = 0;
while (result < realLen) {
if (!curStripeRange.include(getOffsetInBlockGroup())) {
readOneStripe(corruptedBlocks);
}
int ret = copyToTargetBuf(strategy, realLen - result);
result += ret;
pos += ret;
}
return result;
} finally {
// Check if need to report block replicas corruption either read
// was successful or ChecksumException occurred.
reportCheckSumFailure(corruptedBlocks, getCurrentBlockLocationsLength(),
true);
}
}
return -1;
}
整个方法的功能就是依据读取策略读取数据,策略其实很简单,就是读取长度和buffer。方法的27行调用了readOneStripe()方法,调用的条件为已经超出了当前的stripe范围但是依然没有读取到足够的数据,即result<realLen,可以理解为当前块因为损坏或其他原因无法提供数据。下面我们来看readOneStripe()
/**
* Read a new stripe covering the current position, and store the data in the
* {@link #curStripeBuf}.
*/
private void readOneStripe(CorruptedBlocks corruptedBlocks)
throws IOException {
resetCurStripeBuffer(true);
// compute stripe range based on pos
final long offsetInBlockGroup = getOffsetInBlockGroup();
final long stripeLen = cellSize * dataBlkNum;
final int stripeIndex = (int) (offsetInBlockGroup / stripeLen);
final int stripeBufOffset = (int) (offsetInBlockGroup % stripeLen);
final int stripeLimit = (int) Math.min(currentLocatedBlock.getBlockSize()
- (stripeIndex * stripeLen), stripeLen);
StripeRange stripeRange =
new StripeRange(offsetInBlockGroup, stripeLimit - stripeBufOffset);
LocatedStripedBlock blockGroup = (LocatedStripedBlock) currentLocatedBlock;
AlignedStripe[] stripes = StripedBlockUtil.divideOneStripe(ecPolicy,
cellSize, blockGroup, offsetInBlockGroup,
offsetInBlockGroup + stripeRange.getLength() - 1, curStripeBuf);
final LocatedBlock[] blks = StripedBlockUtil.parseStripedBlockGroup(
blockGroup, cellSize, dataBlkNum, parityBlkNum, localParityBlkNum);
// read the whole stripe
for (AlignedStripe stripe : stripes) {
// Parse group to get chosen DN location
StripeReader sreader = new StatefulStripeReader(stripe, ecPolicy, blks,
blockReaders, corruptedBlocks, decoder, this);
sreader.readStripe();
}
curStripeBuf.position(stripeBufOffset);
curStripeBuf.limit(stripeLimit);
curStripeRange = stripeRange;
}
从注释中可以看出,该方法的功能为读取新数据覆盖当前stripe,其中divideOneStripe()和parseStripedBlockGroup()方法将在以后进行分析,前面的过程都是为了后面读取条带做铺垫,下面重点分析核心内容readStripe()
/**
* read the whole stripe. do decoding if necessary
*/
void readStripe() throws IOException {
alignedStripe.missingLocalChunksNum = new int[ecPolicy.getNumLocalParityUnits()];
if(ecPolicy.getSchema().getCodecName() != "lrc"){
for (int i = 0; i < dataBlkNum; i++) {
if (alignedStripe.chunks[i] != null &&
alignedStripe.chunks[i].state != StripingChunk.ALLZERO) {
if (!readChunk(targetBlocks[i], i)) {
alignedStripe.missingChunksNum++;
}
}
}
if (alignedStripe.missingChunksNum > 0) {
checkMissingBlocks();
readDataForDecoding();
// read parity chunks
readParityChunks(alignedStripe.missingChunksNum);
}
}else{
for (int i = 0; i < dataBlkNum; i++) {
if (alignedStripe.chunks[i] != null &&
alignedStripe.chunks[i].state != StripingChunk.ALLZERO) {
if (!readChunk(targetBlocks[i], i)) {
if(i > dataBlkNum/localParityBlkNum) {
alignedStripe.missingChunksNum++;
alignedStripe.missingLocalChunksNum[1]++;
}else{
alignedStripe.missingChunksNum++;
alignedStripe.missingLocalChunksNum[0]++;
}
}
}
}
if (alignedStripe.missingLocalChunksNum[0] > 0 || alignedStripe.missingLocalChunksNum[1] > 0) {
checkMissingBlocks();
readDataForDecoding();
// read parity chunks
readLocalParityChunks(alignedStripe.missingLocalChunksNum);
}
}
// There are missing block locations at this stage. Thus we need to read
// the full stripe and one more parity block.
// TODO: for a full stripe we can start reading (dataBlkNum + 1) chunks
// Input buffers for potential decode operation, which remains null until
// first read failure
while (!futures.isEmpty()) {
try {
StripingChunkReadResult r = StripedBlockUtil
.getNextCompletedStripedRead(service, futures, 0);
dfsStripedInputStream.updateReadStats(r.getReadStats());
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("Read task returned: " + r + ", for stripe "
+ alignedStripe);
}
StripingChunk returnedChunk = alignedStripe.chunks[r.index];
Preconditions.checkNotNull(returnedChunk);
Preconditions.checkState(returnedChunk.state == StripingChunk.PENDING);
if (r.state == StripingChunkReadResult.SUCCESSFUL) {
returnedChunk.state = StripingChunk.FETCHED;
alignedStripe.fetchedChunksNum++;
updateState4SuccessRead(r);
if (alignedStripe.fetchedChunksNum == dataBlkNum) {
clearFutures();
break;
}
} else {
returnedChunk.state = StripingChunk.MISSING;
// close the corresponding reader
dfsStripedInputStream.closeReader(readerInfos[r.index]);
final int missing = alignedStripe.missingChunksNum;
alignedStripe.missingChunksNum++;
checkMissingBlocks();
readDataForDecoding();
readParityChunks(alignedStripe.missingChunksNum - missing);
}
} catch (InterruptedException ie) {
String err = "Read request interrupted";
DFSClient.LOG.error(err);
clearFutures();
// Don't decode if read interrupted
throw new InterruptedIOException(err);
}
}
if (alignedStripe.missingChunksNum > 0) {
decode();
}
}
为了适应LRC码,该方法进行了修改。结合之前的学习笔记中的内容,chunk是stripe读取时的最小单位,可以理解为cell+校验。首先,readStripe要读取一个完整的数据条带,即dataBlkNum的长度,得到其中错误的chunk个数alignedStripe.missingChunksNum,然后通过readParityChunks或者readLocalParityChunks读取校验块的chunk,其中futures中保存了读取的所有chunk,最后,如果missingChunksNum>0也就是有错误的情况下(此时必然启用了校验块),那么就进行解码decode操作