Hadoop3.1.3学习笔记2
本文主要探讨StripedBlockReconstruction过程,即条带化存储下如何恢复损坏的block,以下内容涉及StripedReader、StripedWriter等相关类的内容,将在后面进行介绍
/**
* StripedBlockReconstructor reconstruct one or more missed striped block in
* the striped block group, the minimum number of live striped blocks should
* be no less than data block number.
*/
在注释中给出了StripedBlockReconstructor的功能介绍,显然在纠删码策略下必须要大于k个块才能完整恢复数据。StripedBlockReconstructor继承了StripedReconstructor并实现了StripedReconstructor的抽象接口reconstruct,也就是整个过程中最关键的一步“重建”,在父类StripedReconstructor中有对这一过程的详细描述如下:
/**
* StripedReconstructor reconstruct one or more missed striped block in the
* striped block group, the minimum number of live striped blocks should be
* no less than data block number.
*
* | <- Striped Block Group -> |
* blk_0 blk_1 blk_2(*) blk_3 ... <- A striped block group
* | | | |
* v v v v
* +------+ +------+ +------+ +------+
* |cell_0| |cell_1| |cell_2| |cell_3| ...
* +------+ +------+ +------+ +------+
* |cell_4| |cell_5| |cell_6| |cell_7| ...
* +------+ +------+ +------+ +------+
* |cell_8| |cell_9| |cell10| |cell11| ...
* +------+ +------+ +------+ +------+
* ... ... ... ...
*
*
* We use following steps to reconstruct striped block group, in each round, we
* reconstruct <code>bufferSize</code> data until finish, the
* <code>bufferSize</code> is configurable and may be less or larger than
* cell size:
* step1: read <code>bufferSize</code> data from minimum number of sources
* required by reconstruction.
* step2: decode data for targets.
* step3: transfer data to targets.
*
* In step1, try to read <code>bufferSize</code> data from minimum number
* of sources , if there is corrupt or stale sources, read from new source
* will be scheduled. The best sources are remembered for next round and
* may be updated in each round.
*
* In step2, typically if source blocks we read are all data blocks, we
* need to call encode, and if there is one parity block, we need to call
* decode. Notice we only read once and reconstruct all missed striped block
* if they are more than one.
*
* In step3, send the reconstructed data to targets by constructing packet
* and send them directly. Same as continuous block replication, we
* don't check the packet ack. Since the datanode doing the reconstruction work
* are one of the source datanodes, so the reconstructed data are sent
* remotely.
*
* There are some points we can do further improvements in next phase:
* 1. we can read the block file directly on the local datanode,
* currently we use remote block reader. (Notice short-circuit is not
* a good choice, see inline comments).
* 2. We need to check the packet ack for EC reconstruction? Since EC
* reconstruction is more expensive than continuous block replication,
* it needs to read from several other datanodes, should we make sure the
* reconstructed result received by targets?
*/
通过阅读以上注释,可以看到,重建过程分为三步,第一步,读取大小为buffersize的数据,这些数据来源于“用于重建的最小数量源数据”,第二步对这些数据进行编码,第三步将编码后的数据发送到目标datanode。
注释中给出了这三步中的一些关键性问题。在第一步读取源数据的过程中,如果遇到数据损坏或过期,会读取新的数据,并且会进行记录“best source”,在每个round中都会更新best source。在第二步编码过程中,实际上不仅仅指的是我们通常意义上的编码,而是泛指这个过程,如果丢失的是校验块,读取数据块进行修复,显然要调用编码方法,如果丢失的包含数据块,读取的源数据包含校验块,则需要调用解码方法。在第三步将编码后的数据发送的目标提出了一个新的概念“packet”,指的就是通信上所讲的“包”,在这里特地强调了,重建过程是在某个源数据的datanode上进行的。甚至作者在这里还对下一阶段的提高提出了两点设想。
接着我们回到StripedBlockReconstructor。
首先来看run方法
public void run() {
try {
initDecoderIfNecessary();
getStripedReader().init();
stripedWriter.init();
reconstruct();
stripedWriter.endTargetBlocks();
// Currently we don't check the acks for packets, this is similar as
// block replication.
} catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
} finally {
getDatanode().decrementXmitsInProgress(getXmits());
final DataNodeMetrics metrics = getDatanode().getMetrics();
metrics.incrECReconstructionTasks();
metrics.incrECReconstructionBytesRead(getBytesRead());
metrics.incrECReconstructionRemoteBytesRead(getRemoteBytesRead());
metrics.incrECReconstructionBytesWritten(getBytesWritten());
getStripedReader().close();
stripedWriter.close();
cleanup();
}
}
run方法对应了上述注释中的重建过程,首先初始化解码器,初始化条带化读取,再进行重建,最后发送到目标datanode。在这里我们先来看最关键的reconstruct方法
void reconstruct() throws IOException {
while (getPositionInBlock() < getMaxTargetLength()) {
DataNodeFaultInjector.get().stripedBlockReconstruction();
long remaining = getMaxTargetLength() - getPositionInBlock();
final int toReconstructLen =
(int) Math.min(getStripedReader().getBufferSize(), remaining);
long start = Time.monotonicNow();
// step1: read from minimum source DNs required for reconstruction.
// The returned success list is the source DNs we do real read from
getStripedReader().readMinimumSources(toReconstructLen);
long readEnd = Time.monotonicNow();
// step2: decode to reconstruct targets
reconstructTargets(toReconstructLen);
long decodeEnd = Time.monotonicNow();
// step3: transfer data
if (stripedWriter.transferData2Targets() == 0) {
String error = "Transfer failed for all targets.";
throw new IOException(error);
}
long writeEnd = Time.monotonicNow();
// Only the succeed reconstructions are recorded.
final DataNodeMetrics metrics = getDatanode().getMetrics();
metrics.incrECReconstructionReadTime(readEnd - start);
metrics.incrECReconstructionDecodingTime(decodeEnd - readEnd);
metrics.incrECReconstructionWriteTime(writeEnd - decodeEnd);
updatePositionInBlock(toReconstructLen);
clearBuffers();
}
}
下面我们详细分析这个方法。第一步获取所需的最小源数据,start和readend用于记录运行时间,我们不去管它。ReadMinimumSources方法给了一个参数toReconstructLen即要重建的数据长度,单位是字节
/**
* Read from minimum source DNs required for reconstruction in the iteration.
* First try the success list which we think they are the best DNs
* If source DN is corrupt or slow, try to read some other source DN,
* and will update the success list.
*
* Remember the updated success list and return it for following
* operations and next iteration read.
*
* @param reconstructLength the length to reconstruct.
* @return updated success list of source DNs we do real read
* @throws IOException
*/
void readMinimumSources(int reconstructLength) throws IOException {
CorruptedBlocks corruptedBlocks = new CorruptedBlocks();
try {
successList = doReadMinimumSources(reconstructLength, corruptedBlocks);
} finally {
// report corrupted blocks to NN
datanode.reportCorruptedBlocks(corruptedBlocks);
}
}
ReadMinimumSources方法又调用了doReadMinimumSources用于寻找源数据,这里我们先不去考虑有block损坏的情况,仅考虑doReadMinimumSources,
int[] doReadMinimumSources(int reconstructLength,
CorruptedBlocks corruptedBlocks)
throws IOException {
Preconditions.checkArgument(reconstructLength >= 0 &&
reconstructLength <= bufferSize);
int nSuccess = 0;
int[] newSuccess = new int[minRequiredSources];
BitSet usedFlag = new BitSet(sources.length);
/*
* Read from minimum source DNs required, the success list contains
* source DNs which we think best.
*/
for (int i = 0; i < minRequiredSources; i++) {
StripedBlockReader reader = readers.get(successList[i]);
int toRead = getReadLength(liveIndices[successList[i]],
reconstructLength);
if (toRead > 0) {
Callable<BlockReadStats> readCallable =
reader.readFromBlock(toRead, corruptedBlocks);
Future<BlockReadStats> f = readService.submit(readCallable);
futures.put(f, successList[i]);
} else {
// If the read length is 0, we don't need to do real read
reader.getReadBuffer().position(0);
newSuccess[nSuccess++] = successList[i];
}
usedFlag.set(successList[i]);
}
while (!futures.isEmpty()) {
try {
StripingChunkReadResult result =
StripedBlockUtil.getNextCompletedStripedRead(
readService, futures, stripedReadTimeoutInMills);
int resultIndex = -1;
if (result.state == StripingChunkReadResult.SUCCESSFUL) {
resultIndex = result.index;
} else if (result.state == StripingChunkReadResult.FAILED) {
// If read failed for some source DN, we should not use it anymore
// and schedule read from another source DN.
StripedBlockReader failedReader = readers.get(result.index);
failedReader.closeBlockReader();
resultIndex = scheduleNewRead(usedFlag,
reconstructLength, corruptedBlocks);
} else if (result.state == StripingChunkReadResult.TIMEOUT) {
// If timeout, we also schedule a new read.
resultIndex = scheduleNewRead(usedFlag,
reconstructLength, corruptedBlocks);
}
if (resultIndex >= 0) {
newSuccess[nSuccess++] = resultIndex;
if (nSuccess >= minRequiredSources) {
// cancel remaining reads if we read successfully from minimum
// number of source DNs required by reconstruction.
cancelReads(futures.keySet());
futures.clear();
break;
}
}
} catch (InterruptedException e) {
LOG.info("Read data interrupted.", e);
cancelReads(futures.keySet());
futures.clear();
break;
}
}
if (nSuccess < minRequiredSources) {
String error = "Can't read data from minimum number of sources "
+ "required by reconstruction, block id: " +
reconstructor.getBlockGroup().getBlockId();
throw new IOException(error);
}
return newSuccess;
}
本方法的逻辑较为复杂,笔者研究了很久才搞清楚。首先,根据上述注释所讲的读取过程,要从我们认为是bestsource的datanode中读取源数据(代码10-11行),这里的toRead就是要读取的数据长度,如果toRead>0则说明需要读,加入future中随后来读取,如果toRead <= 0则说明不需要再读了,直接加入newSuccess,这里要解释一个问题,所谓的“要读取的长度”是什么,又是如何得到的。显然,这是由方法getReadLength来决定的,我们具体来看这个方法:
private int getReadLength(int index, int reconstructLength) {
// the reading length should not exceed the length for reconstruction
long blockLen = reconstructor.getBlockLen(index);
long remaining = blockLen - reconstructor.getPositionInBlock();
return (int) Math.min(remaining, reconstructLength);
}
这个方法非常简单,只有6行,其中blocklen是块的长度,即块中有多少字节,remaining是剩下的长度,第4行中getPositionInBlock方法会返回当前读到的位置,就是position,position初始值为0,每读完一个reconstructionLength就会加上它,即
position += reconstructionLength;
reconstructionLength是buffersize,也就是读取的缓冲区大小,可以理解为一次只能读buffersize这么多的字节。方法的第5行返回剩余的大小和buffersize中更小的那一个,这就好理解了,如果剩下的数据比一个buffersize大那就读一个buffersize那么大的数据,如果小,那就把剩下的都读出来。
然后我们回到doReadMinimumSources方法,此方法代码的30行往后就是从future中的datanode读数据,如果超时或失败则去新的source datanode中读取。
这样,reconstruct方法的第一步就完成了。
reconstruct的第二步是reconstructTargets,在这一步中实际调用了doDecode函数,会根据采取的纠删码不同调用不同的解码方法。
第三步发送到输出流,在这里是以packet的形式发送
/**
* buf is pointed into like follows:
* (C is checksum data, D is payload data)
*
* [_________CCCCCCCCC________________DDDDDDDDDDDDDDDD___]
* ^ ^ ^ ^
* | checksumPos dataStart dataPos
* checksumStart
*
* Right before sending, we move the checksum data to immediately precede
* the actual data, and then insert the header into the buffer immediately
* preceding the checksum data, so we make sure to keep enough space in
* front of the checksum data to support the largest conceivable header.
*/
最后,reconstruct方法会更新position并清空buffer进行下一次发送,直到达到maxTargetLength,也就是要恢复的块的大小。