在阅读这篇文章前,建议先阅读这篇文章。
我们回到DFSInputStream中的read函数,这些函数的逻辑都差不多,都是会创建BlockReaderFactory类对象,并执行该对象的build函数,相应的代码我在之前的文章中都有讲述,可以翻看我之前的文章,我们这里讲解函数
public synchronized ByteBuffer read(ByteBufferPool bufferPool,int maxLength, EnumSet<ReadOption> opts)
中的另外一部分,就是零拷贝。相应的代码如下:
@Override
/*这里bufferPool用来存储读出来的数据,
maxLength表示读取数据的大小
opts为读数据方式(比如SKIP_CHECKSUMS,这个表示跳过文件校验)
*/
public synchronized ByteBuffer read(ByteBufferPool bufferPool,
int maxLength, EnumSet<ReadOption> opts)
throws IOException, UnsupportedOperationException {
if (maxLength == 0) {
return EMPTY_BUFFER;
} else if (maxLength < 0) {
throw new IllegalArgumentException("can't read a negative " +
"number of bytes.");
}
//如果blockReader为null或者blockEnd为-1(也就是当前块无效或者块对象还没初始化)
if ((blockReader == null) || (blockEnd == -1)) {
//如果当前没有可读的数据,那么就返回null
if (pos >= getFileLength()) {
return null;
}
/*
* If we don't have a blockReader, or the one we have has no more bytes
* left to read, we call seekToBlockSource to get a new blockReader and
* recalculate blockEnd. Note that we assume we're not at EOF here
* (we check this above).
*/
/*根据pos获取对应的数据块,这里用!多此一举,因为seekToBlockSource函数要么抛出异常,要么返回true,所以!seekToBlockSource(pos)永远都为false
如果该函数返回false或者blockReader为null那么就抛出异常
*/
if ((!seekToBlockSource(pos)) || (blockReader == null)) {
throw new IOException("failed to allocate new BlockReader " +
"at position " + pos);
}
}
ByteBuffer buffer = null;
//判断是否支持零拷贝方式
if (dfsClient.getConf().shortCircuitMmapEnabled) {
buffer = tryReadZeroCopy(maxLength, opts);
}
if (buffer != null) {
return buffer;
}
//如果零拷贝不成功,那么会退化为一个普通的读取
buffer = ByteBufferUtil.fallbackRead(this, bufferPool, maxLength);
if (buffer != null) {
//将数据放入到extendedReadBuffers中
extendedReadBuffers.put(buffer, bufferPool);
}
return buffer;
}
我们来分析tryReadZeroCopy函数,代码如下:
private synchronized ByteBuffer tryReadZeroCopy(int maxLength,
EnumSet<ReadOption> opts) throws IOException {
// Copy 'pos' and 'blockEnd' to local variables to make it easier for the
// JVM to optimize this function.
final long curPos = pos;
final long curEnd = blockEnd;
final long blockStartInFile = currentLocatedBlock.getStartOffset();
final long blockPos = curPos - blockStartInFile;
// Shorten this read if the end of the block is nearby.
long length63;
if ((curPos + maxLength) <= (curEnd + 1)) {
length63 = maxLength;
} else {
length63 = 1 + curEnd - curPos;
if (length63 <= 0) {
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("Unable to perform a zero-copy read from offset " +
curPos + " of " + src + "; " + length63 + " bytes left in block. " +
"blockPos=" + blockPos + "; curPos=" + curPos +
"; curEnd=" + curEnd);
}
return null;
}
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("Reducing read length from " + maxLength +
" to " + length63 + " to avoid going more than one byte " +
"past the end of the block. blockPos=" + blockPos +
"; curPos=" + curPos + "; curEnd=" + curEnd);
}
}
// Make sure that don't go beyond 31-bit offsets in the MappedByteBuffer.
int length;
if (blockPos + length63 <= Integer.MAX_VALUE) {
length = (int)length63;
} else {
long length31 = Integer.MAX_VALUE - blockPos;
if (length31 <= 0) {
// Java ByteBuffers can't be longer than 2 GB, because they use
// 4-byte signed integers to represent capacity, etc.
// So we can't mmap the parts of the block higher than the 2 GB offset.
// FIXME: we could work around this with multiple memory maps.
// See HDFS-5101.
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("Unable to perform a zero-copy read from offset " +
curPos + " of " + src + "; 31-bit MappedByteBuffer limit " +
"exceeded. blockPos=" + blockPos + ", curEnd=" + curEnd);
}
return null;
}
length = (int)length31;
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("Reducing read length from " + maxLength +
" to " + length + " to avoid 31-bit limit. " +
"blockPos=" + blockPos + "; curPos=" + curPos +
"; curEnd=" + curEnd);
}
}
final ClientMmap clientMmap = blockReader.getClientMmap(opts);
if (clientMmap == null) {
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("unable to perform a zero-copy read from offset " +
curPos + " of " + src + "; BlockReader#getClientMmap returned " +
"null.");
}
return null;
}
boolean success = false;
ByteBuffer buffer;
try {
seek(curPos + length);
//直接从映射内存中读取相应的数据
buffer = clientMmap.getMappedByteBuffer().asReadOnlyBuffer();
buffer.position((int)blockPos);
buffer.limit((int)(blockPos + length));
extendedReadBuffers.put(buffer, clientMmap);
readStatistics.addZeroCopyBytes(length);
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("readZeroCopy read " + length +
" bytes from offset " + curPos + " via the zero-copy read " +
"path. blockEnd = " + blockEnd);
}
success = true;
} finally {
if (!success) {
IOUtils.closeQuietly(clientMmap);
}
}
return buffer;
}
getClientMmap函数代码如下:
/**
* Get or create a memory map for this replica.
*
* There are two kinds of ClientMmap objects we could fetch here: one that
* will always read pre-checksummed data, and one that may read data that
* hasn't been checksummed.
*
* If we fetch the former, "safe" kind of ClientMmap, we have to increment
* the anchor count on the shared memory slot. This will tell the DataNode
* not to munlock the block until this ClientMmap is closed.
* If we fetch the latter, we don't bother with anchoring.
*
* @param opts The options to use, such as SKIP_CHECKSUMS.
*
* @return null on failure; the ClientMmap otherwise.
*/
@Override
public ClientMmap getClientMmap(EnumSet<ReadOption> opts) {
//如果需要验证校验
boolean anchor = verifyChecksum &&
(opts.contains(ReadOption.SKIP_CHECKSUMS) == false);
if (anchor) {
//如果不能免校验,说明存在问题
if (!createNoChecksumContext()) {
if (LOG.isTraceEnabled()) {
LOG.trace("can't get an mmap for " + block + " of " + filename +
" since SKIP_CHECKSUMS was not given, " +
"we aren't skipping checksums, and the block is not mlocked.");
}
return null;
}
}
ClientMmap clientMmap = null;
try {
//得到一个映射内存类对象
clientMmap = replica.getOrCreateClientMmap(anchor);
} finally {
if ((clientMmap == null) && anchor) {
//如果映射内存类对象为null同时需要验证文件校验和,那么就需要释放掉之前的锚
releaseNoChecksumContext();
}
}
return clientMmap;
}
createNoChecksumContext函数代码如下:
//返回false表示
private boolean createNoChecksumContext() {
if (verifyChecksum) {
//如果存在存储类型,且存储属于不可持久化的,不可持久化的一律不进行文件校验
if (storageType != null && storageType.isTransient()) {
// Checksums are not stored for replicas on transient storage. We do not
// anchor, because we do not intend for client activity to block eviction
// from transient storage on the DataNode side.
return true;
} else {
//如果数据的存储类型为持久类型那么就给datanode上的该块数据添加一个免文件校验的锚
return replica.addNoChecksumAnchor();
}
} else {
return true;
}
}
getOrCreateClientMmap函数最终会调用ClientMmap类中的getOrCreateClientMmap函数,该函数代码如下:
ClientMmap getOrCreateClientMmap(ShortCircuitReplica replica,
boolean anchored) {
Condition newCond;
lock.lock();
try {
while (replica.mmapData != null) {
//如果已经有值,那么直接创建ClientMmap类对象
if (replica.mmapData instanceof MappedByteBuffer) {
//添加对ShortCircuitReplica类对象的引用
ref(replica);
MappedByteBuffer mmap = (MappedByteBuffer)replica.mmapData;
return new ClientMmap(replica, mmap, anchored);
} else if (replica.mmapData instanceof Long) {
long lastAttemptTimeMs = (Long)replica.mmapData;
long delta = Time.monotonicNow() - lastAttemptTimeMs;
if (delta < mmapRetryTimeoutMs) {
if (LOG.isTraceEnabled()) {
LOG.trace(this + ": can't create client mmap for " +
replica + " because we failed to " +
"create one just " + delta + "ms ago.");
}
return null;
}
if (LOG.isTraceEnabled()) {
LOG.trace(this + ": retrying client mmap for " + replica +
", " + delta + " ms after the previous failure.");
}
} else if (replica.mmapData instanceof Condition) {
Condition cond = (Condition)replica.mmapData;
cond.awaitUninterruptibly();
} else {
Preconditions.checkState(false, "invalid mmapData type " +
replica.mmapData.getClass().getName());
}
}
newCond = lock.newCondition();
replica.mmapData = newCond;
} finally {
lock.unlock();
}
MappedByteBuffer map = replica.loadMmapInternal();
lock.lock();
try {
if (map == null) {
replica.mmapData = Long.valueOf(Time.monotonicNow());
newCond.signalAll();
return null;
} else {
outstandingMmapCount++;
replica.mmapData = map;
//添加对ShortCircuitReplica类对象的引用
ref(replica);
newCond.signalAll();
return new ClientMmap(replica, map, anchored);
}
} finally {
lock.unlock();
}
}
总结一下零拷贝的过程:
1、获取要读取数据所在的块,并获取到该块文件对应的映射内存
2、直接从该映射内存中获取相应的数据