2、BlockReader
DFSInputStream.read()方法中,调用了BlockReader对象的doRead()方法读取数据块。
BlockReader是一个接口,抽象了从指定数据节点上读取数据块的类。BlockReader有三个主要的子类,如下:
(1) BlockReaderLocal:
进行本地短路读取(请参考短路读取相关小节)的BlockReader。当客户端与Datanode在同一台物理机器上时,客户端可以直接从本来地磁盘读取数据。绕过Datanode进程,从而提高了读取性能。
(2) BlockReaderLocalLegacy:
老版本的BlockReaderLocal。当客户端与Datanode在同一台机器(通过ip地址来判断是否在同一台机器上)上时,客户端直接从磁盘读取数据,老版本的实现要求客户端获取Datanode数据目录的权限,这可能引入安全问题(请参考HDF-2246)remoteBlockReader
(3) RemoteBlockReader2:
使用TCP协议从Datanode读取数据块。
BlockReader接口下的方法
<1> read()、readFully()、readAll():将数据读取到byte[]数组中。
<2> skip():从数据块中跳过若干字节
<3> available():当不用进行一次新的网络IO时,当前输入流可以读取的字节数。
<4> isLocal:是否是一个本地读取,也就是说,客户端和数据块是否在同一台机器上
<5> isShortCircuit():是否是一个短路读取,注意短路读取必须是本地读取。
<6> getClientMmap():为当前读取获得一个内存映射区域(请参考零拷贝读取)
先看看BlockReaderFactory构造BlockReader对象的流程
在DFSInputStream.read()方法中,在调用readWithStrategy()方法的blockSeekTo()方法中,会创建BlockReader,代码如下:
blockReader = new BlockReaderFactory(dfsClient.getConf()).
setInetSocketAddress(targetAddr).
setRemotePeerFactory(dfsClient).
setDatanodeInfo(chosenNode).
setStorageType(storageType).
setFileName(src).
setBlock(blk).
setBlockToken(accessToken).
setStartOffset(offsetIntoBlock).
setVerifyChecksum(verifyChecksum).
setClientName(dfsClient.clientName).
setLength(blk.getNumBytes() - offsetIntoBlock).
setCachingStrategy(cachingStrategy).
setAllowShortCircuitLocalReads(!shortCircuitForbidden()).
setClientCacheContext(dfsClient.getClientContext()).
setUserGroupInformation(dfsClient.ugi).
setConfiguration(dfsClient.getConfiguration()).
build();
其中build()方法的代码如下:
public BlockReader build() throws IOException {
BlockReader reader = null;
Preconditions.checkNotNull(configuration);
if (conf.shortCircuitLocalReads && allowShortCircuitLocalReads) {
if (clientContext.getUseLegacyBlockReaderLocal()) {
reader = getLegacyBlockReaderLocal();
if (reader != null) {
if (LOG.isTraceEnabled()) {
LOG.trace(this + ": returning new legacy block reader local.");
}
return reader;
}
} else {
reader = getBlockReaderLocal();
if (reader != null) {
if (LOG.isTraceEnabled()) {
LOG.trace(this + ": returning new block reader local.");
}
return reader;
}
}
}
if (conf.domainSocketDataTraffic) {
reader = getRemoteBlockReaderFromDomain();
if (reader != null) {
if (LOG.isTraceEnabled()) {
LOG.trace(this + ": returning new remote block reader using " +
"UNIX domain socket on " + pathInfo.getPath());
}
return reader;
}
}
Preconditions.checkState(!DFSInputStream.tcpReadsDisabledForTesting,
"TCP reads were disabled for testing, but we failed to " +
"do a non-TCP read.");
return getRemoteBlockReaderFromTcp();
}
build()方法首先尝试创建一个本地短路读取器,短路读取避免了Socket通信的开销。如果短路读取方式创建失败,则创建一个域套接字读取器,这种方式使用Linux的domainSocket方法进行本地传输(由dfs.client.domain.socket.data.traffic配置,默认为false)。如果上述两种方式都不能创建成功,则创建一个远程读取器,使用TCP进行数据的读取。
<1> getLegacyBlockReaderLocal()方法
(1) 这个方法先判断客户端和datanode是否在同一台机器上,如果不在那么就返回null
(2) 调用BlockReaderLocalLegacy.newBlockReader(conf,userGroupInformation, configuration, fileName, block, token,datanode, startOffset, length, storageType)函数创建BlockReaderLocalLegacy类对象,newBlockReader()方法 代码如下:
/**
* The only way this object can be instantiated.
*/
static BlockReaderLocalLegacy newBlockReader(DFSClient.Conf conf,
UserGroupInformation userGroupInformation,
Configuration configuration, String file, ExtendedBlock blk,
Token<BlockTokenIdentifier> token, DatanodeInfo node,
long startOffset, long length, StorageType storageType)
throws IOException {
LocalDatanodeInfo localDatanodeInfo = getLocalDatanodeInfo(node
.getIpcPort());
// check the cache first
// 如果本地缓存中已经有了该文件路径和验证码文件路径等信息,那么就直接从缓存中获取
BlockLocalPathInfo pathinfo = localDatanodeInfo.getBlockLocalPathInfo(blk);
if (pathinfo == null) {
if (userGroupInformation == null) {
userGroupInformation = UserGroupInformation.getCurrentUser();
}
//通过代理对象用RPC从datanode获取对应块所在的文件路径和验证码文件路径等信息,如果获取到了
//会将该信息保存到本地缓存中
pathinfo = getBlockPathInfo(userGroupInformation, blk, node,
configuration, conf.socketTimeout, token,
conf.connectToDnViaHostname, storageType);
}
// check to see if the file exists. It may so happen that the
// HDFS file has been deleted and this block-lookup is occurring
// on behalf of a new HDFS file. This time, the block file could
// be residing in a different portion of the fs.data.dir directory.
// In this case, we remove this entry from the cache. The next
// call to this method will re-populate the cache.
FileInputStream dataIn = null;
FileInputStream checksumIn = null;
BlockReaderLocalLegacy localBlockReader = null;
boolean skipChecksumCheck = conf.skipShortCircuitChecksums ||
storageType.isTransient();
try {
// get a local file system
//获取本地文件路径后,开始直接打开本地文件
File blkfile = new File(pathinfo.getBlockPath());
dataIn = new FileInputStream(blkfile);
if (LOG.isDebugEnabled()) {
LOG.debug("New BlockReaderLocalLegacy for file " + blkfile + " of size "
+ blkfile.length() + " startOffset " + startOffset + " length "
+ length + " short circuit checksum " + !skipChecksumCheck);
}
if (!skipChecksumCheck) {
// get the metadata file
//获取本地文件验证码文件路径
File metafile = new File(pathinfo.getMetaPath());
checksumIn = new FileInputStream(metafile);
final DataChecksum checksum = BlockMetadataHeader.readDataChecksum(
new DataInputStream(checksumIn), blk);
long firstChunkOffset = startOffset
- (startOffset % checksum.getBytesPerChecksum());
localBlockReader = new BlockReaderLocalLegacy(conf, file, blk, token,
startOffset, length, pathinfo, checksum, true, dataIn,
firstChunkOffset, checksumIn);
} else {
localBlockReader = new BlockReaderLocalLegacy(conf, file, blk, token,
startOffset, length, pathinfo, dataIn);
}
} catch (IOException e) {
// remove from cache
localDatanodeInfo.removeBlockLocalPathInfo(blk);
DFSClient.LOG.warn("BlockReaderLocalLegacy: Removing " + blk
+ " from cache because local file " + pathinfo.getBlockPath()
+ " could not be opened.");
throw e;
} finally {
if (localBlockReader == null) {
if (dataIn != null) {
dataIn.close();
}
if (checksumIn != null) {
checksumIn.close();
}
}
}
return localBlockReader;
}
<2> getBlockReaderLocal()方法
这个方法会尝试创建一个本地短路读取器,这个方法首先从clientContext中获取ShortCircuitCache,ShortCircuitCache是在DFSClient端缓存ShortCircuitReplicaInfo的类。然后调用fetchOrCreate()方法从ShortCircuitCache中获取当前读取数据块对应的ShortCircuitReplicaInfo类。
ShortCircuitCache类会在文件短路读操作小节中一并介绍,ShortCircuitCache中的ShortCircuitReplica类保存了用来执行短路读取的文件描述符、client和Datanode共享内存中记录当前副本信息的Slot对象,以及数据块在内存中的映射文件mmapData。
获取了数据块对应的ShortCircuitReplica后,getBlockReaderLocal()方法会使用ShortCircuitReplica中保存的文件描述符构造数据块文件以及校验文件的输入流,然后构造BlockReaderLocal类。
<3>getRemoteBlockReaderFromDomain()方法和getRemoteBlockReaderFromTcp()方法
这两个方法分别使用Domain Socket以及TCP Socket作为底层IO流,构造RemoteBlockRead2对象读取数据块。
RemoteBlockRead2类
RemoteBlockRead2类实现了通过Socket连接(可以是Domain Socket或者TCP Socket)从Datanode读取一个数据块的逻辑。我看一下该类中的read()函数,代码如下:
@Override
public int read(ByteBuffer buf) throws IOException {
if (curDataSlice == null || curDataSlice.remaining() == 0 && bytesNeededToFinish > 0) {
//读取下一个数据包,将数据包中的数据部分存入curDataSlice变量中
readNextPacket();
}
if (curDataSlice.remaining() == 0) {
// we're at EOF now
return -1;
}
//将curDataSlice中的数据写入buf中
int nRead = Math.min(curDataSlice.remaining(), buf.remaining());
ByteBuffer writeSlice = curDataSlice.duplicate();
writeSlice.limit(writeSlice.position() + nRead);
buf.put(writeSlice);
curDataSlice.position(writeSlice.position());
return nRead;
}
readNextPacket()函数代码如下:
private void readNextPacket() throws IOException {
//Read packet headers.
//调用packetReceiver从IO流中读取一个新的数据包
packetReceiver.receiveNextPacket(in);
//将数据包头读入curHeader变量中,将数据包数据写入curDataSlice变量中
PacketHeader curHeader = packetReceiver.getHeader();
curDataSlice = packetReceiver.getDataSlice();
assert curDataSlice.capacity() == curHeader.getDataLen();
//检查头域中的长度
if (LOG.isTraceEnabled()) {
LOG.trace("DFSClient readNextPacket got header " + curHeader);
}
// Sanity check the lengths
if (!curHeader.sanityCheck(lastSeqNo)) {
throw new IOException("BlockReader: error in packet header " +
curHeader);
}
//检查数据和校验和是否匹配
if (curHeader.getDataLen() > 0) {
int chunks = 1 + (curHeader.getDataLen() - 1) / bytesPerChecksum;
int checksumsLen = chunks * checksumSize;
assert packetReceiver.getChecksumSlice().capacity() == checksumsLen :
"checksum slice capacity=" + packetReceiver.getChecksumSlice().capacity() +
" checksumsLen=" + checksumsLen;
lastSeqNo = curHeader.getSeqno();
if (verifyChecksum && curDataSlice.remaining() > 0) {
// N.B.: the checksum error offset reported here is actually
// relative to the start of the block, not the start of the file.
// This is slightly misleading, but preserves the behavior from
// the older BlockReader.
checksum.verifyChunkedSums(curDataSlice,
packetReceiver.getChecksumSlice(),
filename, curHeader.getOffsetInBlock());
}
bytesNeededToFinish -= curHeader.getDataLen();
}
// First packet will include some data prior to the first byte
// the user requested. Skip it.
if (curHeader.getOffsetInBlock() < startOffset) {
int newPos = (int) (startOffset - curHeader.getOffsetInBlock());
curDataSlice.position(newPos);
}
// If we've now satisfied the whole client read, read one last packet
// header, which should be empty
//如果完成了客户端的整个读取操作,读取最后一个空的数据包,因为数据块的最后一个数据包为空的标识数据包
if (bytesNeededToFinish <= 0) {
readTrailingEmptyPacket();
if (verifyChecksum) {
sendReadResult(Status.CHECKSUM_OK);
} else {
sendReadResult(Status.SUCCESS);
}
}
}
BlockReaderLocal类
该类实现了本地短路读取功能,也就是当客户端与Datanode在同一台机器上时,客户端可以绕过Datanode进程直接从本地磁盘读取数据。
当客户端向Datanode请求数据时,Datanode会打开块文件以及该块文件的元数据文件,将这两个文件的文件描述符通过domainSocket传给客户端,客户端拿到文件描述符后构造输入流,之后通过输入流直接读取磁盘上的块文件,采用这种方式,数据绕过了Datanode进程的转发,提供了更好的读取性能(参考HDFS-347)。由于文件描述符是只读的,所以客户端不能修改收到的文件,同时由于客户端自身无法访问块文件所在的目录,所以它也就不能访问数据目录中的其他文件了,从而提供的数据的安全性。
![BlockReaderLocal流程图](https://i-blog.csdnimg.cn/blog_migrate/126cee829fc37fe2bd37e8d3241e2a6b.jpeg)
以下是BlockReaderLocal类中read()方法代码:
@Override
public synchronized int read(ByteBuffer buf) throws IOException {
//能否跳过数据校验
boolean canSkipChecksum = createNoChecksumContext();
try {
String traceString = null;
if (LOG.isTraceEnabled()) {
traceString = new StringBuilder().
append("read(").
append("buf.remaining=").append(buf.remaining()).
append(", block=").append(block).
append(", filename=").append(filename).
append(", canSkipChecksum=").append(canSkipChecksum).
append(")").toString();
LOG.info(traceString + ": starting");
}
int nRead;
try {
//可以跳过数据校验,不需要预读
if (canSkipChecksum && zeroReadaheadRequested) {
nRead = readWithoutBounceBuffer(buf);
} else {
//需要校验,以及开启了预读取时
nRead = readWithBounceBuffer(buf, canSkipChecksum);
}
} catch (IOException e) {
if (LOG.isTraceEnabled()) {
LOG.info(traceString + ": I/O error", e);
}
throw e;
}
if (LOG.isTraceEnabled()) {
LOG.info(traceString + ": returning " + nRead);
}
return nRead;
} finally {
if (canSkipChecksum) releaseNoChecksumContext();
}
}
该read()方法的代码可以切分为三块:
第一块:
判断能否通过createNoChecksumContext()方法创建一个免校验上下文
第二块:
如果可以免校验,并且无预读取请求,则调用readWithoutBounceBuffer()方法读取数据
第三块:
如果不可以免校验,并且开启了预读取,则调用readWithBounceBuffer()方法读取数据。
下面分别介绍上面的三块的具体实现
createNoChecksumContext()方法
该方法会判断如果verifyChecksum字段为false,也就是当前配置本来就不需要进行校验,则直接返回true,创建免校验上下文成功。如果当前配置需要进行校验,那么尝试在Datanode和Client共享内存中副本的Slot上添加一个免校验的锚(锚的概念后面会讲到)。这里注意,当且仅当Datanode已经缓存了这个副本时,才可以添加一个锚,因为当Datanode尝试缓存一个数据块副本时,会验证数据块的校验和,然后通过mmap以及mlock将数据块缓存到内存中。也就是说说,当前Datanode上缓存的数据块是经过校验的、是正确的,不用再次进行校验。
readWithoutBounceBuffer()方法
这个方法比较简单,不需要使用额外的数据以及校验和和缓冲区预读取数据以及校验和,而是直接从数据流中将数据读取到缓冲区。代码如下:
private synchronized int readWithoutBounceBuffer(ByteBuffer buf)
throws IOException {
freeDataBufIfExists();
freeChecksumBufIfExists();
int total = 0;
//直接从输入流中将数据读取到buf
while (buf.hasRemaining()) {
int nRead = dataIn.read(buf, dataPos);
if (nRead <= 0) break;
dataPos += nRead;
total += nRead;
}
return (total == 0 && (dataPos == dataIn.size())) ? -1 : total;
}
readWithBounceBuffer()方法
该方法在BlockReaderLocal对象上申请了两个缓冲区:
dataBuf 数据缓冲区
checksumBuf 校验和缓冲区
dataBuf缓冲区的大小为maxReadaheadLength,这个长度始终是校验块(chunk,一个校验值对应的数据长度)的整数倍,这样设计是为了进行校验操作时比较方便,能够以校验块为单位读取数据。dataBuf和checksumBuf的构造使用了direct byte buffer,也就是堆外内存上的缓冲区。
dataBuf以及checksumBuf都是通过调用java.nio.ByteBuffer.allocateDirect()方法分配的堆外内存,这里值得我们积累,对于比较大的缓冲区,可以通过调用java.nio提供的方法,将缓冲区分配在堆外,节省宝贵的堆内存空间。
BlockReaderLocal提供了对缓冲区操作的几个方法
<1> fillBuffer(ByteBuffer buf,boolean canSkipChecksum):
将数据从输入流读入指定buf中,并将校验和读入checksumBuf中进行校验操作
<2>fillDataBuf():调用fillBuffer()方法将数据读入dataBuf缓冲区中,将校验和读入checksumBuf缓冲区中,这里需要注意,dataBuf缓冲区中的数据始终是chunk(一个校验值对应的数据长度)的整数倍。
<3>将dataBuf缓冲区中的数据拉取到buf中,然后返回读取的字节数。
readWithBounceBuffer()中首先从dataBuf中拉取缓存中的数据到buf,这样就保证了读取游标pos在chunk边界上。如果buf的剩余空间大于dataBuf缓冲区的大小,且当前数据流游标在chunk边界上,则调用fillBuffer(buf)方法将数据直接读入buf,而不通过dataBuf缓存。如果buf的剩余空间小于dataBuf缓冲区大小,则先调用fillDataBuf()方法将数据读入dataBuf缓存,然后再调用drainDataBuf()将dataBuf中的数据拉取到buf缓冲区。
readWithBounceBuffer()方法的代码如下:
/**
* Read using the bounce buffer.
*
* A 'direct' read actually has three phases. The first drains any
* remaining bytes from the slow read buffer. After this the read is
* guaranteed to be on a checksum chunk boundary. If there are still bytes
* to read, the fast direct path is used for as many remaining bytes as
* possible, up to a multiple of the checksum chunk size. Finally, any
* 'odd' bytes remaining at the end of the read cause another slow read to
* be issued, which involves an extra copy.
*
* Every 'slow' read tries to fill the slow read buffer in one go for
* efficiency's sake. As described above, all non-checksum-chunk-aligned
* reads will be served from the slower read path.
*
* @param buf The buffer to read into.
* @param canSkipChecksum True if we can skip checksums.
*/
private synchronized int readWithBounceBuffer(ByteBuffer buf,
boolean canSkipChecksum) throws IOException {
int total = 0;
//调用drainDataBuf(),将dataBuf缓冲区中的数据写入buf
int bb = drainDataBuf(buf); // drain bounce buffer if possible
if (bb >= 0) {
total += bb;
if (buf.remaining() == 0) return total;
}
boolean eof = true, done = false;
do {
//如果buf的空间足够大,并且输入游标在chunk边界上,则直接从IO流中将数据写入buf
if (buf.isDirect() && (buf.remaining() >= maxReadaheadLength)
&& ((dataPos % bytesPerChecksum) == 0)) {
// Fast lane: try to read directly into user-supplied buffer, bypassing
// bounce buffer.
int oldLimit = buf.limit();
int nRead;
try {
buf.limit(buf.position() + maxReadaheadLength);
nRead = fillBuffer(buf, canSkipChecksum);
} finally {
buf.limit(oldLimit);
}
if (nRead < maxReadaheadLength) {
done = true;
}
if (nRead > 0) {
eof = false;
}
total += nRead;
} else {
// Slow lane: refill bounce buffer.
//否则,将数据读入dataBuf缓存
if (fillDataBuf(canSkipChecksum)) {
done = true;
}
//然后将dataBuf中的数据导入buf
bb = drainDataBuf(buf); // drain bounce buffer if possible
if (bb >= 0) {
eof = false;
total += bb;
}
}
} while ((!done) && (buf.remaining() > 0));
return (eof && total == 0) ? -1 : total;
}
3、HasEnhancedByteBufferAccess.read()
DFSInputStream实现了HasEnhancedByteBufferAccess接口的read()方法,提供了以零拷贝模式读取数据块的功能。代码如下:
@Override
public synchronized ByteBuffer read(ByteBufferPool bufferPool,
int maxLength, EnumSet<ReadOption> opts)
throws IOException, UnsupportedOperationException {
if (maxLength == 0) {
return EMPTY_BUFFER;
} else if (maxLength < 0) {
throw new IllegalArgumentException("can't read a negative " +
"number of bytes.");
}
if ((blockReader == null) || (blockEnd == -1)) {
if (pos >= getFileLength()) {
return null;
}
/*
* If we don't have a blockReader, or the one we have has no more bytes
* left to read, we call seekToBlockSource to get a new blockReader and
* recalculate blockEnd. Note that we assume we're not at EOF here
* (we check this above).
*/
if ((!seekToBlockSource(pos)) || (blockReader == null)) {
throw new IOException("failed to allocate new BlockReader " +
"at position " + pos);
}
}
ByteBuffer buffer = null;
//首先尝试零拷贝模式
if (dfsClient.getConf().shortCircuitMmapEnabled) {
buffer = tryReadZeroCopy(maxLength, opts);
}
if (buffer != null) {
return buffer;
}
//如果零拷贝读取不成功,则退化为一个普通的读取
buffer = ByteBufferUtil.fallbackRead(this, bufferPool, maxLength);
if (buffer != null) {
extendedReadBuffers.put(buffer, bufferPool);
}
return buffer;
}
<1> tryReadZeroCopy()
在传统的文件IO操作中,都是调用操作系统提供的系统调用函数read()或write()来执行读写操作的,此时调用此函数的进程(在Java中即java进程)会由用户态切换到内核态,然后操作系统的内核代码负责将相应的文件数据读取到内核的IO缓冲区,最后再把数据从内核IO缓冲区拷贝到进程的私有地址空间中,这样便完成了一次IO操作。
tryReadZeroCopy()方法使用了内存映射文件的读取方式。内存映射文件和标准IO操作最大的不同是并不需要将数据读取到操作系统的内核缓冲区,而是直接将进程私有地址空间中的一部分区域与文件对象建立起映射关系,就好像直接从内存中读写文件一样,减少了IO的拷贝次数,提高了文件的读写速度。
java提供了三种内存映射模式,即:只读(readonly)、读写(read_write)、专用(private)。对于只读模式来说,如果程序试图进行写操作,则会抛出ReadOnlyBufferException异常;对于读写模式来说,如果程序通过内存映射文件的方式写或者修改文件内容,则修改内容会立刻反映到磁盘文件中,如果另一个进程共享了同一个映射文件,也会立即看到变化;专用模式采用的是操作系统的"写时拷贝"原则,即在没有发生写操作的情况下,多个进程之间都是共享文件的同一块物理内存的(进程各自的虚拟地址指向同一片物理地址),一旦某个进程进行写操作,就会把受影响的文件数据单独拷贝一份到进程的私有缓冲区中,不会反映到物理文件中。在tryReadZeroCopy()方法中使用的是只读模式。
对于数据文件的读取,内存映射读取大大提高了性能,这种模式值得积累。
<1>、tryReadZeroCopy()方法代码如下:
private synchronized ByteBuffer tryReadZeroCopy(int maxLength,
EnumSet<ReadOption> opts) throws IOException {
// Copy 'pos' and 'blockEnd' to local variables to make it easier for the
// JVM to optimize this function.
final long curPos = pos;
final long curEnd = blockEnd;
final long blockStartInFile = currentLocatedBlock.getStartOffset();
final long blockPos = curPos - blockStartInFile;
// Shorten this read if the end of the block is nearby.
//首先确保读取是在同一个数据块之内
long length63;
if ((curPos + maxLength) <= (curEnd + 1)) {
length63 = maxLength;
} else {
length63 = 1 + curEnd - curPos;
if (length63 <= 0) {
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("Unable to perform a zero-copy read from offset " +
curPos + " of " + src + "; " + length63 + " bytes left in block. " +
"blockPos=" + blockPos + "; curPos=" + curPos +
"; curEnd=" + curEnd);
}
return null;
}
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("Reducing read length from " + maxLength +
" to " + length63 + " to avoid going more than one byte " +
"past the end of the block. blockPos=" + blockPos +
"; curPos=" + curPos + "; curEnd=" + curEnd);
}
}
// Make sure that don't go beyond 31-bit offsets in the MappedByteBuffer.
//确保读取映射数据没有超过2GB
int length;
if (blockPos + length63 <= Integer.MAX_VALUE) {
length = (int)length63;
} else {
long length31 = Integer.MAX_VALUE - blockPos;
if (length31 <= 0) {
// Java ByteBuffers can't be longer than 2 GB, because they use
// 4-byte signed integers to represent capacity, etc.
// So we can't mmap the parts of the block higher than the 2 GB offset.
// FIXME: we could work around this with multiple memory maps.
// See HDFS-5101.
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("Unable to perform a zero-copy read from offset " +
curPos + " of " + src + "; 31-bit MappedByteBuffer limit " +
"exceeded. blockPos=" + blockPos + ", curEnd=" + curEnd);
}
return null;
}
length = (int)length31;
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("Reducing read length from " + maxLength +
" to " + length + " to avoid 31-bit limit. " +
"blockPos=" + blockPos + "; curPos=" + curPos +
"; curEnd=" + curEnd);
}
}
//调用blockReader.getClientMmap()将文件映射到内存中,并返回ClientMmap对象。这个对象当中包含了MappedByteBuffer对象
final ClientMmap clientMmap = blockReader.getClientMmap(opts);
if (clientMmap == null) {
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("unable to perform a zero-copy read from offset " +
curPos + " of " + src + "; BlockReader#getClientMmap returned " +
"null.");
}
return null;
}
boolean success = false;
ByteBuffer buffer;
try {
seek(curPos + length);
//将内存映射缓冲区返回,在缓冲区中是数据块文件的数据
buffer = clientMmap.getMappedByteBuffer().asReadOnlyBuffer();
buffer.position((int)blockPos);
buffer.limit((int)(blockPos + length));
extendedReadBuffers.put(buffer, clientMmap);
readStatistics.addZeroCopyBytes(length);
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("readZeroCopy read " + length +
" bytes from offset " + curPos + " via the zero-copy read " +
"path. blockEnd = " + blockEnd);
}
success = true;
} finally {
if (!success) {
IOUtils.closeQuietly(clientMmap);
}
}
return buffer;
}
tryReadZeroCopy()会通过调用序列BlockReader.getClientMmap()->ShortCircuitReplica.getOrCreateClientMmap()->ShortCircuitCache.getOrCreateClientMmap()->ShortCircuitReplica.loadMmapInternal()获取MappedByteBuffer对象,也就是数据块文件在内存中的映射对象。BlockReader.getClientMmap()代码如下:
/**
* Get or create a memory map for this replica.
*
* There are two kinds of ClientMmap objects we could fetch here: one that
* will always read pre-checksummed data, and one that may read data that
* hasn't been checksummed.
*
* If we fetch the former, "safe" kind of ClientMmap, we have to increment
* the anchor count on the shared memory slot. This will tell the DataNode
* not to munlock the block until this ClientMmap is closed.
* If we fetch the latter, we don't bother with anchoring.
*
* @param opts The options to use, such as SKIP_CHECKSUMS.
*
* @return null on failure; the ClientMmap otherwise.
*/
@Override
public ClientMmap getClientMmap(EnumSet<ReadOption> opts) {
boolean anchor = verifyChecksum &&
(opts.contains(ReadOption.SKIP_CHECKSUMS) == false);
if (anchor) {
if (!createNoChecksumContext()) {
if (LOG.isTraceEnabled()) {
LOG.trace("can't get an mmap for " + block + " of " + filename +
" since SKIP_CHECKSUMS was not given, " +
"we aren't skipping checksums, and the block is not mlocked.");
}
return null;
}
}
ClientMmap clientMmap = null;
try {
clientMmap = replica.getOrCreateClientMmap(anchor);
} finally {
if ((clientMmap == null) && anchor) {
releaseNoChecksumContext();
}
}
return clientMmap;
}
ShortCircuitReplica.getOrCreateClientMmap()代码如下:
public ClientMmap getOrCreateClientMmap(boolean anchor) {
return cache.getOrCreateClientMmap(this, anchor);
}
ShortCircuitCache.getOrCreateClientMmap()代码如下:
ClientMmap getOrCreateClientMmap(ShortCircuitReplica replica,
boolean anchored) {
Condition newCond;
lock.lock();
try {
while (replica.mmapData != null) {
if (replica.mmapData instanceof MappedByteBuffer) {
ref(replica);
MappedByteBuffer mmap = (MappedByteBuffer)replica.mmapData;
return new ClientMmap(replica, mmap, anchored);
} else if (replica.mmapData instanceof Long) {
long lastAttemptTimeMs = (Long)replica.mmapData;
long delta = Time.monotonicNow() - lastAttemptTimeMs;
if (delta < mmapRetryTimeoutMs) {
if (LOG.isTraceEnabled()) {
LOG.trace(this + ": can't create client mmap for " +
replica + " because we failed to " +
"create one just " + delta + "ms ago.");
}
return null;
}
if (LOG.isTraceEnabled()) {
LOG.trace(this + ": retrying client mmap for " + replica +
", " + delta + " ms after the previous failure.");
}
} else if (replica.mmapData instanceof Condition) {
Condition cond = (Condition)replica.mmapData;
cond.awaitUninterruptibly();
} else {
Preconditions.checkState(false, "invalid mmapData type " +
replica.mmapData.getClass().getName());
}
}
newCond = lock.newCondition();
replica.mmapData = newCond;
} finally {
lock.unlock();
}
MappedByteBuffer map = replica.loadMmapInternal();
lock.lock();
try {
if (map == null) {
replica.mmapData = Long.valueOf(Time.monotonicNow());
newCond.signalAll();
return null;
} else {
outstandingMmapCount++;
replica.mmapData = map;
ref(replica);
newCond.signalAll();
return new ClientMmap(replica, map, anchored);
}
} finally {
lock.unlock();
}
}
ShortCircuitReplica.loadMmapInternal()的代码如下:
MappedByteBuffer loadMmapInternal() {
try {
FileChannel channel = dataStream.getChannel();
//调用java.nio.channel.map()方法创建文件的内存映射
MappedByteBuffer mmap = channel.map(MapMode.READ_ONLY, 0,
Math.min(Integer.MAX_VALUE, channel.size()));
if (LOG.isTraceEnabled()) {
LOG.trace(this + ": created mmap of size " + channel.size());
}
return mmap;
} catch (IOException e) {
LOG.warn(this + ": mmap error", e);
return null;
} catch (RuntimeException e) {
LOG.warn(this + ": mmap error", e);
return null;
}
}
tryReadZeroCopy()方法首先通过上面的getClientMmap()方法获取数据块文件的内存映射对象clientMmap,clientMmap对象中保存了MappedByteBuffer对象,也就是数据块文件在内存中的映射缓冲区,tryReadZeroCopy()会通过clientMmap获取这个MappedByteBuffer对象并将这个对象返回。接下来用户代码就可以从MappedByteBuffer这个对象读取数据了,这里特别注意的是,java的ByteBuffer只支持2GB以下的空间,因为ByteBuffer使用4字节的地址空间,所以需要对加载数据的大小进行判断,超过2GB不予加载。
<2> ByteBufferUtil.fallbackRead()
/**
* Perform a fallback read.
*/
public static ByteBuffer fallbackRead(
InputStream stream, ByteBufferPool bufferPool, int maxLength)
throws IOException {
if (bufferPool == null) {
throw new UnsupportedOperationException("zero-copy reads " +
"were not available, and you did not provide a fallback " +
"ByteBufferPool.");
}
//判断stream是否支持将数据读入ByteBuffer
boolean useDirect = streamHasByteBufferRead(stream);
//调用ByteBufferPool构造一个ByteBuffer
ByteBuffer buffer = bufferPool.getBuffer(useDirect, maxLength);
if (buffer == null) {
//ByteBufferPool无法构造ByteBuffer
throw new UnsupportedOperationException("zero-copy reads " +
"were not available, and the ByteBufferPool did not provide " +
"us with " + (useDirect ? "a direct" : "an indirect") +
"buffer.");
}
Preconditions.checkState(buffer.capacity() > 0);
Preconditions.checkState(buffer.isDirect() == useDirect);
maxLength = Math.min(maxLength, buffer.capacity());
boolean success = false;
try {
if (useDirect) {
buffer.clear();
buffer.limit(maxLength);
ByteBufferReadable readable = (ByteBufferReadable)stream;
int totalRead = 0;
while (true) {
if (totalRead >= maxLength) {
success = true;
break;
}
//直接调用stream上支持ByteBufferRead的函数
int nRead = readable.read(buffer);
if (nRead < 0) {
if (totalRead > 0) {
success = true;
}
break;
}
totalRead += nRead;
}
buffer.flip();
} else {
buffer.clear();
//调用InputStream.read(byte[])方法
int nRead = stream.read(buffer.array(),
buffer.arrayOffset(), maxLength);
if (nRead >= 0) {
buffer.limit(nRead);
success = true;
}
}
} finally {
if (!success) {
// If we got an error while reading, or if we are at EOF, we
// don't need the buffer any more. We can give it back to the
// bufferPool.
bufferPool.putBuffer(buffer);
buffer = null;
}
}
return buffer;
}
当read()方法执行零拷贝读操作失败后,会调用ByteBufferUtil.fallbackRead()退化为一个普通的读操作。ByteBufferUtil.fallbackRead()方法非常简单,判断传入参数的InputStream(DFSInputStream)是否支持ByteBufferRead(实现了ByteBufferReadable接口)。如果支持则直接将数据读取至ByteBuffer中,否则读取到ByteBuffer.array()字节数组中。
4、关闭输入流
用户代码读取完所有数据之后,就会调用DFSInputStream.close()方法关闭输入流。close()方法的实现也非常简单,它首先检查DFSClient是否处于运行状态,然后关闭读取过程中可能使用过的ByteBuffer,最后调用BlockReader.close()关闭当前输入流底层的BlockReader。close()代码如下:
/**
* Close it down!
*/
@Override
public synchronized void close() throws IOException {
if (closed) {
return;
}
dfsClient.checkOpen();
//关闭读取过程中使用的ByteBuffer
if (!extendedReadBuffers.isEmpty()) {
final StringBuilder builder = new StringBuilder();
extendedReadBuffers.visitAll(new IdentityHashStore.Visitor<ByteBuffer, Object>() {
private String prefix = "";
@Override
public void accept(ByteBuffer k, Object v) {
builder.append(prefix).append(k);
prefix = ", ";
}
});
DFSClient.LOG.warn("closing file " + src + ", but there are still " +
"unreleased ByteBuffers allocated by read(). " +
"Please release " + builder.toString() + ".");
}
//关闭BlockReader对象
if (blockReader != null) {
blockReader.close();
blockReader = null;
}
super.close();
closed = true;
}
接下来我们开始记录5.3节的内容