说到hdfs文件系统上的读文件流程,相信大家都不会陌生,都会知道读取文件的过程如下的示意图:
客户端读过程示意图:
其基本的读取流程如下:
- 客户端通过调用FileSystem对象的open()方法来打开hdfs上的文件,这个方法在底层会调用ClientProtocol.open()方法,该方法会返回一个HdfsDataInputStream对象用于读取数据块。HdfsDataInputStream是一个DFSInputStream的装饰类,真正进行数据块读取操作的是DFSInputStream对象。
- DistributedFileSystem通过调用RPC接口ClientProtocol.getBlockLocations()方法向名字节点NameNode获取该hdfs文件起始块的位置,同一Block按照重复数会返回多个位置,这些位置按照Hadoop集群拓扑结构排序,距离客户端近的排在前面;所以DFSInputStream会选择一个最优的DataNode节点,然后建立与这个节点的数据连接并读取数据块。
- 客户端通过DFSInputStream.read()方法从最优的DataNode节点上读取数据块,数据块会以数据包(packet)为单位从数据节点通过流式接口传递到客户端,当一个数据块读取完毕时,其会再次调用ClientProtocol.getBlockLocations()获取文件的下一个数据块位置信息,并建立和这个新的数据块的最优DataNode之间的连接,然后hdfs客户端就会继续读取该数据块了。
- 一旦客户端完成读取,就对HdfsDataInputStream调用close()方法关闭文件读取的输入流。
接下来从源码的角度一步步解析,看hdfs client是如何与NameNode,DataNode进行读文件交互的。
1、首先客户端调用FSDataInputStream inputStream = DistributedFileSystem.open();打开文件并获取到相应的输入流,可以看到其最终会构造一个DFSInputStream输入流对象用来读取该hdfs文件。
@Override
public FSDataInputStream open(Path f, final int bufferSize)
throws IOException {
statistics.incrementReadOps(1);
Path absF = fixRelativePart(f);
return new FileSystemLinkResolver<FSDataInputStream>() {
@Override
public FSDataInputStream doCall(final Path p)
throws IOException, UnresolvedLinkException {
final DFSInputStream dfsis =
dfs.open(getPathName(p), bufferSize, verifyChecksum);
return dfs.createWrappedInputStream(dfsis);
}
}.resolve(this, absF);
}
public HdfsDataInputStream createWrappedInputStream(DFSInputStream dfsis)
throws IOException {
// ......... 主要是一些加密流的判断
return new HdfsDataInputStream(dfsis);
}
}
在实际的DistributedFileSystem.open()过程中,其内部是委托给DFSClient类的实际对象dfs.open();其主要作用在于打开文件,并构造获取该文件对应的输入流DFSInputStream。在DFSInputStream的构造方法内部会
- 初始化DFSInputStream的基本属性:包括 dfsClient类的引用,verifyChecksum读取数据时是否进行校验(这个主要适用于零拷贝),buffersize读取数据时缓冲区大小(4KB),src读取文件地址;
- 调用openInfo()方法:从NameNode处获取文件对应的数据块的位置信息,并将返回的数据块位置信息保存DFSInputStream.locatedBlocks字段中。
接着来详细看下openInfo()方法的具体执行;openInfo()方法会调用fetchLocatedBlocksAndGetLastBlockLength()方法获取文件对应的所有数据块的位置信息。其主要执行的流程有:
- 先调用dfsClient.getLocatedBlocks()方法通过rpc接口ClientProtocol.getBlockLocations()从NameNode获取文件对应的所有数据块的位置信息;
- 然后将新获取的数据块位置信息与locatedBlocks保存的位置信息进行对比,更新最新的locatedBlocks字段;
- 最后会调用readBlockLength()方法通过rpc接口ClientDatanodeProtocol去获取文件最后一个数据块的大小,然后更新locatedBlocks记录的最后一个数据块的长度;
private long fetchLocatedBlocksAndGetLastBlockLength() throws IOException {
// 通过rpc接口ClientProtocol.getBlockLocations()从NameNode获取文件对应的所有数据块的位置信息
final LocatedBlocks newInfo = dfsClient.getLocatedBlocks(src, 0);
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("newInfo = " + newInfo);
}
if (newInfo == null) {
throw new IOException("Cannot open filename " + src);
}
// 比较并更新locatedBlocks字段
if (locatedBlocks != null) {
Iterator<LocatedBlock> oldIter = locatedBlocks.getLocatedBlocks().iterator();
Iterator<LocatedBlock> newIter = newInfo.getLocatedBlocks().iterator();
while (oldIter.hasNext() && newIter.hasNext()) {
if (! oldIter.next().getBlock().equals(newIter.next().getBlock())) {
throw new IOException("Blocklist for " + src + " has changed!");
}
}
}
locatedBlocks = newInfo;
long lastBlockBeingWrittenLength = 0;
if (!locatedBlocks.isLastBlockComplete()) {
final LocatedBlock last = locatedBlocks.getLastLocatedBlock();
if (last != null) {
if (last.getLocations().length == 0) {
if (last.getBlockSize() == 0) {
// if the length is zero, then no data has been written to
// datanode. So no need to wait for the locations.
return 0;
}
return -1;
}
// 通过rpc接口ClientDatanodeProtocol去获取文件最后一个数据块的大小并更新
final long len = readBlockLength(last);
last.getBlock().setNumBytes(len);
lastBlockBeingWrittenLength = len;
}
}
fileEncryptionInfo = locatedBlocks.getFileEncryptionInfo();
currentNode = null;
return lastBlockBeingWrittenLength;
}
2、inputStream.read();在构造并获取该文件对应的输入流DFSInputStream后,便可以调用inputStream.read()方法进行数据块的读取;其读取的基本过程如下:
- currentNode = blockSeekTo(targetPos);其会获取保存下一个数据块的最佳DataNode位置信息;blockSeekTo()方法首先会调用getBlockAt()方法去获取当前游标所在的数据块信息,然后调用chooseDataNode()方法获取一个最佳的DataNode节点;之后便会构造读取该block数据块的blockReader对象用于数据流的读取;
- blockReader对象主要是用来从指定数据节点上读取数据块;在构造的过程中,其会构造一个Sender对象向DataNode发送一个数据块读取的Op.READ_BLOCK操作码;其有多种读取的方式(本文主要介绍remote读取方式):
- BlockReaderLocal:本地短路读取(client和datanode在同一机器上,可以直接从本地磁盘读取)
- RemoteBlockReader2:使用socket连接从datanode读取数据块
- readBuffer()将从数据流中读取该数据块的数据;其内部会委托给blockReader.read(buf)进行数据的读取;并且会在读取错误时,根据重试策略尝试seekToBlockSource重新尝试本节点或者调用seekToNewSource(其内部会重新调用blockSeekTo)选择新的DataNode节点
/**
* Open a DataInputStream to a DataNode so that it can be read from.
* We get block ID and the IDs of the destinations at startup, from the namenode.
*/
private synchronized DatanodeInfo blockSeekTo(long target) throws IOException {
//
// Connect to best DataNode for desired Block, with potential offset
//
DatanodeInfo chosenNode = null;
while (true) {
// 获取当前游标所在的数据块信息
LocatedBlock targetBlock = getBlockAt(target, true);
assert (target==pos) : "Wrong postion " + pos + " expect " + target;
long offsetIntoBlock = target - targetBlock.getStartOffset();
// 获取最佳DataNode位置信息
DNAddrPair retval = chooseDataNode(targetBlock, null);
chosenNode = retval.info;
InetSocketAddress targetAddr = retval.addr;
StorageType storageType = retval.storageType;
try {
ExtendedBlock blk = targetBlock.getBlock();
Token<BlockTokenIdentifier> accessToken = targetBlock.getBlockToken();
// 造读blockReader对象用于该数据块流的读取
blockReader = new BlockReaderFactory(dfsClient.getConf()).
setInetSocketAddress(targetAddr).
setRemotePeerFactory(dfsClient).
setDatanodeInfo(chosenNode).
......
build();
if(connectFailedOnce) {
DFSClient.LOG.info("Successfully connected to " + targetAddr +
" for " + blk);
}
return chosenNode;
} catch (IOException ex) {
} else {
connectFailedOnce = true;
DFSClient.LOG.warn("Failed to connect to " + targetAddr + " for block"
+ ", add to deadNodes and continue. " + ex, ex);
// Put chosen node into dead list, continue
addToDeadNodes(chosenNode); // 将chosenNode加入到黑名单中
}
}
}
}
最佳DataNode选择策略为:因为在数据块locatedBlocks获取的时候,其已经按照与客户端的距离进行排序,所以只要找到不在deadNodes中的DataNode即可;
数据块读取操作码Op.READ_BLOCK发送过程为:在构造reader = new BlockReaderFactory().build();方法中:
- getRemoteBlockReaderFromTcp()
- blockReader = getRemoteBlockReader(peer)
- RemoteBlockReader2.newBlockReader()
- new Sender(out).readBlock(block, blockToken, clientName, startOffset, len, verifyChecksum, cachingStrategy); 最终会调用Sender发送READ_BLOCK操作码
- RemoteBlockReader2.newBlockReader()
- blockReader = getRemoteBlockReader(peer)
DFSInputStream#read()进行数据块的读取:
private synchronized int readBuffer(ReaderStrategy reader, int off, int len,
Map<ExtendedBlock, Set<DatanodeInfo>> corruptedBlockMap)
throws IOException {
IOException ioe;
boolean retryCurrentNode = true;
while (true) {
// retry as many times as seekToNewSource allows.
try {
// 调用reader读取数据
return reader.doRead(blockReader, off, len, readStatistics);
} catch ( ChecksumException ce ) {
DFSClient.LOG.warn("Found Checksum error for "
+ getCurrentBlock() + " from " + currentNode
+ " at " + ce.getPos());
ioe = ce;
retryCurrentNode = false;
// we want to remember which block replicas we have tried
// 将损坏的数据块加入CorruptedBlockMap中,并向NameNode汇报
addIntoCorruptedBlockMap(getCurrentBlock(), currentNode,
corruptedBlockMap);
} catch ( IOException e ) {
// .........
}
boolean sourceFound = false;
if (retryCurrentNode) {
// 尝试重试当前节点
sourceFound = seekToBlockSource(pos);
} else {
// 选择一个新的DataNode进行数据读取
addToDeadNodes(currentNode);
sourceFound = seekToNewSource(pos);
}
if (!sourceFound) {
throw ioe;
}
retryCurrentNode = false;
}
}
3、blockReader.read();在remote模式下会构造RemoteBlockReader2;其使用socket连接从datanode中读取数据块,其主要的read()方法会调用readNextPacket()将从数据流中获取一个新的数据包packet。
@Override
public synchronized int read(byte[] buf, int off, int len)
throws IOException {
// 读取下一个数据包
if (curDataSlice == null || curDataSlice.remaining() == 0 && bytesNeededToFinish > 0) {
readNextPacket();
}
if (curDataSlice.remaining() == 0) {
// we're at EOF now
return -1;
}
int nRead = Math.min(curDataSlice.remaining(), len);
curDataSlice.get(buf, off, nRead);
return nRead;
}
private void readNextPacket() throws IOException {
//Read packet headers.
// 读取数据包头与数据包
packetReceiver.receiveNextPacket(in);
PacketHeader curHeader = packetReceiver.getHeader();
curDataSlice = packetReceiver.getDataSlice();
assert curDataSlice.capacity() == curHeader.getDataLen();
// Sanity check the lengths
// 检查数据包头长度
if (!curHeader.sanityCheck(lastSeqNo)) {
throw new IOException("BlockReader: error in packet header " +
curHeader);
}
// 数据包校验和
if (curHeader.getDataLen() > 0) {
int chunks = 1 + (curHeader.getDataLen() - 1) / bytesPerChecksum;
int checksumsLen = chunks * checksumSize;
assert packetReceiver.getChecksumSlice().capacity() == checksumsLen :
"checksum slice capacity=" + packetReceiver.getChecksumSlice().capacity() +
" checksumsLen=" + checksumsLen;
lastSeqNo = curHeader.getSeqno();
if (verifyChecksum && curDataSlice.remaining() > 0) {
checksum.verifyChunkedSums(curDataSlice,
packetReceiver.getChecksumSlice(),
filename, curHeader.getOffsetInBlock());
}
bytesNeededToFinish -= curHeader.getDataLen();
}
// First packet will include some data prior to the first byte
// the user requested. Skip it.
if (curHeader.getOffsetInBlock() < startOffset) {
int newPos = (int) (startOffset - curHeader.getOffsetInBlock());
curDataSlice.position(newPos);
}
// If we've now satisfied the whole client read, read one last packet
// header, which should be empty
if (bytesNeededToFinish <= 0) {
readTrailingEmptyPacket();
if (verifyChecksum) {
sendReadResult(Status.CHECKSUM_OK);
} else {
sendReadResult(Status.SUCCESS);
}
}
}
4、读取完毕后;会简单的调用DFSInputStream.close()方法进行数据流的关闭,其内部也是最终调用关闭blockReader;
@Override
public synchronized void close() throws IOException {
if (closed) {
return;
}
dfsClient.checkOpen();
if (!extendedReadBuffers.isEmpty()) {
final StringBuilder builder = new StringBuilder();
extendedReadBuffers.visitAll(new IdentityHashStore.Visitor<ByteBuffer, Object>() {
private String prefix = "";
@Override
public void accept(ByteBuffer k, Object v) {
builder.append(prefix).append(k);
prefix = ", ";
}
});
DFSClient.LOG.warn("closing file " + src + ", but there are still " +
"unreleased ByteBuffers allocated by read(). " +
"Please release " + builder.toString() + ".");
}
if (blockReader != null) {
blockReader.close();
blockReader = null;
}
super.close();
closed = true;
}