以下是本人研究源代码成果, 此文僅献给我和我的小伙伴们,不足之处,欢迎斧正-------------------------------------------------致谢道格等人!
注:hadoop版本0.20.2,有童鞋表示看代码头晕,所以本文采用纯文字描述,哥还特意为你们把字体调调颜色噢 ^ o ^上一篇文章,我们一起讨论了DFSClient初始化过程,下面我们一起讨论关于数据读写的详细过程,该过程对比前面,稍稍复杂一点点,
它们不但要与名字节点通信,还需要访问数据节点,读取数据过程中,名字节点提供了两种远程方法:
a,getBlockLocations:确定数据的位置
b,reportBadBlocks:向 名字节点 报告客户端发现的坏块
下面我们详细的分析下读取数据的详细过程
DFSClient的open()方法用于打开文件,检查文件系统是否已经打开后构造并返回一个DFSInputStream对象,接下来用户就可以通过这个对象读取HDFS文件数据
=========================================================================================================
--------------------------------------------------------------------------------------------------------------------------------------------
/**
DFSClient类的成员变量:
public static final Log LOG = LogFactory.getLog(DFSClient.class);
public static final Log LOG = LogFactory.getLog(DFSClient.class);
public static final int MAX_BLOCK_ACQUIRE_FAILURES = 3;
private static final int TCP_WINDOW_SIZE = 128 * 1024; // 128 KB
/**namenode在rpcNamenode基础上增加了失败重试的功能**/
public final ClientProtocol namenode;
private final ClientProtocol rpcNamenode;
/**unuix系统用户组信息**/
final UnixUserGroupInformation ugi;
/**标识DFSClient客户端是否正在运行**/
volatile boolean clientRunning = true;
Random r = new Random();
/**客户端的名称**/
final String clientName;
/**租约续约的检查线程**/
final LeaseChecker leasechecker = new LeaseChecker();
private Configuration conf;
/**数据块的默认大小**/
private long defaultBlockSize;
/**数据Block的默认副本数**/
private short defaultReplication;
/**创建socket连接的工厂类**/
private SocketFactory socketFactory;
/**socket连接的过期时间**/
private int socketTimeout;
/***通过socket向dataNode写入数据的超期时间**/
private int datanodeWriteTimeout;
/**数据包最多能达到64K字节**/
final int writePacketSize;//64K
/**收集文件系统统计信息的对象**/
private final FileSystem.Statistics stats;
private int maxBlockAcquireFailures;
---------------------------------------------------------------------------------------------------------------------------------------------
FileSystem的open抽象方法:
---------------------------------------------------------------------------------------------------------------------------------------------
/**
* 打开path对应文件的FSDataInputStream输入流
*/
public abstract FSDataInputStream open(Path f, int bufferSize)
throws IOException;
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
DistributedFileSystem的open方法:
--------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
结合上下文我们可以看出:
文件系统是通过内部包装的DFSInputStream来完成 真正的读取数过程
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
DistributedFileSystem的open方法:
/**
* 打开指定文件,返回输入流
*/
public FSDataInputStream open(Path f, int bufferSize) throws IOException {
return new DFSClient.DFSDataInputStream(
dfs.open(getPathName(f), bufferSize, verifyChecksum, statistics));
}
--------------------------------------------------------------------------------------------------------------------------------------------
open方法:返回一个
DFSInputStream
DFSInputStream open(String src, int buffersize, boolean verifyChecksum,
DFSInputStream open(String src, int buffersize, boolean verifyChecksum,
FileSystem.Statistics stats
) throws IOException {
checkOpen();
// Get block info from namenode
return new DFSInputStream(src, buffersize, verifyChecksum);
}-----------------------------------------------------------------------------------------------------------------------
结合上下文我们可以看出:
文件系统是通过内部包装的DFSInputStream来完成 真正的读取数过程
下面一起分析创建DFSInputStream对象的过程
a,初始化以下对象:
1,是否对读取的数据进行校验的标识,
this.verifyChecksum = verifyChecksum;
2,缓冲区大小
this.buffersize = buffersize;
3,要打开的源文件路径
this.src = src;
this.buffersize = buffersize;
3,要打开的源文件路径
this.src = src;
4,预读取文件大小(默认为10个块大小)
prefetchSize = conf.getLong("dfs.read.prefetch.size", prefetchSize);
b,从NameNode中获得将要被打开文件的元数据信息
1, 调用callGetBlockLocations()方法取得用户请求的file的元数据信息,数据保存在locatedBlocks 中
//非常关键的一步
LocatedBlocks newInfo = callGetBlockLocations(namenode, src, 0, prefetchSize);
prefetchSize = conf.getLong("dfs.read.prefetch.size", prefetchSize);
b,从NameNode中获得将要被打开文件的元数据信息
1, 调用callGetBlockLocations()方法取得用户请求的file的元数据信息,数据保存在locatedBlocks 中
//非常关键的一步
LocatedBlocks newInfo = callGetBlockLocations(namenode, src, 0, prefetchSize);
2,若无法定位某个文件的数据Block,则表示文件不存在,则抛出对应异常
if (newInfo == null) {
if (locatedBlocks != null) {
this.currentNode = null;
=========================================================================================================
DFSInputStream构造完毕后,就可以通过read()方法读取数据了
if (newInfo == null) {
throw new IOException("Cannot open filename " + src);
}
3,
使用定位到locatedBlocks来更新当前的locatedBlocks
if (locatedBlocks != null) {
Iterator<LocatedBlock> oldIter = locatedBlocks.getLocatedBlocks().iterator();
Iterator<LocatedBlock> newIter = newInfo.getLocatedBlocks().iterator();
while (oldIter.hasNext() && newIter.hasNext()) {
if (! oldIter.next().getBlock().equals(newIter.next().getBlock())) {
throw new IOException("Blocklist for " + src + " has changed!");
}
}
}
this.locatedBlocks = newInfo;
4,置空currentNode
this.locatedBlocks = newInfo;
this.currentNode = null;
=========================================================================================================
DFSInputStream构造完毕后,就可以通过read()方法读取数据了
/**
* 读取一个字节
*/
@Override
public synchronized int read() throws IOException {
int ret = read( oneByteBuf, 0, 1 );