hadoop源码分析

最新推荐文章于 2023-05-26 13:54:38 发布

xiaoyutongxue6

最新推荐文章于 2023-05-26 13:54:38 发布

阅读量454

点赞数 1

废话不多说，直接进入org.apache.hadoop.hdfs.server.namenode包下NameNode类的main方法

public static void main(String argv[]) throws Exception {
if (DFSUtil.parseHelpArgument(argv, NameNode.USAGE, System.out, true)) {
System.exit(0);
}
try {
StringUtils.startupShutdownMessage(NameNode.class, argv, LOG);
NameNode namenode = createNameNode(argv, null);
if (namenode != null) {
namenode.join();
}
} catch (Throwable e) {
LOG.fatal("Exception in namenode join", e);
terminate(1, e);
}
}

DFSUtil的parseHelpArgument方法用于解析输入的命令行参数。接下来看try，catch语句块：

StringUtils的startupShutdownMessage方法字面很容易看出是打印启动关闭信息；NameNode

的主要工作由createNameNode方法完成，我们进入createNameNode可以发现主要是switch语句块，我们重点看下关于格式化format

case FORMAT: {
boolean aborted = format(conf, startOpt.getForceFormat(),
startOpt.getInteractiveFormat());
terminate(aborted ? 1 : 0);
return null; // avoid javac warning
}

然后进入format方法，每一步的注释写在代码后面，自己看

private static boolean format(Configuration conf, boolean force,
boolean isInteractive) throws IOException {
String nsId = DFSUtil.getNamenodeNameServiceId(conf); // 获取nameserviceid，在hadoop ha中配置
String namenodeId = HAUtil.getNameNodeId(conf, nsId); // 获取namenodeid，
initializeGenericKeys(conf, nsId, namenodeId);
checkAllowFormat(conf); // 判断是否允许格式化，也就是你不能把正在运行的hdfs给格了
if (UserGroupInformation.isSecurityEnabled()) { // 看到UserGroupInformation，我们知道这是hdfs系统权限相关的，
// 判断是否使用Kerberos验证
InetSocketAddress socAddr = getAddress(conf);
SecurityUtil.login(conf, DFS_NAMENODE_KEYTAB_FILE_KEY,
DFS_NAMENODE_USER_NAME_KEY, socAddr.getHostName());
}
/*
获取hdfs-site.xml中dfs.namenode.name.dir设置的路径，如/home/hadoop/dfs/name
用于存储文件系统命名空间镜像
*/
Collection<URI> nameDirsToFormat = FSNamesystem.getNamespaceDirs(conf);
/*
获取hdfs-site.xml中dfs.namenode.shared.edits.dir设置的路径，如果使用的hadoop的ha配置，
那么值可以为qjournal://node1:8485;node2:8485;node3:8485/clusterid，其中clusterid是dfs.nameservices配置的值
*/
List<URI> sharedDirs = FSNamesystem.getSharedEditsDirs(conf);
List<URI> dirsToPrompt = new ArrayList<URI>();
dirsToPrompt.addAll(nameDirsToFormat);
dirsToPrompt.addAll(sharedDirs);
List<URI> editDirsToFormat =
FSNamesystem.getNamespaceEditsDirs(conf);
// if clusterID is not provided - see if you can find the current one
String clusterId = StartupOption.FORMAT.getClusterId();
if(clusterId == null || clusterId.equals("")) {
//Generate a new cluster id
clusterId = NNStorage.newClusterID();
}
System.out.println("Formatting using clusterid: " + clusterId);
// 关于文件系统的创建，日后会详细分析
FSImage fsImage = new FSImage(conf, nameDirsToFormat, editDirsToFormat);
try {
FSNamesystem fsn = new FSNamesystem(conf, fsImage);
fsImage.getEditLog().initJournalsForWrite();
if (!fsImage.confirmFormat(force, isInteractive)) {
return true; // aborted
}
fsImage.format(fsn, clusterId);
} catch (IOException ioe) {
LOG.warn("Encountered exception during format: ", ioe);
fsImage.close();
throw ioe;
}
return false;
}

回到NameNode的main方法，namenode.join最终启动的是RPC.Server serviceRpcServer，RPC.ServerclientRpcServer两大线程。

serviceRpcServer监听来自DataNodes的请求，clientRpcServer监听来自客户端的请求。

上一篇讲到了namenode的格式化，格式化方法中有

FSImage fsImage = new FSImage(conf, nameDirsToFormat, editDirsToFormat);
try {
FSNamesystem fsn = new FSNamesystem(conf, fsImage);

今天主要讲讲FSImage ，FSNamesystem 分别在（1），（2）中

（1）先来看FSImage，FSImage处理checkpointing（检查点），并记录到文件命名空间编辑日志中。

fsimage在磁盘上对应上一篇文章提到的/home/hadoop/dfs/name路径。目录下有current，image，in_use.lock；在current目录下有edits日志，fsimage内存镜像，fstime镜像时间，VERSION版本信息。

FSImage常用操作有loadFSImage（加载文件系统镜像），saveFSImage（保存文件系统镜像）

在loadFSImage中，最终会调用FSImageFormat类中的load(File curFile)方法，代码如下：

public void load(File curFile) throws IOException {
checkNotLoaded(); // 保证是第一次加载时执行下面的语句
assert curFile != null : "curFile is null"; // 断言
StartupProgress prog = NameNode.getStartupProgress(); // 获取启动进度
Step step = new Step(StepType.INODES);
prog.beginStep(Phase.LOADING_FSIMAGE, step);
long startTime = now(); // 开始
//
// Load in bits
//
MessageDigest digester = MD5Hash.getDigester();
DigestInputStream fin = new DigestInputStream(
new FileInputStream(curFile), digester); // 获取输入流
DataInputStream in = new DataInputStream(fin); // 包装输入流
try {
// read image version: first appeared in version -1
int imgVersion = in.readInt(); // 读取镜像版本号
if (getLayoutVersion() != imgVersion) { // 判断版本是否一致，不一致抛异常
throw new InconsistentFSStateException(curFile,
"imgVersion " + imgVersion +
" expected to be " + getLayoutVersion());
}
boolean supportSnapshot = NameNodeLayoutVersion.supports( // 判断是否支持快照
LayoutVersion.Feature.SNAPSHOT, imgVersion);
if (NameNodeLayoutVersion.supports(
LayoutVersion.Feature.ADD_LAYOUT_FLAGS, imgVersion)) {
LayoutFlags.read(in);
}
// read namespaceID: first appeared in version -2
in.readInt(); // 读取命名空间编号
long numFiles = in.readLong(); // 文件数量
<span style="white-space:pre"> </span>......

在saveFSImage中，最终调用FSImageFormatProtobuf中save(File file, FSImageCompression compression)方法，代码如下

void save(File file, FSImageCompression compression) throws IOException {
FileOutputStream fout = new FileOutputStream(file); // 创建输出流
fileChannel = fout.getChannel(); // 获取网络套接字的通道，用过java nio的朋友应该清楚
try {
saveInternal(fout, compression, file.getAbsolutePath().toString()); // 在该方法中，underlyingOutputStream.write(FSImageUtil.MAGIC_HEADER)进行持久化操作
} finally {
fout.close();
}
}

（2）对于hadoop集群，master节点存储3种类型元数据：文件和数据块的命名空间，文件和数据块的对应关系，每个数据块副本的存放地点。所有的元数据都保存在内存中，前两种类型也会以记录变更日志的方式记录在系统日志文件中。

文件系统的存储和管理都交给了FSNameSystem类，我们就看看他的注释：

/***************************************************
* FSNamesystem does the actual bookkeeping work for the // 此类为datanode做实际的簿记工作
* DataNode.
*
* It tracks several important tables.
*
* 1) valid fsname --> blocklist (kept on disk, logged) // 文件系统命名空间到数据块列表的映射，保存在磁盘上并记录日志
* 2) Set of all valid blocks (inverted #1) // 合法数据块集合，上面的逆关系
* 3) block --> machinelist (kept in memory, rebuilt dynamically from reports) // 数据块到datanode的映射，保存在内存中，由datanode上报动态重建
* 4) machine --> blocklist (inverted #2) // datanode上保存的数据块列表，上面的逆关系
* 5) LRU cache of updated-heartbeat machines 近期最少使用缓存队列，保存datanode的心跳信息
***************************************************/

FSNamesystem 有一个FSDirectory成员变量，它保存文件名到数据块列表的映射，类中有添加文命名空间，添加文件，添加数据块，创建目录等操作。

下面是数据块相关的方法

@Override // FSNamesystemMBean
@Metric
public long getPendingReplicationBlocks() { // 返回正在复制的数据块
return blockManager.getPendingReplicationBlocksCount();
}
@Override // FSNamesystemMBean
@Metric
public long getUnderReplicatedBlocks() { // 返回需要复制的数据块
return blockManager.getUnderReplicatedBlocksCount();
}
/** Returns number of blocks with corrupt replicas */
@Metric({"CorruptBlocks", "Number of blocks with corrupt replicas"})
public long getCorruptReplicaBlocks() { // 返回损坏的数据块
return blockManager.getCorruptReplicaBlocksCount();
}
@Override // FSNamesystemMBean
@Metric
public long getScheduledReplicationBlocks() { // 返回当前正在处理的数据块复制数目
return blockManager.getScheduledReplicationBlocksCount();
}
@Override
@Metric
public long getPendingDeletionBlocks() { // 返回正在删除的数据块数目
return blockManager.getPendingDeletionBlocksCount();
}
@Metric
public long getExcessBlocks() { // 返回超过配额的数据块数目
return blockManager.getExcessBlocksCount();
}
// HA-only metric
@Metric
public long getPostponedMisreplicatedBlocks() { // 返回延期或错过复制的数据块数目，仅在ha的情况下
return blockManager.getPostponedMisreplicatedBlocksCount();
}

本人博客针对的是hadoop2版本，比1版本略为复杂（采用了很多当下流行的设计模式，加入了新的序列化框架，ha配置，联邦特性，yarn框架，以及采用maven的工程划分结构等）。网上的源码分析大多针对的是1版本，由于是针对源码写出自己的理解，难免有错误或不当的地方，欢迎指正

前面两篇主要讲了namenode，现在来说说datanode。好了，直接打开idea，进入DataNode

首先我来翻译一下注释（有些是自己添加的）：

/**********************************************************
* DataNode is a class (and program) that stores a set of
* blocks for a DFS deployment. A single deployment can
* have one or many DataNodes. Each DataNode communicates
* regularly with a single NameNode. It also communicates
* with client code and other DataNodes from time to time.
*datanode是DFS调度存储一系列数据块的一个类或者说是程序。DFS调度可以有1个或多个数据节点。
* 每个数据节点定期和唯一的namenode进行通信。有时它也和客户端或其他的数据节点通信。
*
* DataNodes store a series of named blocks. The DataNode
* allows client code to read these blocks, or to write new
* block data. The DataNode may also, in response to instructions
* from its NameNode, delete blocks or copy blocks to/from other
* DataNodes.
*数据节点存储一系列命名数据块。它允许客户端读写或者创建新的数据块。而且，
* 它还会执行来自namenode的删除数据块，拷贝或复制其他节点上的数据块的指令，
* 当然了，这些指令是通过心跳的的响应时传达的
*
* The DataNode maintains just one critical table:
* block-> stream of bytes (of BLOCK_SIZE or less)
*数据节点保存了极其重要的一张表：数据块到字节流的映射
*
* This info is stored on a local disk. The DataNode
* reports the table's contents to the NameNode upon startup
* and every so often afterwards.
*这个信息保存在本地磁盘上，数据节点在启动，以后定期把这些内容上报给namenode
* 这就是前面文章中说的第三种元数据是由数据节点上报动态建立的
*
* DataNodes spend their lives in an endless loop of asking
* the NameNode for something to do. A NameNode cannot connect
* to a DataNode directly; a NameNode simply returns values from
* functions invoked by a DataNode.
*datanode是一个死循环并一直询问namenode有什么事吩咐。namenode不可以直接连接datanode，
*它只能通过datanode的请求函数中返回值
*
* DataNodes maintain an open server socket so that client code
* or other DataNodes can read/write data. The host/port for
* this server is reported to the NameNode, which then sends that
* information to clients or other DataNodes that might be interested.
*数据节点保持一个打开的套接字供客户端和其他数据节点读写数据。当前的主机名，
* 端口要上报给namenode，然后namenode再发给其他感兴趣的客户端或数据节点
*
**********************************************************/

接着，同namenode分析思路，直接进入main方法

public static void main(String args[]) {
if (DFSUtil.parseHelpArgument(args, DataNode.USAGE, System.out, true)) { // 在namenode中见过吧，解析命令行参数,其实在hadoop1中是没有这个if判断的
System.exit(0);
}
secureMain(args, null);
}
public static void secureMain(String args[], SecureResources resources) {
int errorCode = 0;
try {
StringUtils.startupShutdownMessage(DataNode.class, args, LOG); // 打印启动关闭信息
DataNode datanode = createDataNode(args, null, resources);
// 看到没，跟namenode一个编码思路，
// 创建datanode时会调用instantiateDataNode方法，进行初始化配置信息，权限设置。
// 在hadoop1里面有行代码是
// String[] dataDirs = conf.getStrings(DATA_DIR_KEY);而在hadoop2里面是
// Collection<StorageLocation> dataLocations = getStorageLocations(conf);
// hadoop2对其做了下封装，显得更规范。java你懂得，
// 不抽出点接口、进行点包装显示不出自己的逼格。其实就是获取数据存储目录。
if (datanode != null) {
datanode.join();
// 还记得namenode中是启动两大rpcserver吗，下面详细解析join方法
} else {
errorCode = 1;
}
} catch (Throwable e) {
LOG.fatal("Exception in secureMain", e);
terminate(1, e);
} finally {
// We need to terminate the process here because either shutdown was called
// or some disk related conditions like volumes tolerated or volumes required
// condition was not met. Also, In secure mode, control will go to Jsvc
// and Datanode process hangs if it does not exit.
LOG.warn("Exiting Datanode");
terminate(errorCode);
}
}