申明:本文基于hadoop2.7 进行源码研读
一、NameNode类代码注释
我简单对类注释做了一些翻译:
/**********************************************************
* NameNode serves as both directory namespace manager and
* "inode table" for the Hadoop DFS. There is a single NameNode
* running in any DFS deployment. (Well, except when there
* is a second backup/failover NameNode, or when using federated NameNodes.)
*
* 翻译一:
* NameNode servers 管理了目录命名空间(目录树) 和 "inode表"。
* 一个集群里面只有一个Namenode (除非Stand by NameNode (HA),或在使用联邦 NameNode)
*
* The NameNode controls two critical tables:
* 1) filename->blocksequence (namespace)
* 2) block->machinelist ("inodes")
*
* 翻译二:
* NameNode 控制两个关键表:
* 1) 文件名 -> block块映射(命名空间namespace)
* 2) block快 -> DataNode主机 之间映射 ("inode")
*
* The first table is stored on disk and is very precious.
* The second table is rebuilt every time the NameNode comes up.
*
* 翻译三:
* 第一张表存储在磁盘上,非常珍贵。 --> 文件与文件块的关系, 基本不会发生变化,如file1 -> block1/block2/block3,不会发生变化
* 每次 NameNode 重启时都会重建第二个表。 --> 实际是存在内存中
* 如 block1 -> datanode1
* block2 -> datanode2
* block3 -> datanode3
* 在发生一次负载均衡后, 有可能block1 -> datanode1变成了 block1 -> datanode2, block存储的位置发生了变化, 所以block块与主机之间的映射存在内存中而不是磁盘中
*
* 'NameNode' refers to both this class as well as the 'NameNode server'.
* The 'FSNamesystem' class actually performs most of the filesystem
* management. The majority of the 'NameNode' class itself is concerned
* with exposing the IPC interface and the HTTP server to the outside world,
* plus some configuration management.
*
* NameNode 服务由3个重要的类构成
* (1) NameNode类 —> 管理配置参数, 如 hdfs-site.xml core-site.xml
* (2) NameNode服务端(NameNode Server)"
* 2.1 IPC interface:
* NameNodeRPCServer: 开放常见的端口如8020 9000
* 2.2 HTTP Server:
* NameNodeHttpServer: 开发端口如50070,CDH中的9870,通过Web UI掌握HDFS运行情况
* (3) FSNamesystem:
* 文件系统的管理,HDFS元数据的管理
二、从NameNode的main方法开始
public static void main(String argv[]) throws Exception {
//解析参数
if (DFSUtil.parseHelpArgument(argv, NameNode.USAGE, System.out, true)) {
System.exit(0);
}
try {
StringUtils.startupShutdownMessage(NameNode.class, argv, LOG);
// importance: 2.1 创建NameNode的核心源码
NameNode namenode = createNameNode(argv, null);
if (namenode != null) {
namenode.join();
}
} catch (Throwable e) {
LOG.error("Failed to start namenode.", e);
terminate(1, e);
}
}
2.1 追踪createNameNode()代码,发现new NameNode对象
//创建NameNode的核心源码
public static NameNode createNameNode(String argv[], Configuration conf)
throws IOException {
LOG.info("createNameNode " + Arrays.asList(argv));
if (conf == null)
conf = new HdfsConfiguration();
/**
* argv: 接收传进来的参数, 如:
* namenode启动命令: hadoop-daemon.sh start namenode
* namenode元数据格式化命令: hdfs namenode -format
*/
//参数解析成StartupOption对象
StartupOption startOpt = parseArguments(argv);
if (startOpt == null) {
printUsage(System.err);
return null;
}
setStartupOption(conf, startOpt);
switch (startOpt) {
//hdfs namenode -format
case FORMAT: {
boolean aborted = format(conf, startOpt.getForceFormat(),
startOpt.getInteractiveFormat());
terminate(aborted ? 1 : 0);
return null; // avoid javac warning
}
case GENCLUSTERID: {
System.err.println("Generating new cluster id:");
System.out.println(NNStorage.newClusterID());
terminate(0);
return null;
}
case FINALIZE: {
System.err.println("Use of the argument '" + StartupOption.FINALIZE +
"' is no longer supported. To finalize an upgrade, start the NN " +
" and then run `hdfs dfsadmin -finalizeUpgrade'");
terminate(1);
return null; // avoid javac warning
}
case ROLLBACK: {
boolean aborted = doRollback(conf, true);
terminate(aborted ? 1 : 0);
return null; // avoid warning
}
case BOOTSTRAPSTANDBY: {
String toolArgs[] = Arrays.copyOfRange(argv, 1, argv.length);
int rc = BootstrapStandby.run(toolArgs, conf);
terminate(rc);
return null; // avoid warning
}
case INITIALIZESHAREDEDITS: {
boolean aborted = initializeSharedEdits(conf,
startOpt.getForceFormat(),
startOpt.getInteractiveFormat());
terminate(aborted ? 1 : 0);
return null; // avoid warning
}
case BACKUP:
case CHECKPOINT: {
NamenodeRole role = startOpt.toNodeRole();
DefaultMetricsSystem.initialize(role.toString().replace(" ", ""));
return new BackupNode(conf, role);
}
case RECOVER: {
NameNode.doRecovery(startOpt, conf);
return null;
}
case METADATAVERSION: {
printMetadataVersion(conf);
terminate(0);
return null; // avoid javac warning
}
case UPGRADEONLY: {
DefaultMetricsSystem.initialize("NameNode");
new NameNode(conf);
terminate(0);
return null;
}
default: {
// 启动namenode
DefaultMetricsSystem.initialize("NameNode");
// 2.2 new了一个NameNode对象
return new NameNode(conf);
}
}
}
2.2 追踪new NameNode方法
protected NameNode(Configuration conf, NamenodeRole role)
throws IOException {
// ...
//2.3 执行namenode初始化
initialize(conf);
// ...
}
2.3 追踪initialize方法实现
-
NameNode在创建后立刻执行initialize方法
-
通过建造模式创建并启动HttpServer,通过ip:50070或ip:9870可以访问到HDFS的Web UI,同时hadoop框架封装了HttpServer,内部实现了众多Servlet,提供了丰富的功能,比如浏览所有的目录、将FSImage文件替换给Active NameNode都是通过servlet方式调用
-
从磁盘中加载元数据(加载FSImage合并Edit log, 并且创建一个新的edit log)
-
同样,通过建造模式创建了一个NameNodeRpcServer对象,通过实现Protocols协议接口,丰富了NameNode的RPC功能,如client的mkdir、delete等功能
-
启动一些common服务资源检查, 监察是否有磁盘足够存储元数据
-
在common服务内还进行了安全模式必要检查, 检查是否可以退出安全模式
protected void initialize(Configuration conf) throws IOException {
// ...
if (NamenodeRole.NAMENODE == role) {
// TODO importance: 启动HttpServer,对应追踪startHttpServer方法实现
startHttpServer(conf);
}
this.spanReceiverHost = SpanReceiverHost.getInstance(conf);
// TODO importance: 加载元数据,对应追踪loadNamesystem方法实现
loadNamesystem(conf);
// TODO importance: 创建一个RPC Server,对应追踪createRpcServer方法实现
rpcServer = createRpcServer(conf);
if (clientNamenodeAddress == null) {
// This is expected for MiniDFSCluster. Set it now using
// the RPC server's bind address.
clientNamenodeAddress =
NetUtils.getHostPortString(rpcServer.getRpcAddress());
LOG.info("Clients are to use " + clientNamenodeAddress + " to access"
+ " this namenode/service.");
}
if (NamenodeRole.NAMENODE == role) {
httpServer.setNameNodeAddress(getNameNodeAddress());
httpServer.setFSImage(getFSImage());
}
pauseMonitor = new JvmPauseMonitor(conf);
pauseMonitor.start();
metrics.getJvmMetrics().setPauseMonitor(pauseMonitor);
/**
* TODO importance:
* 启动一些common服务:
* (1) 资源检查, 监察是否有磁盘足够存储元数据
* (2) 安全模式必要检查, 检查是否可以退出安全模式
*/
startCommonServices(conf);
}
2.3.1追踪startHttpServer方法实现
private void startHttpServer(final Configuration conf) throws IOException {
// 2.3.1.1 getHttpServerBindAddress 设置主机名和端口号
httpServer = new NameNodeHttpServer(conf, this, getHttpServerBindAddress(conf));
// 2.3.1.2 启动HttpServer, 根据ip:50070就可以访问web ui了
httpServer.start();
httpServer.setStartupProgress(startupProgress);
}
//2.3.1.1 层层深入getHttpServerBindAddress,最后调用了getHttpAddress方法
public static InetSocketAddress getHttpAddress(Configuration conf) {
return NetUtils.createSocketAddr(
// public static final String DFS_NAMENODE_HTTP_ADDRESS_KEY = "dfs.namenode.http-address";
// public static final String DFS_NAMENODE_HTTP_ADDRESS_DEFAULT = "0.0.0.0:" + DFS_NAMENODE_HTTP_PORT_DEFAULT;
// public static final int DFS_NAMENODE_HTTP_PORT_DEFAULT = 50070;
// 解释: 0.0.0.0默认为本机ip地址, 默认端口号设置为50070
conf.getTrimmed(DFS_NAMENODE_HTTP_ADDRESS_KEY, DFS_NAMENODE_HTTP_ADDRESS_DEFAULT));
}
// 2.3.1.2 深入httpServer.start()方法
void start() throws IOException {
// ...
// HttpServer2是Hadoop内部封装的Http服务, Hadoop RPC通信也是由Hadoop自己封装
HttpServer2.Builder builder = DFSUtil.httpServerTemplateForNNAndJN(conf,
httpAddr, httpsAddr, "hdfs",
DFSConfigKeys.DFS_NAMENODE_KERBEROS_INTERNAL_SPNEGO_PRINCIPAL_KEY,
DFSConfigKeys.DFS_NAMENODE_KEYTAB_FILE_KEY);
// 设计模式之: 建造模式 构建出了一个HttpServer对象
httpServer = builder.build();
//...
// 2.3.1.3 绑定Servlet, 丰富web ui功能
setupServlets(httpServer, conf);
// 真正启动了http server,ip:50070访问
httpServer.start();
int connIdx = 0;
if (policy.isHttpEnabled()) {
httpAddress = httpServer.getConnectorAddress(connIdx++);
conf.set(DFSConfigKeys.DFS_NAMENODE_HTTP_ADDRESS_KEY,
NetUtils.getHostPortString(httpAddress));
}
if (policy.isHttpsEnabled()) {
httpsAddress = httpServer.getConnectorAddress(connIdx);
conf.set(DFSConfigKeys.DFS_NAMENODE_HTTPS_ADDRESS_KEY,
NetUtils.getHostPortString(httpsAddress));
}
}
// 2.3.1.3 深入setupServlets方法
private static void setupServlets(HttpServer2 httpServer, Configuration conf) {
httpServer.addInternalServlet("startupProgress",
StartupProgressServlet.PATH_SPEC, StartupProgressServlet.class);
httpServer.addInternalServlet("getDelegationToken",
GetDelegationTokenServlet.PATH_SPEC,
GetDelegationTokenServlet.class, true);
httpServer.addInternalServlet("renewDelegationToken",
RenewDelegationTokenServlet.PATH_SPEC,
RenewDelegationTokenServlet.class, true);
httpServer.addInternalServlet("cancelDelegationToken",
CancelDelegationTokenServlet.PATH_SPEC,
CancelDelegationTokenServlet.class, true);
httpServer.addInternalServlet("fsck", "/fsck", FsckServlet.class,
true);
/* 添加内部servlet, 发起一个上传元数据的请求:
* Secoundary NameNode、Standby Namenode都是通过发送http请求, 请求传递给servlet, 将合并的FSImage文件替换给Active Namenode的FSImage文件
*/
httpServer.addInternalServlet("imagetransfer", ImageServlet.PATH_SPEC,
ImageServlet.class, true);
/**
* web ui可以通过路径浏览所有的目录,也是通过servlet
* http://ip:50070/listPaths/?path=/
*/
httpServer.addInternalServlet("listPaths", "/listPaths/*",
ListPathsServlet.class, false);
httpServer.addInternalServlet("data", "/data/*",
FileDataServlet.class, false);
httpServer.addInternalServlet("checksum", "/fileChecksum/*",
FileChecksumServlets.RedirectServlet.class, false);
httpServer.addInternalServlet("contentSummary", "/contentSummary/*",
ContentSummaryServlet.class, false);
}
2.3.2 追踪loadNamesystem方法实现
// 2.3.2.1 加载元数据
protected void loadNamesystem(Configuration conf) throws IOException {
//2.3.2.2 从磁盘中加载元数据
this.namesystem = FSNamesystem.loadFromDisk(conf);
}
//2.3.2.2 从磁盘中加载元数据(加载FSImage合并Edit log, 并且创建一个新的edit log)
static FSNamesystem loadFromDisk(Configuration conf) throws IOException {
checkConfiguration(conf);
FSImage fsImage = new FSImage(conf,
FSNamesystem.getNamespaceDirs(conf),
FSNamesystem.getNamespaceEditsDirs(conf));
// 核心类之一: FSNamesystem,管理了HDFS元数据
FSNamesystem namesystem = new FSNamesystem(conf, fsImage, false);
StartupOption startOpt = NameNode.getStartupOption(conf);
if (startOpt == StartupOption.RECOVER) {
namesystem.setSafeMode(SafeModeAction.SAFEMODE_ENTER);
}
long loadStart = monotonicNow();
try {
// 2.3.2.3 加载FSImage
namesystem.loadFSImage(startOpt);
} catch (IOException ioe) {
LOG.warn("Encountered exception loading fsimage", ioe);
fsImage.close();
throw ioe;
}
long timeTakenToLoadFSImage = monotonicNow() - loadStart;
LOG.info("Finished loading FSImage in " + timeTakenToLoadFSImage + " msecs");
NameNodeMetrics nnMetrics = NameNode.getNameNodeMetrics();
if (nnMetrics != null) {
nnMetrics.setFsImageLoadTime((int) timeTakenToLoadFSImage);
}
return namesystem;
}
// 2.3.2.3 深入namesystem.loadFSImage(startOpt);
private void loadFSImage(StartupOption startOpt) throws IOException {
final FSImage fsImage = getFSImage();
// format before starting up if requested
boolean success = false;
writeLock();
try {
// We shouldn't be calling saveNamespace if we've come up in standby state.
MetaRecoveryContext recovery = startOpt.createRecoveryContext();
// (1) namenode启动时候,将fsImage与edit log进行合并,生成一个新的fsImage
final boolean staleImage = fsImage.recoverTransitionRead(startOpt, this, recovery);
// ...非核心 省略
if (needToSave) {
// (2) 把合并出来新的fsImage文件写入到磁盘中
fsImage.saveNamespace(this);
} else {
updateStorageVersionForRollingUpgrade(fsImage.getLayoutVersion(),
startOpt);
// No need to save, so mark the phase done.
StartupProgress prog = NameNode.getStartupProgress();
prog.beginPhase(Phase.SAVING_CHECKPOINT);
prog.endPhase(Phase.SAVING_CHECKPOINT);
}
// This will start a new log segment and write to the seen_txid file, so
// we shouldn't do it when coming up in standby state
if (!haEnabled || (haEnabled && startOpt == StartupOption.UPGRADE)
|| (haEnabled && startOpt == StartupOption.UPGRADEONLY)) {
// (3) 打开一个新的edit log开始继续写日志
fsImage.openEditLogForWrite();
}
success = true;
} finally {
if (!success) {
fsImage.close();
}
writeUnlock();
}
imageLoadComplete();
}
2.3.3 createRpcServer方法返回NameNodeRpcServer实例
// 2.3.3.1 创建了一个NameNodeRpcServer对象
protected NameNodeRpcServer createRpcServer(Configuration conf)
throws IOException {
// 2.3.3.2 发现NameNode RPC服务是由NameNodeRpcServer类构建的
return new NameNodeRpcServer(conf, this);
}
//2.3.3.2 深入NameNodeRpcServer构造方法,依旧使用了设计模式中的建造模式 构建出了一个NameNodeRpcServer对象
this.serviceRpcServer = new RPC.Builder(conf)
.setProtocol(
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolPB.class)
.setInstance(clientNNPbService)
.setBindAddress(bindHost)
.setPort(serviceRpcAddr.getPort()).setNumHandlers(serviceHandlerCount)
.setVerbose(false)
.setSecretManager(namesystem.getDelegationTokenSecretManager())
.build();
解释:NameNodeRpcServer是NameNode的一个成员变量
//NameNode类中:
private NameNodeRpcServer rpcServer;
//NameNodeRpcServer实现了NamenodeProtocols协议接口
class NameNodeRpcServer implements NamenodeProtocols
//NamenodeProtocols协议接口,协议接口内部定义了各种方法
/** The full set of RPC methods implemented by the Namenode. */
@InterfaceAudience.Private
public interface NamenodeProtocols
extends ClientProtocol, //客户端协议,定义了很多方法, 像mkdir delete之类
DatanodeProtocol,
NamenodeProtocol,
RefreshAuthorizationPolicyProtocol,
RefreshUserMappingsProtocol,
RefreshCallQueueProtocol,
GenericRefreshProtocol,
GetUserMappingsProtocol,
HAServiceProtocol,
TraceAdminProtocol {
}
2.3.4 startCommonServices方法
/**
* Start services common to both active and standby states
*/
void startCommonServices(Configuration conf, HAContext haContext) throws IOException {
this.registerMBean(); // register the MBean for the FSNamesystemState
writeLock();
this.haContext = haContext;
try {
//TODO (1)跟踪NameNodeResourceChecker构造方法
// 创建一个资源检查器
nnResourceChecker = new NameNodeResourceChecker(conf);
//TODO (2) 深入checkAvailableResources()方法
//检查是否有足够的磁盘存储元数据, FSImage + editlog,要有足够的空间存储这两个文件
checkAvailableResources();
assert safeMode != null && !isPopulatingReplQueues();
StartupProgress prog = NameNode.getStartupProgress();
prog.beginPhase(Phase.SAFEMODE);
prog.setTotal(Phase.SAFEMODE, STEP_AWAITING_REPORTED_BLOCKS,
getCompleteBlocksTotal());
// TODO (3) 深入HDFS的安全模式
setBlockTotal();
blockManager.activate(conf);
} finally {
writeUnlock();
}
registerMXBean();
DefaultMetricsSystem.instance().register(this);
if (inodeAttributeProvider != null) {
inodeAttributeProvider.start();
dir.setINodeAttributeProvider(inodeAttributeProvider);
}
snapshotManager.registerMXBean();
}
(1) 磁盘路径大小的检查,不小于默认值100M,如果小于100M,NameNode也能正常启动,但是会进入安全模式
(2) 安全模式必要检查, 检查是否可以退出安全模式,重点关注触发安全模式的3类场景
- 在集群启动时datanode汇报的block块数小于阈值(默认0.999)
- 集群中存活的datanode小于datanode阈值(默认为0)
- 资源不足,元数据空间FSImage+editlog空间小于100M进入安全模式
// (1)跟踪NameNodeResourceChecker构造方法
public NameNodeResourceChecker(Configuration conf) throws IOException {
this.conf = conf;
volumes = new HashMap<String, CheckedVolume>();
//public static final String DFS_NAMENODE_DU_RESERVED_KEY = "dfs.namenode.resource.du.reserved";
// 默认值是100M: public static final long DFS_NAMENODE_DU_RESERVED_DEFAULT = 1024 * 1024 * 100; // 100 MB
// 也就是说edits和FSImage文件目录, 不能小于100M
duReserved = conf.getLong(DFSConfigKeys.DFS_NAMENODE_DU_RESERVED_KEY,
DFSConfigKeys.DFS_NAMENODE_DU_RESERVED_DEFAULT);
// ...
//TODO 添加需要监控的磁盘路径, 路径是从hdfs-site.xml和core-site.xml中读取
for (URI editsDirToCheck : localEditDirs) {
addDirToCheck(editsDirToCheck,
FSNamesystem.getRequiredNamespaceEditsDirs(conf).contains(
editsDirToCheck));
}
// All extra checked volumes are marked "required"
for (URI extraDirToCheck : extraCheckedVolumes) {
addDirToCheck(extraDirToCheck, true);
}
minimumRedundantVolumes = conf.getInt(
DFSConfigKeys.DFS_NAMENODE_CHECKED_VOLUMES_MINIMUM_KEY,
DFSConfigKeys.DFS_NAMENODE_CHECKED_VOLUMES_MINIMUM_DEFAULT);
}
// (1.1) 添加目录进行check
private void addDirToCheck(URI directoryToCheck, boolean required)
throws IOException {
// 一个文件是一个File
File dir = new File(directoryToCheck.getPath());
if (!dir.exists()) {
throw new IOException("Missing directory "+dir.getAbsolutePath());
}
// 一个目录是一个CheckedVolume
CheckedVolume newVolume = new CheckedVolume(dir, required);
CheckedVolume volume = volumes.get(newVolume.getVolume());
if (volume == null || !volume.isRequired()) {
//volumes是一个map: private Map<String, CheckedVolume> volumes;
//volumes包括了多个目录
volumes.put(newVolume.getVolume(), newVolume);
}
}
// (2) 深入checkAvailableResources()方法
// 发现最终传递了一个volumes,它是一个map存放了从从hdfs-site.xml和core-site.xml中读取的多个需要检查的目录
// 遍历每个目录, 检查资源是否充足, 即目录是否小于默认值100M
public boolean hasAvailableDiskSpace() {
return NameNodeResourcePolicy.areResourcesAvailable(volumes.values(),
minimumRedundantVolumes);
}
// (3) 深入HDFS的安全模式
//3.1 setBlockTotal
public void setBlockTotal() {
// safeMode is volatile, and may be set to null at any time
SafeModeInfo safeMode = this.safeMode;
if (safeMode == null)
return;
//getCompleteBlocksTotal是获取所有正常block的个数
safeMode.setBlockTotal((int)getCompleteBlocksTotal());
}
//3.2 getCompleteBlocksTotal
/**
* TODO hdfs中block的两种类型状态
* (1) Complete: 正常可用的block
* (2) UnderConstuction: 正在构建的block
*/
//返回正常可用的block,即历史已经构建完成的, Complete状态的block个数
private long getCompleteBlocksTotal() {
long numUCBlocks = 0;
readLock();
//获取所有正在构建的block
numUCBlocks = leaseManager.getNumUnderConstructionBlocks();
try {
// 获取所有的block - UnderConstuction block = Complete block
return getBlocksTotal() - numUCBlocks;
} finally {
readUnlock();
}
}
// 3.3 返回上一层setBlockTotal
private synchronized void setBlockTotal(int total) { //假设total block为1000
this.blockTotal = total;
// TODO 计算阈值(threshold = 0.999)
// 1000 * 0.999 = 999 , 即99.9%的block为可用,才退出安全模式
this.blockThreshold = (int) (blockTotal * threshold);
this.blockReplQueueThreshold =
(int) (blockTotal * replQueueThreshold);
if (haEnabled) {
// After we initialize the block count, any further namespace
// modifications done while in safe mode need to keep track
// of the number of total blocks in the system.
this.shouldIncrementallyTrackBlocks = true;
}
if(blockSafe < 0)
this.blockSafe = 0;
/*
* TODO 安全模式检查, 如果进入安全模式有以下几种情况
* (1) threshold!= 0 && blockSafe < blockThreshold
* 假设上一次关闭集群之前有1000个complete的block,默认阈值是0.999
* blockThreshold = 1000 * 0.999 = 999
* 在集群启动时datanode汇报的blockSafe数量小于999就进入安全模式
* blockSafe一开始为0,datanode会进行block汇报, block信息 -> namenode , 每汇报一个块blockSafe++
* (2) 集群中存活的datanode小于datanode阈值
* 默认阈值是0,可以手动设置,比如1000台datanode,阈值设置为0.9,如果datanode数量少于900,则进入安全模式
* (3) 资源不足,即!nameNodeHasResourcesAvailable
* 检查NameNode元数据空间FSImage+editlog空间是否小于100M,小于进入安全模式
*/
checkMode();
}