上文说了NodeManager的初始化,本文说下其服务启动的代码:
@Override
protected void serviceStart() throws Exception {
try {
doSecureLogin();
} catch (IOException e) {
throw new YarnRuntimeException("Failed NodeManager login", e);
}
super.serviceStart();
}
看起来真简单,而实际上,则是把我们初始化过程中加入到serviceList中的所有服务都拿出来,进行一轮serviceStart的过程:
第一个:
DeletionService del = createDeletionService(exec);
addService(del);
定时清除服务,其实际上并没有serviceStart,是因为其初始化的时候,已经定义了一个定时处理的线程池:
@Override
protected void serviceInit(Configuration conf) throws Exception {
ThreadFactory tf = new ThreadFactoryBuilder().setNameFormat("DeletionService #%d").build();
if (conf != null) {
sched = new ScheduledThreadPoolExecutor(conf.getInt(YarnConfiguration.NM_DELETE_THREAD_COUNT,
YarnConfiguration.DEFAULT_NM_DELETE_THREAD_COUNT), tf);
debugDelay = conf.getInt(YarnConfiguration.DEBUG_NM_DELETE_DELAY_SEC, 0);
} else {
sched = new ScheduledThreadPoolExecutor(YarnConfiguration.DEFAULT_NM_DELETE_THREAD_COUNT, tf);
}
sched.setExecuteExistingDelayedTasksAfterShutdownPolicy(false);
sched.setKeepAliveTime(60L, SECONDS);
if (stateStore.canRecover()) {
recover(stateStore.loadDeletionServiceState());
}
super.serviceInit(conf);
}
接着,看这部分:
nodeHealthChecker = new NodeHealthCheckerService();
addService(nodeHealthChecker);
dirsHandler = nodeHealthChecker.getDiskHandler();
其初始化已经完成了,但是实际上内部并没有serviceStart方法,而实际上,其所用的地方在下面:
nodeStatusUpdater = createNodeStatusUpdater(context, dispatcher, nodeHealthChecker);
看看其serviceStart的方法:
// NodeManager is the last service to start, so NodeId is available.
this.nodeId = this.context.getNodeId();
this.httpPort = this.context.getHttpPort();
this.nodeManagerVersionId = YarnVersionInfo.getVersion();
try {
// Registration has to be in start so that ContainerManager can get the
// perNM tokens needed to authenticate ContainerTokens.
this.resourceTracker = getRMClient();
registerWithRM();
super.serviceStart();
startStatusUpdater();
} catch (Exception e) {
String errorMessage = "Unexpected error starting NodeStatusUpdater";
LOG.error(errorMessage, e);
throw new YarnRuntimeException(e);
}
这里,如果查看代码,会发现nodeId,和httpPort都没有定义,看似是bug,是bug么?不是,可以看这块:
addService(nodeStatusUpdater);
((NMContext) context).setNodeStatusUpdater(nodeStatusUpdater);
最后才把nodeStatusUpdater才加到服务清单内,所以最后才会对其进行初始化,所以我们先略过这块,看看后面的:
NodeResourceMonitor nodeResourceMonitor = createNodeResourceMonitor();
addService(nodeResourceMonitor);
还是很奇怪这段代码,感觉什么用都没有,没有初始化,也没有自定义的serviceStart方法,只能采用默认的方法,不多介绍了:
containerManager = createContainerManager(context, exec, del, nodeStatusUpdater, this.aclsManager, dirsHandler);
addService(containerManager);
看看这个containerManager的serviceStart方法:
final InetSocketAddress initialAddress = conf.getSocketAddr(YarnConfiguration.NM_BIND_HOST,
YarnConfiguration.NM_ADDRESS, YarnConfiguration.DEFAULT_NM_ADDRESS, YarnConfiguration.DEFAULT_NM_PORT);
boolean usingEphemeralPort = (initialAddress.getPort() == 0);
if (context.getNMStateStore().canRecover() && usingEphemeralPort) {
throw new IllegalArgumentException("Cannot support recovery with an "
+ "ephemeral server port. Check the setting of " + YarnConfiguration.NM_ADDRESS);
}
// If recovering then delay opening the RPC service until the recovery
// of resources and containers have completed, otherwise requests from
// clients during recovery can interfere with the recovery process.
final boolean delayedRpcServerStart = context.getNMStateStore().canRecover();
Configuration serverConf = new Configuration(conf);
// always enforce it to be token-based.
serverConf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
SaslRpcServer.AuthMethod.TOKEN.toString());
YarnRPC rpc = YarnRPC.create(conf);
server = rpc.getServer(ContainerManagementProtocol.class, this, initialAddress, serverConf,
this.context.getNMTokenSecretManager(), conf.getInt(YarnConfiguration.NM_CONTAINER_MGR_THREAD_COUNT,
YarnConfiguration.DEFAULT_NM_CONTAINER_MGR_THREAD_COUNT));
毫无疑问,这段代码负责建立一个RPCServer,奇怪的是ipc的默认端口竟然是0,所以启动之前一定要配置,不然启动应该会报错:
/** address of node manager IPC. */
public static final String NM_ADDRESS = NM_PREFIX + "address";
public static final int DEFAULT_NM_PORT = 0;
public static final String DEFAULT_NM_ADDRESS = "0.0.0.0:" + DEFAULT_NM_PORT;
接下来,看看nodeId到底是怎么来的:
// setup node ID
InetSocketAddress connectAddress;
if (delayedRpcServerStart) {
connectAddress = NetUtils.getConnectAddress(initialAddress);
} else {
server.start();
connectAddress = NetUtils.getConnectAddress(server);
}
NodeId nodeId = buildNodeId(connectAddress, hostOverride);
((NodeManager.NMContext) context).setNodeId(nodeId);
this.context.getNMTokenSecretManager().setNodeId(nodeId);
this.context.getContainerTokenSecretManager().setNodeId(nodeId);
我们给出了connectAddress,生成了一个nodeId,
private NodeId buildNodeId(InetSocketAddress connectAddress, String hostOverride) {
if (hostOverride != null) {
connectAddress = NetUtils.getConnectAddress(new InetSocketAddress(hostOverride, connectAddress.getPort()));
}
return NodeId.newInstance(connectAddress.getAddress().getCanonicalHostName(), connectAddress.getPort());
}
@Private
@Unstable
public static NodeId newInstance(String host, int port) {
NodeId nodeId = Records.newRecord(NodeId.class);
nodeId.setHost(host);
nodeId.setPort(port);
nodeId.build();
return nodeId;
}
如此,生成了一个NodeId。
LOG.info("ContainerManager started at " + connectAddress);
LOG.info("ContainerManager bound to " + initialAddress);
最后有日志输出,我们也可以在日志中看到得到的NodeId到底是什么:
WebServer webServer = createWebServer(context, containerManager.getContainersMonitor(), this.aclsManager,
dirsHandler);
addService(webServer);
看看NM监控webapp的启动:
@Override
protected void serviceStart() throws Exception {
String bindAddress = WebAppUtils.getWebAppBindURL(getConfig(),
YarnConfiguration.NM_BIND_HOST,
WebAppUtils.getNMWebAppURLWithoutScheme(getConfig()));
LOG.info("Instantiating NMWebApp at " + bindAddress);
try {
this.webApp =
WebApps
.$for("node", Context.class, this.nmContext, "ws")
.at(bindAddress)
.with(getConfig())
.withHttpSpnegoPrincipalKey(
YarnConfiguration.NM_WEBAPP_SPNEGO_USER_NAME_KEY)
.withHttpSpnegoKeytabKey(
YarnConfiguration.NM_WEBAPP_SPNEGO_KEYTAB_FILE_KEY)
.start(this.nmWebApp);
this.port = this.webApp.httpServer().getConnectorAddress(0).getPort();
} catch (Exception e) {
String msg = "NMWebapps failed to start.";
LOG.error(msg, e);
throw new YarnRuntimeException(msg, e);
}
super.serviceStart();
}
没什么可说的,最重要的是需要注意address和port的加载来源,
/** NM Webapp address. **/
public static final String NM_WEBAPP_ADDRESS = NM_PREFIX + "webapp.address";
public static final int DEFAULT_NM_WEBAPP_PORT = 8042;
public static final String DEFAULT_NM_WEBAPP_ADDRESS = "0.0.0.0:" + DEFAULT_NM_WEBAPP_PORT;
这些配置都在YarnConfiguration内:
这一切结束后,我们再看看:
addService(nodeStatusUpdater);
// NodeManager is the last service to start, so NodeId is available.
this.nodeId = this.context.getNodeId();
this.httpPort = this.context.getHttpPort();
this.nodeManagerVersionId = YarnVersionInfo.getVersion();
try {
// Registration has to be in start so that ContainerManager can get the
// perNM tokens needed to authenticate ContainerTokens.
this.resourceTracker = getRMClient();
registerWithRM();
super.serviceStart();
startStatusUpdater();
} catch (Exception e) {
String errorMessage = "Unexpected error starting NodeStatusUpdater";
LOG.error(errorMessage, e);
throw new YarnRuntimeException(e);
}
这下看的清楚了,里面的nodeId和httpPort实际上已经初始化完毕了,重点放在startStatusUpdater,重点在其中的try部分的代码:
NodeStatus nodeStatus = getNodeStatus(lastHeartBeatID);
该方法读取了NM节点的基本状态:
private NodeStatus getNodeStatus(int responseId) throws IOException {
NodeHealthStatus nodeHealthStatus = this.context.getNodeHealthStatus();
nodeHealthStatus.setHealthReport(healthChecker.getHealthReport());
nodeHealthStatus.setIsNodeHealthy(healthChecker.isHealthy());
nodeHealthStatus.setLastHealthReportTime(healthChecker.getLastHealthReportTime());
if (LOG.isDebugEnabled()) {
LOG.debug("Node's health-status : " + nodeHealthStatus.getIsNodeHealthy() + ", "
+ nodeHealthStatus.getHealthReport());
}
List<ContainerStatus> containersStatuses = getContainerStatuses();
NodeStatus nodeStatus = NodeStatus.newInstance(nodeId, responseId, containersStatuses,
createKeepAliveApplicationList(), nodeHealthStatus);
return nodeStatus;
}
而实际上的实现,采用的是healthChecker来实现的,实际上则是我们前面的NodeHealthCheckerService:
nodeHealthChecker = new NodeHealthCheckerService();
addService(nodeHealthChecker);
nodeStatusUpdater = createNodeStatusUpdater(context, dispatcher, nodeHealthChecker);
我们看看其中的方法,大同小异,看其中一个:
/**
* @return the reporting string of health of the node
*/
String getHealthReport() {
String scriptReport = (nodeHealthScriptRunner == null) ? "" : nodeHealthScriptRunner.getHealthReport();
if (scriptReport.equals("")) {
return dirsHandler.getDisksHealthReport(false);
} else {
return scriptReport.concat(SEPARATOR + dirsHandler.getDisksHealthReport(false));
}
}
其实工作都是交给了dirsHandler,具体不多说了:
而所有的服务启动完毕之后,我们的NMManager就可以使用了。