上文说了下服务的初始化,本文认真说下服务的启动。
RMStateStore rmStore = rmContext.getStateStore();
// The state store needs to start irrespective of recoveryEnabled as apps
// need events to move to further states.
rmStore.start();
我们看看这段代码,关于RMStateStore的启动:
在分析RM初始化的时候,我们注意到很多服务都内部使用了RM的AsyncDispatcher,但是,有些服务并没有使用,比如RMStateStore,其内部用了自己的AsyncDispatcher:
在默认情况下,这个RMStateStore是个NullStateStore,我们回头看下其初始化的代码:
public synchronized void serviceInit(Configuration conf) throws Exception {
// create async handler
dispatcher = new AsyncDispatcher();
dispatcher.init(conf);
dispatcher.register(RMStateStoreEventType.class, new ForwardingEventHandler());
initInternal(conf);
}
很清楚,其自己内部用了一个AsyncDispatcher,用意在于,如果全局的AsyncDispatcher,即RM内部的Dispatcher事件调度给RMStateStore后,其内部可以继续对该事件进行调度:
好,重新看其服务启动的代码:
protected synchronized void serviceStart() throws Exception {
dispatcher.start();
startInternal();
}
实际调用的就是dispatcher的初始化,简单看看:
@Override
protected void serviceStart() throws Exception {
// start all the components
super.serviceStart();
eventHandlingThread = new Thread(createThread());
eventHandlingThread.setName("AsyncDispatcher event handler");
eventHandlingThread.start();
}
专注于其中的一个地方:createThread,明显,这是Runnable的子类,服务启动就开始单独一个线程运行:
Runnable createThread() {
return new Runnable() {
@Override
public void run() {
while (!stopped && !Thread.currentThread().isInterrupted()) {
Event event;
try {
event = eventQueue.take();
} catch (InterruptedException ie) {
if (!stopped) {
LOG.warn("AsyncDispatcher thread interrupted", ie);
}
return;
}
if (event != null) {
dispatch(event);
}
}
}
};
}
逻辑简单明了,从内部的事件队列内不断取出事件,一直进行处理:
接着看RM的serviceStart的代码:
protected void startWepApp() {
Builder<ApplicationMasterService> builder = WebApps
.$for("cluster", ApplicationMasterService.class, masterService, "ws").with(conf)
.withHttpSpnegoPrincipalKey(YarnConfiguration.RM_WEBAPP_SPNEGO_USER_NAME_KEY)
.withHttpSpnegoKeytabKey(YarnConfiguration.RM_WEBAPP_SPNEGO_KEYTAB_FILE_KEY)
.at(WebAppUtils.getRMWebAppURLWithoutScheme(conf));
String proxyHostAndPort = WebAppUtils.getProxyHostAndPort(conf);
if (WebAppUtils.getResolvedRMWebAppURLWithoutScheme(conf).equals(proxyHostAndPort)) {
AppReportFetcher fetcher = new AppReportFetcher(conf, getClientRMService());
builder.withServlet(ProxyUriUtils.PROXY_SERVLET_NAME, ProxyUriUtils.PROXY_PATH_SPEC,
WebAppProxyServlet.class);
builder.withAttribute(WebAppProxy.FETCHER_ATTRIBUTE, fetcher);
String[] proxyParts = proxyHostAndPort.split(":");
builder.withAttribute(WebAppProxy.PROXY_HOST_ATTRIBUTE, proxyParts[0]);
}
webApp = builder.start(new RMWebApp(this));
}
这部分逻辑没仔细分析,主要是启动一个我们浏览器上看到的一个webapp。
接着看:
super.serviceStart();
这是重点,启动RM的绝大部分服务,我们看看具体实现,需要到其父类中找,
protected void serviceStart() throws Exception {
List<Service> services = getServices();
if (LOG.isDebugEnabled()) {
LOG.debug(getName() + ": starting services, size=" + services.size());
}
for (Service service : services) {
// start the service. If this fails that service
// will be stopped and an exception raised
service.start();
}
super.serviceStart();
}
果然,把serviceList中的一堆重新捞出来,启动一遍,没办法,那这里仔细看下serviceList内部到底都有什么,一个一个看吧:
this.rmDispatcher = createDispatcher();
addIfService(this.rmDispatcher);
坐镇中央的调度器,其start逻辑如上,不多说了:
this.containerAllocationExpirer = new ContainerAllocationExpirer(this.rmDispatcher);
addService(this.containerAllocationExpirer);
我们看看这个的serviceStart方法,实际上是其父类的方法:
@Override
protected void serviceStart() throws Exception {
assert !stopped : "starting when already stopped";
checkerThread = new Thread(new PingChecker());
checkerThread.setName("Ping Checker");
checkerThread.start();
super.serviceStart();
}
内部安装了一个PingChecker,这也是继承了Runnable接口,我们看看内部的run方法:
public void run() {
while (!stopped && !Thread.currentThread().isInterrupted()) {
synchronized (AbstractLivelinessMonitor.this) {
Iterator<Map.Entry<O, Long>> iterator = running.entrySet().iterator();
// avoid calculating current time everytime in loop
long currentTime = clock.getTime();
while (iterator.hasNext()) {
Map.Entry<O, Long> entry = iterator.next();
if (currentTime > entry.getValue() + expireInterval) {
iterator.remove();
expire(entry.getKey());
LOG.info("Expired:" + entry.getKey().toString() + " Timed out after "
+ expireInterval / 1000 + " secs");
}
}
}
try {
Thread.sleep(monitorInterval);
} catch (InterruptedException e) {
LOG.info(getName() + " thread interrupted");
break;
}
}
}
很明显,主要就是为了实现一个定时的检测功能:重点在其中的expire方法,即检测过期后的处理:
接下来的两个监视器,起到了同样的功能:
public class SystemClock implements Clock {
public long getTime() {
return System.currentTimeMillis();
}
}
这里提一下这个,为什么要求hadoop整个集群要保证时间一致的,如果不一致的话,会出现一些奇怪的错误,重要的就是我们这些定时检测的线程执行起来的功能,可能并不是我们想象的那样完美:
接下来,需要看看NodeListManager的启动逻辑:
this.nodesListManager = new NodesListManager(this.rmContext);
this.rmDispatcher.register(NodesListManagerEventType.class, this.nodesListManager);
addService(nodesListManager);
这个类,内部的serviceStart方法为空,主要是因为其本身更多的一个工具类,不实现较重的服务功能:
this.schedulerDispatcher = createSchedulerEventDispatcher();
addIfService(this.schedulerDispatcher);
接下来,看这个分发调度器:
public SchedulerEventDispatcher(ResourceScheduler scheduler) {
super(SchedulerEventDispatcher.class.getName());
this.scheduler = scheduler;
this.eventProcessor = new Thread(new EventProcessor());
this.eventProcessor.setName("ResourceManager Event Processor");
}
@Override
protected void serviceStart() throws Exception {
this.eventProcessor.start();
super.serviceStart();
}
看看eventProcessor做了什么:
SchedulerEvent event;
while (!stopped && !Thread.currentThread().isInterrupted()) {
try {
event = eventQueue.take();
} catch (InterruptedException e) {
LOG.error("Returning, interrupted : " + e);
return; // TODO: Kill RM.
}
try {
scheduler.handle(event);
} catch (Throwable t) {
// An error occurred, but we are shutting down anyway.
// If it was an InterruptedException, the very act of
// shutdown could have caused it and is probably harmless.
if (stopped) {
LOG.warn("Exception during shutdown: ", t);
break;
}
LOG.fatal("Error in handling event type " + event.getType() + " to the scheduler", t);
if (shouldExitOnError && !ShutdownHookManager.get().isShutdownInProgress()) {
LOG.info("Exiting, bbye..");
System.exit(-1);
}
}
}
毫无疑问,完成对事件的不断调度:
this.resourceTracker = createResourceTrackerService();
addService(resourceTracker);
接下来是这块,我们看看resourceTracker的作用:
public interface ResourceTracker {
public RegisterNodeManagerResponse registerNodeManager(
RegisterNodeManagerRequest request) throws YarnException,
IOException;
public NodeHeartbeatResponse nodeHeartbeat(NodeHeartbeatRequest request)
throws YarnException, IOException;
}
protected ResourceTrackerService createResourceTrackerService() {
return new ResourceTrackerService(this.rmContext, this.nodesListManager, this.nmLivelinessMonitor,
this.containerTokenSecretManager, this.nmTokenSecretManager);
}
上面是其实现的接口,能看出来,实现的功能是对资源进行定期的监控和检测:
@Override
protected void serviceStart() throws Exception {
super.serviceStart();
// ResourceTrackerServer authenticates NodeManager via Kerberos if
// security is enabled, so no secretManager.
Configuration conf = getConfig();
YarnRPC rpc = YarnRPC.create(conf);
this.server = rpc.getServer(ResourceTracker.class, this, resourceTrackerAddress, conf, null,
conf.getInt(YarnConfiguration.RM_RESOURCE_TRACKER_CLIENT_THREAD_COUNT,
YarnConfiguration.DEFAULT_RM_RESOURCE_TRACKER_CLIENT_THREAD_COUNT));
// Enable service authorization?
if (conf.getBoolean(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION, false)) {
refreshServiceAcls(conf, new RMPolicyProvider());
}
this.server.start();
conf.updateConnectAddr(YarnConfiguration.RM_RESOURCE_TRACKER_ADDRESS, server.getListenerAddress());
}
我们看看其启动代码,这里可以注意到,其实RM和NM之间的通信是用YarnRPC实现的。
masterService = createApplicationMasterService();
addService(masterService);
这里,看起来应该是ApplicationMaster的管理服务,看看其服务启动代码:
@Override
protected void serviceStart() throws Exception {
Configuration conf = getConfig();
YarnRPC rpc = YarnRPC.create(conf);
InetSocketAddress masterServiceAddress = conf.getSocketAddr(YarnConfiguration.RM_SCHEDULER_ADDRESS,
YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS, YarnConfiguration.DEFAULT_RM_SCHEDULER_PORT);
Configuration serverConf = conf;
// If the auth is not-simple, enforce it to be token-based.
serverConf = new Configuration(conf);
serverConf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
SaslRpcServer.AuthMethod.TOKEN.toString());
this.server = rpc.getServer(ApplicationMasterProtocol.class, this, masterServiceAddress, serverConf,
this.rmContext.getAMRMTokenSecretManager(),
serverConf.getInt(YarnConfiguration.RM_SCHEDULER_CLIENT_THREAD_COUNT,
YarnConfiguration.DEFAULT_RM_SCHEDULER_CLIENT_THREAD_COUNT));
// Enable service authorization?
if (conf.getBoolean(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION, false)) {
refreshServiceAcls(conf, new RMPolicyProvider());
}
this.server.start();
this.bindAddress = conf.updateConnectAddr(YarnConfiguration.RM_SCHEDULER_ADDRESS, server.getListenerAddress());
super.serviceStart();
}
这里,很清楚看出来,对于ApplicationMasterProtol,在RM端的实现是ApplicationMasterService,这对我们动态提交ApplicaionMaster时候的逻辑分析有好处。
RM的服务启动逻辑就到这这儿:
本文介绍的主要是一整套逻辑,某些较为细致的地方并未细致讲述,大家参照代码就可以看懂。