一.前言
用户向YARN ResourceManager提交应用程序, ResourceManager收到提交请求后, 先向资源调度器申请用以启动ApplicationMaster的资源, 待申请到资源后,再由ApplicationMasterLauncher与对应的NodeManager通信, 从而启动应用程序的ApplicationMaster。
二. 属性
一共有四个主要属性.
其中最重要的就三个.
masterEvents : 阻塞式队列
launcherPool: 工作线程池
launcherHandlingThread : 任务下发工作线程, 单线程执行, 用于监控masterEvents队列领取任务. 并将任务下发给launcherPool执行.
// ApplicationMasterLauncher 线程池
private ThreadPoolExecutor launcherPool;
// [ 主 ] 工作线程, 单线程执行, 用于监控masterEvents队列,
// 并将任务下发给launcherPool执行.
private LauncherThread launcherHandlingThread;
// 阻塞式队列
private final BlockingQueue<Runnable> masterEvents = new LinkedBlockingQueue<Runnable>();
// RM的 context信息 : RMContextImpl
protected final RMContext context;
三.构造方法
由ResourceManager在调用serviceInit方法时调用.
主要是构建主工作线程: launcherHandlingThread
public ApplicationMasterLauncher(RMContext context) {
super(ApplicationMasterLauncher.class.getName());
this.context = context;
// 构建工作线程
this.launcherHandlingThread = new LauncherThread();
}
四. serviceInit
主要是初始化工作线程池launcherPool.
@Override
protected void serviceInit(Configuration conf) throws Exception {
// 默认线程数量 : 50
int threadCount = conf.getInt(
YarnConfiguration.RM_AMLAUNCHER_THREAD_COUNT,
YarnConfiguration.DEFAULT_RM_AMLAUNCHER_THREAD_COUNT);
ThreadFactory tf = new ThreadFactoryBuilder()
.setNameFormat("ApplicationMasterLauncher #%d")
.build();
// 构架线程池
launcherPool = new ThreadPoolExecutor(threadCount, threadCount, 1,
TimeUnit.HOURS, new LinkedBlockingQueue<Runnable>());
// 构建线程工厂类
launcherPool.setThreadFactory(tf);
Configuration newConf = new YarnConfiguration(conf);
// 最大重试次数: 10
newConf.setInt(CommonConfigurationKeysPublic.
IPC_CLIENT_CONNECT_MAX_RETRIES_ON_SOCKET_TIMEOUTS_KEY,
conf.getInt(YarnConfiguration.RM_NODEMANAGER_CONNECT_RETRIES,
YarnConfiguration.DEFAULT_RM_NODEMANAGER_CONNECT_RETRIES));
setConfig(newConf);
super.serviceInit(newConf);
}
五.serviceStart
这里没有啥说的,主要是启动主工作线程launcherHandlingThread
@Override
protected void serviceStart() throws Exception {
launcherHandlingThread.start();
super.serviceStart();
}
六.LauncherThread
LauncherThread是主工作线程 , 主要是从队列masterEvents中领取任务.然后交由工作线程池launcherPool执行.
@Override
public void run() {
while (!this.isInterrupted()) {
Runnable toLaunch;
try {
// 从队列汇总获取任务.
toLaunch = masterEvents.take();
// 线程池用于处理 masterEvents 队列中的事件
launcherPool.execute(toLaunch);
} catch (InterruptedException e) {
LOG.warn(this.getClass().getName() + " interrupted. Returning.");
return;
}
}
}
七. handle
主要处理AMLauncherEvent 时间. 根据事件类型处理任务.
主要有两种任务类型 LAUNCH 和 CLEANUP
@Override
public synchronized void handle(AMLauncherEvent appEvent) {
AMLauncherEventType event = appEvent.getType();
RMAppAttempt application = appEvent.getAppAttempt();
switch (event) {
case LAUNCH:
// 处理启动类型事件
launch(application);
break;
case CLEANUP:
// 处理清理操作
cleanup(application);
break;
default:
break;
}
}
八. launch
主要是构建启动类型的任务加入到masterEvents, 稍后会有主线程执行.
private void launch(RMAppAttempt application) {
// 构建启动类型的任务
Runnable launcher = createRunnableLauncher(application, AMLauncherEventType.LAUNCH);
// 添加任务队列
masterEvents.add(launcher);
}
//构建 AMLauncher
protected Runnable createRunnableLauncher(RMAppAttempt application,
AMLauncherEventType event) {
Runnable launcher = new AMLauncher(context, application, event, getConfig());
return launcher;
}
在这里,我们看到 这里构建的是AMLauncher , 交由工作线程池开始执行.
九.AMLauncher
负责两种类型的事件 LAUNCH 和 CLEANUP 两种事件
9.1. 通讯协议ContainerManagementProtocol
ContainerManagementProtocol: AM与NM之间的协议, AM通过该RPC要求NM启动或者停止Container, 获取各个Container的使用状态等信息。
方法名称 | 描述 |
---|---|
startContainers | 启动容器 |
stopContainers | 停止容器 |
getContainerStatuses | 获取容器状态 |
increaseContainersResource [废弃] | 增加容器资源 |
updateContainer | 更新容器 |
signalToContainer | 发送信号 |
localize | 本地化容器所需的资源,目前,此API仅适用于运行容器 |
reInitializeContainer | 使用新的 Launch Context 初始化容器 |
restartContainer | 重新启动容器 |
rollbackLastReInitialization | 尝试回滚最后一次重新初始化操作 |
commitLastReInitialization | 尝试提交最后一次初始化操作,如果提交成功则不可以回滚. |
ContainerManagementProtocol协议主要提供了以下三个RPC函数 :
❑ startContainer: ApplicationMaster通过该RPC要求NodeManager启动一个Container。该函数有一个StartContainerRequest类型的参数, 封装了Container启动所需的本地资源、 环境变量、 执行命令、 Token等信息。 如果Container启动成功, 则该函数返回一个StartContainerResponse对象。
❑ stopContainer: ApplicationMaster通过该RPC要求NodeManager停止( 杀死) 一个Container。 该函数有一个StopContainerRequest类型的参数, 用于指定待杀死的ContainerID。 如果Container被成功杀死, 则该函数返回一个StopContainer-Response对象。
❑ getContainerStatus: ApplicationMaster通过该RPC获取一个Container的运行状态。 该函数参数类型为GetContainerStatusRequest, 封装了目标Container的ID, 返回值为封装了Container当前运行状态的类型为GetContainerStatusResponse的对象。
9.2. 构造方法
ApplicationMasterLauncher#createRunnableLauncher进行调用, 生成AMLauncher 对象, 然后加入到masterEvents队列,
等到主线程launcherHandlingThread 获取交由工作线程launcherPool 调用.
public AMLauncher(RMContext rmContext, RMAppAttempt application,
AMLauncherEventType eventType, Configuration conf) {
this.application = application;
this.conf = conf;
this.eventType = eventType;
this.rmContext = rmContext;
this.handler = rmContext.getDispatcher().getEventHandler();
this.masterContainer = application.getMasterContainer();
this.timelineServiceV2Enabled = YarnConfiguration.
timelineServiceV2Enabled(conf);
}
9.3. run
核心方法, 线程池ApplicationMasterLauncher#launcherPool 执行.
根据事件类型eventType 处理不同的事件.
@SuppressWarnings("unchecked")
public void run() {
switch (eventType) {
// 启动AM
case LAUNCH:
try {
LOG.info("Launching master" + application.getAppAttemptId());
// 尝试启动Container
launch();
// 处理RMAppAttemptEvent启动事件
handler.handle(new RMAppAttemptEvent(application.getAppAttemptId(),
RMAppAttemptEventType.LAUNCHED, System.currentTimeMillis()));
} catch(Exception ie) {
onAMLaunchFailed(masterContainer.getId(), ie);
}
break;
// 执行请求操作.
case CLEANUP:
try {
LOG.info("Cleaning master " + application.getAppAttemptId());
// 执行清理操作
cleanup();
} catch(IOException ie) {
LOG.info("Error cleaning master ", ie);
} catch (YarnException e) {
StringBuilder sb = new StringBuilder("Container ");
sb.append(masterContainer.getId().toString());
sb.append(" is not handled by this NodeManager");
if (!e.getMessage().contains(sb.toString())) {
// Ignoring if container is already killed by Node Manager.
LOG.info("Error cleaning master ", e);
}
}
break;
default:
LOG.warn("Received unknown event-type " + eventType + ". Ignoring.");
break;
}
}
9.3. launch
与NodeManager建立连接, 通过ContainerManagementProtocol#startContainer 协议发送信息启动 启动Container
private void launch() throws IOException, YarnException {
// 建立连接
connect();
// 获取ContainerId: container_1611506953824_0001_01_000001
ContainerId masterContainerID = masterContainer.getId();
// 获取applicationContext信息
// application_id { id: 1 cluster_timestamp: 1611506953824 }
// application_name: "org.apache.spark.examples.SparkPi"
// queue: "default"
// priority { priority: 0 }
// am_container_spec { localResources { key: "__spark_conf__" value { resource { scheme: "hdfs" host: "localhost" port: 8020 file: "/user/henghe/.sparkStaging/application_1611506953824_0001/__spark_conf__.zip" } size: 250024 timestamp: 1611512474054 type: ARCHIVE visibility: PRIVATE } }
// localResources { key: "__app__.jar" value { resource { scheme: "hdfs" host: "localhost" port: 8020 file: "/user/henghe/.sparkStaging/application_1611506953824_0001/spark-examples_2.11-2.4.5.jar" } size: 1475072 timestamp: 1611512473631 type: FILE visibility: PRIVATE } } tokens: "HDTS\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000" environment { key: "SPARK_YARN_STAGING_DIR" value: "hdfs://localhost:8020/user/henghe/.sparkStaging/application_1611506953824_0001" }
// environment { key: "SPARK_USER" value: "henghe" }
// environment { key: "CLASSPATH" value: "{{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__" }
// environment { key: "PYTHONHASHSEED" value: "0" } command: "{{JAVA_HOME}}/bin/java" command: "-server" command: "-Xmx1024m" command: "-Djava.io.tmpdir={{PWD}}/tmp" command: "-Dspark.yarn.app.container.log.dir=<LOG_DIR>" command: "org.apache.spark.deploy.yarn.ApplicationMaster" command: "--class" command: "\'org.apache.spark.examples.SparkPi\'" command: "--jar" command: "file:/opt/tools/spark-2.4.5/examples/jars/spark-examples_2.11-2.4.5.jar" command: "--arg" command: "\'10\'" command: "--properties-file" command: "{{PWD}}/__spark_conf__/__spark_conf__.properties" command: "1>" command: "<LOG_DIR>/stdout" command: "2>" command: "<LOG_DIR>/stderr" application_ACLs { accessType: APPACCESS_VIEW_APP acl: "sysadmin,henghe " } application_ACLs { accessType: APPACCESS_MODIFY_APP acl: "sysadmin,henghe " } } resource { memory: 2048 virtual_cores: 1 resource_value_map { key: "memory-mb" value: 2048 units: "Mi" type: COUNTABLE } resource_value_map { key: "vcores" value: 1 units: "" type: COUNTABLE } } applicationType: "SPARK"
ApplicationSubmissionContext applicationContext = application.getSubmissionContext();
// Setting up container Container: [ContainerId: container_1611513130854_0001_01_000001, AllocationRequestId: -1, Version: 0,
// NodeId: boyi-pro.lan:56960,
// NodeHttpAddress: boyi-pro.lan:8042,
// Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: boyi-pro.lan:56960 }, ExecutionType: GUARANTEED, ] for AM appattempt_1611513130854_0001_000001
LOG.info("Setting up container " + masterContainer + " for AM " + application.getAppAttemptId());
ContainerLaunchContext launchContext = createAMContainerLaunchContext(applicationContext, masterContainerID);
// container_launch_context { localResources { key: "__spark_conf__" value { resource { scheme: "hdfs" host: "localhost" port: 8020 file: "/user/henghe/.sparkStaging/application_1611513130854_0001/__spark_conf__.zip" } size: 250024 timestamp: 1611513158992 type: ARCHIVE visibility: PRIVATE } } localResources { key: "__app__.jar" value { resource { scheme: "hdfs" host: "localhost" port: 8020 file: "/user/henghe/.sparkStaging/application_1611513130854_0001/spark-examples_2.11-2.4.5.jar" } size: 1475072 timestamp: 1611513158274 type: FILE visibility: PRIVATE } } tokens: "HDTS\000\001\000\032\n\r\n\t\b\001\020\346\336\253\255\363.\020\001\020\312\207\343\314\372\377\377\377\377\001\024\256X\203p\026\324\327\356^;\212z\314\333\276\361\263\227\246\236\020YARN_AM_RM_TOKEN\000\000"
// environment { key: "SPARK_YARN_STAGING_DIR" value: "hdfs://localhost:8020/user/henghe/.sparkStaging/application_1611513130854_0001" }
// environment { key: "APPLICATION_WEB_PROXY_BASE" value: "/proxy/application_1611513130854_0001" }
// environment { key: "SPARK_USER" value: "henghe" }
// environment { key: "CLASSPATH" value: "{{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__" }
// environment { key: "PYTHONHASHSEED" value: "0" }
// environment { key: "APP_SUBMIT_TIME_ENV" value: "1611513161745" } command: "{{JAVA_HOME}}/bin/java" command: "-server" command: "-Xmx1024m" command: "-Djava.io.tmpdir={{PWD}}/tmp" command: "-Dspark.yarn.app.container.log.dir=<LOG_DIR>" command: "org.apache.spark.deploy.yarn.ApplicationMaster" command: "--class" command: "\'org.apache.spark.examples.SparkPi\'" command: "--jar" command: "file:/opt/tools/spark-2.4.5/examples/jars/spark-examples_2.11-2.4.5.jar" command: "--arg" command: "\'10\'" command: "--properties-file" command: "{{PWD}}/__spark_conf__/__spark_conf__.properties" command: "1>" command: "<LOG_DIR>/stdout" command: "2>" command: "<LOG_DIR>/stderr" application_ACLs { accessType: APPACCESS_MODIFY_APP acl: "sysadmin,henghe " } application_ACLs { accessType: APPACCESS_VIEW_APP acl: "sysadmin,henghe " } } container_token { identifier: "\n\021\022\r\n\t\b\001\020\346\336\253\255\363.\020\001\030\001\022\022boyi-pro.lan:56960\032\006henghe\"+\b\200\020\020\001\032\024\n\tmemory-mb\020\200\020\032\002Mi \000\032\016\n\006vcores\020\001\032\000 \000(\361\251\322\255\363.0\265\254\237\350\0068\346\336\253\255\363.B\002\b\000H\245\332\255\255\363.Z\000`\001h\001p\000x\377\377\377\377\377\377\377\377\377\001" password: "$g\341\316Mw\377\fC\270\v\026<\242a\325\027\354\331\354" kind: "ContainerToken" service: "boyi-pro.lan:56960" }
StartContainerRequest scRequest =
StartContainerRequest.newInstance(launchContext,
masterContainer.getContainerToken());
List<StartContainerRequest> list = new ArrayList<StartContainerRequest>();
list.add(scRequest);
StartContainersRequest allRequests = StartContainersRequest.newInstance(list);
// 获取响应信息
StartContainersResponse response = containerMgrProxy.startContainers(allRequests);
if (response.getFailedRequests() != null
&& response.getFailedRequests().containsKey(masterContainerID)) {
Throwable t =
response.getFailedRequests().get(masterContainerID).deSerialize();
parseAndThrowException(t);
} else {
// succeeded_requests { app_attempt_id { application_id { id: 1 cluster_timestamp: 1611514283537 } attemptId: 1 } id: 1 }
LOG.info("Done launching container " + masterContainer + " for AM " + application.getAppAttemptId());
}
}
9.5. cleanup
private void cleanup() throws IOException, YarnException {
// 建立连接
connect();
// 获取容器的id
ContainerId containerId = masterContainer.getId();
List<ContainerId> containerIds = new ArrayList<ContainerId>();
containerIds.add(containerId);
// 构建请求
StopContainersRequest stopRequest = StopContainersRequest.newInstance(containerIds);
// 发送请求
StopContainersResponse response = containerMgrProxy.stopContainers(stopRequest);
// 处理响应信息
if (response.getFailedRequests() != null
&& response.getFailedRequests().containsKey(containerId)) {
Throwable t = response.getFailedRequests().get(containerId).deSerialize();
parseAndThrowException(t);
}
}