概述
ApplicationMaster与ResourceManager之间通信主要有以下三个步骤:
1、ApplicationMaster通过rpc向ResourceManager注册。ApplicationMaster启动时,首先向ResourceManager注册,注册消息封装到ProtocolBuffers消息RegisterApplicationMasterRequest中,主要包含以下字段:
- host: ApplicationMaster启动所在的节点的host
- rpc_port: ApplicationMaster本次启动对外rpc的端口号
- tracking_url: ApplicationMaster对外提供的追踪的web url,用户可以通过该url查看应用程序执行状态
ApplicationMaster注册成功后,ResourceManager将返回一个RegisterApplicationMasterResponse类型的对象。该返回对象主要包含以下几个字段:
- maximumCapability: 最大可申请的单个Container的占用的资源量
- application_ACLs: 应用程序访问控制列表
2、注册成功后,ApplicationMaster通过rpc向ResourceManager申请资源(资源以Container为单位),rpc请求中主要包含以下几个字段:
- ask: AppliationMaster请求的资源列表,每个请求资源可以用ResourceRequest表示。ResourceRequest包含以下几个主要字段:
- priority: 资源优先级,为正整数,值越小,优先级越高,分配的资源的优先级也就越高
- resouce_name: 期望资源所在的节点,如果是*,表示任何机器上的资源都可以
- capability: 所需的资源量,支持cpu和内存两种资源
- num_container: 满足以上要求的资源数目
- release: ApplicationMaster释放的资源列表
ResourceManager接受请求后,将返回以下一个AllocateResponse类型的对象,该对象主要包含以下字段:
- a_m_command:AppliactionMaster需要执行的命令。主要有两个取值:AM_RESYNC表示重启,AM_SHUTDOWN表示关闭。当ResourceManager重启或者应用程序信息出现不一致的状态时,可能要求AppliactionMaster重启;当处于黑名单时,则要求ApplicationMaster关闭。
- allocated_container: 分配给应用程序的Container列表(Container在mr中相当于task,在spark中相当于executor)
3、应用程序执行完毕后,ApplicationMaster将通过rpc告诉ResoureManager程序执行完毕并退出。
yarn-api层源码分析
1、ApplicationMasterProtocol
该接口描述了ApplicationMaster与ResourceManager之间通信的三个步骤:
- registerApplicationMaster()方法:ApplicationMaster通过rpc向ResourceManager注册。
- allocate()方法:注册成功后,ApplicationMaster通过rpc向ResourceManager申请资源。
- finishApplicationMaster()方法:ApplicationMaster将通过rpc告诉ResoureManager程序执行完毕并退出。
/**
* <p>The protocol between a live instance of <code>ApplicationMaster</code>
* and the <code>ResourceManager</code>.</p>
*
* <p>This is used by the <code>ApplicationMaster</code> to register/unregister
* and to request and obtain resources in the cluster from the
* <code>ResourceManager</code>.</p>
*/
@Public
@Stable
public interface ApplicationMasterProtocol {
public RegisterApplicationMasterResponse registerApplicationMaster(
RegisterApplicationMasterRequest request)
throws YarnException, IOException;
public FinishApplicationMasterResponse finishApplicationMaster(
FinishApplicationMasterRequest request)
throws YarnException, IOException;
public AllocateResponse allocate(AllocateRequest request)
throws YarnException, IOException;
}
yarn-common层基于protobuf的客户端实现
1、ApplicationMasterProtocolPBClientImpl
public class ApplicationMasterProtocolPBClientImpl implements ApplicationMasterProtocol, Closeable {
private ApplicationMasterProtocolPB proxy;
public ApplicationMasterProtocolPBClientImpl(long clientVersion, InetSocketAddress addr,
Configuration conf) throws IOException {
RPC.setProtocolEngine(conf, ApplicationMasterProtocolPB.class, ProtobufRpcEngine.class);
proxy =
(ApplicationMasterProtocolPB) RPC.getProxy(ApplicationMasterProtocolPB.class, clientVersion,
addr, conf);
}
@Override
public void close() {
if (this.proxy != null) {
RPC.stopProxy(this.proxy);
}
}
@Override
public AllocateResponse allocate(AllocateRequest request)
throws YarnException, IOException {
AllocateRequestProto requestProto =
((AllocateRequestPBImpl) request).getProto();
try {
return new AllocateResponsePBImpl(proxy.allocate(null, requestProto));
} catch (ServiceException e) {
RPCUtil.unwrapAndThrowException(e);
return null;
}
}
@Override
public FinishApplicationMasterResponse finishApplicationMaster(
FinishApplicationMasterRequest request) throws YarnException,
IOException {
FinishApplicationMasterRequestProto requestProto =
((FinishApplicationMasterRequestPBImpl) request).getProto();
try {
return new FinishApplicationMasterResponsePBImpl(
proxy.finishApplicationMaster(null, requestProto));
} catch (ServiceException e) {
RPCUtil.unwrapAndThrowException(e);
return null;
}
}
@Override
public RegisterApplicationMasterResponse registerApplicationMaster(
RegisterApplicationMasterRequest request) throws YarnException,
IOException {
RegisterApplicationMasterRequestProto requestProto =
((RegisterApplicationMasterRequestPBImpl) request).getProto();
try {
return new RegisterApplicationMasterResponsePBImpl(
proxy.registerApplicationMaster(null, requestProto));
} catch (ServiceException e) {
RPCUtil.unwrapAndThrowException(e);
return null;
}
}
}
yarn-common层基于protpbuf的服务端实现
1、ApplicationMasterProtocolPBServiceImpl
它完完全全是ApplicationMasterService的代理类
public class ApplicationMasterProtocolPBServiceImpl implements ApplicationMasterProtocolPB {
//real将初始化为ApplicationMasterService
private ApplicationMasterProtocol real;
public ApplicationMasterProtocolPBServiceImpl(ApplicationMasterProtocol impl) {
this.real = impl;
}
@Override
public AllocateResponseProto allocate(RpcController arg0,
AllocateRequestProto proto) throws ServiceException {
AllocateRequestPBImpl request = new AllocateRequestPBImpl(proto);
try {
//调用ApplicationMasterService#allocate()方法
AllocateResponse response = real.allocate(request);
return ((AllocateResponsePBImpl)response).getProto();
} catch (YarnException e) {
throw new ServiceException(e);
} catch (IOException e) {
throw new ServiceException(e);
}
}
@Override
public FinishApplicationMasterResponseProto finishApplicationMaster(
RpcController arg0, FinishApplicationMasterRequestProto proto)
throws ServiceException {
FinishApplicationMasterRequestPBImpl request = new FinishApplicationMasterRequestPBImpl(proto);
try {
//调用ApplicationMasterService#finishApplicationMaster()方法
FinishApplicationMasterResponse response = real.finishApplicationMaster(request);
return ((FinishApplicationMasterResponsePBImpl)response).getProto();
} catch (YarnException e) {
throw new ServiceException(e);
} catch (IOException e) {
throw new ServiceException(e);
}
}
@Override
public RegisterApplicationMasterResponseProto registerApplicationMaster(
RpcController arg0, RegisterApplicationMasterRequestProto proto)
throws ServiceException {
RegisterApplicationMasterRequestPBImpl request = new RegisterApplicationMasterRequestPBImpl(proto);
try {
//调用ApplicationMasterService#registerApplicationMaster()方法
RegisterApplicationMasterResponse response = real.registerApplicationMaster(request);
return ((RegisterApplicationMasterResponsePBImpl)response).getProto();
} catch (YarnException e) {
throw new ServiceException(e);
} catch (IOException e) {
throw new ServiceException(e);
}
}
}
ApplicationMasterService处理AM的注册
ApplicationMasterService接收到registerApplicationMaster的请求后,将向RMAppAttemptImpl发送一个RMAppAttemptEventType.registered事件,而RMAppAttemptImpl收到该事件后,首先保存ApplicationMaster的基本信息(所在host、启用的rpc端口号等),然后向RMAppImpl发送一个RMAppEventType.attempt_registered事件。至此,RMAppAttemptImpl状态由launched转为running,RMAppImpl状态由accepted转为running。
由此得出,RMApp状态机running的含义为:该application的ApplicationMaster成功在某个节点上运行。
@Override
public RegisterApplicationMasterResponse registerApplicationMaster(
RegisterApplicationMasterRequest request) throws YarnException,
IOException {
AMRMTokenIdentifier amrmTokenIdentifier =
YarnServerSecurityUtils.authorizeRequest();
ApplicationAttemptId applicationAttemptId =
amrmTokenIdentifier.getApplicationAttemptId();
ApplicationId appID = applicationAttemptId.getApplicationId();
AllocateResponseLock lock = responseMap.get(applicationAttemptId);
if (lock == null) {
RMAuditLogger.logFailure(this.rmContext.getRMApps().get(appID).getUser(),
AuditConstants.REGISTER_AM, "Application doesn't exist in cache "
+ applicationAttemptId, "ApplicationMasterService",
"Error in registering application master", appID,
applicationAttemptId);
throwApplicationDoesNotExistInCacheException(applicationAttemptId);
}
// Allow only one thread in AM to do registerApp at a time.
synchronized (lock) {
AllocateResponse lastResponse = lock.getAllocateResponse();
if (hasApplicationMasterRegistered(applicationAttemptId)) {
// allow UAM re-register if work preservation is enabled
ApplicationSubmissionContext appContext =
rmContext.getRMApps().get(appID).getApplicationSubmissionContext();
if (!(appContext.getUnmanagedAM()
&& appContext.getKeepContainersAcrossApplicationAttempts())) {
String message =
AMRMClientUtils.APP_ALREADY_REGISTERED_MESSAGE + appID;
LOG.warn(message);
RMAuditLogger.logFailure(
this.rmContext.getRMApps().get(appID).getUser(),
AuditConstants.REGISTER_AM, "", "ApplicationMasterService",
message, appID, applicationAttemptId);
throw new InvalidApplicationMasterRequestException(message);
}
}
this.amLivelinessMonitor.receivedPing(applicationAttemptId);
// Setting the response id to 0 to identify if the
// application master is register for the respective attemptid
lastResponse.setResponseId(0);
lock.setAllocateResponse(lastResponse);
RegisterApplicationMasterResponse response =
recordFactory.newRecordInstance(
RegisterApplicationMasterResponse.class);
//由AMSProcessingChain责任链模式处理AM的注册
this.amsProcessingChain.registerApplicationMaster(
amrmTokenIdentifier.getApplicationAttemptId(), request, response);
return response;
}
}
AMSProcessingChain的初始化
private final AMSProcessingChain amsProcessingChain;
public ApplicationMasterService(String name, RMContext rmContext,
YarnScheduler scheduler) {
......
this.amsProcessingChain = new AMSProcessingChain(new DefaultAMSProcessor());
}
DefaultAMSProcessor处理AM的注册
public void registerApplicationMaster(
ApplicationAttemptId applicationAttemptId,
RegisterApplicationMasterRequest request,
RegisterApplicationMasterResponse response)
throws IOException, YarnException {
RMApp app = getRmContext().getRMApps().get(
applicationAttemptId.getApplicationId());
LOG.info("AM registration " + applicationAttemptId);
//向RMAppAttemptImpl发送一个RMAppAttemptEventType.registered事件
getRmContext().getDispatcher().getEventHandler()
.handle(
new RMAppAttemptRegistrationEvent(applicationAttemptId, request
.getHost(), request.getRpcPort(), request.getTrackingUrl()));
RMAuditLogger.logSuccess(app.getUser(),
RMAuditLogger.AuditConstants.REGISTER_AM,
"ApplicationMasterService", app.getApplicationId(),
applicationAttemptId);
response.setMaximumResourceCapability(getScheduler()
.getMaximumResourceCapability(app.getQueue()));
response.setApplicationACLs(app.getRMAppAttempt(applicationAttemptId)
.getSubmissionContext().getAMContainerSpec().getApplicationACLs());
response.setQueue(app.getQueue());
if (UserGroupInformation.isSecurityEnabled()) {
LOG.info("Setting client token master key");
response.setClientToAMTokenMasterKey(java.nio.ByteBuffer.wrap(
getRmContext().getClientToAMTokenSecretManager()
.getMasterKey(applicationAttemptId).getEncoded()));
}
// For work-preserving AM restart, retrieve previous attempts' containers
// and corresponding NM tokens.
if (app.getApplicationSubmissionContext()
.getKeepContainersAcrossApplicationAttempts()) {
List<Container> transferredContainers = getScheduler()
.getTransferredContainers(applicationAttemptId);
if (!transferredContainers.isEmpty()) {
response.setContainersFromPreviousAttempts(transferredContainers);
// Clear the node set remembered by the secret manager. Necessary
// for UAM restart because we use the same attemptId.
rmContext.getNMTokenSecretManager()
.clearNodeSetForAttempt(applicationAttemptId);
List<NMToken> nmTokens = new ArrayList<NMToken>();
for (Container container : transferredContainers) {
try {
NMToken token = getRmContext().getNMTokenSecretManager()
.createAndGetNMToken(app.getUser(), applicationAttemptId,
container);
if (null != token) {
nmTokens.add(token);
}
} catch (IllegalArgumentException e) {
// if it's a DNS issue, throw UnknowHostException directly and
// that
// will be automatically retried by RMProxy in RPC layer.
if (e.getCause() instanceof UnknownHostException) {
throw (UnknownHostException) e.getCause();
}
}
}
response.setNMTokensFromPreviousAttempts(nmTokens);
LOG.info("Application " + app.getApplicationId() + " retrieved "
+ transferredContainers.size() + " containers from previous"
+ " attempts and " + nmTokens.size() + " NM tokens.");
}
}
response.setSchedulerResourceTypes(getScheduler()
.getSchedulingResourceTypes());
response.setResourceTypes(ResourceUtils.getResourcesTypeInfo());
if (getRmContext().getYarnConfiguration().getBoolean(
YarnConfiguration.RM_RESOURCE_PROFILES_ENABLED,
YarnConfiguration.DEFAULT_RM_RESOURCE_PROFILES_ENABLED)) {
response.setResourceProfiles(
resourceProfilesManager.getResourceProfiles());
}
}
RMAppAttemptRegistrationEvent定义如下:
public RMAppAttemptRegistrationEvent(ApplicationAttemptId appAttemptId,
String host, int rpcPort, String trackingUrl) {
//registered类型的RMAppAttemptEventType
super(appAttemptId, RMAppAttemptEventType.REGISTERED);
this.host = host;
this.rpcport = rpcPort;
this.trackingurl = trackingUrl;
}
由以下spark on yarn的运行日志也可得出—— RMApp状态机running的含义为:该application的ApplicationMaster成功在某个节点上运行。