在上几篇博文中分析了YARN调度模拟器SLS的源码,重点分析了AM与RM之间的通信协议。
接下来分析在YARN项目中,AM-RM通信如何实现的。
注意点:在YARN中,真正已经实现的只有RM和NM,而AM和client只是提供了api,需要用户自行实现。
而AM的主要功能是根据业务需求,从RM处申请资源,并利用这些资源完成业务逻辑,因此AM需要跟RM通信,也需要跟NM通信。
通信协议:
AM-RM:ApplicationMasterProtocol
AM-NM: ContainerManagementProtocol
这两个协议的定义在hadoop-yarn-api中
AM-RM协议:
AM与RM通过ApplicationMasterProtocol协议进行通信,该协议提供了几种方法:
1.向RM注册AM
public RegisterApplicationMasterResponse registerApplicationMaster(
RegisterApplicationMasterRequest request)
throws YarnException, IOException;
2.告诉RM,应用结束
public FinishApplicationMasterResponse finishApplicationMaster(
FinishApplicationMasterRequest request)
throws YarnException, IOException;
3.向RM请求资源
public AllocateResponse allocate(AllocateRequest request)
throws YarnException, IOException;
客户端(client)向RM提交应用后,RM会分配一定的资源启动AM,AM启动后调用ApplicationMasterProtocol的registerApplicationMaster方法向RM注册自己。完成注册后,调用allocate方法向RM申请运行任务的资源。 获取资源后,通过与NM的通信协议:ContainerManagementProtocol启动资源容器,完成任务。完成后,通过ApplicationMasterProtocol的finishApplicationMaster方法向RM汇报应用结束,并注销AM。
接下来详细看下AM向RM请求资源的过程:
1.AM向RM注册
AM通过allocate方法向RM申请或释放资源。信息被封装为AllocateRequest里。
举个例子:hadoop实现了MR的例子。
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.java
在serviceinit函数中,首先创建job也就是任务。
@Override
protected void serviceInit(final Configuration conf) throws Exception {
// create the job classloader if enabled
createJobClassLoader(conf);
initJobCredentialsAndUGI(conf);
......
if(!stagingExists) {
isLastAMRetry = true;
LOG.info("Attempt num: " + appAttemptID.getAttemptId() +
" is last retry: " + isLastAMRetry +
" because the staging dir doesn't exist.");
errorHappenedShutDown = true;
forcedState = JobStateInternal.ERROR;
shutDownMessage = "Staging dir does not exist " + stagingDir;
LOG.error(shutDownMessage);
} else if (commitStarted) {
//A commit was started so this is the last time, we just need to know
// what result we will use to notify, and how we will unregister
errorHappenedShutDown = true;
isLastAMRetry = true;
LOG.info("Attempt num: " + appAttemptID.getAttemptId() +
" is last retry: " + isLastAMRetry +
" because a commit was started.");
copyHistory = true;
if (commitSuccess) {
shutDownMessage =
"Job commit succeeded in a prior MRAppMaster attempt " +
"before it crashed. Recovering.";
forcedState = JobStateInternal.SUCCEEDED;
} else if (commitFailure) {
shutDownMessage =
"Job commit failed in a prior MRAppMaster attempt " +
"before it crashed. Not retrying.";
forcedState = JobStateInternal.FAILED;
} else {
if (isCommitJobRepeatable()) {
// cleanup previous half done commits if committer supports
// repeatable job commit.
errorHappenedShutDown = false;
cleanupInterruptedCommit(conf, fs, startCommitFile);
} else {
//The commit is still pending, commit error
shutDownMessage =
"Job commit from a prior MRAppMaster attempt is " +
"potentially in progress. Preventing multiple commit executions";
forcedState = JobStateInternal.ERROR;
}
}
}
......
// service to allocate containers from RM (if non-uber) or to fake it (uber)
containerAllocator = createContainerAllocator(null, context);
addIfService(containerAllocator);
dispatcher.register(ContainerAllocator.EventType.class, containerAllocator);
然后最后从RM分配容器资源。通过调用createContainerAllocator函数实现。
该函数实现如下:
protected ContainerAllocator createContainerAllocator(
final ClientService clientService, final AppContext context) {
return new ContainerAllocatorRouter(clientService, context);
}
new了ContainerAllocatorRouter类并返回。
ContainerAllocatorRouter实现如下:
private final class ContainerAllocatorRouter extends AbstractService
implements ContainerAllocator, RMHeartbeatHandler {
private final ClientService clientService;
private final AppContext context;
private ContainerAllocator containerAllocator;
ContainerAllocatorRouter(ClientService clientService,
AppContext context) {
super(ContainerAllocatorRouter.class.getName());
this.clientService = clientService;
this.context = context;
}
@Override
protected void serviceStart() throws Exception {
if (job.isUber()) {
MRApps.setupDistributedCacheLocal(getConfig());
this.containerAllocator = new LocalContainerAllocator(
this.clientService, this.context, nmHost, nmPort, nmHttpPort
, containerID);
} else {
this.containerAllocator = new RMContainerAllocator(
this.clientService, this.context, preemptionPolicy);
}
((Service)this.containerAllocator).init(getConfig());
((Service)this.containerAllocator).start();
super.serviceStart();
}
@Override
protected void serviceStop() throws Exception {
ServiceOperations.stop((Service) this.containerAllocator);
super.serviceStop();
}
@Override
public void handle(ContainerAllocatorEvent event) {
this.containerAllocator.handle(event);
}
public void setSignalled(boolean isSignalled) {
((RMCommunicator) containerAllocator).setSignalled(isSignalled);
}
public void setShouldUnregister(boolean shouldUnregister) {
((RMCommunicator) containerAllocator).setShouldUnregister(shouldUnregister);
}
@Override
public long getLastHeartbeatTime() {
return ((RMCommunicator) containerAllocator).getLastHeartbeatTime();
}
@Override
public void runOnNextHeartbeat(Runnable callback) {
((RMCommunicator) containerAllocator).runOnNextHeartbeat(callback);
}
}
ContainerAllocatorRouter返回后,serviceinit继续往下执行,将返回的该类服务加入到schedule中:
addIfService(containerAllocator);
然后会启动该类的服务:serviceStart
@Override
protected void serviceStart() throws Exception {
if (job.isUber()) {
MRApps.setupDistributedCacheLocal(getConfig());
this.containerAllocator = new LocalContainerAllocator(
this.clientService, this.context, nmHost, nmPort, nmHttpPort
, containerID);
} else {
this.containerAllocator = new RMContainerAllocator(
this.clientService, this.context, preemptionPolicy);
}
((Service)this.containerAllocator).init(getConfig());
((Service)this.containerAllocator).start();
super.serviceStart();
}
可以看到该类服务会判断job类型,是uber or not。uber是啥(hadoop针对小规模MR的本地模式,均在一个jvm中运行。可以理解为本地模式)。
如果不是uber,则new一个RMContainerAllocator,然后init和start这个类。
看下这个类:
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.java
构造函数:用传入的参数进行成员变量初始化
public RMContainerAllocator(ClientService clientService, AppContext context,
AMPreemptionPolicy preemptionPolicy) {
super(clientService, context);
this.preemptionPolicy = preemptionPolicy;
this.stopped = new AtomicBoolean(false);
this.clock = context.getClock();
this.assignedRequests = createAssignedRequests();
}
(Service)this.containerAllocator).init(getConfig());中的init函数:
@Override
protected void serviceInit(Configuration conf) throws Exception {
super.serviceInit(conf);
reduceSlowStart = conf.getFloat(
MRJobConfig.COMPLETED_MAPS_FOR_REDUCE_SLOWSTART,
DEFAULT_COMPLETED_MAPS_PERCENT_FOR_REDUCE_SLOWSTART);
maxReduceRampupLimit = conf.getFloat(
MRJobConfig.MR_AM_JOB_REDUCE_RAMPUP_UP_LIMIT,
MRJobConfig.DEFAULT_MR_AM_JOB_REDUCE_RAMP_UP_LIMIT);
maxReducePreemptionLimit = conf.getFloat(
MRJobConfig.MR_AM_JOB_REDUCE_PREEMPTION_LIMIT,
MRJobConfig.DEFAULT_MR_AM_JOB_REDUCE_PREEMPTION_LIMIT);
reducerUnconditionalPreemptionDelayMs = 1000 * conf.getInt(
MRJobConfig.MR_JOB_REDUCER_UNCONDITIONAL_PREEMPT_DELAY_SEC,
MRJobConfig.DEFAULT_MR_JOB_REDUCER_UNCONDITIONAL_PREEMPT_DELAY_SEC);
reducerNoHeadroomPreemptionDelayMs = conf.getInt(
MRJobConfig.MR_JOB_REDUCER_PREEMPT_DELAY_SEC,
MRJobConfig.DEFAULT_MR_JOB_REDUCER_PREEMPT_DELAY_SEC) * 1000;//sec -> ms
maxRunningMaps = conf.getInt(MRJobConfig.JOB_RUNNING_MAP_LIMIT,
MRJobConfig.DEFAULT_JOB_RUNNING_MAP_LIMIT);
maxRunningReduces = conf.getInt(MRJobConfig.JOB_RUNNING_REDUCE_LIMIT,
MRJobConfig.DEFAULT_JOB_RUNNING_REDUCE_LIMIT);
RackResolver.init(conf);
retryInterval = getConfig().getLong(MRJobConfig.MR_AM_TO_RM_WAIT_INTERVAL_MS,
MRJobConfig.DEFAULT_MR_AM_TO_RM_WAIT_INTERVAL_MS);
mapNodeLabelExpression = conf.get(MRJobConfig.MAP_NODE_LABEL_EXP);
reduceNodeLabelExpression = conf.get(MRJobConfig.REDUCE_NODE_LABEL_EXP);
// Init startTime to current time. If all goes well, it will be reset after
// first attempt to contact RM.
retrystartTime = System.currentTimeMillis();
this.scheduledRequests.setNumOpportunisticMapsPercent(
conf.getInt(MRJobConfig.MR_NUM_OPPORTUNISTIC_MAPS_PERCENT,
MRJobConfig.DEFAULT_MR_NUM_OPPORTUNISTIC_MAPS_PERCENT));
LOG.info(this.scheduledRequests.getNumOpportunisticMapsPercent() +
"% of the mappers will be scheduled using OPPORTUNISTIC containers");
}
对所需容器的一些参数进行配置,比如心跳时间,map或reduce等。
((Service)this.containerAllocator).start():start函数
@Override
protected void serviceStart() throws Exception {
this.eventHandlingThread = new Thread() {
@SuppressWarnings("unchecked")
@Override
public void run() {
ContainerAllocatorEvent event;
while (!stopped.get() && !Thread.currentThread().isInterrupted()) {
try {
event = RMContainerAllocator.this.eventQueue.take();
} catch (InterruptedException e) {
if (!stopped.get()) {
LOG.error("Returning, interrupted : " + e);
}
return;
}
try {
handleEvent(event);
} catch (Throwable t) {
LOG.error("Error in handling event type " + event.getType()
+ " to the ContainreAllocator", t);
// Kill the AM
eventHandler.handle(new JobEvent(getJob().getID(),
JobEventType.INTERNAL_ERROR));
return;
}
}
}
};
this.eventHandlingThread.start();
super.serviceStart();
}
服务启动后,会进入心跳模式。
循环执行类中的heartbeat方法:
@Override
protected synchronized void heartbeat() throws Exception {
scheduleStats.updateAndLogIfChanged("Before Scheduling: ");
List<Container> allocatedContainers = getResources();
if (allocatedContainers != null && allocatedContainers.size() > 0) {
scheduledRequests.assign(allocatedContainers);
}
int completedMaps = getJob().getCompletedMaps();
int completedTasks = completedMaps + getJob().getCompletedReduces();
if ((lastCompletedTasks != completedTasks) ||
(scheduledRequests.maps.size() > 0)) {
lastCompletedTasks = completedTasks;
recalculateReduceSchedule = true;
}
if (recalculateReduceSchedule) {
boolean reducerPreempted = preemptReducesIfNeeded();
if (!reducerPreempted) {
// Only schedule new reducers if no reducer preemption happens for
// this heartbeat
scheduleReduces(getJob().getTotalMaps(), completedMaps,
scheduledRequests.maps.size(), scheduledRequests.reduces.size(),
assignedRequests.maps.size(), assignedRequests.reduces.size(),
mapResourceRequest, reduceResourceRequest, pendingReduces.size(),
maxReduceRampupLimit, reduceSlowStart);
}
recalculateReduceSchedule = false;
}
scheduleStats.updateAndLogIfChanged("After Scheduling: ");
}
在该方法中,最重要的是第一行,获得资源:
List<Container> allocatedContainers = getResources();
获得资源后,加入到schedule的assign分配中:
if (allocatedContainers != null && allocatedContainers.size() > 0) {
scheduledRequests.assign(allocatedContainers);
}
接下来逐段具体分析getResources函数:
@SuppressWarnings("unchecked")
private List<Container> getResources() throws Exception {
applyConcurrentTaskLimits();
// will be null the first time
Resource headRoom = Resources.clone(getAvailableResources());
AllocateResponse response;
/*
* If contact with RM is lost, the AM will wait MR_AM_TO_RM_WAIT_INTERVAL_MS
* milliseconds before aborting. During this interval, AM will still try
* to contact the RM.
*/
try {
response = makeRemoteRequest();
// Reset retry count if no exception occurred.
retrystartTime = System.currentTimeMillis();
这里最重要的就是response = makeRemoteRequest(),这是AM向RM通信索取资源的关键方法。
接着就是try后面进行catch捕捉异常
} catch (ApplicationAttemptNotFoundException e ) {
// This can happen if the RM has been restarted. If it is in that state,
// this application must clean itself up.
eventHandler.handle(new JobEvent(this.getJob().getID(),
JobEventType.JOB_AM_REBOOT));
throw new RMContainerAllocationException(
"Resource Manager doesn't recognize AttemptId: "
+ this.getContext().getApplicationAttemptId(), e);
} catch (ApplicationMasterNotRegisteredException e) {
LOG.info("ApplicationMaster is out of sync with ResourceManager,"
+ " hence resync and send outstanding requests.");
// RM may have restarted, re-register with RM.
lastResponseID = 0;
register();
addOutstandingRequestOnResync();
return null;
} catch (InvalidLabelResourceRequestException e) {
// If Invalid label exception is received means the requested label doesnt
// have access so killing job in this case.
String diagMsg = "Requested node-label-expression is invalid: "
+ StringUtils.stringifyException(e);
LOG.info(diagMsg);
JobId jobId = this.getJob().getID();
eventHandler.handle(new JobDiagnosticsUpdateEvent(jobId, diagMsg));
eventHandler.handle(new JobEvent(jobId, JobEventType.JOB_KILL));
throw e;
} catch (Exception e) {
// This can happen when the connection to the RM has gone down. Keep
// re-trying until the retryInterval has expired.
if (System.currentTimeMillis() - retrystartTime >= retryInterval) {
LOG.error("Could not contact RM after " + retryInterval + " milliseconds.");
eventHandler.handle(new JobEvent(this.getJob().getID(),
JobEventType.JOB_AM_REBOOT));
throw new RMContainerAllocationException("Could not contact RM after " +
retryInterval + " milliseconds.");
}
// Throw this up to the caller, which may decide to ignore it and
// continue to attempt to contact the RM.
throw e;
}
也就是捕捉response = makeRemoteRequest返回的异常,比如资源不够,资源分配出错等。
接着:
Resource newHeadRoom = getAvailableResources();
List<Container> newContainers = response.getAllocatedContainers();
这是将response(回复)中的已分配好的container资源拎出来赋给List<Container> newContainers。这就是AM所需要的资源。
接着:
// Setting NMTokens
if (response.getNMTokens() != null) {
for (NMToken nmToken : response.getNMTokens()) {
NMTokenCache.setNMToken(nmToken.getNodeId().toString(),
nmToken.getToken());
}
}
// Setting AMRMToken
if (response.getAMRMToken() != null) {
updateAMRMToken(response.getAMRMToken());
}
List<ContainerStatus> finishedContainers =
response.getCompletedContainersStatuses();
// propagate preemption requests
final PreemptionMessage preemptReq = response.getPreemptionMessage();
if (preemptReq != null) {
preemptionPolicy.preempt(
new PreemptionContext(assignedRequests), preemptReq);
}
if (newContainers.size() + finishedContainers.size() > 0
|| !headRoom.equals(newHeadRoom)) {
//something changed
recalculateReduceSchedule = true;
if (LOG.isDebugEnabled() && !headRoom.equals(newHeadRoom)) {
LOG.debug("headroom=" + newHeadRoom);
}
}
if (LOG.isDebugEnabled()) {
for (Container cont : newContainers) {
LOG.debug("Received new Container :" + cont);
}
}
//Called on each allocation. Will know about newly blacklisted/added hosts.
computeIgnoreBlacklisting();
handleUpdatedNodes(response);
handleJobPriorityChange(response);
// Handle receiving the timeline collector address and token for this app.
MRAppMaster.RunningAppContext appContext =
(MRAppMaster.RunningAppContext)this.getContext();
if (appContext.getTimelineV2Client() != null) {
appContext.getTimelineV2Client().
setTimelineCollectorInfo(response.getCollectorInfo());
}
for (ContainerStatus cont : finishedContainers) {
processFinishedContainer(cont);
}
return newContainers;
}
进行一些处理后,返回newContainers。
接下来对getResources方法中调用到的makeRemoteRequest方法进行分析:
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.java
makeRemoteRequest方法:
protected AllocateRespo makeRemoteRequestnse () throws YarnException,
IOException {
applyRequestLimits();
ResourceBlacklistRequest blacklistRequest =
ResourceBlacklistRequest.newInstance(new ArrayList<String>(blacklistAdditions),
new ArrayList<String>(blacklistRemovals));
AllocateRequest allocateRequest =
AllocateRequest.newInstance(lastResponseID,
super.getApplicationProgress(), new ArrayList<ResourceRequest>(ask),
new ArrayList<ContainerId>(release), blacklistRequest);
AllocateResponse allocateResponse = scheduler.allocate(allocateRequest);
lastResponseID = allocateResponse.getResponseId();
availableResources = allocateResponse.getAvailableResources();
lastClusterNmCount = clusterNmCount;
clusterNmCount = allocateResponse.getNumClusterNodes();
int numCompletedContainers =
allocateResponse.getCompletedContainersStatuses().size();
if (ask.size() > 0 || release.size() > 0) {
LOG.info("getResources() for " + applicationId + ":" + " ask="
+ ask.size() + " release= " + release.size() + " newContainers="
+ allocateResponse.getAllocatedContainers().size()
+ " finishedContainers=" + numCompletedContainers
+ " resourcelimit=" + availableResources + " knownNMs="
+ clusterNmCount);
}
ask.clear();
release.clear();
if (numCompletedContainers > 0) {
// re-send limited requests when a container completes to trigger asking
// for more containers
requestLimitsToUpdate.addAll(requestLimits.keySet());
}
if (blacklistAdditions.size() > 0 || blacklistRemovals.size() > 0) {
LOG.info("Update the blacklist for " + applicationId +
": blacklistAdditions=" + blacklistAdditions.size() +
" blacklistRemovals=" + blacklistRemovals.size());
}
blacklistAdditions.clear();
blacklistRemovals.clear();
return allocateResponse;
}
定义一个AllocateRequest对象,调用newInstance实例化,并加入了一个getApplicationProgress方法。
AllocateRequest allocateRequest =
AllocateRequest.newInstance(lastResponseID,
super.getApplicationProgress(), new ArrayList<ResourceRequest>(ask),
new ArrayList<ContainerId>(release), blacklistRequest);
这样就构造完成一个标准的allocateRequest对象。可以发送给RM了。
接着,调用scheduler.allocate将请求加入调度中,返回值为标准的RM返回格式AllocateResponse。
这里的scheduler.allocate(allocateRequest)是不是似曾相识的感觉?
这里就是最上面提到的ApplicationMasterProtocol协议的第三个方法:
public AllocateResponse allocate(AllocateRequest request)
AllocateResponse allocateResponse = scheduler.allocate(allocateRequest);
这就完成了AM向RM请求资源和回复资源。
然后将allocateResponse返回即可。
最后来详细分析下AllocateRequest类和AllocateResponse类。
AllocateRequest:AM向RM请求资源的标准包
org.apache.hadoop.yarn.api.protocolrecords.AllocateRequest.java
AllocateRequest包格式如下:
1. responseID,相应的ID,用于区分重复相应
2. appProgress,进程的进度
3. askList(List<ResourceRequest> resourceAsk),AM向RM请求的资源列表,是一个List<ResourceRequest> 对象。其中ResourceRequest是一个资源请求的详细参数,包括容器个数,容器容量,分配策略等。
hadoop-yarn-api/src/main/proto/yarn_protos.proto
ResourceRequest
message ResourceRequestProto {
optional PriorityProto priority = 1;
optional string resource_name = 2;
optional ResourceProto capability = 3;
optional int32 num_containers = 4;
optional bool relax_locality = 5 [default = true];
optional string node_label_expression = 6;
optional ExecutionTypeRequestProto execution_type_request = 7;
optional int64 allocation_request_id = 8 [default = -1];
}
4. resourceBlacklistRequest,要添加或者删除的资源黑名单
AllocateResponse类:RM向AM回复的资源包
org.apache.hadoop.yarn.api.protocolrecords.AllocateResponse.java
包内容包括:
1.responseId,回复的ID,避免重复响应
2.numClusterNodes,集群规模大小
3.completedContainersStatuses,已完成的容器状态列表
4. allocatedContainers,RM新分配的资源给AM,这些资源封装在Container类中,因此返回类型通常为List<Container>
Container组成:
org.apache.hadoop.yarn.api.records.Container.java
public static Container newInstance(ContainerId containerId, NodeId nodeId,
String nodeHttpAddress, Resource resource, Priority priority,
Token containerToken, ExecutionType executionType) {
Container container = Records.newRecord(Container.class);
container.setId(containerId);
container.setNodeId(nodeId);
container.setNodeHttpAddress(nodeHttpAddress);
container.setResource(resource);
container.setPriority(priority);
container.setContainerToken(containerToken);
container.setExecutionType(executionType);
return container;
}
可以看到container组成包括:
a. container ID
b. Node ID
c. nodeHttpAddress:节点http的地址
d. resource:为Resource类,格式<mem, vcores>
e. priority: 优先级
f. containerToken: 容器令牌
g. executionType: 容器运行的类型
5.updatedNodes,状态被更新过的所有节点列表,每个节点的更新信息存放在NodeReport类中,因此返回类型通常为List<NodeReport>
6.amCommand,RM给AM发送的控制命令,包括重连和关闭。
7.preemptionMessage,资源抢占信息,包括两部分:强制回收部分和可自主调配部分
8.nmTokens,AM与NM之间的通信令牌
over