一次和朋友的谈话中涉及到ApplicationMaster的container分配策略是什么,我映像中是随机分配的,但他说是根据各节点空闲资源来分配的。
之前看代码的时候也没注意这块的逻辑,既然现在有了疑惑那就去代码里瞅瞅。
个人站点地址:http://bigdatadecode.club/YARN源码分析之ApplicationMaster分配策略.html
从MR的运行log中可以找到AM的container是在什么时候分配的,见log
2017-04-09 03:26:17,113 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1491729774382_0001_000001 State change from SUBMITTED to SCHEDULED
2017-04-09 03:26:17,415 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1491729774382_0001_01_000001 Container Transitioned from NEW to ALLOCATED
AM container是在appattempt的状态由SUBMITTED
变为SCHEDULED
时初始化的。
appattempt由SUBMITTED
变为SCHEDULED
状态的处理逻辑为:
public static final class ScheduleTransition
implements
MultipleArcTransition<RMAppAttemptImpl, RMAppAttemptEvent, RMAppAttemptState> {
@Override
public RMAppAttemptState transition(RMAppAttemptImpl appAttempt,
RMAppAttemptEvent event) {
ApplicationSubmissionContext subCtx = appAttempt.submissionContext;
if (!subCtx.getUnmanagedAM()) {
// Need reset #containers before create new attempt, because this request
// will be passed to scheduler, and scheduler will deduct the number after
// AM container allocated
// 设置am container的请求
appAttempt.amReq.setNumContainers(1);
appAttempt.amReq.setPriority(AM_CONTAINER_PRIORITY);
// ResourceName为ANY表示任何机架上的任一机器
appAttempt.amReq.setResourceName(ResourceRequest.ANY);
appAttempt.amReq.setRelaxLocality(true);
// 由调度器来分配资源
Allocation amContainerAllocation =
appAttempt.scheduler.allocate(appAttempt.applicationAttemptId,
Collections.singletonList(appAttempt.amReq),
EMPTY_CONTAINER_RELEASE_LIST, null, null);
...
return RMAppAttemptState.SCHEDULED;
} else {
...
}
}
}
首先为AM container构造container请求,其实从appAttempt.amReq.setResourceName(ResourceRequest.ANY)
就可以看出am container的分配原则是随机的,因为在创建请求时对ResourceName并没有要求。但我们还是继续看下代码以验证下。
请求创建成功之后,由调度器来分配资源,这里默认使用的是Capacity调度,代码如下:
// CapacityScheduler.java
public Allocation allocate(ApplicationAttemptId applicationAttemptId,
List<ResourceRequest> ask, List<ContainerId> release,
List<String> blacklistAdditions, List<String> blacklistRemovals) {
FiCaSchedulerApp application = getApplicationAttempt(applicationAttemptId);
...
// Release containers
releaseContainers(release, application);
synchronized (application) {
...
if (!ask.isEmpty()) {
...
application.showRequests();
// 将请求该application attempt的map中
// Update application requests
application.updateResourceRequests(ask);
application.showRequests();
}
application.updateBlacklist(blacklistAdditions, blacklistRemovals);
//
return application.getAllocation(getResourceCalculator(),
clusterResource, getMinimumResourceCapability());
}
}
CapacityScheduler分配请求时,调用application.updateResourceRequests(ask)
将请求放入map中,等待nm心跳时来取。
这个application是FiCaSchedulerApp
的对象,FiCaSchedulerApp其实对应的是application attempt。updateResurceRequests代码如下:
public synchronized void updateResourceRequests(
List<ResourceRequest> requests) {
if (!isStopped) {
// AppSchedulingInfo.updateResourceRequests
appSchedulingInfo.updateResourceRequests(requests, false);
}
}
AppSchedulingInfo记录了application的所有消费情况,当然也包括这个application正在运行或者已经完成的container。
synchronized public void updateResourceRequests(
List<ResourceRequest> requests, boolean recoverPreemptedRequest) {
// Update resource requests
for (ResourceRequest request : requests) {
Priority priority = request.getPriority();
String resourceName = request.getResourceName();
boolean updatePendingResources = false;
ResourceRequest lastRequest = null;
// 如果request的ResourceName是ResourceRequest.ANY
// 只有am container是ANY???不应该吧
if (resourceName.equals(ResourceRequest.ANY)) {
...
// ResourceRequest.ANY才置为true??
updatePendingResources = true;
// Premature optimization?
// Assumes that we won't see