CapacityScheduler --ApplicationMaster资源分配(基于hadoop 2.7.6)
资源分配是被动分配的方式,在数据节点发送心跳(NODE_UPDATE)时,根据数据节点汇报的资源情况进行调度分配.
先贴下: ApplicationMaster启动需要的资源多少(memory和virtualcores)在客户端提交应用程序的时候已经初始化(在YARNRunner类里),memory默认是1536M,virtualcores默认是1.
代码清单:
case NODE_UPDATE:
{
NodeUpdateSchedulerEvent nodeUpdatedEvent = (NodeUpdateSchedulerEvent)event;
RMNode node = nodeUpdatedEvent.getRMNode();
/**
* 更新节点信息:
* 1.处理已分配的container
* 触发RMContainerEventType.LAUNCHED事件是由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了
* 2.处理已经完成的container
* 主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新
*/
nodeUpdate(node);
/**
* 是否异步分配,默认值是false,默认capacity-scheduler.xml配置文件里是没有配置的.
* 配置项:yarn.scheduler.capacity.scheduler-asynchronously.enable
*/
if (!scheduleAsynchronously) {
/**
* 进行资源分配
*/
allocateContainersToNode(getNode(node.getNodeID()));
}
}
NODE_UPDATE事件处理逻辑:
1.节点更新信息处理
2.分配资源
/**
* 1.处理已分配的container
* 触发RMContainerEventType.LAUNCHED事件,该事件由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了(在处理APP_ATTEMPT_ADDED事件时,会将container加入到containerAllocationExpirer进行监控)
*
* 2.处理已经完成的container
* 主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新
* @param nm
*/
private synchronized void nodeUpdate(RMNode nm) {
if (LOG.isDebugEnabled()) {
LOG.debug("nodeUpdate: " + nm + " clusterResources: " + clusterResource);
}
FiCaSchedulerNode node = getNode(nm.getNodeID());
List<UpdatedContainerInfo> containerInfoList = nm.pullContainerUpdates();
List<ContainerStatus> newlyLaunchedContainers = new ArrayList<ContainerStatus>();
List<ContainerStatus> completedContainers = new ArrayList<ContainerStatus>();
for(UpdatedContainerInfo containerInfo : containerInfoList) {
newlyLaunchedContainers.addAll(containerInfo.getNewlyLaunchedContainers());
completedContainers.addAll(containerInfo.getCompletedContainers());
}
// Processing the newly launched containers
for (ContainerStatus launchedContainer : newlyLaunchedContainers) {
/**
* 触发RMContainerEventType.LAUNCHED事件,该事件由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了(在处理APP_ATTEMPT_ADDED事件时,会将container加入到containerAllocationExpirer进行监控)
*/
containerLaunchedOnNode(launchedContainer.getContainerId(), node);
}
// Process completed containers
for (ContainerStatus completedContainer : completedContainers) {
ContainerId containerId = completedContainer.getContainerId();
LOG.debug("Container FINISHED: " + containerId);
/**
* 主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新
*/
completedContainer(getRMContainer(containerId),
completedContainer, RMContainerEventType.FINISHED);
}
// Now node data structures are upto date and ready for scheduling.
if(LOG.isDebugEnabled()) {
LOG.debug("Node being looked for scheduling " + nm
+ " availableResource: " + node.getAvailableResource());
}
}
更新数据节点信息:
1.处理已分配的container
触发RMContainerEventType.LAUNCHED事件,该事件是由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了
2.处理已经完成的container
主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新
在贴分配逻辑代码前,先YY几个问题:
1.分配是以队列为单位,那么是怎么选队列的(按什么顺序、条件选队列)?
2.选中队列后,又是怎么选应用程序进行分配(按什么顺序分配提交到队列内的应用程序)?
/**
* 为了尽量简单,能先看懂主体逻辑流程,先不考虑reserved情况
*/
@VisibleForTesting
public synchronized void allocateContainersToNode(FiCaSchedulerNode node) {
if (rmContext.isWorkPreservingRecoveryEnabled()
&& !rmContext.isSchedulerReadyForAllocatingContainers()) {
return;
}
/**
* 数据节点还未注册过
*/
if (!nodes.containsKey(node.getNodeID())) {
LOG.info("Skipping scheduling as the node " + node.getNodeID() +
" has been removed");
return;
}
// Assign new containers...
// 1. Check for reserved applications
// 2. Schedule if there are no reservations
/**
* 看容器节点上有无预留资源,有预留资源则先用
*
* 为了尽量简单,先不考虑reservedContainer情况
*/
RMContainer reservedContainer = node.getReservedContainer();
if (reservedContainer != null) {
FiCaSchedulerApp reservedApplication =
getCurrentAttemptForContainer(reservedContainer.getContainerId());
// Try to fulfill the reservation
LOG.info("Trying to fulfill reservation for application " +
reservedApplication.getApplicationId() + " on node: " +
node.getNodeID());
LeafQueue queue = ((LeafQueue)reservedApplication.getQueue());
CSAssignment assignment =
queue.assignContainers(
clusterResource,
node,
new ResourceLimits(labelManager.getResourceByLabel(
RMNodeLabelsManager.NO_LABEL, clusterResource)));
RMContainer excessReservation = assignment.getExcessReservation();
if (excessReservation != null) {
Container container = excessReservation.getContainer();
queue.completedContainer(
clusterResource, assignment.getApplication(), node,
excessReservation,
SchedulerUtils.createAbnormalContainerStatus(
container.getId(),
SchedulerUtils.UNRESERVED_CONTAINER),
RMContainerEventType.RELEASED, null, true);
}
}
/**
* minimumAllocation包括最小内存和最小虚拟CPU数,在CapacityScheduler初始化initScheduler的时候初始化
* 最小内存: 配置项是yarn.scheduler.minimum-allocation-mb,默认值是1024M
* 最小虚拟CPU数: 配置项是yarn.scheduler.minimum-allocation-vcores,默认值是1
*/
// Try to schedule more if there are no reservations to fulfill
if (node.getReservedContainer() == null) {
/**
* 数据节点的可用资源是否能满足,算法:
* node.getAvailableResource()/minimumAllocation
*/
if (calculator.computeAvailableContainers(node.getAvailableResource(),
minimumAllocation) > 0) {
if (LOG.isDebugEnabled()) {
LOG.debug("Trying to schedule on node: " + node.getNodeName() +
", available: " + node.getAvailableResource());
}
/**
* 这里有两个思路或问题:
* 1.从root开始匹配,那么先匹配哪个队列呢?
* 队列是根据可使用容量来排序遍历,可使用容量越多越靠前
* 2.队列内部按什么顺序匹配需求?
* 队列内是安排FIFO的顺序匹配需求
*
* 注意:assignContainers是从根节点开始匹配,assignContainers和assignContainersToChildQueues方法是相互调用的递归方法,
* 直到叶子节点的时候才调用叶子节点的assignContainers进行实质上的分配
*/
root.assignContainers(
clusterResource,
node,
new ResourceLimits(labelManager.getResourceByLabel(
RMNodeLabelsManager.NO_LABEL, clusterResource)));
}
} else {
LOG.info("Skipping scheduling since node " + node.getNodeID() +
" is reserved by application " +
node.getReservedContainer().getContainerId().getApplicationAttemptId()
);
}
}
allocateContainersToNode方法的主要实现:
从根节点root开始调用assignContainers进行匹配,一直到叶子节点真正完成分配.这个匹配过程中与parentQueue.assignContainersToChildQueues方法两者相互递归调用完成.
主要的是否可分配的检查逻辑是:
1.数据节点汇报上来的可用资源是否大于等于配置的minimumAllocation.
2.检查分配后队列的总占用资源是否超过队列的资源上限.
重新回到主体逻辑代码:
@Override
public synchronized CSAssignment ParantQueue.assignContainers(Resource clusterResource,
FiCaSchedulerNode node, ResourceLimits resourceLimits) {
CSAssignment assignment =
new CSAssignment(Resources.createResource(0, 0), NodeType.NODE_LOCAL);
Set<String> nodeLabels = node.getLabels();
/**
* 数据节点是否标签是否正匹配:
* 1.如果队列标签是*,则可以访问任何一个计算节点
* 2.如果节点没有打标签,则任何队列都可以访问
* 3.如果队列打了固定标签,则只能访问对应标签的节点
*/
if (!SchedulerUtils.checkQueueAccessToNode(accessibleLabels, nodeLabels)) {
return assignment;
}
/**
* 检查node上的可用资源是否达到minimumAllocation要求
*
* 计算node上的资源是否可以用(是与minimumAllocation匹配),计算公式:node.getAvailableResource()-minimumAllocation>0
* 1.如果DefaultResourceCalculator是直接用上述公式计算,不需要用到clusterResource
* 2.如果DominantResourceCalculator是用资源占用率算的,则需要用到clusterResource
*/
while (canAssign(clusterResource, node)) {
if (LOG.isDebugEnabled()) {
LOG.debug("Trying to assign containers to child-queue of "
+ getQueueName());
}
/**
* 检查是否超过当前队列资源上限,即判断当前队列是否可分配
*/
if (!super.canAssignToThisQue