概述
FairScheduler分配container的核心调度流程
核心调度流程如下:
- 调度器锁住FairScheduler对象,避免核心数据结构冲突。
- 调度器选取集群的一个节点(node),从树形队列的根节点ROOT开始出发,每层队列都会按照公平策略选择一个子队列,最后在叶子队列按照公平策略选择一个App,为这个App在node上找一块适配的资源。
对于每层队列进行如下流程:
- 队列预先检查:检查队列的资源使用量是否已经超过了队列的Quota
- 排序子队列/App:按照公平调度策略,对子队列/App进行排序
- 递归调度子队列/App
例如,某次调度的路径是ROOT -> ParentQueueA -> LeafQueueA1 -> App11,这次调度会从node上给App11分配Container。
FairScheduler架构
公平调度器是一个多线程异步协作的架构,而为了保证调度过程中数据的一致性,在主要的流程中加入了FairScheduler对象锁。其中核心调度流程是单线程执行的。这意味着Container分配是串行的,这是调度器存在性能瓶颈的核心原因。
- scheduler Lock:FairScheduler对象锁
- AllocationFileLoaderService:负责公平策略配置文件的热加载,更新队列数据结构
- Continuous Scheduling Thread:开启连续调度时的核心调度线程,不停地执行分配container的核心调度流程。
- Update Thread:更新队列资源需求,执行Container抢占流程等
- Scheduler Event Dispatcher Thread: 调度器事件的处理器,处理App新增,App结束,node新增,node移除等事件
FairScheduler的资源调度方式
FairScheduler支持2种资源调度方式:心跳调度和连续调度。
心跳调度方式:NodeManager向ResourceManager汇报了自身资源情况(比如,当前可用资源,正在使用的资源,已经释放的资源),这个RPC会触发ResourceManager调用nodeUpdate()方法,这个方法为这个节点进行一次资源调度,即,从维护的Queue中取出合适的应用的资源请求(合适 ,指的是这个资源请求既不违背队列的最大资源使用限制,也不违背这个NodeManager的剩余资源量限制)放到这个NodeManager上运行。这种调度方式一个主要缺点就是调度缓慢,当一个NodeManager即使已经有了剩余资源,调度也只能在心跳发送以后才会进行,不够及时。
连续调度方式:由一个独立的线程ContinuousSchedulingThread负责进行持续的资源调度,与NodeManager的心跳是异步进行的。即不需要等到NodeManager发来心跳才开始资源调度。
FairSharePolicy的比较器
FariSchaeduler根据FairSharePolicy的比较器,对队列/app进行排序。
两个组, 排序的规则是:
1. 一个需要资源, 另外一个不需要资源, 则需要资源的排前面
2. 若都需要资源的话, 对比 使用的内存占minShare的比例, 比例小的排前面, (即尽量保证达到minShare)
3. 若比例相同的话, 计算出使用量与权重的比例, 小的排前面, 即权重大的优先, 使用量小的优先.
4. 若还是相同, 提交时间早的优先, app id小的排前面.
/**
* Compare Schedulables mainly via fair share usage to meet fairness.
* Specifically, it goes through following four steps.
*
* 1. Compare demands. Schedulables without resource demand get lower priority
* than ones who have demands.
*
* 2. Compare min share usage. Schedulables below their min share are compared
* by how far below it they are as a ratio. For example, if job A has 8 out
* of a min share of 10 tasks and job B has 50 out of a min share of 100,
* then job B is scheduled next, because B is at 50% of its min share and A
* is at 80% of its min share.
*
* 3. Compare fair share usage. Schedulables above their min share are
* compared by fair share usage by checking (resource usage / weight).
* If all weights are equal, slots are given to the job with the fewest tasks;
* otherwise, jobs with more weight get proportionally more slots. If weight
* equals to 0, we can't compare Schedulables by (resource usage/weight).
* There are two situations: 1)All weights equal to 0, slots are given
* to one with less resource usage. 2)Only one of weight equals to 0, slots
* are given to the one with non-zero weight.
*
* 4. Break the tie by compare submit time and job name.
*/
private static class FairShareComparator implements Comparator<Schedulable>,
Serializable {
private static final long serialVersionUID = 5564969375856699313L;
@Override
public int compare(Schedulable s1, Schedulable s2) {
int res = compareDemand(s1, s2);
// Share resource usages to avoid duplicate calculation
Resource resourceUsage1 = null;
Resource resourceUsage2 = null;
if (res == 0) {
resourceUsage1 = s1.getResourceUsage();
resourceUsage2 = s2.getResourceUsage();
res = compareMinShareUsage(s1, s2, resourceUsage1, resourceUsage2);
}
if (res == 0) {
res = compareFairShareUsage(s1, s2, resourceUsage1, resourceUsage2);
}
// Break the tie by submit time
if (res == 0) {
res = (int) Math.signum(s1.getStartTime() - s2.getStartTime());
}
// Break the tie by job name
if (res == 0) {
res = s1.getName().compareTo(s2.getName());
}
return res;
}
}
心跳调度源码分析
略,请参阅博客:yarn3.2源码分析之NM与RM通信完成心跳调度
连续调度源码分析
ContinuousSchedulingThread
用于连续调度的线程。连续调度默认不开启,只有设置yarn.scheduler.fair.continuous-scheduling-enabled参数为true,才会启动该线程。连续调度现在已经不推荐了,因为它会因为锁的问题,而导致资源调度变得缓慢。可以使用yarn.scheduler.assignmultiple参数启动批量分配功能,作为连续调度的替代品。
/**
* Thread which attempts scheduling resources continuously,
* asynchronous to the node heartbeats.
*/
@Deprecated
private class ContinuousSchedulingThread extends Thread {
@Override
public void run() {
while (!Thread.currentThread().isInterrupted()) {
try {
continuousSchedulingAttempt();
Thread.sleep(getContinuousSchedulingSleepMs());
} catch (InterruptedException e) {
LOG.warn("Continuous scheduling thread interrupted. Exiting.", e);
return;
}
}
}
}
continuousSchedulingAttempt()方法