概述
FairScheduler可以通过配置yarn.scheduler.fair.preemption参数为true,开启抢占式调度,默认为false,即不开启。
FairScheduler在计算这个队列允许抢占其它队列的资源大小时,如果这个队列使用的资源低于其minshare的时间超过了抢占超时时间,那么,应该抢占的资源量就在它当前的fair share和它的min share之间的差额。如果队列资源已经低于它的fair share的时间超过了fairSharePreemptionTimeout,那么他应该进行抢占的资源就是满足其fair share的资源总量。如果两者都发生了,则抢占两个的较多者。
抢占式调度主要由2个线程配合完成:
- UpdateThread:计算队列和app的fairShare,然后计算app的fairshareStarvation量和app的minshareStarvation量,并记录这些处于starvation状态的app。
- FSPreemptionThread:处理处于starvation状态的app。基于给定的app,获取一批可用于抢占的cotainer,并在必要时kill掉这些container。
FSPreemptionThread
FairScheduler在启动服务时,会启动FSPreemptionThread。
如果开启抢占式调度,则会创建一个后台线程FSPreemptionThread 。
if (this.conf.getPreemptionEnabled()) {
createPreemptionThread();
}
FSPreemptionThread的run()方法逻辑
- FairScheduler的上下文FSContext的FSStarvedApps变量中维护了一个PriorityBlockingQueue<FSAppAttempt>队列,该队列中记录了处于starvation状态的app。FSPreemptedThread从队列中获取处于starvation状态的app 。
- 基于给定的app,抢占一定数量的container用来满足该app的需求。返回可用于抢占的container。
- 在时间超过waitTimeBeforeKill后,标记为待抢占的container仍没有被释放,则preemptContainers()方法主动kill掉该container。
public void run() {
while (!Thread.interrupted()) {
try {
//FairScheduler的上下文FSContext的FSStarvedApps变量中维护了一个PriorityBlockingQueue<FSAppAttempt>队列
//该队列中记录了处于starvation状态的app
//FSPreemptedThread从队列中获取处于starvation状态的app
FSAppAttempt starvedApp = context.getStarvedApps().take();
// Hold the scheduler readlock so this is not concurrent with the
// update thread.
schedulerReadLock.lock();
try {
//identifyContainersToPreempt()方法:基于给定的app,抢占一定数量的container用来满足该app的需求。返回可用于抢占的container。
//preemptContainers()方法:在时间超过waitTimeBeforeKill后,标记为待抢占的container仍没有被释放,则preemptContainers()方法主动kill掉该container。
preemptContainers(identifyContainersToPreempt(starvedApp));
} finally {
schedulerReadLock.unlock();
}
starvedApp.preemptionTriggered(delayBeforeNextStarvationCheck);
} catch (InterruptedException e) {
LOG.info("Preemption thread interrupted! Exiting.");
Thread.currentThread().interrupt();
}
}
}
preemptContainers()方法
如果这个container已经被标记为待抢占,并且距离标记时间已经超过了waitTimeBeforeKill却依然没有被自己的ApplicationMaster主动释放的container,那么直接杀死这个Container。
private void preemptContainers(List<RMContainer> containers) {
// Schedule timer task to kill containers
preemptionTimer.schedule(
new PreemptContainersTask(containers), warnTimeBeforeKill);
}
PreemptContainersTask
执行kill container事件
private class PreemptContainersTask extends TimerTask {
private final List<RMContainer> containers;
PreemptContainersTask(List<RMContainer> containers) {
this.containers = containers;
}
@Override
public void run() {
for (RMContainer container : containers) {
ContainerStatus status = SchedulerUtils.createPreemptedContainerStatus(
container.getContainerId(), SchedulerUtils.PREEMPTED_CONTAINER);
LOG.info("Killing container " + container);
scheduler.completedContainer(
container, status, RMContainerEventType.KILL);
}
}
}
UpdateThread
AbstractYarnScheduler在启动服务时,会启动UpdateThread。UpdateThread是AbstractYarnScheduler的内部类。
/**
* Thread which calls {@link #update()} every
* <code>updateInterval</code> milliseconds.
*/
private class UpdateThread extends Thread {
@Override
public void run() {
while (!Thread.currentThread().isInterrupted()) {
try {
synchronized (updateThreadMonitor) {
updateThreadMonitor.wait(updateInterval);
}
update();
} catch (InterruptedException ie) {
LOG.warn("Scheduler UpdateThread interrupted. Exiting.");
return;
} catch (Exception e) {
LOG.error("Exception in scheduler UpdateThread", e);
}
}
}
}
FairShcheduler#update()方法
FairShcheduler使用该方法重新计算它的内部变量,包括每个job的权重,fairShare、已使用资源量、每个job的资源需求量等。
- 递归计算队列和它的子队列的资源需求量
- 递归计算队列和它的子队列/app的fairShare
- 更新叶子队列上所有FSAppAttempt的minshareStarvation和fairshareStarvation
/**
* Recompute the internal variables used by the scheduler - per-job weights,
* fair shares, deficits, minimum slot allocations, and amount of used and
* required resources per job.
*/
@VisibleForTesting
@Override
public void update() {
// Storing start time for fsOpDurations
long start = getClock().getTime();
FSQueue rootQueue = queueMgr.getRootQueue();
// Update demands and fairshares
writeLock.lock();
try {
// Recursively update demands for all queues
//递归计算队列和它的子队列的资源需求量
rootQueue.updateDemand();
//从根队列开始,递归计算队列和它的子队列/app的fairShare
//root queue的fairShare是整个集群的可用资源
rootQueue.update(getClusterResource());
// Update metrics
updateRootQueueMetrics();
} finally {
writeLock.unlock();
}
readLock.lock();
try {
// Update starvation stats and identify starved applications
if (shouldAttemptPreemption()) {
for (FSLeafQueue queue : queueMgr.getLeafQueues()) {
//更新叶子队列上所有FSAppAttempt的minshareStarvation和fairshareStarvation
queue.updateStarvedApps();
}
}
// Log debug information
if (STATE_DUMP_LOG.isDebugEnabled()) {
if (--updatesToSkipForDebug < 0) {
updatesToSkipForDebug = UPDATE_DEBUG_FREQUENCY;
dumpSchedulerState();
}
}
} finally {
readLock.unlock();
}
fsOpDurations.addUpdateThreadRunDuration(getClock().getTime() - start);
}
从根队列开始递归计算并更新队列及其子队列/app的fairShare
FSQueue#update()方法
/**
* Set the queue's fairshare and update the demand/fairshare of child
* queues/applications.
*
* To be called holding the scheduler writelock.
*
* @param fairShare
*/
public void update(Resource fairShare) {
setFairShare(fairShare);
updateInternal();
}
FSParentQueue#updateInternal()方法
基于配置的调度策略,计算所有子队列的fairShare。FairScheduler支持的调度策略有:fair、fifo、drft。默认是fair。
void updateInternal() {
readLock.lock();
try {
policy.computeShares(childQueues, getFairShare());
for (FSQueue childQueue : childQueues) {
childQueue.getMetrics().setFairShare(childQueue.getFairShare());
childQueue.updateInternal();
}
} finally {
readLock.unlock();
}
}
FSLeafQueue#updateInternal()方法
基于配置的调度策略,计算叶子队列中所有运行的app的fairShare。
@Override
void updateInternal() {
readLock.lock();
try {
policy.computeShares(runnableApps, getFairShare());
} finally {
readLock.unlock();
}
}
FairSharePolicy计算fairShare
每个Schedulable对象都有minShare、maxShare、fairShare 3个属性,此外还有权重属性(weight)。对于queue而言,minShare、maxShare就是fair-scheduler.xml里配置的minResource和maxResource,weight也是直接配置的。对于FSAppAttempt而言minShare直接返回0,maxShare直接返回Long.MAX_VALUE,weight如果没有配置yarn.scheduler.fair.sizebasedweight=true就直接返回1.0,意味着所有app的权重是相同的。
@Override
public void computeShares(Collection<? extends Schedulable> schedulables,
Resource totalResources) {
//FairSharePolicy只考虑内存资源,所以这里传入的资源类型是memory
ComputeFairShares.computeShares(schedulables, totalResources, MEMORY);
}
computeShares()方法
/**
* Compute fair share of the given schedulables.Fair share is an allocation of
* shares considering only active schedulables ie schedulables which have
* running apps.
*
* @param schedulables
* @param totalResources
* @param type
*/
public static void computeShares(
Collection<? extends Schedulable> schedulables, Resource totalResources,
String type) {
//computeShares相当于computeSharesInternal()的重载方法
//即设置computeSharesInternal()方法的最后一个参数isSteadyShare为false
computeSharesInternal(schedulables, totalResources, type, false);
}
computeSharesInternal()方法
假设每个Scheduler的minShare和maxShare都是事先给定好的,基于每个Scheduler的minShare和maxShare、总的可用资源totalResources(totalSlots)、每个Scheduler的权重weight,计算每个Scheduler的weighted fairShare。要求不允许计算得出的fairShare低于Scheduler的minShare,也不允许高于Scheduler的maxShare。因此,我们的处理策略如下:
- 如果Scheduler的minShare > weight*R值,设置Scheduler的fairshare值 = minShare
- 如果Scheduler的maxShare < weight*R值,设置Scheduler的fairshare值 = maxShare
- 其它的Scheduler的fairshare值 = weight*R值
- Scheduler的fairshare值之和 = 总的可用资源totalResources
我们称R值为weight-to-slots ratio,因为它成功将Scheduler的权重weight转换成了Scheduler应分配得到的资源数量。
我们通过二分查找,找到合适的R值,使Scheduler的fairshare值最大化。在二分查找过程中,我们初始化R值为0,意味着所有的Scheduler在初始化时只能分配到它们的minshare值。每次将R值 * 2 进行加倍,直至Schedulable的fairshare值总和 >= totalResource。resourceUsedWithWeightToResourceRatio()方法会基于给定的R值,计算出Schedulable的fairshare值总和。
private static void computeSharesInternal(
Collection<? extends Schedulable> allSchedulables,
Resource totalResources, String type, boolean isSteadyShare) {
Collection<Schedulable> schedulables = new ArrayList<>();
//计算allSchedulables集合中固定的fairShare的和
//如果Schedulable的maxShare为0,或者weight为0,则它的fairShare为fixed share
//如果Schedulable的maxShare为0,则该Schedulable的固定fairShare为0
//如果Schedulable的weight为0,则该Schedulable的固定fairShare为它的minShare
int takenResources = handleFixedFairShares(
allSchedulables, schedulables, isSteadyShare, type);
if (schedulables.isEmpty()) {
return;
}
// Find an upper bound on R that we can use in our binary search. We start
// at R = 1 and double it until we have either used all the resources or we
// have met all Schedulables' max shares.
//计算allSchedulables集合的maxShare的总和
//该总和作为在二分查找中R的上边界。我们初始化R为1,每次对R进行加倍,直至R值遇到该上边界或者totalResource为止
int totalMaxShare = 0;
for (Schedulable sched : schedulables) {
long maxShare = sched.getMaxShare().getResourceValue(type);
totalMaxShare = (int) Math.min(maxShare + (long)totalMaxShare,
Integer.MAX_VALUE);
if (totalMaxShare == Integer.MAX_VALUE) {
break;
}
}
long totalResource = Math.max((totalResources.getResourceValue(type) -
takenResources), 0);
//重置totalResource为totalMaxShare和totalResource的较小值
totalResource = Math.min(totalMaxShare, totalResource);
//初始化R值为1
double rMax = 1.0;
//每次对R进行加倍,直至Schedulable的fairshare值总和 >= totalResource
//resourceUsedWithWeightToResourceRatio()方法逻辑:计算每个Schedulable的fairshare值,并进行求和。
//fairshare值 = weight*R , if minShare <= weight*R <= maxShare
//fairshare值 = minShare , if weight*R < minShare
//fairshare值 = maxShare , if maxShare < weight*R
while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
< totalResource) {
rMax *= 2.0;
}
// Perform the binary search for up to COMPUTE_FAIR_SHARES_ITERATIONS steps
//使用二分查找,进一步逼近R值,使得Schedulable的fairshare值总和 == totalResource
double left = 0;
double right = rMax;
for (int i = 0; i < COMPUTE_FAIR_SHARES_ITERATIONS; i++) {
double mid = (left + right) / 2.0;
int plannedResourceUsed = resourceUsedWithWeightToResourceRatio(
mid, schedulables, type);
if (plannedResourceUsed == totalResource) {
right = mid;
break;
} else if (plannedResourceUsed < totalResource) {
left = mid;
} else {
right = mid;
}
}
// Set the fair shares based on the value of R we've converged to
for (Schedulable sched : schedulables) {
Resource target;
if (isSteadyShare) {
target = ((FSQueue) sched).getSteadyFairShare();
} else {
//获取每个Schedulable的fairShare变量
target = sched.getFairShare();
}
//基于最终得出的R值,设置每个Schedulable的fairShare值为weight*R
target.setResourceValue(type, (long)computeShare(sched, right, type));
}
}
计算并更新叶子队列上所有FSAppAttempt的minshareStarvation和fairshareStarvation
FSLeafQueue#updateStarvedApps()方法
队列的饥饿状态有2种:minshare starvation状态和fairshare starvation状态。
minshare的大小只由队列确定。
faireshare的大小同时由队列和应用确定。
如果队列处于minshare starvation状态,我们需要挑选出最紧缺资源的app。
如果队列处于fairshare starvation状态,最少有一个app处于starvation状态。
/**
* Helper method to identify starved applications. This needs to be called
* ONLY from {@link #updateInternal}, after the application shares
* are updated.
*
* Caller does not need read/write lock on the leaf queue.
*/
void updateStarvedApps() {
// Fetch apps with pending demand
TreeSet<FSAppAttempt> appsWithDemand = fetchAppsWithDemand(false);
// Process apps with fairshare starvation
//计算app的已使用资源是否已经低于它的fair share,并且时间超过了fairSharePreemptionTimeout。
//记录处于fairshare starvation状态的app,返回这批app的fairshare差量的总和。
Resource fairShareStarvation = updateStarvedAppsFairshare(appsWithDemand);
// Compute extent of minshare starvation
//计算队列的minShareStarvation
Resource minShareStarvation = minShareStarvation();
// Compute minshare starvation that is not subsumed by fairshare starvation
//minShareStarvation = minShareStarvation - fairShareStarvation
//计算不归入fairShareStarvation的minShareStarvation量
Resources.subtractFromNonNegative(minShareStarvation, fairShareStarvation);
// Assign this minshare to apps with pending demand over fairshare
//计算app的minshareStarvation量
updateStarvedAppsMinshare(appsWithDemand, minShareStarvation);
}
FSLeafQueue#updateStarvedAppsMinshare()方法
计算app的minshareStarvation量 = app的资源需求量pendingDemand - app的fairshareStarvation量 - 队列的minshareStarvation量。
将处于minshareStarvation状态的app记录到FairScheduler的上下文FSContext的FSStarvedApps变量中。FSStarvedApps变量内部维护了一个PriorityBlockingQueue<FSAppAttempt>队列,它会按照总的starvation量(app的fairshareStarvation量 + app的minshareStarvation量)的大小将app在队列中进行排序。这个队列最终会被FSPreemptionThread处理。
/**
* Distribute minshare starvation to a set of apps
* @param appsWithDemand set of apps
* @param minShareStarvation minshare starvation to distribute
*/
private void updateStarvedAppsMinshare(
final TreeSet<FSAppAttempt> appsWithDemand,
final Resource minShareStarvation) {
//pending = minShareStarvation
Resource pending = Resources.clone(minShareStarvation);
// Keep adding apps to the starved list until the unmet demand goes over
// the remaining minshare
for (FSAppAttempt app : appsWithDemand) {
if (!Resources.isNone(pending)) {
Resource appMinShare = app.getPendingDemand();
//appMinShare = appMinShare - app.getFairshareStarvation()
Resources.subtractFromNonNegative(
appMinShare, app.getFairshareStarvation());
if (Resources.greaterThan(policy.getResourceCalculator(),
scheduler.getClusterResource(), appMinShare, pending)) {
//appMinShare = appMinShare - pending
Resources.subtractFromNonNegative(appMinShare, pending);
pending = none();
} else {
//pending = pending - appMinShare
Resources.subtractFromNonNegative(pending, appMinShare);
}
//记录app的minshareStarvation量为appMinShare
app.setMinshareStarvation(appMinShare);
//将处于minshareStarvation状态的app记录到FairScheduler的上下文FSContext的FSStarvedApps变量中
context.getStarvedApps().addStarvedApp(app);
} else {
// Reset minshare starvation in case we had set it in a previous
// iteration
app.resetMinshareStarvation();
}
}
}
FSLeafQueue#minShareStarvation()方法
计算队列的minshareStarvation量。计算队列的已使用资源是否已经低于它的minshare,并且时间超过了MinSharePreemptionTimeout,并返回相差minshare的差量。
/**
* Helper method to compute the amount of minshare starvation.
*
* @return the extent of minshare starvation
*/
private Resource minShareStarvation() {
// If demand < minshare, we should use demand to determine starvation
//starvation量 = min(minshare,资源需求量demand)
Resource starvation =
Resources.componentwiseMin(getMinShare(), getDemand());
//starvation量 = starvation量 - getResourceUsage()
//计算出相差minshare的差量,或者相差demand的差量
Resources.subtractFromNonNegative(starvation, getResourceUsage());
//通过判断饥饿量是否为none,从而判断队列当前是否处于minshare starvation状态
boolean starved = !Resources.isNone(starvation);
long now = scheduler.getClock().getTime();
if (!starved) {
// Record that the queue is not starved
//如果队列不处于minshare starvation状态,设置上次计算minshare的时间为当前时间
setLastTimeAtMinShare(now);
}
if (now - lastTimeAtMinShare < getMinSharePreemptionTimeout()) {
// the queue is not starved for the preemption timeout
//即使饥饿量不为none,但是如果时间没有超过MinSharePreemptionTimeout,重置饥饿量为none
starvation = Resources.clone(Resources.none());
}
return starvation;
}
FSLeafQueue#updateStarvedAppsFairShare()方法
计算一批app的资源需求量相差fairShare的差量,并将这些处于fairshare starvation状态的app记录到FairScheduler的上下文FSContext的FSStarvedApps变量中。最后,返回这批app的差量之和。
FSStarvedApps内部维护了一个PriorityBlockingQueue<FSAppAttempt>队列,它会按照总的starvation量(app的fairshareStarvation量 + app的minshareStarvation量)的大小将app在队列中进行排序。这个队列最终会被FSPreemptionThread处理。
/**
* Compute the extent of fairshare starvation for a set of apps.
*
* @param appsWithDemand apps to compute fairshare starvation for
* @return aggregate fairshare starvation for all apps
*/
private Resource updateStarvedAppsFairshare(
TreeSet<FSAppAttempt> appsWithDemand) {
Resource fairShareStarvation = Resources.clone(none());
// Fetch apps with unmet demand sorted by fairshare starvation
for (FSAppAttempt app : appsWithDemand) {
//计算每个app的资源需求量相差fairShare的差量
Resource appStarvation = app.fairShareStarvation();
if (!Resources.isNone(appStarvation)) {
//将starvation状态的app记录到FairScheduler的上下文FSContext。
context.getStarvedApps().addStarvedApp(app);
//累加每个app的差量
Resources.addTo(fairShareStarvation, appStarvation);
} else {
break;
}
}
//返回这批app的差量之和
return fairShareStarvation;
}
FSAppAttempt#fairShareStarvation()方法
计算app的已使用资源是否已经低于它的fair share,并且时间超过了fairSharePreemptionTimeout。如果是,
计算app的fairshareStarvation = fairDemand - getResourceUsage()
/**
* Helper method that computes the extent of fairshare starvation.
* @return freshly computed fairshare starvation
*/
Resource fairShareStarvation() {
long now = scheduler.getClock().getTime();
Resource threshold = Resources.multiply(
getFairShare(), getQueue().getFairSharePreemptionThreshold());
Resource fairDemand = Resources.componentwiseMin(threshold, demand);
// Check if the queue is starved for fairshare
//判断队列资源是否已经低于它的fair share
boolean starved = isUsageBelowShare(getResourceUsage(), fairDemand);
if (!starved) {
lastTimeAtFairShare = now;
}
if (!starved ||
now - lastTimeAtFairShare <
getQueue().getFairSharePreemptionTimeout()) {
fairshareStarvation = Resources.none();
} else {
//如果队列资源已经低于它的fair share,并且时间超过了fairSharePreemptionTimeout
// The app has been starved for longer than preemption-timeout.
//计算app的fairshareStarvation = fairDemand - getResourceUsage()
fairshareStarvation =
Resources.subtractFromNonNegative(fairDemand, getResourceUsage());
}
return fairshareStarvation;
}