调度方式
Spark 对于提交到同一个SparkContext的job,有两种调度方式,FIFO 和 Fair。 使用配置项spark.scheduler.mode 进行配置,默认为FIFO。 Spark对于调度算法进行了抽象,有个SchedulingAlgorithm的trait,然后FIFO和Fair的算法分别实现了这个接口。
FIFO 调度
FIFO是一种非常常见的调度算法。spark对于这种调度算法的实现非常简单。
private[spark] class FIFOSchedulingAlgorithm extends SchedulingAlgorithm {
override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
val priority1 = s1.priority
val priority2 = s2.priority
var res = math.signum(priority1 - priority2)
if (res == 0) {
val stageId1 = s1.stageId
val stageId2 = s2.stageId
res = math.signum(stageId1 - stageId2)
}
res < 0
}
}
这里就是实现了一个比较器,根据调度对象的优先级或者stageId决定调度顺序。
Fair 调度
调度算法设计
Fair 调度的算法稍微复杂一些:
private[spark] class FairSchedulingAlgorithm extends SchedulingAlgorithm {
override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
val minShare1 = s1.minShare
val minShare2 = s2.minShare
val runningTasks1 = s1.runningTasks
val runningTasks2 = s2.runningTasks
val s1Needy = runningTasks1 < minShare1
val s2Needy = runningTasks2 < minShare2
val minShareRatio1 = runningTasks1.toDouble / math.max(minShare1, 1.0)
val minShareRatio2 = runningTasks2.toDouble / math.max(minShare2, 1.0)
val taskToWeightRatio1 = runningTasks1.toDouble / s1.weight.toDouble
val taskToWeightRatio2 = runningTasks2.toDouble / s2.weight.toDouble
var compare = 0
if (s1Needy && !s2Needy) {
return true
} else if (!s1Needy && s2Needy) {
return false
} else if (s1Needy && s2Needy) {
compare = minShareRatio1.compareTo(minShareRatio2)
} else {
compare = taskToWeightRatio1.compareTo(taskToWeightRatio2)
}
if (compare < 0) {
true
} else if (compare > 0) {
false
} else {
s1.name < s2.name
}
}
}
这里对于两组资源池,首先会根据其最小需要的资源数(minShare)和正在运行的task个数,判断优先级高低。排列组合了两种状态,重点在于后面两种情况:
- 如果两个资源都满足running task 小于最小资源的情况,根据正在运行的task个数和最小资源数的比值(minShareRatio)判断优先级,值小的优先级高
- 如果两个都不满足running task 小于最小资源的情况,则以资源权重来划分调度优先级。
- 最后,如果这几种情况都无法划分优先级,那么就比较资源池的名称来决定优先级。
调度配置
如果需要使用公平调度,需要配置一个资源的分配文件,默认为fairscheduler.xml,以下是spark test case 的配置demo:
<?xml version="1.0"?>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one or more
~ contributor license agreements. See the NOTICE file distributed with
~ this work for additional information regarding copyright ownership.
~ The ASF licenses this file to You under the Apache License, Version 2.0
~ (the "License"); you may not use this file except in compliance with
~ the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
-->
<allocations>
<pool name="1">
<minShare>2</minShare>
<weight>1</weight>
<schedulingMode>FIFO</schedulingMode>
</pool>
<pool name="2">
<minShare>3</minShare>
<weight>1</weight>
<schedulingMode>FIFO</schedulingMode>
</pool>
<pool name="3">
</pool>
</allocations>
可以看到这里有算法里需要用到的部分参数配置。
默认情况下可以使用上述文件名,或者使用spark.scheduler.allocation.file 参数指定自己的资源配置文件名。如果需要指定使用的资源池,可以调用sparkContext的 setLocalProperties 设置spark.scheduler.pool配置项的值。比如,使用pool 1可以这样:
sparkContext.setLocalProperties(“spark.scheduler.pool”,“1”)
资源池实现和构建
Spark的调度对象主要有两个,Pool和TaskSetManager
这里Pool是一个资源池的封装,TaskSetManager是用于追踪和管理tasks。此处我们重点了解一下资源池的实现,以及资源池会如何构建和管理。
资源池实现
Pool的类结构如下
这里有两个重要的数据结构,用于存储资源池中的调度对象(主要是taskSetManager),以及调度对象名称和调度对象之间的映射关系:
val schedulableQueue = new ConcurrentLinkedQueue[Schedulable]
val schedulableNameToSchedulable = new ConcurrentHashMap[String, Schedulable]
此外还封装了一些调度对象的添加,移除,查找方法,同时对于每个TaskSetManager,当执行task 的exector 丢失时,用于重新调度任务,保证task的正常执行。具体实现如下:
/** Called by TaskScheduler when an executor is lost so we can re-enqueue our tasks */
override def executorLost(execId: String, host: String, reason: ExecutorLossReason) {
// Re-enqueue any tasks that ran on the failed executor if this is a shuffle map stage,
// and we are not using an external shuffle server which could serve the shuffle outputs.
// The reason is the next stage wouldn't be able to fetch the data from this dead executor
// so we would need to rerun these tasks on other executors.
if (tasks(0).isInstanceOf[ShuffleMapTask] && !env.blockManager.externalShuffleServiceEnabled
&& !isZombie) {
for ((tid, info) <- taskInfos if info.executorId == execId) {
val index = taskInfos(tid).index
if (successful(index) && !killedByOtherAttempt(index)) {
successful(index) = false
copiesRunning(index) -= 1
tasksSuccessful -= 1
addPendingTask(index)
// Tell the DAGScheduler that this task was resubmitted so that it doesn't think our
// stage finishes when a total of tasks.size tasks finish.
sched.dagScheduler.taskEnded(
tasks(index), Resubmitted, null, Seq.empty, info)
}
}
}
for ((tid, info) <- taskInfos if info.running && info.executorId == execId) {
val exitCausedByApp: Boolean = reason match {
case exited: ExecutorExited => exited.exitCausedByApp
case ExecutorKilled => false
case _ => true
}
handleFailedTask(tid, TaskState.FAILED, ExecutorLostFailure(info.executorId, exitCausedByApp,
Some(reason.toString)))
}
// recalculate valid locality levels and waits when executor is lost
recomputeLocality()
}
资源池构建
spark通过构建调度树的方式来管理资源池,每个资源池(pool)的调度对象使用一个一致性链表来存储。
构建资源池这部分是由SchedulableBuilder完成,根据不同的调度算法有不同的Builder:
资源池初始化由TaskSchedulerImpl完成,这里根据最基本的调度方式初始化不同的SchedulableBuilder,调用buidler的buildPools()方法将初始化资源池添加到root pool中:
val rootPool: Pool = new Pool("", schedulingMode, 0, 0)
def initialize(backend: SchedulerBackend) {
this.backend = backend
schedulableBuilder = {
schedulingMode match {
case SchedulingMode.FIFO =>
new FIFOSchedulableBuilder(rootPool)
case SchedulingMode.FAIR =>
new FairSchedulableBuilder(rootPool, conf)
case _ =>
throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
s"$schedulingMode")
}
}
schedulableBuilder.buildPools()
}
FIFO资源池构建
FIFOSchedulableBuilder的实现非常简单,只有一个添加taskSetManager 的方法往对应的资源池里面添加taskSetManager :
private[spark] class FIFOSchedulableBuilder(val rootPool: Pool)
extends SchedulableBuilder with Logging {
override def buildPools() {
// nothing
}
override def addTaskSetManager(manager: Schedulable, properties: Properties) {
rootPool.addSchedulable(manager)
}
}
因为对于FIFO的策略来说只有一个root pool,不存在对于其他二级pool的相关操作。相比之下FairSchedulableBuilder就要复杂的多。
Fair资源池构建
FairSchedulableBuilder结构如下:
这里面包含了很多资源池的初始化工作,包括读取配置文件,初始化资源池,初始默认资源池等工作:
override def buildPools() {
var fileData: Option[(InputStream, String)] = None
try {
fileData = schedulerAllocFile.map { f =>
val fis = new FileInputStream(f)
logInfo(s"Creating Fair Scheduler pools from $f")
Some((fis, f))
}.getOrElse {
val is = Utils.getSparkClassLoader.getResourceAsStream(DEFAULT_SCHEDULER_FILE)
if (is != null) {
logInfo(s"Creating Fair Scheduler pools from default file: $DEFAULT_SCHEDULER_FILE")
Some((is, DEFAULT_SCHEDULER_FILE))
} else {
logWarning("Fair Scheduler configuration file not found so jobs will be scheduled in " +
s"FIFO order. To use fair scheduling, configure pools in $DEFAULT_SCHEDULER_FILE or " +
s"set $SCHEDULER_ALLOCATION_FILE_PROPERTY to a file that contains the configuration.")
None
}
}
fileData.foreach { case (is, fileName) => buildFairSchedulerPool(is, fileName) }
} catch {
case NonFatal(t) =>
val defaultMessage = "Error while building the fair scheduler pools"
val message = fileData.map { case (is, fileName) => s"$defaultMessage from $fileName" }
.getOrElse(defaultMessage)
logError(message, t)
throw t
} finally {
fileData.foreach { case (is, fileName) => is.close() }
}
// finally create "default" pool
buildDefaultPool()
}
private def buildDefaultPool() {
if (rootPool.getSchedulableByName(DEFAULT_POOL_NAME) == null) {
val pool = new Pool(DEFAULT_POOL_NAME, DEFAULT_SCHEDULING_MODE,
DEFAULT_MINIMUM_SHARE, DEFAULT_WEIGHT)
rootPool.addSchedulable(pool)
logInfo("Created default pool: %s, schedulingMode: %s, minShare: %d, weight: %d".format(
DEFAULT_POOL_NAME, DEFAULT_SCHEDULING_MODE, DEFAULT_MINIMUM_SHARE, DEFAULT_WEIGHT))
}
}
// 这里根据配置文件创建资源池,资源池的调度方式也是可以配置(FIFO或Fair)
private def buildFairSchedulerPool(is: InputStream, fileName: String) {
val xml = XML.load(is)
for (poolNode <- (xml \\ POOLS_PROPERTY)) {
val poolName = (poolNode \ POOL_NAME_PROPERTY).text
val schedulingMode = getSchedulingModeValue(poolNode, poolName,
DEFAULT_SCHEDULING_MODE, fileName)
val minShare = getIntValue(poolNode, poolName, MINIMUM_SHARES_PROPERTY,
DEFAULT_MINIMUM_SHARE, fileName)
val weight = getIntValue(poolNode, poolName, WEIGHT_PROPERTY,
DEFAULT_WEIGHT, fileName)
rootPool.addSchedulable(new Pool(poolName, schedulingMode, minShare, weight))
logInfo("Created pool: %s, schedulingMode: %s, minShare: %d, weight: %d".format(
poolName, schedulingMode, minShare, weight))
}
}
这里添加TaskSetManager的方式也稍微复杂些,需要根据配置项获取TaskSetManager所属的pool,然后将TaskSetManager添加到pool里面,如果没有对应的pool,那么就创建一个新的pool,挂到rootPool上,并把TaskSetManager添加到新建的pool中。
override def addTaskSetManager(manager: Schedulable, properties: Properties) {
val poolName = if (properties != null) {
properties.getProperty(FAIR_SCHEDULER_PROPERTIES, DEFAULT_POOL_NAME)
} else {
DEFAULT_POOL_NAME
}
var parentPool = rootPool.getSchedulableByName(poolName)
if (parentPool == null) {
// we will create a new pool that user has configured in app
// instead of being defined in xml file
parentPool = new Pool(poolName, DEFAULT_SCHEDULING_MODE,
DEFAULT_MINIMUM_SHARE, DEFAULT_WEIGHT)
rootPool.addSchedulable(parentPool)
logWarning(s"A job was submitted with scheduler pool $poolName, which has not been " +
"configured. This can happen when the file that pools are read from isn't set, or " +
s"when that file doesn't contain $poolName. Created $poolName with default " +
s"configuration (schedulingMode: $DEFAULT_SCHEDULING_MODE, " +
s"minShare: $DEFAULT_MINIMUM_SHARE, weight: $DEFAULT_WEIGHT)")
}
parentPool.addSchedulable(manager)
logInfo("Added task set " + manager.name + " tasks to pool " + poolName)
}
优先级排序和任务调度
当资源池初始化完毕,taskscheduler会调用Pool的getSortedTaskSetQueue方法,对task按照对应的优先级算法进行优先级排序,然后逐个调度任务去执行。以下是排序的具体实现:
override def getSortedTaskSetQueue: ArrayBuffer[TaskSetManager] = {
val sortedTaskSetQueue = new ArrayBuffer[TaskSetManager]
val sortedSchedulableQueue =
schedulableQueue.asScala.toSeq.sortWith(taskSetSchedulingAlgorithm.comparator)
for (schedulable <- sortedSchedulableQueue) {
sortedTaskSetQueue ++= schedulable.getSortedTaskSetQueue
}
sortedTaskSetQueue
}
对于taskscheduler实现,我之前的文章也写过这部分的源码解读,可以参考我此前的系列文章 菜鸟的Spark 源码学习之路 -3 TaskScheduler源码 如果有不正确的地方还请批评指正。
总结
以上就是spark整个job资源池配置和调度相关的源码止现,欢迎讨论。