Spark Job调度模式源码学习

调度方式

Spark 对于提交到同一个SparkContext的job,有两种调度方式,FIFO 和 Fair。 使用配置项spark.scheduler.mode 进行配置,默认为FIFO。 Spark对于调度算法进行了抽象,有个SchedulingAlgorithm的trait,然后FIFO和Fair的算法分别实现了这个接口。
调度算法类图

FIFO 调度

FIFO是一种非常常见的调度算法。spark对于这种调度算法的实现非常简单。

private[spark] class FIFOSchedulingAlgorithm extends SchedulingAlgorithm {
  override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
    val priority1 = s1.priority
    val priority2 = s2.priority
    var res = math.signum(priority1 - priority2)
    if (res == 0) {
      val stageId1 = s1.stageId
      val stageId2 = s2.stageId
      res = math.signum(stageId1 - stageId2)
    }
    res < 0
  }
}

这里就是实现了一个比较器,根据调度对象的优先级或者stageId决定调度顺序。

Fair 调度

调度算法设计

Fair 调度的算法稍微复杂一些:

private[spark] class FairSchedulingAlgorithm extends SchedulingAlgorithm {
  override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
    val minShare1 = s1.minShare
    val minShare2 = s2.minShare
    val runningTasks1 = s1.runningTasks
    val runningTasks2 = s2.runningTasks
    val s1Needy = runningTasks1 < minShare1
    val s2Needy = runningTasks2 < minShare2
    val minShareRatio1 = runningTasks1.toDouble / math.max(minShare1, 1.0)
    val minShareRatio2 = runningTasks2.toDouble / math.max(minShare2, 1.0)
    val taskToWeightRatio1 = runningTasks1.toDouble / s1.weight.toDouble
    val taskToWeightRatio2 = runningTasks2.toDouble / s2.weight.toDouble

    var compare = 0
    if (s1Needy && !s2Needy) {
      return true
    } else if (!s1Needy && s2Needy) {
      return false
    } else if (s1Needy && s2Needy) {
      compare = minShareRatio1.compareTo(minShareRatio2)
    } else {
      compare = taskToWeightRatio1.compareTo(taskToWeightRatio2)
    }
    if (compare < 0) {
      true
    } else if (compare > 0) {
      false
    } else {
      s1.name < s2.name
    }
  }
}

这里对于两组资源池,首先会根据其最小需要的资源数(minShare)和正在运行的task个数,判断优先级高低。排列组合了两种状态,重点在于后面两种情况:

  1. 如果两个资源都满足running task 小于最小资源的情况,根据正在运行的task个数和最小资源数的比值(minShareRatio)判断优先级,值小的优先级高
  2. 如果两个都不满足running task 小于最小资源的情况,则以资源权重来划分调度优先级。
  3. 最后,如果这几种情况都无法划分优先级,那么就比较资源池的名称来决定优先级。

调度配置

如果需要使用公平调度,需要配置一个资源的分配文件,默认为fairscheduler.xml,以下是spark test case 的配置demo:

<?xml version="1.0"?>
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one or more
  ~ contributor license agreements.  See the NOTICE file distributed with
  ~ this work for additional information regarding copyright ownership.
  ~ The ASF licenses this file to You under the Apache License, Version 2.0
  ~ (the "License"); you may not use this file except in compliance with
  ~ the License.  You may obtain a copy of the License at
  ~
  ~    http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing, software
  ~ distributed under the License is distributed on an "AS IS" BASIS,
  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  ~ See the License for the specific language governing permissions and
  ~ limitations under the License.
  -->

<allocations>
<pool name="1">
    <minShare>2</minShare>
    <weight>1</weight>
    <schedulingMode>FIFO</schedulingMode>
</pool>
<pool name="2">
    <minShare>3</minShare>
    <weight>1</weight>
    <schedulingMode>FIFO</schedulingMode>
</pool>
<pool name="3">
</pool>
</allocations>

可以看到这里有算法里需要用到的部分参数配置。
默认情况下可以使用上述文件名,或者使用spark.scheduler.allocation.file 参数指定自己的资源配置文件名。如果需要指定使用的资源池,可以调用sparkContext的 setLocalProperties 设置spark.scheduler.pool配置项的值。比如,使用pool 1可以这样:
sparkContext.setLocalProperties(“spark.scheduler.pool”,“1”)

资源池实现和构建

Spark的调度对象主要有两个,Pool和TaskSetManager
可调度对象
这里Pool是一个资源池的封装,TaskSetManager是用于追踪和管理tasks。此处我们重点了解一下资源池的实现,以及资源池会如何构建和管理。

资源池实现

Pool的类结构如下
Pool的类结构
这里有两个重要的数据结构,用于存储资源池中的调度对象(主要是taskSetManager),以及调度对象名称和调度对象之间的映射关系:

  val schedulableQueue = new ConcurrentLinkedQueue[Schedulable]
  val schedulableNameToSchedulable = new ConcurrentHashMap[String, Schedulable]

此外还封装了一些调度对象的添加,移除,查找方法,同时对于每个TaskSetManager,当执行task 的exector 丢失时,用于重新调度任务,保证task的正常执行。具体实现如下:

/** Called by TaskScheduler when an executor is lost so we can re-enqueue our tasks */
  override def executorLost(execId: String, host: String, reason: ExecutorLossReason) {
    // Re-enqueue any tasks that ran on the failed executor if this is a shuffle map stage,
    // and we are not using an external shuffle server which could serve the shuffle outputs.
    // The reason is the next stage wouldn't be able to fetch the data from this dead executor
    // so we would need to rerun these tasks on other executors.
    if (tasks(0).isInstanceOf[ShuffleMapTask] && !env.blockManager.externalShuffleServiceEnabled
        && !isZombie) {
      for ((tid, info) <- taskInfos if info.executorId == execId) {
        val index = taskInfos(tid).index
        if (successful(index) && !killedByOtherAttempt(index)) {
          successful(index) = false
          copiesRunning(index) -= 1
          tasksSuccessful -= 1
          addPendingTask(index)
          // Tell the DAGScheduler that this task was resubmitted so that it doesn't think our
          // stage finishes when a total of tasks.size tasks finish.
          sched.dagScheduler.taskEnded(
            tasks(index), Resubmitted, null, Seq.empty, info)
        }
      }
    }
    for ((tid, info) <- taskInfos if info.running && info.executorId == execId) {
      val exitCausedByApp: Boolean = reason match {
        case exited: ExecutorExited => exited.exitCausedByApp
        case ExecutorKilled => false
        case _ => true
      }
      handleFailedTask(tid, TaskState.FAILED, ExecutorLostFailure(info.executorId, exitCausedByApp,
        Some(reason.toString)))
    }
    // recalculate valid locality levels and waits when executor is lost
    recomputeLocality()
  }

资源池构建

spark通过构建调度树的方式来管理资源池,每个资源池(pool)的调度对象使用一个一致性链表来存储。
构建资源池这部分是由SchedulableBuilder完成,根据不同的调度算法有不同的Builder:
SchedulableBuilder资源池初始化由TaskSchedulerImpl完成,这里根据最基本的调度方式初始化不同的SchedulableBuilder,调用buidler的buildPools()方法将初始化资源池添加到root pool中:

val rootPool: Pool = new Pool("", schedulingMode, 0, 0)

def initialize(backend: SchedulerBackend) {
    this.backend = backend
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, conf)
        case _ =>
          throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
          s"$schedulingMode")
      }
    }
    schedulableBuilder.buildPools()
  }

FIFO资源池构建

FIFOSchedulableBuilder的实现非常简单,只有一个添加taskSetManager 的方法往对应的资源池里面添加taskSetManager :

private[spark] class FIFOSchedulableBuilder(val rootPool: Pool)
  extends SchedulableBuilder with Logging {

  override def buildPools() {
    // nothing
  }

  override def addTaskSetManager(manager: Schedulable, properties: Properties) {
    rootPool.addSchedulable(manager)
  }
}

因为对于FIFO的策略来说只有一个root pool,不存在对于其他二级pool的相关操作。相比之下FairSchedulableBuilder就要复杂的多。

Fair资源池构建

FairSchedulableBuilder结构如下:
FairSchedulableBuilder
这里面包含了很多资源池的初始化工作,包括读取配置文件,初始化资源池,初始默认资源池等工作:

 override def buildPools() {
    var fileData: Option[(InputStream, String)] = None
    try {
      fileData = schedulerAllocFile.map { f =>
        val fis = new FileInputStream(f)
        logInfo(s"Creating Fair Scheduler pools from $f")
        Some((fis, f))
      }.getOrElse {
        val is = Utils.getSparkClassLoader.getResourceAsStream(DEFAULT_SCHEDULER_FILE)
        if (is != null) {
          logInfo(s"Creating Fair Scheduler pools from default file: $DEFAULT_SCHEDULER_FILE")
          Some((is, DEFAULT_SCHEDULER_FILE))
        } else {
          logWarning("Fair Scheduler configuration file not found so jobs will be scheduled in " +
            s"FIFO order. To use fair scheduling, configure pools in $DEFAULT_SCHEDULER_FILE or " +
            s"set $SCHEDULER_ALLOCATION_FILE_PROPERTY to a file that contains the configuration.")
          None
        }
      }

      fileData.foreach { case (is, fileName) => buildFairSchedulerPool(is, fileName) }
    } catch {
      case NonFatal(t) =>
        val defaultMessage = "Error while building the fair scheduler pools"
        val message = fileData.map { case (is, fileName) => s"$defaultMessage from $fileName" }
          .getOrElse(defaultMessage)
        logError(message, t)
        throw t
    } finally {
      fileData.foreach { case (is, fileName) => is.close() }
    }

    // finally create "default" pool
    buildDefaultPool()
  }

  private def buildDefaultPool() {
    if (rootPool.getSchedulableByName(DEFAULT_POOL_NAME) == null) {
      val pool = new Pool(DEFAULT_POOL_NAME, DEFAULT_SCHEDULING_MODE,
        DEFAULT_MINIMUM_SHARE, DEFAULT_WEIGHT)
      rootPool.addSchedulable(pool)
      logInfo("Created default pool: %s, schedulingMode: %s, minShare: %d, weight: %d".format(
        DEFAULT_POOL_NAME, DEFAULT_SCHEDULING_MODE, DEFAULT_MINIMUM_SHARE, DEFAULT_WEIGHT))
    }
  }
  // 这里根据配置文件创建资源池,资源池的调度方式也是可以配置(FIFO或Fair)
  private def buildFairSchedulerPool(is: InputStream, fileName: String) {
    val xml = XML.load(is)
    for (poolNode <- (xml \\ POOLS_PROPERTY)) {

      val poolName = (poolNode \ POOL_NAME_PROPERTY).text

      val schedulingMode = getSchedulingModeValue(poolNode, poolName,
        DEFAULT_SCHEDULING_MODE, fileName)
      val minShare = getIntValue(poolNode, poolName, MINIMUM_SHARES_PROPERTY,
        DEFAULT_MINIMUM_SHARE, fileName)
      val weight = getIntValue(poolNode, poolName, WEIGHT_PROPERTY,
        DEFAULT_WEIGHT, fileName)

      rootPool.addSchedulable(new Pool(poolName, schedulingMode, minShare, weight))

      logInfo("Created pool: %s, schedulingMode: %s, minShare: %d, weight: %d".format(
        poolName, schedulingMode, minShare, weight))
    }
  }

这里添加TaskSetManager的方式也稍微复杂些,需要根据配置项获取TaskSetManager所属的pool,然后将TaskSetManager添加到pool里面,如果没有对应的pool,那么就创建一个新的pool,挂到rootPool上,并把TaskSetManager添加到新建的pool中。

override def addTaskSetManager(manager: Schedulable, properties: Properties) {
    val poolName = if (properties != null) {
        properties.getProperty(FAIR_SCHEDULER_PROPERTIES, DEFAULT_POOL_NAME)
      } else {
        DEFAULT_POOL_NAME
      }
    var parentPool = rootPool.getSchedulableByName(poolName)
    if (parentPool == null) {
      // we will create a new pool that user has configured in app
      // instead of being defined in xml file
      parentPool = new Pool(poolName, DEFAULT_SCHEDULING_MODE,
        DEFAULT_MINIMUM_SHARE, DEFAULT_WEIGHT)
      rootPool.addSchedulable(parentPool)
      logWarning(s"A job was submitted with scheduler pool $poolName, which has not been " +
        "configured. This can happen when the file that pools are read from isn't set, or " +
        s"when that file doesn't contain $poolName. Created $poolName with default " +
        s"configuration (schedulingMode: $DEFAULT_SCHEDULING_MODE, " +
        s"minShare: $DEFAULT_MINIMUM_SHARE, weight: $DEFAULT_WEIGHT)")
    }
    parentPool.addSchedulable(manager)
    logInfo("Added task set " + manager.name + " tasks to pool " + poolName)
  }

优先级排序和任务调度

当资源池初始化完毕,taskscheduler会调用Pool的getSortedTaskSetQueue方法,对task按照对应的优先级算法进行优先级排序,然后逐个调度任务去执行。以下是排序的具体实现:

  override def getSortedTaskSetQueue: ArrayBuffer[TaskSetManager] = {
    val sortedTaskSetQueue = new ArrayBuffer[TaskSetManager]
    val sortedSchedulableQueue =
      schedulableQueue.asScala.toSeq.sortWith(taskSetSchedulingAlgorithm.comparator)
    for (schedulable <- sortedSchedulableQueue) {
      sortedTaskSetQueue ++= schedulable.getSortedTaskSetQueue
    }
    sortedTaskSetQueue
  }

对于taskscheduler实现,我之前的文章也写过这部分的源码解读,可以参考我此前的系列文章 菜鸟的Spark 源码学习之路 -3 TaskScheduler源码 如果有不正确的地方还请批评指正。

总结

以上就是spark整个job资源池配置和调度相关的源码止现,欢迎讨论。

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值