Spark Job调度模式源码学习

最新推荐文章于 2021-12-04 19:40:34 发布

彼岸枫雪非

最新推荐文章于 2021-12-04 19:40:34 发布

阅读量422

点赞数

分类专栏： Spark 文章标签：大数据 spark

本文链接：https://blog.csdn.net/u012543819/article/details/106000035

版权

Spark 专栏收录该内容

23 篇文章 1 订阅

订阅专栏

文章内容

调度方式
FIFO 调度
Fair 调度
- 调度算法设计
- 调度配置
资源池实现和构建
优先级排序和任务调度
总结

调度方式

Spark 对于提交到同一个SparkContext的job，有两种调度方式，FIFO 和 Fair。使用配置项spark.scheduler.mode 进行配置，默认为FIFO。 Spark对于调度算法进行了抽象，有个SchedulingAlgorithm的trait，然后FIFO和Fair的算法分别实现了这个接口。
调度算法类图

FIFO 调度

FIFO是一种非常常见的调度算法。spark对于这种调度算法的实现非常简单。

private[spark] class FIFOSchedulingAlgorithm extends SchedulingAlgorithm {
  override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
    val priority1 = s1.priority
    val priority2 = s2.priority
    var res = math.signum(priority1 - priority2)
    if (res == 0) {
      val stageId1 = s1.stageId
      val stageId2 = s2.stageId
      res = math.signum(stageId1 - stageId2)
    }
    res < 0
  }
}

这里就是实现了一个比较器，根据调度对象的优先级或者stageId决定调度顺序。

Fair 调度

调度算法设计

Fair 调度的算法稍微复杂一些：

private[spark] class FairSchedulingAlgorithm extends SchedulingAlgorithm {
  override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
    val minShare1 = s1.minShare
    val minShare2 = s2.minShare
    val runningTasks1 = s1.runningTasks
    val runningTasks2 = s2.runningTasks
    val s1Needy = runningTasks1 < minShare1
    val s2Needy = runningTasks2 < minShare2
    val minShareRatio1 = runningTasks1.toDouble / math.max(minShare1, 1.0)
    val minShareRatio2 = runningTasks2.toDouble / math.max(minShare2, 1.0)
    val taskToWeightRatio1 = runningTasks1.toDouble / s1.weight.toDouble
    val taskToWeightRatio2 = runningTasks2.toDouble / s2.weight.toDouble

    var compare = 0
    if (s1Needy && !s2Needy) {
      return true
    } else if (!s1Needy && s2Needy) {
      return false
    } else if (s1Needy && s2Needy) {
      compare = minShareRatio1.compareTo(minShareRatio2)
    } else {
      compare = taskToWeightRatio1.compareTo(taskToWeightRatio2)
    }
    if (compare < 0) {
      true
    } else if (compare > 0) {
      false
    } else {
      s1.name < s2.name
    }
  }
}

这里对于两组资源池，首先会根据其最小需要的资源数（minShare）和正在运行的task个数，判断优先级高低。排列组合了两种状态，重点在于后面两种情况：

如果两个资源都满足running task 小于最小资源的情况，根据正在运行的task个数和最小资源数的比值（minShareRatio）判断优先级，值小的优先级高
如果两个都不满足running task 小于最小资源的情况，则以资源权重来划分调度优先级。
最后，如果这几种情况都无法划分优先级，那么就比较资源池的名称来决定优先级。

调度配置

如果需要使用公平调度，需要配置一个资源的分配文件，默认为fairscheduler.xml，以下是spark test case 的配置demo：

<?xml version="1.0"?>
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one or more
  ~ contributor license agreements.  See the NOTICE file distributed with
  ~ this work for additional information regarding copyright ownership.
  ~ The ASF licenses this file to You under the Apache License, Version 2.0
  ~ (the "License"); you may not use this file except in compliance with
  ~ the License.  You may obtain a copy of the License at
  ~
  ~    http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing, software
  ~ distributed under the License is distributed on an "AS IS" BASIS,
  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  ~ See the License for the specific language governing permissions and
  ~ limitations under the License.
  -->

<allocations>
<pool name="1">
    <minShare>2</minShare>
    <weight>1</weight>
    <schedulingMode>FIFO</schedulingMode>
</pool>
<pool name="2">
    <minShare>3</minShare>
    <weight>1</weight>
    <schedulingMode>FIFO</schedulingMode>
</pool>
<pool name="3">
</pool>
</allocations>

可以看到这里有算法里需要用到的部分参数配置。
默认情况下可以使用上述文件名，或者使用spark.scheduler.allocation.file 参数指定自己的资源配置文件名。如果需要指定使用的资源池，可以调用sparkContext的 setLocalProperties 设置spark.scheduler.pool配置项的值。比如，使用pool 1可以这样：
sparkContext.setLocalProperties(“spark.scheduler.pool”,“1”)

资源池实现和构建

Spark的调度对象主要有两个，Pool和TaskSetManager
可调度对象
这里Pool是一个资源池的封装，TaskSetManager是用于追踪和管理tasks。此处我们重点了解一下资源池的实现，以及资源池会如何构建和管理。

资源池实现

Pool的类结构如下

这里有两个重要的数据结构，用于存储资源池中的调度对象（主要是taskSetManager），以及调度对象名称和调度对象之间的映射关系：

  val schedulableQueue = new ConcurrentLinkedQueue[Schedulable]
  val schedulableNameToSchedulable = new ConcurrentHashMap[String, Schedulable]

此外还封装了一些调度对象的添加，移除，查找方法，同时对于每个TaskSetManager，当执行task 的exector 丢失时，用于重新调度任务，保证task的正常执行。具体实现如下：

/** Called by TaskScheduler when an executor is lost so we can re-enqueue our tasks */
  override def executorLost(execId: String, host: String, reason: ExecutorLossReason) {
    // Re-enqueue any tasks that ran on the failed executor if this is a shuffle map stage,
    // and we are not using an external shuffle server which could serve the shuffle outputs.
    // The reason is the next stage wouldn't be able to fetch the data from this dead executor
    // so we would need to rerun these tasks on other executors.
    if (tasks(0).isInstanceOf[ShuffleMapTask] && !env.blockManager.externalShuffleServiceEnabled
        && !isZombie) {
      for ((tid, info) <- taskInfos if info.executorId == execId) {
        val index = taskInfos(tid).index
        if (successful(index) && !killedByOtherAttempt(index)) {
          successful(index) = false
          copiesRunning(index) -= 1
          tasksSuccessful -= 1
          addPendingTask(index)
          // Tell the DAGScheduler that this task was resubmitted so that it doesn't think our
          // stage finishes when a total of tasks.size tasks finish.
          sched.dagScheduler.taskEnded(
            tasks(index), Resubmitted, null, Seq.empty, info)
        }
      }
    }
    for ((tid, info) <- taskInfos if info.running && info.executorId == execId) {
      val exitCausedByApp: Boolean = reason match {
        case exited: ExecutorExited => exited.exitCausedByApp
        case ExecutorKilled => false
        case _ => true
      }
      handleFailedTask(tid, TaskState.FAILED, ExecutorLostFailure(info.executorId, exitCausedByApp,
        Some(reason.toString)))
    }
    // recalculate valid locality levels and waits when executor is lost
    recomputeLocality()
  }

资源池构建

spark通过构建调度树的方式来管理资源池，每个资源池（pool）的调度对象使用一个一致性链表来存储。
构建资源池这部分是由SchedulableBuilder完成，根据不同的调度算法有不同的Builder：
资源池初始化由TaskSchedulerImpl完成，这里根据最基本的调度方式初始化不同的SchedulableBuilder，调用buidler的buildPools()方法将初始化资源池添加到root pool中：

val rootPool: Pool = new Pool("", schedulingMode, 0, 0)

def initialize(backend: SchedulerBackend) {
    this.backend = backend
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, conf)
        case _ =>
          throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
          s"$schedulingMode")
      }
    }
    schedulableBuilder.buildPools()
  }

FIFO资源池构建

FIFOSchedulableBuilder的实现非常简单，只有一个添加taskSetManager 的方法往对应的资源池里面添加taskSetManager ：

private[spark] class FIFOSchedulableBuilder(val rootPool: Pool)
  extends SchedulableBuilder with Logging {

  override def buildPools() {
    // nothing
  }

  override def addTaskSetManager(manager: Schedulable, properties: Properties) {
    rootPool.addSchedulable(manager)
  }
}

因为对于FIFO的策略来说只有一个root pool，不存在对于其他二级pool的相关操作。相比之下FairSchedulableBuilder就要复杂的多。

Fair资源池构建

FairSchedulableBuilder结构如下：

这里面包含了很多资源池的初始化工作，包括读取配置文件，初始化资源池，初始默认资源池等工作：

 override def buildPools() {
    var fileData: Option[(InputStream, String)] = None
    try {
      fileData = schedulerAllocFile.map { f =>
        val fis = new FileInputStream(f)
        logInfo(s"Creating Fair Scheduler pools from $f")
        Some((fis, f))
      }.getOrElse {
        val is = Utils.getSparkClassLoader.getResourceAsStream(DEFAULT_SCHEDULER_FILE)
        if (is != null) {
          logInfo(s"Creating Fair Scheduler pools from default file: $DEFAULT_SCHEDULER_FILE")
          Some((is, DEFAULT_SCHEDULER_FILE))
        } else {
          logWarning("Fair Scheduler configuration file not found so jobs will be scheduled in " +
            s"FIFO order. To use fair scheduling, configure pools in $DEFAULT_SCHEDULER_FILE or " +
            s"set $SCHEDULER_ALLOCATION_FILE_PROPERTY to a file that contains the configuration.")
          None
        }
      }

      fileData.foreach { case (is, fileName) => buildFairSchedulerPool(is, fileName) }
    } catch {
      case NonFatal(t) =>
        val defaultMessage = "Error while building the fair scheduler pools"
        val message = fileData.map { case (is, fileName) => s"$defaultMessage from $fileName" }
          .getOrElse(defaultMessage)
        logError(message, t)
        throw t
    } finally {
      fileData.foreach { case (is, fileName) => is.close() }
    }

    // finally create "default" pool
    buildDefaultPool()
  }

  private def buildDefaultPool() {
    if (rootPool.getSchedulableByName(DEFAULT_POOL_NAME) == null) {
      val pool = new Pool(DEFAULT_POOL_NAME, DEFAULT_SCHEDULING_MODE,
        DEFAULT_MINIMUM_SHARE, DEFAULT_WEIGHT)
      rootPool.addSchedulable(pool)
      logInfo("Created default pool: %s, schedulingMode: %s, minShare: %d, weight: %d".format(
        DEFAULT_POOL_NAME, DEFAULT_SCHEDULING_MODE, DEFAULT_MINIMUM_SHARE, DEFAULT_WEIGHT))
    }
  }
  // 这里根据配置文件创建资源池，资源池的调度方式也是可以配置（FIFO或Fair）
  private def buildFairSchedulerPool(is: InputStream, fileName: String) {
    val xml = XML.load(is)
    for (poolNode <- (xml \\ POOLS_PROPERTY)) {

      val poolName = (poolNode \ POOL_NAME_PROPERTY).text

      val schedulingMode = getSchedulingModeValue(poolNode, poolName,
        DEFAULT_SCHEDULING_MODE, fileName)
      val minShare = getIntValue(poolNode, poolName, MINIMUM_SHARES_PROPERTY,
        DEFAULT_MINIMUM_SHARE, fileName)
      val weight = getIntValue(poolNode, poolName, WEIGHT_PROPERTY,
        DEFAULT_WEIGHT, fileName)

      rootPool.addSchedulable(new Pool(poolName, schedulingMode, minShare, weight))

      logInfo("Created pool: %s, schedulingMode: %s, minShare: %d, weight: %d".format(
        poolName, schedulingMode, minShare, weight))
    }
  }

这里添加TaskSetManager的方式也稍微复杂些，需要根据配置项获取TaskSetManager所属的pool，然后将TaskSetManager添加到pool里面，如果没有对应的pool，那么就创建一个新的pool，挂到rootPool上，并把TaskSetManager添加到新建的pool中。

override def addTaskSetManager(manager: Schedulable, properties: Properties) {
    val poolName = if (properties != null) {
        properties.getProperty(FAIR_SCHEDULER_PROPERTIES, DEFAULT_POOL_NAME)
      } else {
        DEFAULT_POOL_NAME
      }
    var parentPool = rootPool.getSchedulableByName(poolName)
    if (parentPool == null) {
      // we will create a new pool that user has configured in app
      // instead of being defined in xml file
      parentPool = new Pool(poolName, DEFAULT_SCHEDULING_MODE,
        DEFAULT_MINIMUM_SHARE, DEFAULT_WEIGHT)
      rootPool.addSchedulable(parentPool)
      logWarning(s"A job was submitted with scheduler pool $poolName, which has not been " +
        "configured. This can happen when the file that pools are read from isn't set, or " +
        s"when that file doesn't contain $poolName. Created $poolName with default " +
        s"configuration (schedulingMode: $DEFAULT_SCHEDULING_MODE, " +
        s"minShare: $DEFAULT_MINIMUM_SHARE, weight: $DEFAULT_WEIGHT)")
    }
    parentPool.addSchedulable(manager)
    logInfo("Added task set " + manager.name + " tasks to pool " + poolName)
  }

优先级排序和任务调度

当资源池初始化完毕，taskscheduler会调用Pool的getSortedTaskSetQueue方法，对task按照对应的优先级算法进行优先级排序，然后逐个调度任务去执行。以下是排序的具体实现：

  override def getSortedTaskSetQueue: ArrayBuffer[TaskSetManager] = {
    val sortedTaskSetQueue = new ArrayBuffer[TaskSetManager]
    val sortedSchedulableQueue =
      schedulableQueue.asScala.toSeq.sortWith(taskSetSchedulingAlgorithm.comparator)
    for (schedulable <- sortedSchedulableQueue) {
      sortedTaskSetQueue ++= schedulable.getSortedTaskSetQueue
    }
    sortedTaskSetQueue
  }