spark中的动态executor分配

最新推荐文章于 2024-10-16 17:06:45 发布

weixin_33911824

最新推荐文章于 2024-10-16 17:06:45 发布

阅读量1.9k

点赞数

文章标签：大数据 python scala

原文链接：https://my.oschina.net/hblt147/blog/2876605

版权

2019独角兽企业重金招聘Python工程师标准>>>

动态分配executor的实例初始化部分
如果spark.executor.instances配置项设置为0或者没有设置,这个默认情况下是一个未设置的值,yarn的运行模式时,这个配置通过--num-executors来得到.

同时spark.dynamicAllocation.enabled配置项设置为true时.默认值为false,表示启用了动态分配executor.

在driver端SparkContext生成时,会检查上面两个配置项,如果这两个配置满足动态executor分配的要求时,会生成一个ExecutorAllocationManager实例.

_executorAllocationManager =
  if (dynamicAllocationEnabled) {
    Some(new ExecutorAllocationManager(this, listenerBus, _conf))
  } else {
    None
  }
_executorAllocationManager.foreach(_.start())

必要的配置项:

1,配置项spark.dynamicAllocation.minExecutors,默认值0,最少分配的executor的个数.

2,配置项spark.dynamicAllocation.maxExecutors,默认值int.maxvalue.最大可分配的executor的个数.

3,配置项spark.dynamicAllocation.initialExecutors,默认值为配置项1的值,初始化时启用的executor的个数,

4,1,配置项spark.dynamicAllocation.schedulerBacklogTimeout,默认值1s,如果未分配的task等待分配的时间超过了这个配置的时间,表示需要新启动executor.

4,2,配置项spark.dynamicAllocation.sustainedSchedulerBacklogTimeout,默认是4,1,配置项的值,这个配置用于设置在初始调度的executor调度延时后,每次的等待超时时间.

5,配置项spark.dynamicAllocation.executorIdleTimeout,默认值60s,executor的空闲回收时间.

6,配置项spark.executor.cores的配置(executor-cores)必须大于或等于配置项spark.task.cpus的值(这个配置默认是1,这是每个task需要的cpu的个数).

7,配置项spark.shuffle.service.enabled必须配置为true,默认为false.如果这个配置设置为true时,BlockManager实例生成时,需要读取spark.shuffle.service.port配置项配置的shuffle的端口,同时对应BlockManager的shuffleClient不在是默认的BlockTransferService实例,而是ExternalShuffleClient实例.

8,初始化时,ExecutorAllocationManager中的属性initializing默认值为true,表示定时调度时,什么都不做.

在执行ExecutorAllocationManager中的start函数时:

def start(): Unit = {

这里把ExecutorAllocationListener实例(内部实现类)添加到sparkContext中的listenerBus中,用于监听stage,task的启动与完成,并做对应的操作.
listenerBus.addListener(listener)

  val scheduleTask = new Runnable() {
    override def run(): Unit = {
      try {
        schedule()
      } catch {
        case ct: ControlThrowable =>
          throw ct
        case t: Throwable =>
          logWarning(s"Uncaught exception in thread

               ${Thread.currentThread().getName}", t)
      }
    }
  }

定时100ms执行一次schedule的调度函数,来进行task的分析.
executor.scheduleAtFixedRate(scheduleTask, 0, intervalMillis,

TimeUnit.MILLISECONDS)
}

对executor的个数分配的计算
针对task的调度主要由一个定时器每100ms进行一次schedule函数的调用.

private def schedule(): Unit = synchronized {

在这个函数中,首先得到当前的时间,
val now = clock.getTimeMillis

在调用这个函数时,初始情况下,initializing的属性值为true,这个时候,这个函数什么也不做.

这个函数的内容,后面在进行分析.
updateAndSyncNumExecutorsTarget(now)

这个removeTimes集合中记录有每一个executor没有被task占用后的时间,如果这个时间超过了上面配置的idle的时间,会移出掉这个executor,同时设置initializing属性为false,表示可以继续进行task的调度.retain函数只保留未超时的executor.
  removeTimes.retain { case (executorId, expireTime) =>
    val expired = now >= expireTime
    if (expired) {
      initializing = false
      removeExecutor(executorId)
    }
    !expired
  }
}

如何知道stage被提交?看下面,

在SparkContext中,执行runJob命令时,针对一个stage进行submit操作时,会调用listenerBus中所有的listener对应的onStageSubmitted函数.

而在ExecutorAllocationManager进行start操作时,生成了一个listener,实例为ExecutorAllocationListener,并把这个listener添加到了listenerBus中.

接下来看看ExecutorAllocationListener中对应stage提交的监听处理:

override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted)

: Unit = {

这里首先把initializing的属性值设置为false,表示下次定时调度时,需要执行executor的分配操作.
initializing = false

得到进行submit操作对应的stage的id与stage中对应的task的个数.
  val stageId = stageSubmitted.stageInfo.stageId
  val numTasks = stageSubmitted.stageInfo.numTasks
  allocationManager.synchronized {

通过对应的stageId设置这个stage的task的个数,存储到stageIdToNumTasks集合中.
stageIdToNumTasks(stageId) = numTasks

这里更新allocationManager中的addTime的时间,

由当前时间加上配置spark.dynamicAllocation.schedulerBacklogTimeout的超时时间.
allocationManager.onSchedulerBacklogged()

这里根据每个task对应的host,计算出每个host对应的task的个数,numTasksPending的个数原则上应该与stage中numTask的个数相同.
    // Compute the number of tasks requested by the stage on each host
    var numTasksPending = 0
    val hostToLocalTaskCountPerStage = new mutable.HashMap[String, Int]()
    stageSubmitted.stageInfo.taskLocalityPreferences.foreach { locality =>
      if (!locality.isEmpty) {
        numTasksPending += 1
        locality.foreach { location =>
          val count = hostToLocalTaskCountPerStage.getOrElse(location.host, 0) + 1
          hostToLocalTaskCountPerStage(location.host) = count
        }
      }
    }

在对应的集合中,根据stageid与pending的task的个数,对应的host与host对应的task的个数进行存储.
stageIdToExecutorPlacementHints.put(stageId,
(numTasksPending, hostToLocalTaskCountPerStage.toMap))

下面的函数迭代stageIdToExecutorPlacementHints集合中的values,并更新allocationManager中localityAwareTasks属性(存储待启动的task的个数)与hostToLocalTaskCount集合属性(存储host对应的task的个数)的值.添加到这里,主要是executor启动时对应的调度启动task
    // Update the executor placement hints
    updateExecutorPlacementHints()
  }
}

接面看看allocationManager中定时调度的updateAndSyncNumExecutorsTarget函数:

现在来说说updateAndSyncNumExecutorsTarget函数与addExecutors函数的作用:

示例说明:

假定这次的stage需要的executor的个数为5,numExecutorsTarget的配置保持默认值0,

如果是第一次调度启动时,在updateAndSyncNumExecutorsTarget函数中:

1,先计算出这个stage需要的executor的个数,

val maxNeeded = maxNumExecutorsNeeded

if (initializing) {

如果函数进行这里,表示还没有stage提交,也就是没有job被执行.不进行调度.
    // Do not change our target while we are still initializing,
    // Otherwise the first job may have to ramp up unnecessarily
    0
  }

2,进入的流程为else if (addTime != NOT_SET && now >= addTime)部分.这个时候执行addExecutors函数,(这里假定时间已经达到了addTime的超时时间)

这种情况下默认的初始executor的个数为0的情况下,在当前时间超过了等待超时时间后,会进入,第一次时需要等待一秒钟,每次执行会更新等待时间.

这里根据要stage对应的task需要的executor的个数,并执行addExecutors的函数.

else if (addTime != NOT_SET && now >= addTime) {
    val delta = addExecutors(maxNeeded)
    logDebug(s"Starting timer to add more executors (to " +
      s"expire in $sustainedSchedulerBacklogTimeoutS seconds)")
    addTime += sustainedSchedulerBacklogTimeoutS * 1000
    delta
}

在addExecutors函数中,先计算出目标的executor的个数(属性numExecutorsTarget),

// Do not request more executors if it would put our target over the upper bound
  if (numExecutorsTarget >= maxNumExecutors) {
    logDebug(s"Not adding executors because our current target total " +
      s"is already $numExecutorsTarget (limit $maxNumExecutors)")
    numExecutorsToAdd = 1
    return 0
  }

  val oldNumExecutorsTarget = numExecutorsTarget
  // There's no point in wasting time ramping up to the number of executors we already have, so
  // make sure our target is at least as much as our current allocation:
  numExecutorsTarget = math.max(numExecutorsTarget, executorIds.size)
  // Boost our target with the number to add for this round:
  numExecutorsTarget += numExecutorsToAdd
  // Ensure that our target doesn't exceed what we need at the present moment:
  numExecutorsTarget = math.min(numExecutorsTarget, maxNumExecutorsNeeded)
  // Ensure that our target fits within configured bounds:
  numExecutorsTarget = math.max(math.min(numExecutorsTarget, maxNumExecutors),

minNumExecutors)

此时executorIds的长度为0,集合是个空集合,这个时候numExecutorsToAdd的值为默认的1,根据上面的代码计算完成后(maxNumExecutorsNeeded为5就是tasks需要的executor的个数),numExecutorsTarget的值为1.接下来计算出来一个值,如果这次任务的目标executor的个数高于上次tasks启动的目标executor的个数,delta的值是一个大于0的值.根据上面的说明,下面代码中这个delta的值为1,
val delta = numExecutorsTarget - oldNumExecutorsTarget

  // If our target has not changed, do not send a message
  // to the cluster manager and reset our exponential growth
  if (delta == 0) {

如果delta等于0,表示这次的目标executor的个数,与上次任务的executor的个数相同,重置增量的个数为1.
    numExecutorsToAdd = 1
    return 0
  }
接下来,通过下面的代码通过SparkContext发起numExecutorsTarget的executor的启动,并在executor中加载对应的task的个数.

val addRequestAcknowledged = testing ||
client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks,

hostToLocalTaskCount)

接下来,由于我们的任务执行还需要的executor的个数还需要4个(共需要),同时这个时候,delta的值为1,与numExecutorsToAdd的属性值相同,因此numExecutorsToAdd的值会*2.

    numExecutorsToAdd = if (delta == numExecutorsToAdd) {
      numExecutorsToAdd * 2
    } else {
      1
    }

3,调度定时器开始执行第二次调度启动,这个时候执行updateAndSyncNumExecutorsTarget函数时,numExecutorsTarget的值为1,需要的executor的个数为3,因此,还是会执行时间超时的流程.

再次进入addExecutors函数,这个时候numExecutorsToAdd属性值为2,numExecutorsTarget属性值为1,executorsIds的size为1,已经有一个executor被启动,需要的executor的个数为3,最后计算完成后,numExecutorsTarget属性的值为3.计算出来当前的numExecutorsTarget与上一次的numExecutorsTarget的delta的值为2,开始根据这个值为3的numExecutorsTarget发起task的启动请求.

接下来,由于计算出来的delta的值为2,而属性numExecutorsToAdd的值也为2,

因此numExecutorsToAdd属性值现在还是需要*2操作.执行完成后,最后这个numExecutorsToAdd属性值修改成了4.

4,这个时候,由于还有部分task没有被执行,开始第三次的处理,此时,numExecutorTarget的值还是小于目标的executor的个数,接着执行addExecutors函数,此时,executorsIds的size为4,第一次执行一个,第二次启动了3个,这个时候,numExecutorsTarget的属性值变化情况:

首先先修改成4,(取numExecutorsTarget与executorIds.size中的最大值),

然后numExecutorsTarget += numExecutorsToAdd的值,这个时候值修改成了8.

最后与共需要的executor的个数5取最小值,把值修改成5.计算出当前的numExecutorsTarget与上一次的numExecutorsTarget的差值为2,numExecutorsToAdd的值为4,因此重新修改numExecutorsToAdd的值为1.

这个时候调度程序会修改addTime的值为NOT_SET,表示不在执行executor的调度.因为executor已经够了.

5,现在假定spark.dynamicAllocation.initialExecutors配置项配置有一个值,初始值为6.需要的executor的个数还是是5.这个时候,进入updateAndSyncNumExecutorsTarget函数时,执行如下的流程部分,因为初始的executor的个数大于了需要的executor的个数.这部分流程在设置有初始大小的executor个数或者说要执行的job的第二个stage的task的个数需要的executor的个数小于小次stage需要的executor的个数时,会被执行.

else if (maxNeeded < numExecutorsTarget) {
  // The target number exceeds the number we actually need, so stop adding new
  // executors and inform the cluster manager to cancel the extra pending requests
  val oldNumExecutorsTarget = numExecutorsTarget
  numExecutorsTarget = math.max(maxNeeded, minNumExecutors)
  numExecutorsToAdd = 1

  // If the new target has not changed, avoid sending a message to the cluster manager
  if (numExecutorsTarget < oldNumExecutorsTarget) {
    client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks,

hostToLocalTaskCount)
logDebug(s"Lowering target number of executors to $numExecutorsTarget

(previously " +
s"$oldNumExecutorsTarget) because not all requested executors are actually

      needed")
  }
  numExecutorsTarget - oldNumExecutorsTarget
}

在上面的代码中,重新根据需要的executor的个数,计算出numExecutorsTarget的值,这个时候,新的numExecutorsTarget的值为5,而老的numExecutorsTarget的值为6,因此通过新的numExecutorsTarget直接调用SparkContext中对应的启动executor的函数,发起对executor的调度与task的启动.

通过SparkContext调度executor
在allocationManager中,对executor进行动态的调用后,会执行如下的代码片断.

client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks,

hostToLocalTaskCount)

在上面的代码中,client就是SparkContext实例.

下面看看这个函数的处理流程:

函数的传入参数中:

numExecutors是目标的executor的个数,

第二个是共需要的task的个数,

第三个是host->taskCount的集合.

private[spark] override def requestTotalExecutors(
    numExecutors: Int,
    localityAwareTasks: Int,
    hostToLocalTaskCount: scala.collection.immutable.Map[String, Int]
  ): Boolean = {
  schedulerBackend match {
    case b: CoarseGrainedSchedulerBackend =>

这里直接调用了GoarseGrainedSchedulerBackend中对应的函数.
b.requestTotalExecutors(numExecutors, localityAwareTasks,

          hostToLocalTaskCount)
    case _ =>
      logWarning("Requesting executors is only supported in coarse-grained mode")
      false
  }
}

下面看看GoarseGrainedSchedulerBackend中requestTotalExecutors的函数实现:

final override def requestTotalExecutors(
    numExecutors: Int,
    localityAwareTasks: Int,
    hostToLocalTaskCount: Map[String, Int]
  ): Boolean = synchronized {
  if (numExecutors < 0) {
    throw new IllegalArgumentException(
      "Attempted to request a negative number of executor(s) " +
        s"$numExecutors from the cluster manager. Please specify a positive number!")
  }

this.localityAwareTasks = localityAwareTasks
this.hostToLocalTaskCount = hostToLocalTaskCount

每次执行时,计算出还需要的共需要的executor的个数与正在执行或者等待回收的executor的个数之间的差值,这个差值是还需要启动的executor的个数.
numPendingExecutors =
math.max(numExecutors - numExistingExecutors + executorsPendingToRemove.size,

这里根据具体的cluster的部署模式(yarn,standalone,mesos,等),调用对应的函数进行executor的启动操作.这里我们看看standalone的操作.由SparkDeploySchedulerBackend实现.

这个函数的实现主要是通过向master发送一个RequestExecutors消息,这个消息是一个需要响应的消息.

这个消息在Master中通过receiveAndReply函数中的RequestExecutors部分进行处理.
doRequestTotalExecutors(numExecutors)
}

Master中处理executor的申请:

case RequestExecutors(appId, requestedTotal) =>
context.reply(handleRequestExecutors(appId, requestedTotal))

看看这个的handleRequestExecutors函数

private def handleRequestExecutors(appId: String, requestedTotal: Int)

: Boolean = {
idToApp.get(appId) match {

这个函数中,根据传入的app对应的job共依赖的executor的个数,更新appInfo中executorLimit的值.并执行对executor的启动的调度.
case Some(appInfo) =>
logInfo(s"Application $appId requested to set total executors to

$requestedTotal.")
appInfo.executorLimit = requestedTotal

在这个调度的过程中通过startExecutorsOnWorkers函数来调度与启动executor在对应的worker中,

在判断要启动的executor的个数时,会根据scheduleExecutorsOnWorkers函数来判断executor的个数是否达到要求的appInfo.executorLimit的个数,如果达到指定的executor的个数时,调度不再执行executor的启动.判断worker是否有足够的资源启动executor时,通过对executor需要的cpu core的个数与executor需要的内存来判断worker是否有足够的对应资源启动executor,如果有,表示这个worker可以用来启动executor,迭代所有的worker进行executor的启动,当已经启动的executor的个数达到了appInfo的executorLimit的限制时,不在进行分配.
      schedule()
      true
    case None =>
      logWarning(s"Unknown application $appId requested

           $requestedTotal total executors.")
      false
  }
}

Driver端处理worker对executor的启动后的监听:

Worker端的executor启动后,会向driver发起一个RegisterExecutor消息.

这个消息在Driver端通过CoarseGrainedSchedulerBackend实例中的receiveAndReply函数中的RegisterExecutor部分来进行处理.

case RegisterExecutor(executorId, executorRef, hostPort, cores, logUrls) =>
  if (executorDataMap.contains(executorId)) {
    context.reply(RegisterExecutorFailed("Duplicate executor ID: " + executorId))
  } else {
    // If the executor's rpc env is not listening for incoming connections, `hostPort`
    // will be null, and the client connection should be used to contact the executor.
    val executorAddress = if (executorRef.address != null) {
        executorRef.address
      } else {
        context.senderAddress
      }
    logInfo(s"Registered executor $executorRef ($executorAddress) with ID

$executorId")

这里把executor的信息存储到driver中对应的集合容器中.

并更新待启动的executor的个数(numPendingExecutors属性).
    addressToExecutorId(executorAddress) = executorId
    totalCoreCount.addAndGet(cores)
    totalRegisteredExecutors.addAndGet(1)
    val data = new ExecutorData(executorRef, executorRef.address,

      executorAddress.host,
      cores, cores, logUrls)
    // This must be synchronized because variables mutated
    // in this block are read when requesting executors
    CoarseGrainedSchedulerBackend.this.synchronized {
      executorDataMap.put(executorId, data)
      if (numPendingExecutors > 0) {
        numPendingExecutors -= 1
        logDebug(s"Decremented number of pending executors

             ($numPendingExecutors left)")
      }
    }
    // Note: some tests expect the reply to come after we put the executor in the map
    context.reply(RegisteredExecutor(executorAddress.host))

这里通过向allocationManager中的ExecutorAllocationListener中executorAdded的监听函数.
listenerBus.post(
SparkListenerExecutorAdded(System.currentTimeMillis(), executorId, data))

这里在启动后的executors中对task进行调度与启动.
makeOffers()
}

在ExecutorAllocationManager中,对executor注册,task启动与结束的监听处理:

对executor的注册处理:

private def onExecutorAdded(executorId: String): Unit = synchronized {
if (!executorIds.contains(executorId)) {

把新启动的executor对应的id存储到 executorIds的集合中.
executorIds.add(executorId)

这里检查下所有的executor是否包含有task在运行,如果不包含的executor,添加到remoteTimes的集合中,这个remoteTimes中对应此executor有一个超时的清理时间,

如果executor不包含blockCache,

超时清理时间通过spark.dynamicAllocation.executorIdleTimeout配置.默认为60s

如果executor中包含有blockCache时,

    超时清理时间通过spark.dynamicAllocation.cachedExecutorIdleTimeout配置.默认不控制超时.
    // If an executor (call this executor X) is not removed because the lower bound
    // has been reached, it will no longer be marked as idle. When new executors join,
    // however, we are no longer at the lower bound,

and so we must mark executor X
// as idle again so as not to forget that it is a candidate for removal.

         (see SPARK-4951)
    executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle)
    logInfo(s"New executor $executorId has registered (new total is

        ${executorIds.size})")
  } else {
    logWarning(s"Duplicate executor $executorId has registered")
  }
}

对task启动时的监听:

override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
  val stageId = taskStart.stageId
  val taskId = taskStart.taskInfo.taskId
  val taskIndex = taskStart.taskInfo.index
  val executorId = taskStart.taskInfo.executorId

allocationManager.synchronized {

首先更新正在运行的task的个数,
    numRunningTasks += 1
    // This guards against the race condition in which the `SparkListenerTaskStart`
    // event is posted before the `SparkListenerBlockManagerAdded` event, which is
    // possible because these events are posted in different threads. (see SPARK-4951)
    if (!allocationManager.executorIds.contains(executorId)) {
      allocationManager.onExecutorAdded(executorId)
    }

在stage与task的对应关系的集合中,设置stageId中包含的taskId的值.
// If this is the last pending task, mark the scheduler queue as empty
stageIdToTaskIndices.getOrElseUpdate(stageId, new mutable.HashSet[Int]) +=

taskIndex

这里检查的stageid中对应的所有的正在运行的task的个数是否达到了stage需要运行的task的个数,

如果达到了需要运行的task的个数,设置allocationManager中addTime属性的值为NOT_SET.同时重置属性numExecutorsToAdd的值为1.
    if (totalPendingTasks() == 0) {
      allocationManager.onSchedulerQueueEmpty()
    }

把task运行的executor的对应关系存储到对应的集合中,这个存储的集合用于判断executor是否是闲置的状态.
// Mark the executor on which this task is scheduled as busy
executorIdToTaskIds.getOrElseUpdate(executorId, new mutable.HashSet[Long]) +=

taskId

由于executorId已经有task在运行,把这个executor从要移出的executor集合中移出

从removeTimes集合中移出,这个集合中存储了executor的闲置回收超时时间.

从executorsPendingToRemove集合中移出,这个集合中存储了准备执行移出操作的executor的集合.
allocationManager.onExecutorBusy(executorId)
}
}

对task停止时的监听:

override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
  val executorId = taskEnd.taskInfo.executorId
  val taskId = taskEnd.taskInfo.taskId
  val taskIndex = taskEnd.taskInfo.index
  val stageId = taskEnd.stageId
  allocationManager.synchronized {

把对应的运行的TASK的个数减去1,表示有一个task完成运行.
numRunningTasks -= 1

从对应的executor的关系集合中移出此task,如果这个executor已经没有了正在运行的task时,直接把这个executor执行idle的处理,也就是移动到removeTimes集合中.
    // If the executor is no longer running any scheduled tasks, mark it as idle
    if (executorIdToTaskIds.contains(executorId)) {
      executorIdToTaskIds(executorId) -= taskId
      if (executorIdToTaskIds(executorId).isEmpty) {
        executorIdToTaskIds -= executorId
        allocationManager.onExecutorIdle(executorId)
      }
    }

    // If the task failed, we expect it to be resubmitted later. To ensure we have
    // enough resources to run the resubmitted task, we need to mark the scheduler
    // as backlogged again if it's not already marked as such (SPARK-8366)
    if (taskEnd.reason != Success) {
      if (totalPendingTasks() == 0) {
        allocationManager.onSchedulerBacklogged()
      }
      stageIdToTaskIndices.get(stageId).foreach { _.remove(taskIndex) }
    }
  }
}

对stage停止时的监听:

override def onStageCompleted(stageCompleted: SparkListenerStageCompleted)

: Unit = {
val stageId = stageCompleted.stageInfo.stageId
allocationManager.synchronized {

从对应的集合中移出此stage.
    stageIdToNumTasks -= stageId
    stageIdToTaskIndices -= stageId
    stageIdToExecutorPlacementHints -= stageId

更新allocationManager中对应各个job的stage中task与location的对应关系.
// Update the executor placement hints
updateExecutorPlacementHints()

所有的job都运行完成,暂停调度.
    // If this is the last stage with pending tasks, mark the scheduler queue as empty
    // This is needed in case the stage is aborted for any reason
    if (stageIdToNumTasks.isEmpty) {
      allocationManager.onSchedulerQueueEmpty()
      if (numRunningTasks != 0) {
        logWarning("No stages are running, but numRunningTasks != 0")
        numRunningTasks = 0
      }
    }
  }
}

在调度过程中发现executor超时需要移出时的处理:

private def removeExecutor(executorId: String): Boolean = synchronized {
  // Do not kill the executor if we are not aware of it (should never happen)
  if (!executorIds.contains(executorId)) {
    logWarning(s"Attempted to remove unknown executor $executorId!")
    return false
  }

// Do not kill the executor again if it is already pending to be killed

      (should never happen)
  if (executorsPendingToRemove.contains(executorId)) {
    logWarning(s"Attempted to remove executor $executorId " +
      s"when it is already pending to be removed!")
    return false
  }

这里检查下现在存在的executor的个数是否小于了配置的最小的executor的个数,如果不小于才执行移出操作.
  // Do not kill the executor if we have already reached the lower bound
  val numExistingExecutors = executorIds.size - executorsPendingToRemove.size
  if (numExistingExecutors - 1 < minNumExecutors) {
    logDebug(s"Not removing idle executor $executorId because there are only " +
      s"$numExistingExecutors executor(s) left (limit $minNumExecutors)")
    return false
  }

这里发起对executor的停止操作.并把这个发起停止后的executor添加到executorsPendingToRemov集合中.等待回调此allocationManager中listener中的onExecutorRemoved函数.
  // Send a request to the backend to kill this executor
  val removeRequestAcknowledged = testing || client.killExecutor(executorId)
  if (removeRequestAcknowledged) {
    logInfo(s"Removing executor $executorId because it has been idle for " +
      s"$executorIdleTimeoutS seconds (new desired total will be

           ${numExistingExecutors - 1})")
    executorsPendingToRemove.add(executorId)
    true
  } else {
    logWarning(s"Unable to reach the cluster manager to kill executor $executorId!")
    false
  }
}

allocationManager中监听executor的移出操作:

override def onExecutorRemoved(executorRemoved: SparkListenerExecutorRemoved)

: Unit = {
allocationManager.onExecutorRemoved(executorRemoved.executorId)
}

其实这个函数就是从对应的集合中移出此executor.

private def onExecutorRemoved(executorId: String): Unit = synchronized {
  if (executorIds.contains(executorId)) {
    executorIds.remove(executorId)
    removeTimes.remove(executorId)
    logInfo(s"Existing executor $executorId has been removed (new total is

           ${executorIds.size})")
    if (executorsPendingToRemove.contains(executorId)) {
      executorsPendingToRemove.remove(executorId)
      logDebug(s"Executor $executorId is no longer pending to " +
        s"be removed (${executorsPendingToRemove.size} left)")
    }
  } else {
    logWarning(s"Unknown executor $executorId has been removed!")
  }
}

转载于:https://my.oschina.net/hblt147/blog/2876605