Spark-HeartbeatReceiver 源码解析

Spark-HeartbeatReceiver 源码解析


HeartbeatReceiver这个类是一个 endPoint,在driver端才有其对象。他的主要作用是 定时监测 注册到 本dirver的所有的executor 是否存活。
下面来看看源码:

class HeartbeatReceiver

private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock)
  extends SparkListener with ThreadSafeRpcEndpoint with Logging {

  def this(sc: SparkContext) {
    this(sc, new SystemClock)
  }

  sc.listenerBus.addToManagementQueue(this)

  override val rpcEnv: RpcEnv = sc.env.rpcEnv

  private[spark] var scheduler: TaskScheduler = null

  // executor ID -> timestamp of when the last heartbeat from this executor was received
  private val executorLastSeen = new mutable.HashMap[String, Long] //保存 executor 的 最新 心跳时间

  // "spark.network.timeout" uses "seconds", while `spark.storage.blockManagerSlaveTimeoutMs` uses
  // "milliseconds"
  private val slaveTimeoutMs =
    sc.conf.getTimeAsMs("spark.storage.blockManagerSlaveTimeoutMs", "120s") //blockManager endPoint 超时时间
  private val executorTimeoutMs =
    sc.conf.getTimeAsSeconds("spark.network.timeout", s"${slaveTimeoutMs}ms") * 1000 //executor 超时时间

  // "spark.network.timeoutInterval" uses "seconds", while
  // "spark.storage.blockManagerTimeoutIntervalMs" uses "milliseconds"
  private val timeoutIntervalMs =
    sc.conf.getTimeAsMs("spark.storage.blockManagerTimeoutIntervalMs", "60s") //blockManager 超时时间
  private val checkTimeoutIntervalMs =
    sc.conf.getTimeAsSeconds("spark.network.timeoutInterval", s"${timeoutIntervalMs}ms") * 1000

  private var timeoutCheckingTask: ScheduledFuture[_] = null

  // "eventLoopThread" is used to run some pretty fast actions. The actions running in it should not
  // block the thread for a long time.
  private val eventLoopThread =
    ThreadUtils.newDaemonSingleThreadScheduledExecutor("heartbeat-receiver-event-loop-thread") //driver 检查 executor心跳 线程

  private val killExecutorThread = ThreadUtils.newDaemonSingleThreadExecutor("kill-executor-thread") //处理 exector 心跳失败 kill 的线程

  override def onStart(): Unit = { //启动 ExpireDeadHosts 定时监测
    timeoutCheckingTask = eventLoopThread.scheduleAtFixedRate(new Runnable {
      override def run(): Unit = Utils.tryLogNonFatalError {
        Option(self).foreach(_.ask[Boolean](ExpireDeadHosts)) //处理在 receiveAndReply 的 case ExpireDeadHosts中
      }
    }, 0, checkTimeoutIntervalMs, TimeUnit.MILLISECONDS)
  }

  override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {

    // Messages sent and received locally
    case ExecutorRegistered(executorId) => //增加一个executor
      executorLastSeen(executorId) = clock.getTimeMillis() //更新 executor 的 最新 心跳时间
      context.reply(true)
    case ExecutorRemoved(executorId) => //溢出一个executor
      executorLastSeen.remove(executorId) //更新 executor 的 最新 心跳时间
      context.reply(true)
    case TaskSchedulerIsSet =>
      scheduler = sc.taskScheduler //为本对象 设置 taskscheduler属性
      context.reply(true)
    case ExpireDeadHosts => //检查 executor 是否超时
      expireDeadHosts() //检查 executor 是否超时 ,和超时后的处理方法
      context.reply(true)

    // Messages received from executors 处理 executor 的 心跳信息
    case heartbeat @ Heartbeat(executorId, accumUpdates, blockManagerId) =>
      if (scheduler != null) { //taskscheduler 会在 SparkContext 后面 创建 _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet) 这句 就是对应的操作
        if (executorLastSeen.contains(executorId)) { //原来已经汇报过了
          executorLastSeen(executorId) = clock.getTimeMillis() //更新 executor 的 最新 心跳时间
          eventLoopThread.submit(new Runnable {
            override def run(): Unit = Utils.tryLogNonFatalError {
              val unknownExecutor: Boolean = !scheduler.executorHeartbeatReceived( //unknownExecutor这个如果是 true的话,则会在 executor 中 重新注册blockManager
                // 可以看org.apache.spark.executor.Executor的reportHeartBeat方法
                executorId, accumUpdates, blockManagerId)
              //如果 driver的BlockManagerMasterEndPoint 中已经注册过了 这个 blockManagerId,则返回false,则不需要executor 再次 重新注册 BlockManager
              val response = HeartbeatResponse(reregisterBlockManager = unknownExecutor)
              context.reply(response)
            }
          })
        } else {//还没有 在 driver 注册过
          // This may happen if we get an executor's in-flight heartbeat immediately
          // after we just removed it. It's not really an error condition so we should
          // not log warning here. Otherwise there may be a lot of noise especially if
          // we explicitly remove executors (SPARK-4134).
          logDebug(s"Received heartbeat from unknown executor $executorId")
          context.reply(HeartbeatResponse(reregisterBlockManager = true)) //需要 executor 重新注册 BlockManager
        }
      } else {// TaskScheduler 还没有 在 此类中还没有完成 持有化
        // Because Executor will sleep several seconds before sending the first "Heartbeat", this
        // case rarely happens. However, if it really happens, log it and ask the executor to
        // register itself again.
        logWarning(s"Dropping $heartbeat because TaskScheduler is not ready yet")
        context.reply(HeartbeatResponse(reregisterBlockManager = true))
      }
  }

  /**
   * Send ExecutorRegistered to the event loop to add a new executor. Only for test.
   *
   * @return if HeartbeatReceiver is stopped, return None. Otherwise, return a Some(Future) that
   *         indicate if this operation is successful.
   */
  //add 一个executor
  def addExecutor(executorId: String): Option[Future[Boolean]] = {
    Option(self).map(_.ask[Boolean](ExecutorRegistered(executorId)))
  }

  /**
   * If the heartbeat receiver is not stopped, notify it of executor registrations.
   */
  //add 一个executor
  override def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit = {
    addExecutor(executorAdded.executorId)
  }

  /**
   * Send ExecutorRemoved to the event loop to remove an executor. Only for test.
   *
   * @return if HeartbeatReceiver is stopped, return None. Otherwise, return a Some(Future) that
   *         indicate if this operation is successful.
   */
  //移除一个executor
  def removeExecutor(executorId: String): Option[Future[Boolean]] = {
    Option(self).map(_.ask[Boolean](ExecutorRemoved(executorId)))
  }

  /**
   * If the heartbeat receiver is not stopped, notify it of executor removals so it doesn't
   * log superfluous errors.
   *
   * Note that we must do this after the executor is actually removed to guard against the
   * following race condition: if we remove an executor's metadata from our data structure
   * prematurely, we may get an in-flight heartbeat from the executor before the executor is
   * actually removed, in which case we will still mark the executor as a dead host later
   * and expire it with loud error messages.
   */
  //移除一个executor
  override def onExecutorRemoved(executorRemoved: SparkListenerExecutorRemoved): Unit = {
    removeExecutor(executorRemoved.executorId)
  }
  //检查 executor 是否超时 ,和超时后的处理方法
  private def expireDeadHosts(): Unit = {
    logTrace("Checking for hosts with no recent heartbeats in HeartbeatReceiver.")
    val now = clock.getTimeMillis()
    for ((executorId, lastSeenMs) <- executorLastSeen) { // executorLastSeen 是保存 executor 的 最新 心跳时间
      if (now - lastSeenMs > executorTimeoutMs) {//executor 超过 超时时间设置
        logWarning(s"Removing executor $executorId with no recent heartbeats: " +
          s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms")
        //taskScheduler 执行 executor lost 相应的 操作
        scheduler.executorLost(executorId, SlaveLost("Executor heartbeat " +
          s"timed out after ${now - lastSeenMs} ms"))
          // Asynchronously kill the executor to avoid blocking the current thread
        killExecutorThread.submit(new Runnable {// kill executor 线程 单独处理  kill 和 replace 这个 executor
          override def run(): Unit = Utils.tryLogNonFatalError {
            // Note: we want to get an executor back after expiring this one,
            // so do not simply call `sc.killExecutor` here (SPARK-8119)
            sc.killAndReplaceExecutor(executorId) //schedulerBackend 的处理方法
          }
        })
        executorLastSeen.remove(executorId) //executorLastSeen 是保存 executor 的 最新 心跳时间 移除这个 executor
      }
    }
  }

  override def onStop(): Unit = {
    if (timeoutCheckingTask != null) {
      timeoutCheckingTask.cancel(true)
    }
    eventLoopThread.shutdownNow()
    killExecutorThread.shutdownNow()
  }
}

object HeartbeatReceiver

private[spark] object HeartbeatReceiver {
  val ENDPOINT_NAME = "HeartbeatReceiver"
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Spark-submit是Spark自带的提交脚本,用于将Spark应用程序提交到集群中运行。Spark-submit可以将应用程序打包成一个JAR包并提交到集群中运行,也可以直接提交一个Python文件或者一个Scala文件。 使用Spark-submit提交应用程序时,需要指定以下参数: 1. --class:指定主类名,如果是Java应用程序,需要指定该参数;如果是Scala应用程序,可以省略该参数,Spark-submit会自动查找Scala文件中的main函数。 2. --master:指定运行模式,可以是local、yarn、mesos等。 3. --deploy-mode:指定部署模式,可以是client或者cluster,如果是client模式,则Driver运行在提交任务的机器上;如果是cluster模式,则Driver运行在集群中的某个节点上。 4. --executor-memory:指定Executor的内存大小。 5. --total-executor-cores:指定Executor的总核数。 6. --num-executors:指定Executor的个数。 7. 应用程序的JAR包路径或者Python/Scala文件路径。 例如,使用Spark-submit提交一个Java应用程序,命令如下: ``` ./bin/spark-submit --class com.spark.example.WordCount --master yarn --deploy-mode client --executor-memory 2g --total-executor-cores 4 --num-executors 2 /path/to/WordCount.jar /path/to/input /path/to/output ``` 其中,--class指定了Java应用程序的主类名为com.spark.example.WordCount,--master指定了运行模式为yarn,--deploy-mode指定了部署模式为client,--executor-memory指定了每个Executor的内存大小为2g,--total-executor-cores指定了Executor总核数为4,--num-executors指定了Executor的个数为2,最后两个参数为输入和输出路径。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值