spark源码-任务提交流程之-5-CoarseGrainedExecutorBackend

zdaiqing

已于 2022-08-02 15:05:07 修改

阅读量447

点赞数

分类专栏： Spark 大数据源码文章标签： spark 大数据分布式

于 2022-07-28 16:16:28 首次发布

本文链接：https://blog.csdn.net/m0_37817767/article/details/126037618

版权

源码同时被 3 个专栏收录

29 篇文章 2 订阅

订阅专栏

大数据

26 篇文章 0 订阅

订阅专栏

Spark

25 篇文章 2 订阅

订阅专栏

1.概述

在4-spark源码-任务提交流程之container中启动executor中分析到，AM从RM获取到资源后，会轮询资源containers，由AM向NM申请，在每个资源container中由/bin/java命令启动一个org.apache.spark.executor.CoarseGrainedExecutorBackend进程；

下面就org.apache.spark.executor.CoarseGrainedExecutorBackend中执行流程进行分析；

2.入口

通过/bin/java方式创建CoarseGrainedExecutorBackend进程后，会以CoarseGrainedExecutorBackend进程的main方法为入口向后执行；

main方法中，对参数进行解析，参数缺少则终止JVM运行，参数完整则调用run方法向后执行；

CoarseGrainedExecutorBackend类继承ThreadSafeRpcEndpoint特质，ThreadSafeRpcEndpoint特质继承RpcEndpoint特质；

private[spark] object CoarseGrainedExecutorBackend extends Logging {
  def main(args: Array[String]) {
    var driverUrl: String = null
    var executorId: String = null
    var hostname: String = null
    var cores: Int = 0
    var appId: String = null
    var workerUrl: Option[String] = None
    val userClassPath = new mutable.ListBuffer[URL]()
		//参数解析
    var argv = args.toList
    while (!argv.isEmpty) {
      argv match {
        case ("--driver-url") :: value :: tail =>
          driverUrl = value
          argv = tail
        case ("--executor-id") :: value :: tail =>
          executorId = value
          argv = tail
        case ("--hostname") :: value :: tail =>
          hostname = value
          argv = tail
        case ("--cores") :: value :: tail =>
          cores = value.toInt
          argv = tail
        case ("--app-id") :: value :: tail =>
          appId = value
          argv = tail
        case ("--worker-url") :: value :: tail =>
          // Worker url is used in spark standalone mode to enforce fate-sharing with worker
          workerUrl = Some(value)
          argv = tail
        case ("--user-class-path") :: value :: tail =>
          userClassPath += new URL(value)
          argv = tail
        case Nil =>
        case tail =>
          // scalastyle:off println
          System.err.println(s"Unrecognized options: ${tail.mkString(" ")}")
          // scalastyle:on println
          printUsageAndExit()
      }
    }

    if (hostname == null) {
      hostname = Utils.localHostName()
      log.info(s"Executor hostname is not provided, will use '$hostname' to advertise itself")
    }
    
		//参数缺少，终止JVM
    if (driverUrl == null || executorId == null || cores <= 0 || appId == null) {
      printUsageAndExit()
    }
		//调用run方法
    run(driverUrl, executorId, hostname, cores, appId, workerUrl, userClassPath)
    System.exit(0)
  }
}

3.run

组装参数、构造env、注册CoarseGrainedExecutorBackend实例、阻塞线程关闭以等等分发器分发消息到实例；

private[spark] object CoarseGrainedExecutorBackend extends Logging {

  private def run(
      driverUrl: String,
      executorId: String,
      hostname: String,
      cores: Int,
      appId: String,
      workerUrl: Option[String],
      userClassPath: Seq[URL]) {

    //初始化守护进程
    Utils.initDaemon(log)

    //以spark用户运行
    SparkHadoopUtil.get.runAsSparkUser { () =>
      // 校验hostname格式
      Utils.checkHost(hostname)

      // 获取executor的配置信息
      val executorConf = new SparkConf
      //创建rpc调用环境
      val fetcher = RpcEnv.create(
        "driverPropsFetcher",
        hostname,
        -1,
        executorConf,
        new SecurityManager(executorConf),
        clientMode = true)
      //根据--driver-url参数，以rpc方式创建driver节点引用
      val driver = fetcher.setupEndpointRefByURI(driverUrl)
      //从driver获取SparkAppConfig
      val cfg = driver.askSync[SparkAppConfig](RetrieveSparkAppConfig)
      //从SparkAppConfig中获取spark配置，并添加spark应用id
      val props = cfg.sparkProperties ++ Seq[(String, String)](("spark.app.id", appId))
      //关闭rpc调用环境
      fetcher.shutdown()

      // 将从driver获取的spark配置信息封装到sparkConf中
      val driverConf = new SparkConf()
      for ((key, value) <- props) {
        // this is required for SSL in standalone mode
        if (SparkConf.isExecutorStartupConf(key)) {
          driverConf.setIfMissing(key, value)
        } else {
          driverConf.set(key, value)
        }
      }
			//将driver中获取的token封装到sparkConf中
      cfg.hadoopDelegationCreds.foreach { tokens =>
        SparkHadoopUtil.get.addDelegationTokens(tokens, driverConf)
      }

      //利用sparkConf中的参数信息，创建sparkEnv，即创建executor的sparkEnv
      //此时会完成env的属性rpcEnv的复赋值，将NettyRpcEnv的实例赋值给rpcEnv
      val env = SparkEnv.createExecutorEnv(
        driverConf, executorId, hostname, cores, cfg.ioEncryptionKey, isLocal = false)

      //构建一个CoarseGrainedExecutorBackend实例，将该实例以Executor名字注册到消息分派器中
      env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend(
        env.rpcEnv, driverUrl, executorId, hostname, cores, userClassPath, env))
      workerUrl.foreach { url =>
        env.rpcEnv.setupEndpoint("WorkerWatcher", new WorkerWatcher(env.rpcEnv, url))
      }
      //阻塞进程中main线程直到rpcEnv退出：通过判断rpcEnv中分发器的线程池状态决定是否继续阻塞；通过阻塞代码阻止main线程关闭
      env.rpcEnv.awaitTermination()
    }
  }
}

2.1.向消息分派器注册backend

NettyRpcEnv是RpcEnv的实现；

在当前方法中调用消息分派器的registerRpcEndpoint方法进行后续执行；

private[netty] class NettyRpcEnv(
    val conf: SparkConf,
    javaSerializerInstance: JavaSerializerInstance,
    host: String,
    securityManager: SecurityManager,
    numUsableCores: Int) extends RpcEnv(conf) with Logging {
  
  private val dispatcher: Dispatcher = new Dispatcher(this, numUsableCores)
  
  override def setupEndpoint(name: String, endpoint: RpcEndpoint): RpcEndpointRef = {
    dispatcher.registerRpcEndpoint(name, endpoint)
  }
}

2.1.2.消息分派器中注册rpc终端

此处的rpc终端是一个CoarseGrainedExecutorBackend实例；

由EndpointData封装终端信息：名称、终端、终端引用、绑定的收件箱；

在消息分派器中，由ConcurrentHashMap以key-value形式进行缓存name-EndpointData信息完成rpc终端注册；

将封装后的终端信息EndpointData缓存到LinkedBlockingQueue队列中，作为消息分派器分派消息的接受者队列；相当于邮箱中的收件人列表；

private[netty] class Dispatcher(nettyEnv: NettyRpcEnv, numUsableCores: Int) extends Logging {
  def registerRpcEndpoint(name: String, endpoint: RpcEndpoint): NettyRpcEndpointRef = {
    //获取终端的地址标识符：nettyEnv.address-终端的地址，包含host和port；name-终端的名称
    val addr = RpcEndpointAddress(nettyEnv.address, name)
    //rpc终端的引用
    val endpointRef = new NettyRpcEndpointRef(nettyEnv.conf, addr, nettyEnv)
    synchronized {
      if (stopped) {
        throw new IllegalStateException("RpcEnv has been stopped")
      }
      //rpc终端（CoarseGrainedExecutorBackend实例）注册到消息分派器中：
      //	由分派器内部类EndpointData封装终端信息，由ConcurrentHashMap以key-value形式进行缓存注册
      if (endpoints.putIfAbsent(name, new EndpointData(name, endpoint, endpointRef)) != null) {
        throw new IllegalArgumentException(s"There is already an RpcEndpoint called $name")
      }
      //缓存终端与终端引用的对应关系
      val data = endpoints.get(name)
      endpointRefs.put(data.endpoint, data.ref)
      //将封装后的终端信息缓存到LinkedBlockingQueue队列中，作为消息分派器分派消息的接受者队列；相当于邮箱中的收件人列表；
      receivers.offer(data)  // for the OnStart message
    }
    endpointRef
  }
}

2.1.2.1 EndpointData 终端信息封装类

封装终端信息：名称、终端、终端引用、绑定的收件箱；

inbox是一个收件箱：为RpcEndpoint存储消息并以线程安全方式向其发送消息；

private[netty] class Dispatcher(nettyEnv: NettyRpcEnv, numUsableCores: Int) extends Logging {

  private class EndpointData(
      val name: String,
      val endpoint: RpcEndpoint,
      val ref: NettyRpcEndpointRef) {
    //为终端绑定一个收件箱；
    val inbox = new Inbox(ref, endpoint)
  }
}

2.2. rpcEnv阻塞代码

在当前方法中调用消息分派器的awaitTermination方法进行后续执行；

private[netty] class NettyRpcEnv(
    val conf: SparkConf,
    javaSerializerInstance: JavaSerializerInstance,
    host: String,
    securityManager: SecurityManager,
    numUsableCores: Int) extends RpcEnv(conf) with Logging {
  //在NettyRpcEnv实例化的时候，完成dispatcher属性初始化，及实例化消息分派器Dispatcher
  private val dispatcher: Dispatcher = new Dispatcher(this, numUsableCores)
  
  override def awaitTermination(): Unit = {
    dispatcher.awaitTermination()
  }
}

2.2.1. 消息分发器中阻塞代码

调用线程池的阻塞能力；

private[netty] class Dispatcher(nettyEnv: NettyRpcEnv, numUsableCores: Int) extends Logging {
  //线程池
  private val threadpool: ThreadPoolExecutor = {
    //.......
  }
  
  def awaitTermination(): Unit = {
    //线程池阻塞
    threadpool.awaitTermination(Long.MaxValue, TimeUnit.MILLISECONDS)
  }
}

2.2.1.1 线程池初始化说明

在分发器线程池实例化过程中，根据线程池线程数限制，拉起消息循环线程，进行消息发送；

线程池实例化的工作在CoarseGrainedExecutorBackend进程启动后，执行run方法过程中，利用sparkConf中的参数信息，创建executor的sparkEnv过程中完成；

===>CoarseGrainedExecutorBackend进程启动后，执行run方法，run方法中创建executor的sparkEnv，sparkEnv创建过程中需要初始化rpcEnv属性，此时将NettyRpcEnv实例化后赋值给rpcEnv，NettyRpcEnv实例化时，需要初始化NettyRpcEnv的dispatcher属性，new Dispatcher进行实例化过程中，需要初始化Dispatcher的threadpool属性；至此，消息分发器的线程池初始化完成；

private[netty] class Dispatcher(nettyEnv: NettyRpcEnv, numUsableCores: Int) extends Logging {
  //线程池用来分发消息
  //在消息分发器Dispatcher实例化的时候，完成线程池初始化；即从Dispatcher实例化开始，此段代码开始执行，线程池开始工作；
  private val threadpool: ThreadPoolExecutor = {
    //确定线程池线程数
    val availableCores =
      if (numUsableCores > 0) numUsableCores else Runtime.getRuntime.availableProcessors()
    val numThreads = nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads",
      math.max(2, availableCores))
    //初始化线程池
    val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, "dispatcher-event-loop")
    
    //根据线程数，拉起消息循环线程：进行消息发送
    for (i <- 0 until numThreads) {
      pool.execute(new MessageLoop)
    }
    pool
  }
  
}

2.2.1.1.1.MessageLoop-消息循环线程

消息循环线程类是消息分发器内部类；

轮询消息接受终端队列，向每个终端绑定的收件箱中发送并处理消息；

private[netty] class Dispatcher(nettyEnv: NettyRpcEnv, numUsableCores: Int) extends Logging {
  private class MessageLoop extends Runnable {
    override def run(): Unit = {
      try {
        while (true) {
          try {
            //轮询方式从消息接受者队列中获取消息接受者（接受信息的终端）
            val data = receivers.take()
            //所有消息接受者都获取完后，跳出轮询
            if (data == PoisonPill) {
              // Put PoisonPill back so that other MessageLoops can see it.
              receivers.offer(PoisonPill)
              return
            }
            //调用终端绑定的收件箱的process方法，处理初始化收件箱时向收件箱发送的OnStart消息；
            data.inbox.process(Dispatcher.this)
          } catch {
            case NonFatal(e) => logError(e.getMessage, e)
          }
        }
      } catch {
        case _: InterruptedException => // exit
        case t: Throwable =>
          try {
            // Re-submit a MessageLoop so that Dispatcher will still work if
            // UncaughtExceptionHandler decides to not kill JVM.
            threadpool.execute(new MessageLoop)
          } finally {
            throw t
          }
      }
    }
  }
  
  //标识MessageLoop应该退出其消息循环的有害端点
  private val PoisonPill = new EndpointData(null, null, null)
}

2.2.1.1.2 Inbox.process 收件箱中处理消息逻辑

Inbox的实例化：在以EndpointData封装终端信息时，会实例化一个Inbox给EndpointData的inbos属性赋值，在实例化的时候进行一次Inbox初始化；

在Inbox实例化时，会给收件箱初始化一个消息队列用于缓存消息；然后向消息队列中添加一个OnStart消息；

在实例化消息分发器，初始化分发器线程池属性时，会根据线程池线程数现在拉起消息循环线程，执行线程run方法；在run方法执行过程中，会执行终端收件箱的process消息处理方法；此时会首先处理收件箱中第一个添加的消息，即Onstart消息；

对OnStart消息的处理过程中，会执行CoarseGrainedExecutorBackend的onStart方法；并开启多线程处理消息开关；

private[netty] case object OnStart extends InboxMessage

private[netty] class Inbox(
    val endpointRef: NettyRpcEndpointRef,
    val endpoint: RpcEndpoint)
  extends Logging {

  inbox =>  // Give this an alias so we can use it more clearly in closures.

  //消息队列：以队列形式进行消息缓存
  @GuardedBy("this")
  protected val messages = new java.util.LinkedList[InboxMessage]()
      
  //允许多个线程同时处理消息
  @GuardedBy("this")
  private var enableConcurrent = false

  //处理此处收件箱的线程数
  @GuardedBy("this")
  private var numActiveThreads = 0


  // OnStart 消息作为第一个被添加的消息，第一个被处理；在Inbox实例化时执行此段代码；
  inbox.synchronized {
    messages.add(OnStart)
  }

  /**
   * Process stored messages.
   */
  def process(dispatcher: Dispatcher): Unit = {
    //待处理的消息
    var message: InboxMessage = null
    inbox.synchronized {
      if (!enableConcurrent && numActiveThreads != 0) {
        return
      }
      //从消息队列取出消息
      message = messages.poll()
      if (message != null) {
        //处理消息的线程数+1
        numActiveThreads += 1
      } else {
        return
      }
    }
    //一直死循环，一遍随时接收消息
    while (true) {
      //根据前面代码，此处endpoint为一个CoarseGrainedExecutorBackend实例
      safelyCall(endpoint) {
        message match {
          case RpcMessage(_sender, content, context) =>
            try {
              //执行CoarseGrainedExecutorBackend的receiveAndReply方法
              endpoint.receiveAndReply(context).applyOrElse[Any, Unit](content, { msg =>
                throw new SparkException(s"Unsupported message $message from ${_sender}")
              })
            } catch {
              case e: Throwable =>
                context.sendFailure(e)
                // Throw the exception -- this exception will be caught by the safelyCall function.
                // The endpoint's onError function will be called.
                throw e
            }

          case OneWayMessage(_sender, content) =>
            //执行CoarseGrainedExecutorBackend的receive方法
            endpoint.receive.applyOrElse[Any, Unit](content, { msg =>
              throw new SparkException(s"Unsupported message $message from ${_sender}")
            })
		  //当调用process方法时，第一个处理的消息时OnStart消息，根据匹配规则，此段代码会首先执行，即在处理其他消息之前，会先执行此段代码
          case OnStart =>
            //执行CoarseGrainedExecutorBackend的onStart方法
            endpoint.onStart()
            if (!endpoint.isInstanceOf[ThreadSafeRpcEndpoint]) {
              inbox.synchronized {
                if (!stopped) {
                  //开启多线程处理消息
                  enableConcurrent = true
                }
              }
            }

          case OnStop =>
            val activeThreads = inbox.synchronized { inbox.numActiveThreads }
            assert(activeThreads == 1,
              s"There should be only a single active thread but found $activeThreads threads.")
            //从消息分发器中移除终端注册信息
            dispatcher.removeRpcEndpointRef(endpoint)
            //调用CoarseGrainedExecutorBackend的onStop方法
            endpoint.onStop()
            assert(isEmpty, "OnStop should be the last message")

          case RemoteProcessConnected(remoteAddress) =>
            endpoint.onConnected(remoteAddress)

          case RemoteProcessDisconnected(remoteAddress) =>
            endpoint.onDisconnected(remoteAddress)

          case RemoteProcessConnectionError(cause, remoteAddress) =>
            endpoint.onNetworkError(cause, remoteAddress)
        }
      }

      inbox.synchronized {
        // "enableConcurrent" will be set to false after `onStop` is called, so we should check it
        // every time.
        if (!enableConcurrent && numActiveThreads != 1) {
          // If we are not the only one worker, exit
          numActiveThreads -= 1
          return
        }
        message = messages.poll()
        if (message == null) {
          numActiveThreads -= 1
          return
        }
      }
    }
  }
}

2.2.1.2.线程池阻塞逻辑

重复判断runState是否到达最终状态TERMINATED，如果是直接返回true，如果不是，调用termination.awaitNanos(nanos)阻塞一段时间，苏醒后再判断一次，如果runState是TERMINATED返回true，否则返回false。

参考ThreadPoolExecutor源码解读（三）——如何优雅的关闭线程池（shutdown、shutdownNow、awaitTermination）

public class ThreadPoolExecutor extends AbstractExecutorService {
  
  public boolean awaitTermination(long timeout, TimeUnit unit)
        throws InterruptedException {
        long nanos = unit.toNanos(timeout);
        final ReentrantLock mainLock = this.mainLock;
        mainLock.lock();
        try {
            for (;;) {
                if (runStateAtLeast(ctl.get(), TERMINATED))
                    return true;
                if (nanos <= 0)
                    return false;
                nanos = termination.awaitNanos(nanos);
            }
        } finally {
            mainLock.unlock();
        }
    }
}

4.注册executor

4.1.backend向driver发送注册消息

在backend的onStart方法中，backed向driver发送消息注册executor；

private[spark] class CoarseGrainedExecutorBackend(
    override val rpcEnv: RpcEnv,
    driverUrl: String,
    executorId: String,
    hostname: String,
    cores: Int,
    userClassPath: Seq[URL],
    env: SparkEnv)
  extends ThreadSafeRpcEndpoint with ExecutorBackend with Logging {
  
  override def onStart() {
    logInfo("Connecting to driver: " + driverUrl)
    //根据driverUrl异步获取driver终端引用
    rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
      // This is a very fast action so we can use "ThreadUtils.sameThread"
      driver = Some(ref)
      //通过driver终端引用向driver发送消息，注册executor
      ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores, extractLogUrls))
    }(ThreadUtils.sameThread).onComplete {
      // This is a very fast action so we can use "ThreadUtils.sameThread"
      case Success(msg) =>
        // Always receive `true`. Just ignore it
      case Failure(e) =>
        exitExecutor(1, s"Cannot register with driver: $driverUrl", e, notifyDriver = false)
    }(ThreadUtils.sameThread)
  }
}

4.2. driver处理backend注册executor消息

4.2.1.driver终端注册逻辑

在spark-submit提交spark应用后，启动driver线程后，由driver线程注册driver终端到rpcEnv中；

在spark-submit提交spark应用后，会进行一系列的逻辑处理，其中会启动一个driver线程【参考spark源码-任务提交流程之ApplicationMaster】，这个driver线程会从执行应用程序中用户类的main方法开始执行应用程序后续逻辑；

在执行应用程序后续逻辑过程中，前期会进行sparkContext的实例化，实例化过程中设计对对象属性的初始化，其中就包括_schedulerBackend变量【参考Spark源码-sparkContext初始化】；

变量_schedulerBackend的初始化逻辑参考【Spark源码-sparkContext初始化之TaskScheduler任务调度器】,从中可以看到，变量_schedulerBackend是StandaloneSchedulerBackend类的实例；查看源码可以看出，StandaloneSchedulerBackend类是CoarseGrainedSchedulerBackend类的一个实现；

在sparkContext#_schedulerBackend、sparkContext#_taskScheduler初始化后会执行_taskScheduler.start()方法启动任务调度器；代码如下：

private[spark] class TaskSchedulerImpl(
    val sc: SparkContext,
    val maxTaskFailures: Int,
    isLocal: Boolean = false)
  extends TaskScheduler with Logging {
  override def start() {
    //调用backend的start方法
    backend.start()

    if (!isLocal && conf.getBoolean("spark.speculation", false)) {
      logInfo("Starting speculative execution thread")
      speculationScheduler.scheduleWithFixedDelay(new Runnable {
        override def run(): Unit = Utils.tryOrStopSparkContext(sc) {
          checkSpeculatableTasks()
        }
      }, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
    }
  }
}

在任务调度器的start方法中，在spark on yarn-cluster模式下，将会调用StandaloneSchedulerBackend#start()方法：

private[spark] class StandaloneSchedulerBackend(
    scheduler: TaskSchedulerImpl,
    sc: SparkContext,
    masters: Array[String])
  extends CoarseGrainedSchedulerBackend(scheduler, sc.env.rpcEnv)
  with StandaloneAppClientListener
  with Logging {

  override def start() {
    //调用父类的start方法
    super.start()

    //。。。。。。其他代码
  }
}

在StandaloneSchedulerBackend#start()方法中，首先执行的事调用父类CoarseGrainedSchedulerBackend#start()方法;

private[spark]
class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: RpcEnv)
  extends ExecutorAllocationClient with SchedulerBackend with Logging {
  var driverEndpoint: RpcEndpointRef = null
    
  override def start() {
    val properties = new ArrayBuffer[(String, String)]
    for ((key, value) <- scheduler.sc.conf.getAll) {
      if (key.startsWith("spark.")) {
        properties += ((key, value))
      }
    }

    //初始化driver终端并将终端注册到rpcEnv中
    driverEndpoint = createDriverEndpointRef(properties)
  }

  protected def createDriverEndpointRef(
      properties: ArrayBuffer[(String, String)]): RpcEndpointRef = {
    //将终端注册到rpcEnv中
    rpcEnv.setupEndpoint(ENDPOINT_NAME, createDriverEndpoint(properties))
  }

  //构建driver终端
  protected def createDriverEndpoint(properties: Seq[(String, String)]): DriverEndpoint = {
    new DriverEndpoint(rpcEnv, properties)
  }
}

driver终端注册到rpcEnv中后，会被分发器线程池中消息循环线程调度执行driver终端绑定的收件箱的process()方法，在这个方法中会调用driver终端DriverEndpoint的onStart()方法：

class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: RpcEnv)
  extends ExecutorAllocationClient with SchedulerBackend with Logging {

  //driver节点线程池
  private val reviveThread =
    ThreadUtils.newDaemonSingleThreadScheduledExecutor("driver-revive-thread")

  class DriverEndpoint(override val rpcEnv: RpcEnv, sparkProperties: Seq[(String, String)])
    extends ThreadSafeRpcEndpoint with Logging {
    //由driver节点线程池拉起一个线程，向executor定期分配任务
    override def onStart() {
      // Periodically revive offers to allow delay scheduling to work
      val reviveIntervalMs = conf.getTimeAsMs("spark.scheduler.revive.interval", "1s")
			
      reviveThread.scheduleAtFixedRate(new Runnable {
        override def run(): Unit = Utils.tryLogNonFatalError {
          //向driver节点发送ReviveOffers消息，由driver节点向executor分配任务
          Option(self).foreach(_.send(ReviveOffers))
        }
      }, 0, reviveIntervalMs, TimeUnit.MILLISECONDS)
    } 
  }
}

4.2.2 处理backend注册executor的消息

driver终端DriverEndpoint接收到backend注册executor的ask消息后，由DriverEndpoint#receiveAndReply进行消息处理；

已经注册过的和黑名单的executor不注册，通过send方式发送OneWayMessage类型的RegisterExecutorFailed消息给executor终端；其他情况正常注册完成后，通过send方式发送OneWayMessage类型的RegisteredExecutor消息给executor终端；

class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: RpcEnv)
  extends ExecutorAllocationClient with SchedulerBackend with Logging {

  private val executorDataMap = new HashMap[String, ExecutorData]
    
  class DriverEndpoint(override val rpcEnv: RpcEnv, sparkProperties: Seq[(String, String)])
    extends ThreadSafeRpcEndpoint with Logging {
     
    override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
			//针对RegisterExecutor类消息
      case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls) =>
      	//已注册executor不再注册吗，返回RegisterExecutorFailed给executor终端
        if (executorDataMap.contains(executorId)) {
          executorRef.send(RegisterExecutorFailed("Duplicate executor ID: " + executorId))
          context.reply(true)
        } 
      	//针对黑名单节点，不注册，返回RegisterExecutorFailed给executor终端
      	else if (scheduler.nodeBlacklist.contains(hostname)) {
          logInfo(s"Rejecting $executorId as it has been blacklisted.")
          executorRef.send(RegisterExecutorFailed(s"Executor is blacklisted: $executorId"))
          context.reply(true)
        } else {
          // If the executor's rpc env is not listening for incoming connections, `hostPort`
          // will be null, and the client connection should be used to contact the executor.
          val executorAddress = if (executorRef.address != null) {
              executorRef.address
            } else {
              context.senderAddress
            }
          logInfo(s"Registered executor $executorRef ($executorAddress) with ID $executorId")
          //缓存executor信息
          addressToExecutorId(executorAddress) = executorId
          totalCoreCount.addAndGet(cores)
          totalRegisteredExecutors.addAndGet(1)
          val data = new ExecutorData(executorRef, executorAddress, hostname,
            cores, cores, logUrls)
          // This must be synchronized because variables mutated
          // in this block are read when requesting executors
          CoarseGrainedSchedulerBackend.this.synchronized {
            //以hashMap方式缓存executor信息，完成executor在driver终端的注册
            executorDataMap.put(executorId, data)
            if (currentExecutorIdCounter < executorId.toInt) {
              currentExecutorIdCounter = executorId.toInt
            }
            if (numPendingExecutors > 0) {
              numPendingExecutors -= 1
              logDebug(s"Decremented number of pending executors ($numPendingExecutors left)")
            }
          }
          //注册完成后，向executor终端发送消息RegisteredExecutor
          executorRef.send(RegisteredExecutor)
          // Note: some tests expect the reply to come after we put the executor in the map
          context.reply(true)
          listenerBus.post(
            SparkListenerExecutorAdded(System.currentTimeMillis(), executorId, data))
          //driver向executor分配任务
          makeOffers()
        }

      //......其他代码
    }
  }
}

4.2.3.driver向executor分配任务

在driver完成对executor的注册后，即调用DriverEndpoint#makeOffers向executor分配任务；

从scheduler中获取tasks列表，然后轮询tasks列表，根据task选择处理任务的executor节点，向该节点分配task；

class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: RpcEnv)
  extends ExecutorAllocationClient with SchedulerBackend with Logging {

  private val executorDataMap = new HashMap[String, ExecutorData]
    
  class DriverEndpoint(override val rpcEnv: RpcEnv, sparkProperties: Seq[(String, String)])
    extends ThreadSafeRpcEndpoint with Logging {
     
    private def makeOffers() {
      // Make sure no executor is killed while some task is launching on it
      val taskDescs = withLock {
        // Filter out executors under killing
        val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
        val workOffers = activeExecutors.map {
          case (id, executorData) =>
            new WorkerOffer(id, executorData.executorHost, executorData.freeCores,
              Some(executorData.executorAddress.hostPort))
        }.toIndexedSeq
        //获取tasks列表
        scheduler.resourceOffers(workOffers)
      }
      if (!taskDescs.isEmpty) {
        //分配tasks
        launchTasks(taskDescs)
      }
    }
     
    private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
      //轮询task是列表
      for (task <- tasks.flatten) {
        //序列化task
        val serializedTask = TaskDescription.encode(task)
        //被序列化task的大小不能超过最大的rpc消息的大小，否则任务被中断
        if (serializedTask.limit() >= maxRpcMessageSize) {
          Option(scheduler.taskIdToTaskSetManager.get(task.taskId)).foreach { taskSetMgr =>
            try {
              var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
                "spark.rpc.message.maxSize (%d bytes). Consider increasing " +
                "spark.rpc.message.maxSize or using broadcast variables for large values."
              msg = msg.format(task.taskId, task.index, serializedTask.limit(), maxRpcMessageSize)
              taskSetMgr.abort(msg)
            } catch {
              case e: Exception => logError("Exception in error callback", e)
            }
          }
        }
        else {
          //选择处理task的executor节点
          val executorData = executorDataMap(task.executorId)
          //启动一个task，对应的executor上CPU减1，默认启动一个task使用一个CPU core
          executorData.freeCores -= scheduler.CPUS_PER_TASK

          logDebug(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
            s"${executorData.executorHost}.")
					//向executor节点分配任务：发送OneWayMessage类型的LaunchTask消息给executor节点
          executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
        }
      }
    }
  }
}

4.3.backend接受driver注册executor的返回消息

在driver处理executor的注册信息后，会发送OneWayMessage类型的消息给executor终端；OneWayMessage类型的消息由CoarseGrainedExecutorBackend#receive()方法处理；

driver端注册executor成功后，在backend终端构造一个executor；

private[spark] class CoarseGrainedExecutorBackend(
    override val rpcEnv: RpcEnv,
    driverUrl: String,
    executorId: String,
    hostname: String,
    cores: Int,
    userClassPath: Seq[URL],
    env: SparkEnv)
  extends ThreadSafeRpcEndpoint with ExecutorBackend with Logging {
  
  var executor: Executor = null
    
  override def receive: PartialFunction[Any, Unit] = {
    case RegisteredExecutor =>
      logInfo("Successfully registered with driver")
      try {
        //向driver注册成功，构造一个Executor
        executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
      } catch {
        case NonFatal(e) =>
          exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
      }

    case RegisterExecutorFailed(message) =>
      exitExecutor(1, "Slave registration failed: " + message)
    //·······其他消息处理
  }
}

5.task任务处理

5.1 启动task

由CoarseGrainedExecutorBackend#receive()方法处理，在该方法中匹配LaunchTask消息处理逻辑；

private[spark] class CoarseGrainedExecutorBackend(
    override val rpcEnv: RpcEnv,
    driverUrl: String,
    executorId: String,
    hostname: String,
    cores: Int,
    userClassPath: Seq[URL],
    env: SparkEnv)
  extends ThreadSafeRpcEndpoint with ExecutorBackend with Logging {
  
  var executor: Executor = null
    
  override def receive: PartialFunction[Any, Unit] = {
     case LaunchTask(data) =>
      if (executor == null) {
        exitExecutor(1, "Received LaunchTask command but executor was null")
      } else {
        val taskDesc = TaskDescription.decode(data.value)
        logInfo("Got assigned task " + taskDesc.taskId)
        //在executor上启动task
        executor.launchTask(this, taskDesc)
      }
      //·······其他消息处理
  }
}

5.2 执行task

在executor上启动一个task线程，交由executor线程池执行，并将该task线程维护在executor执行线程清单中；

private[spark] class Executor(
    executorId: String,
    executorHostname: String,
    env: SparkEnv,
    userClassPath: Seq[URL] = Nil,
    isLocal: Boolean = false,
    uncaughtExceptionHandler: UncaughtExceptionHandler = new SparkUncaughtExceptionHandler)
  extends Logging {
    
  //执行线程清单
  private val runningTasks = new ConcurrentHashMap[Long, TaskRunner]
    
  //task执行线程池
  private val threadPool = {
    val threadFactory = new ThreadFactoryBuilder()
      .setDaemon(true)
      .setNameFormat("Executor task launch worker-%d")
      .setThreadFactory(new ThreadFactory {
        override def newThread(r: Runnable): Thread =
          // Use UninterruptibleThread to run tasks so that we can allow running codes without being
          // interrupted by `Thread.interrupt()`. Some issues, such as KAFKA-1894, HADOOP-10622,
          // will hang forever if some methods are interrupted.
          new UninterruptibleThread(r, "unused") // thread name will be set by ThreadFactoryBuilder
      })
      .build()
    Executors.newCachedThreadPool(threadFactory).asInstanceOf[ThreadPoolExecutor]
  }
    
  def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
    //在executor上启动一个task线程
    val tr = new TaskRunner(context, taskDescription)
    //将启动的线程添加到executor的执行线程清单中
    runningTasks.put(taskDescription.taskId, tr)
    //由线程池执行task线程
    threadPool.execute(tr)
  }
}