Spark 任务提交流程涉及源码说明

前言

本篇主要阐述 Spark on Yarn 任务提交源码分析说的流程,目的在于了解任务提交的大概流程;其主要是想通过对 Spark 任务提交流程与涉及到的源码了解;在遇到问题的时候可以快速定位到是由什么环节导致的,从而可以快速排查问题并予以解决。

源码流程说明

Spark on Yarn 任务提交源码流程

Spark on Yarn 任务提交源码流程

  1. 在本地提交Spark Job(Cluster模式提交) 任务的时候,首先会启动 SparkSubmit 中的 main 方法;通过反射类加载执行 Client中的 main 方法,并创建 yarnClient 与 Yarn 集群建立通讯的客户端;
  2. yarnClient.run开始向 Yarn集群中的 ResourceManager 提交应用程序(执行submitApplication()方法),其中向 ResourceManager 发送的内容主要包括容器、Java环境的启动命令、ApplicationMaster启动命令等等;
  3. Client 与 ResourceManager 建立通讯,并发出 ApplicationMaster 启动请求之后, ResourceManager 会在某一个适合的 NodeManager 中启动一个 容器 以及 ApplicationMaster 进程;
  4. ApplicationMaster.main开始启动 Driver 线程,用于执行指定类的 main 方法(如初始化SparkContext、划分 Stage 等等);与此同时创建 YarnRMClient(主要作用是用于与 ResourceManager 进行通讯、申请资源),其中服务器进程之间的通讯主要是通过 RPC 框架;然后 ApplicationMaster 向ResourceManager 注册、申请资源,ResourceManager 会通过查看 Yarn 集群中可用的资源;
  5. ApplicationMaster 接收到 ResourceManager 返回的可用的容器列表,开始进行容器分配(分配原则:移动数据不如移动计算,进程本地化);
  6. 然后 ApplicationMaster 会创建一个 NMClient (NodeManager 客户端),用于与 NodeManager 建立连接,并通知对应的NodeManager 启动容器、以及CoarseGrainedExecutorBackend(Executor);
  7. CoarseGrainedExecutorBackend.run开始向 ApplicationMaster 进行反向注册(主要作用是用于告诉 ApplicationMaster Executor已经准备好了,以及当 Executor 挂掉的时候 ApplicationMaster 可以重新申请资源运行任务);
  8. 当 ApplicationMaster 返回注册成功的消息,就开始启动 Executor 执行计算任务;

源码具体说明

提示:最好根据本地源代码、以及上面的流程图来进行查看。

本地提交 Spark Job

org.apache.spark.deploy.SparkSubmit  
/**
   * @Author: Small_Ran
   * @Date: 2022/5/24
   * @param args 传入的参数,例如:./bin/spark-submit --master yarn-cluster --name SparkTast ....
   * @Description:
   * 程序启动的人口
   */
  override def main(args: Array[String]): Unit = {
    //对传入的参数进行封装
    val appArgs = new SparkSubmitArguments(args)
    if (appArgs.verbose) {
      // scalastyle:off println
      printStream.println(appArgs)
      // scalastyle:on println
    }
    appArgs.action match {
			//开始提交任务
      case SparkSubmitAction.SUBMIT => submit(appArgs)
      case SparkSubmitAction.KILL => kill(appArgs)
      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
    }
  }

程序的入口是从 SparkSubmit 中的 main 方法开始执行。

org.apache.spark.deploy.SparkSubmitArguments
  
  /**
   * @Author: Small_Ran
   * @Date: 2022/5/24
   * @Description:
   * 解析并封装来自 spark-submit 脚本的参数
   */
  private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, String] = sys.env)
  extends SparkSubmitArgumentsParser {
  var master: String = null
  var deployMode: String = null
  var executorMemory: String = null
  var executorCores: String = null
  var totalExecutorCores: String = null
  var propertiesFile: String = null
  var driverMemory: String = null
  var driverExtraClassPath: String = null
  var driverExtraLibraryPath: String = null
  var driverExtraJavaOptions: String = null
  var queue: String = null
  var numExecutors: String = null
  var files: String = null
  var archives: String = null
  var mainClass: String = null
  var primaryResource: String = null
  var name: String = null
................. 中间省略部分代码 ....................
//获取 -class 中提交的主类  
  mainClass = jar.getManifest.getMainAttributes.getValue("Main-Class")  

SparkSubmitArguments 类主要用于解析封装 spark-submit 脚本的参数。

org.apache.spark.deploy.SparkSubmit

/**
   * @Author: Small_Ran
   * @Date: 2022/5/24
   * @Description:
   * 使用提供的参数提交申请
   */
private def submit(args: SparkSubmitArguments): Unit = {
//	准备所需要提交的环境,如主类(childMainClass)
    val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)

    def doRunMain(): Unit = {
      if (args.proxyUser != null) {
        val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
          UserGroupInformation.getCurrentUser())
        try {
          proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
            override def run(): Unit = {
              runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
            }
          })
        } catch {
          case e: Exception =>
            // Hadoop's AuthorizationException suppresses the exception's stack trace, which
            // makes the message printed to the output by the JVM not very helpful. Instead,
            // detect exceptions with empty stack traces here, and treat them differently.
            if (e.getStackTrace().length == 0) {
              // scalastyle:off println
              printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
              // scalastyle:on println
              exitFn(1)
            } else {
              throw e
            }
        }
      } else {
        //开始执行主类
        runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
      }
    }
org.apache.spark.deploy.SparkSubmit

/**
   * @Author: Small_Ran
   * @Date: 2022/5/24
   * @Description:
   * 准备所需要提交的环境,如主类(childMainClass)
   */
private[deploy] def prepareSubmitEnvironment(args: SparkSubmitArguments)
      : (Seq[String], Seq[String], Map[String, String], String) = {
    ................. 中间省略部分代码 ....................
    //如果是 Yarn Client 模式,则选择的主类为指定类(--class SparkTest)
    if (deployMode == CLIENT || isYarnCluster) {
      childMainClass = args.mainClass
      if (isUserJar(args.primaryResource)) {
        childClasspath += args.primaryResource
      }
      if (args.jars != null) { childClasspath ++= args.jars.split(",") }
    }
    ................. 中间省略部分代码 ....................    
    //在 yarn-cluster 模式下,使用 org.apache.spark.deploy.yarn.Client
    if (isYarnCluster) {
      childMainClass = "org.apache.spark.deploy.yarn.Client"
      if (args.isPython) {
        childArgs += ("--primary-py-file", args.primaryResource)
        childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
      } else if (args.isR) {
        val mainFile = new Path(args.primaryResource).getName
        childArgs += ("--primary-r-file", mainFile)
        childArgs += ("--class", "org.apache.spark.deploy.RRunner")
      } else {
        if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
          childArgs += ("--jar", args.primaryResource)
        }
        childArgs += ("--class", args.mainClass)
      }
      if (args.childArgs != null) {
        args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
      }
    }

prepareSubmitEnvironment() 方法主要用于准备提交申请的环境。

org.apache.spark.deploy.SparkSubmit

/**
   * @Author: Small_Ran
   * @Date: 2022/5/24
   * @Description:
   * 通过反射类加载执行主类的 main 方法
   */  
private def runMain(
      childArgs: Seq[String],
      childClasspath: Seq[String],
      sysProps: Map[String, String],
      childMainClass: String,
      verbose: Boolean): Unit = {
		................. 中间省略部分代码 ....................  
//    类加载器
    Thread.currentThread.setContextClassLoader(loader)

		................. 中间省略部分代码 ....................  
    var mainClass: Class[_] = null

    try {
//      反射加载类
      mainClass = Utils.classForName(childMainClass)
    } catch {
		................. 中间省略部分代码 ....................  
        System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
    }

		................. 中间省略部分代码 ....................  
//    判断指定的类中是否有 main 方法
    val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
    if (!Modifier.isStatic(mainMethod.getModifiers)) {
      throw new IllegalStateException("The main method in the given main class must be static")
    }
		................. 中间省略部分代码 ....................  
    try {
//      执行指定类中的 main 方法
      mainMethod.invoke(null, childArgs.toArray)
    } catch {
		................. 中间省略部分代码 ....................  
    }
  }

runMain() 方法主要使用提供的启动环境运行子类的 main 方法

org.apache.spark.deploy.yarn.Client
  
/**第五步
   * @Author: Small_Ran
   * @Date: 2022/5/25
   * @param argStrings
   * @Description: Yarn的Client类
   */
def main(argStrings: Array[String]) {
    if (!sys.props.contains("SPARK_SUBMIT")) {
      logWarning("WARNING: This client is deprecated and will be removed in a " +
        "future version of Spark. Use ./bin/spark-submit with \"--master yarn\"")
    }

    // Set an env variable indicating we are running in YARN mode.
    // Note that any env variable with the SPARK_ prefix gets propagated to all (remote) processes
    System.setProperty("SPARK_YARN_MODE", "true")
    val sparkConf = new SparkConf
    // SparkSubmit would use yarn cache to distribute files & jars in yarn mode,
    // so remove them from sparkConf here for yarn mode.
    sparkConf.remove("spark.jars")
    sparkConf.remove("spark.files")
    val args = new ClientArguments(argStrings)
//    创建 createYarnClient Yarn的客户端,可以与 yarn 集群建立连接
    new Client(args, sparkConf).run()
  }

提交 Application 请求

org.apache.spark.deploy.yarn.Client
  
def run(): Unit = {
  
//    开始提交应用程序
    this.appId = submitApplication()
    if (!launcherBackend.isConnected() && fireAndForget) {
      val report = getApplicationReport(appId)
      val state = report.getYarnApplicationState
      logInfo(s"Application report for $appId (state: $state)")
      logInfo(formatReportDetails(report))
      if (state == YarnApplicationState.FAILED || state == YarnApplicationState.KILLED) {
        throw new SparkException(s"Application $appId finished with status: $state")
      }
    } else {
      val (yarnApplicationState, finalApplicationStatus) = monitorApplication(appId)
      if (yarnApplicationState == YarnApplicationState.FAILED ||
        finalApplicationStatus == FinalApplicationStatus.FAILED) {
        throw new SparkException(s"Application $appId finished with failed status")
      }
      if (yarnApplicationState == YarnApplicationState.KILLED ||
        finalApplicationStatus == FinalApplicationStatus.KILLED) {
        throw new SparkException(s"Application $appId is killed")
      }
      if (finalApplicationStatus == FinalApplicationStatus.UNDEFINED) {
        throw new SparkException(s"The final status of application $appId is undefined")
      }
    }
  }

run() 方法主要用于向 ResourceManager 提交应用程序。

org.apache.spark.deploy.yarn.Client
  
def submitApplication(): ApplicationId = {
    var appId: ApplicationId = null
    try {
      launcherBackend.connect()
      // Setup the credentials before doing anything else,
      // so we have don't have issues at any point.
      setupCredentials()
      yarnClient.init(yarnConf)
      yarnClient.start()

      logInfo("Requesting a new application from cluster with %d NodeManagers"
        .format(yarnClient.getYarnClusterMetrics.getNumNodeManagers))

      ................. 中间省略部分代码 ....................  

      // Set up the appropriate contexts to launch our AM
//      创建提交的内容,包括容器、Java环境、ApplicationMaster命令等等
      val containerContext = createContainerLaunchContext(newAppResponse)
      val appContext = createApplicationSubmissionContext(newApp, containerContext)

      // Finally, submit and monitor the application
//      通过YarnClient提交 Application
      logInfo(s"Submitting application $appId to ResourceManager")
      yarnClient.submitApplication(appContext)
      launcherBackend.setAppId(appId.toString)
      reportLauncherState(SparkAppHandle.State.SUBMITTED)

      appId
    } catch {
      ................. 中间省略部分代码 ....................  
    }
  }

submitApplication() 方法主要用于将 ApplicationMaster 的应用程序提交到 ResourceManager。

  private def createContainerLaunchContext(newAppResponse: GetNewApplicationResponse)
    : ContainerLaunchContext = {
    ................. 中间省略部分代码 .................... 
    val useConcurrentAndIncrementalGC = launchEnv.get("SPARK_USE_CONC_INCR_GC").exists(_.toBoolean)
    if (useConcurrentAndIncrementalGC) {
      // In our expts, using (default) throughput collector has severe perf ramifications in
      // multi-tenant machines
      javaOpts += "-XX:+UseConcMarkSweepGC"
      javaOpts += "-XX:MaxTenuringThreshold=31"
      javaOpts += "-XX:SurvivorRatio=8"
      javaOpts += "-XX:+CMSIncrementalMode"
      javaOpts += "-XX:+CMSIncrementalPacing"
      javaOpts += "-XX:CMSIncrementalDutyCycleMin=0"
      javaOpts += "-XX:CMSIncrementalDutyCycle=10"
    }

    ................. 中间省略部分代码 .................... 
    val userClass =
      if (isClusterMode) {
        Seq("--class", YarnSparkHadoopUtil.escapeForShell(args.userClass))
      } else {
        Nil
      }
    val userJar =
      if (args.userJar != null) {
        Seq("--jar", args.userJar)
      } else {
        Nil
      }
    val primaryPyFile =
      if (isClusterMode && args.primaryPyFile != null) {
        Seq("--primary-py-file", new Path(args.primaryPyFile).getName())
      } else {
        Nil
      }
    val primaryRFile =
      if (args.primaryRFile != null) {
        Seq("--primary-r-file", args.primaryRFile)
      } else {
        Nil
      }
    val amClass =
      if (isClusterMode) {
//        Yarn Cluster 任务提交方式
        Utils.classForName("org.apache.spark.deploy.yarn.ApplicationMaster").getName
      } else {
//        Yarn Client 任务提交方式
        Utils.classForName("org.apache.spark.deploy.yarn.ExecutorLauncher").getName
      }
    ................. 中间省略部分代码 .................... 

//  启动 ApplicationMaster 命令
    val commands = prefixEnv ++
      Seq(Environment.JAVA_HOME.$$() + "/bin/java", "-server") ++
      javaOpts ++ amArgs ++
      Seq(
        "1>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout",
        "2>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr")

    // TODO: it would be nicer to just make sure there are no null commands here
    val printableCommands = commands.map(s => if (s == null) "null" else s).toList
    amContainer.setCommands(printableCommands.asJava)
		................. 中间省略部分代码 ....................  
  }

createContainerLaunchContext() 方法主要用于将设置启动环境、Java 选项和启动 ApplicationMaster 的命令。

启动 ApplicationMaster

org.apache.spark.deploy.yarn.ApplicationMaster
  
def main(args: Array[String]): Unit = {
    SignalUtils.registerLogger(log)
    //    封装传入的参数
    val amArgs = new ApplicationMasterArguments(args)

    // Load the properties file with the Spark configuration and set entries as system properties,
    // so that user code run inside the AM also has access to them.
    // Note: we must do this before SparkHadoopUtil instantiated
    if (amArgs.propertiesFile != null) {
      Utils.getPropertiesFromFile(amArgs.propertiesFile).foreach { case (k, v) =>
        sys.props(k) = v
      }
    }
    SparkHadoopUtil.get.runAsSparkUser { () =>
//      创建YarnRMClient与ResourceManager进行连接
      master = new ApplicationMaster(amArgs, new YarnRMClient)
      System.exit(master.run())
    }
  }
org.apache.spark.deploy.yarn.ApplicationMaster
  
final def run(): Int = {
    try {
      val appAttemptId = client.getAttemptId()

      var attemptID: Option[String] = None

      ................. 中间省略部分代码 ....................  
//      开始启动 Driver
      if (isClusterMode) {
//        Yarn Cluster模式
        runDriver(securityMgr)
      } else {
//        Yarn Client 模式
        runExecutorLauncher(securityMgr)
      }
    } catch {
      ................. 中间省略部分代码 ....................
    }
    exitCode
  }

向 ResourceManager 申请资源

org.apache.spark.deploy.yarn.ApplicationMaster
  
private def runDriver(securityMgr: SecurityManager): Unit = {
    addAmIpFilter()
//    启动指定类
    userClassThread = startUserApplication()

		................. 中间省略部分代码 ....................
    try {
      val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
        Duration(totalWaitTime, TimeUnit.MILLISECONDS))
      if (sc != null) {
        rpcEnv = sc.env.rpcEnv
        val driverRef = runAMEndpoint(
          sc.getConf.get("spark.driver.host"),
          sc.getConf.get("spark.driver.port"),
          isClusterMode = true)
        registerAM(sc.getConf, rpcEnv, driverRef, sc.ui.map(_.webUrl), securityMgr)
      } else {
        // Sanity check; should never happen in normal operation, since sc should only be null
        // if the user app did not create a SparkContext.
        if (!finished) {
          throw new IllegalStateException("SparkContext is null but app is still running!")
        }
      }
//      join 表示需要等线程执行完成之后才会继续往下面运行
      userClassThread.join()
    } catch {
      ................. 中间省略部分代码 ....................
    }
  }
org.apache.spark.deploy.yarn.ApplicationMaster
  
/**
   * @Author: Small_Ran
   * @Date: 2022/5/24
   * @Description:
   * ApplicationMaster主要与 NodeManager 交互(资源),以及 Driver 进行交互
   */  
private def startUserApplication(): Thread = {
    logInfo("Starting the user application in a separate Thread")

    ................. 中间省略部分代码 ....................
//    获取 --class 指定的 main 方法
    val mainMethod = userClassLoader.loadClass(args.userClass)
      .getMethod("main", classOf[Array[String]])

//    启动 Driver 线程,并执行指定类的 main方法
    val userThread = new Thread {
      override def run() {
        try {
          mainMethod.invoke(null, userArgs.toArray)
          finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
          logDebug("Done running users class")
        } catch {
          ................. 中间省略部分代码 ....................
            }
            sparkContextPromise.tryFailure(e.getCause())
        } finally {
					................. 中间省略部分代码 ....................
        }
      }
    }
    userThread.setContextClassLoader(userClassLoader)
    userThread.setName("Driver")
    userThread.start()
    userThread
  }

startUserApplication() 方法主要用于启动 Driver 线程。

org.apache.spark.deploy.yarn.ApplicationMaster

private def registerAM(
      _sparkConf: SparkConf,
      _rpcEnv: RpcEnv,
      driverRef: RpcEndpointRef,
      uiAddress: Option[String],
      securityMgr: SecurityManager) = {
    ................. 中间省略部分代码 ....................

//    ApplicationMaster 开始向 ResourceManager申请资源
    allocator = client.register(driverUrl,
      driverRef,
      yarnConf,
      _sparkConf,
      uiAddress,
      historyAddress,
      securityMgr,
      localResources)

//    分配可以用资源,并启动容器
    allocator.allocateResources()
    reporterThread = launchReporterThread()
  }

registerAM() 方法主要用于向 ApplicationMaster 注册,并且开始申请任务所需资源。

ResourceManager 返回集群可用容器

org.apache.spark.deploy.yarn.YarnAllocator
  
def allocateResources(): Unit = synchronized {
    updateResourceRequests()

    val progressIndicator = 0.1f
    // Poll the ResourceManager. This doubles as a heartbeat if there are no pending container
    // requests.
    val allocateResponse = amClient.allocate(progressIndicator)

//    获取可分配的容器
    val allocatedContainers = allocateResponse.getAllocatedContainers()

    if (allocatedContainers.size > 0) {
      logDebug("Allocated containers: %d. Current executor count: %d. Cluster resources: %s."
        .format(
          allocatedContainers.size,
          numExecutorsRunning,
          allocateResponse.getAvailableResources))
        
//      开始处理分配的容器
      handleAllocatedContainers(allocatedContainers.asScala)
    }

allocateResources()方法主要用于向 ResourceManager 发请求申请资源,然后 ResourceManager 会返回一个可用资源列表。

启动容器与 Executor

org.apache.spark.deploy.yarn.YarnAllocator
  
def handleAllocatedContainers(allocatedContainers: Seq[Container]): Unit = {
    val containersToUse = new ArrayBuffer[Container](allocatedContainers.size)

    ................. 中间省略部分代码 ....................

//    运行分配的容器
    runAllocatedContainers(containersToUse)

    logInfo("Received %d containers from YARN, launching executors on %d of them."
      .format(allocatedContainers.size, containersToUse.size))
  }  

handleAllocatedContainers() 方法主要用于处理启动 RM 授予的容器中的 Executor。

org.apache.spark.deploy.yarn.YarnAllocator
  
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = {
    for (container <- containersToUse) {
      ................. 中间省略部分代码 ....................
        
//      运行 Executor线程
      if (numExecutorsRunning < targetNumExecutors) {
        if (launchContainers) {
          launcherPool.execute(new Runnable {
            override def run(): Unit = {
              try {
                new ExecutorRunnable(
                  Some(container),
                  conf,
                  sparkConf,
                  driverUrl,
                  executorId,
                  executorHostname,
                  executorMemory,
                  executorCores,
                  appAttemptId.getApplicationId.toString,
                  securityMgr,
                  localResources
                ).run()
                updateInternalState()
              } catch {
                case NonFatal(e) =>
                  logError(s"Failed to launch executor $executorId on container $containerId", e)
                  // Assigned container should be released immediately to avoid unnecessary resource
                  // occupation.
                  amClient.releaseAssignedContainer(containerId)
              }
            }
          })
				................. 中间省略部分代码 ....................
    }
  }

runAllocatedContainers()方法主要用于运行分配容器中的程序。

org.apache.spark.deploy.yarn.ExecutorRunnable
  
def run(): Unit = {
    logDebug("Starting Executor Container")
    nmClient = NMClient.createNMClient()
    nmClient.init(conf)
    nmClient.start()
//    启动容器
    startContainer()
  }
def startContainer(): java.util.Map[String, ByteBuffer] = {
    val ctx = Records.newRecord(classOf[ContainerLaunchContext])
      .asInstanceOf[ContainerLaunchContext]
    val env = prepareEnvironment().asJava

    ctx.setLocalResources(localResources.asJava)
    ctx.setEnvironment(env)

    val credentials = UserGroupInformation.getCurrentUser().getCredentials()
    val dob = new DataOutputBuffer()
    credentials.writeTokenStorageToStream(dob)
    ctx.setTokens(ByteBuffer.wrap(dob.getData()))

//    启动 CoarseGrainedExecutorBackend 进程
    val commands = prepareCommand()

    ctx.setCommands(commands.asJava)
    ................. 中间省略部分代码 ....................
  }
private def prepareCommand(): List[String] = {
    // Extra options for the JVM
    val javaOpts = ListBuffer[String]()

		................. 中间省略部分代码 ....................
      
    javaOpts += ("-Dspark.yarn.app.container.log.dir=" + ApplicationConstants.LOG_DIR_EXPANSION_VAR)

    val userClassPath = Client.getUserClasspath(sparkConf).flatMap { uri =>
      val absPath =
        if (new File(uri.getPath()).isAbsolute()) {
          Client.getClusterPath(sparkConf, uri.getPath())
        } else {
          Client.buildPath(Environment.PWD.$(), uri.getPath())
        }
      Seq("--user-class-path", "file:" + absPath)
    }.toSeq

    YarnSparkHadoopUtil.addOutOfMemoryErrorArgument(javaOpts)
       
//    设置 CoarseGrainedExecutorBackend 启动命令  
    val commands = prefixEnv ++
      Seq(Environment.JAVA_HOME.$$() + "/bin/java", "-server") ++
      javaOpts ++
      Seq("org.apache.spark.executor.CoarseGrainedExecutorBackend",
        "--driver-url", masterAddress,
        "--executor-id", executorId,
        "--hostname", hostname,
        "--cores", executorCores.toString,
        "--app-id", appId) ++
      userClassPath ++
      Seq(
        s"1>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stdout",
        s"2>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stderr")

    // TODO: it would be nicer to just make sure there are no null commands here
    commands.map(s => if (s == null) "null" else s).toList
  }

prepareCommand()方法主要用于设置 CoarseGrainedExecutorBackend 启动的命令,其中流程图中的 Executor 实际上启动的是 CoarseGrainedExecutorBackend ;Executor只能说是进行进程之间交互的名称,真正 new 的是 CoarseGrainedExecutorBackend;Task首先会把任务发给 CoarseGrainedExecutorBackend ,然后由对象属性 Executor 执行。

org.apache.spark.executor.CoarseGrainedExecutorBackend
  
private def run(
      driverUrl: String,
      executorId: String,
      hostname: String,
      cores: Int,
      appId: String,
      workerUrl: Option[String],
      userClassPath: Seq[URL]) {

    ................. 中间省略部分代码 ....................

      val env = SparkEnv.createExecutorEnv(
        driverConf, executorId, hostname, port, cores, cfg.ioEncryptionKey, isLocal = false)

//      Executor只能说是进行进程之间交互的名称,真正 new 的是 CoarseGrainedExecutorBackend;
//      Task首先会把任务发给 CoarseGrainedExecutorBackend ,然后由对象属性 Executor 执行;
      env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend(
        env.rpcEnv, driverUrl, executorId, hostname, cores, userClassPath, env))
      workerUrl.foreach { url =>
        env.rpcEnv.setupEndpoint("WorkerWatcher", new WorkerWatcher(env.rpcEnv, url))
      }
      env.rpcEnv.awaitTermination()
      SparkHadoopUtil.get.stopCredentialUpdater()
    }
  }

run()方法主要用于启动 CoarseGrainedExecutorBackend (run() 方法由执行 CoarseGrainedExecutorBackend中main 方法得来)。

Executor 反向注册

org.apache.spark.executor.CoarseGrainedExecutorBackend

override def onStart() {
    logInfo("Connecting to driver: " + driverUrl)
    rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
      // This is a very fast action so we can use "ThreadUtils.sameThread"
      //向Driver反向注册,主要作用是告诉Driver Executor已经启动好了;而且当某一个 Executor 挂掉时,Driver可以及时重新申请资源运行任务
      driver = Some(ref)
      ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores, extractLogUrls))
    }(ThreadUtils.sameThread).onComplete {
      // This is a very fast action so we can use "ThreadUtils.sameThread"
      case Success(msg) =>
        // Always receive `true`. Just ignore it
      case Failure(e) =>
        exitExecutor(1, s"Cannot register with driver: $driverUrl", e, notifyDriver = false)
    }(ThreadUtils.sameThread)
  }

onStart()方法主要用于开始向Driver进行反向注册;由于 CoarseGrainedExecutorBackend 是继承了ThreadSafeRpcEndpoint 类所以会重写该类中的方法(生命周期:constructor -> onStart -> receive -> onStop)

分配 Task 任务

org.apache.spark.executor.CoarseGrainedExecutorBackend
  
override def receive: PartialFunction[Any, Unit] = {
//    反向注册成功信息
    case RegisteredExecutor =>
      logInfo("Successfully registered with driver")
      try {
//        Executor 完成反向注册之后,Driver也会返回一个确认信息;然后Executor就开始准备计算
        executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
      } catch {
        case NonFatal(e) =>
          exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
      }

    case RegisterExecutorFailed(message) =>
      exitExecutor(1, "Slave registration failed: " + message)

    //     Executor启动信息
    case LaunchTask(data) =>
      if (executor == null) {
        exitExecutor(1, "Received LaunchTask command but executor was null")
      } else {
        val taskDesc = TaskDescription.decode(data.value)
        logInfo("Got assigned task " + taskDesc.taskId)
//        开始启动
        executor.launchTask(this, taskDesc)
      }

    case KillTask(taskId, _, interruptThread, reason) =>
      if (executor == null) {
        exitExecutor(1, "Received KillTask command but executor was null")
      } else {
        executor.killTask(taskId, interruptThread, reason)
      }

    case StopExecutor =>
      stopping.set(true)
      logInfo("Driver commanded a shutdown")
      // Cannot shutdown here because an ack may need to be sent back to the caller. So send
      // a message to self to actually do the shutdown.
      self.send(Shutdown)

    case Shutdown =>
      stopping.set(true)
      new Thread("CoarseGrainedExecutorBackend-stop-executor") {
        override def run(): Unit = {
        ................. 中间省略部分代码 ....................
          executor.stop()
        }
      }.start()
  }

receive()方法主要用于接收到 Driver返回的注册成功消息,然后开始根据分配的 Task 任务开始执行 Executor。

==> SparkContext 初始化过程源码说明


参考链接:https://www.bilibili.com/video/BV1Si4y1M7N6?p=2&spm_id_from=pageDriver

  • 6
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值