Spark 任务提交流程涉及源码说明

Small_Ran

已于 2022-07-02 11:35:00 修改

阅读量913

点赞数 6

分类专栏： Spark 文章标签： spark scala 大数据分布式数据仓库

于 2022-07-02 11:02:14 首次发布

本文链接：https://blog.csdn.net/ran_hao/article/details/125570433

版权

Spark 专栏收录该内容

8 篇文章 2 订阅

订阅专栏

前言

本篇主要阐述 Spark on Yarn 任务提交源码分析说的流程，目的在于了解任务提交的大概流程；其主要是想通过对 Spark 任务提交流程与涉及到的源码了解；在遇到问题的时候可以快速定位到是由什么环节导致的，从而可以快速排查问题并予以解决。

源码流程说明

Spark on Yarn 任务提交源码流程

在本地提交Spark Job（Cluster模式提交）任务的时候，首先会启动 SparkSubmit 中的 main 方法；通过反射类加载执行 Client中的 main 方法，并创建 yarnClient 与 Yarn 集群建立通讯的客户端；
yarnClient.run开始向 Yarn集群中的 ResourceManager 提交应用程序（执行submitApplication()方法），其中向 ResourceManager 发送的内容主要包括容器、Java环境的启动命令、ApplicationMaster启动命令等等；
Client 与 ResourceManager 建立通讯，并发出 ApplicationMaster 启动请求之后， ResourceManager 会在某一个适合的 NodeManager 中启动一个容器以及 ApplicationMaster 进程；
ApplicationMaster.main开始启动 Driver 线程，用于执行指定类的 main 方法（如初始化SparkContext、划分 Stage 等等）；与此同时创建 YarnRMClient（主要作用是用于与 ResourceManager 进行通讯、申请资源），其中服务器进程之间的通讯主要是通过 RPC 框架；然后 ApplicationMaster 向ResourceManager 注册、申请资源，ResourceManager 会通过查看 Yarn 集群中可用的资源；
ApplicationMaster 接收到 ResourceManager 返回的可用的容器列表，开始进行容器分配（分配原则：移动数据不如移动计算，进程本地化）；
然后 ApplicationMaster 会创建一个 NMClient （NodeManager 客户端），用于与 NodeManager 建立连接，并通知对应的NodeManager 启动容器、以及CoarseGrainedExecutorBackend（Executor）；
CoarseGrainedExecutorBackend.run开始向 ApplicationMaster 进行反向注册（主要作用是用于告诉 ApplicationMaster Executor已经准备好了，以及当 Executor 挂掉的时候 ApplicationMaster 可以重新申请资源运行任务）；
当 ApplicationMaster 返回注册成功的消息，就开始启动 Executor 执行计算任务；

源码具体说明

提示：最好根据本地源代码、以及上面的流程图来进行查看。

本地提交 Spark Job

org.apache.spark.deploy.SparkSubmit  
/**
   * @Author: Small_Ran
   * @Date: 2022/5/24
   * @param args 传入的参数，例如：./bin/spark-submit --master yarn-cluster --name SparkTast ....
   * @Description:
   * 程序启动的人口
   */
  override def main(args: Array[String]): Unit = {
    //对传入的参数进行封装
    val appArgs = new SparkSubmitArguments(args)
    if (appArgs.verbose) {
      // scalastyle:off println
      printStream.println(appArgs)
      // scalastyle:on println
    }
    appArgs.action match {
			//开始提交任务
      case SparkSubmitAction.SUBMIT => submit(appArgs)
      case SparkSubmitAction.KILL => kill(appArgs)
      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
    }
  }

程序的入口是从 SparkSubmit 中的 main 方法开始执行。

org.apache.spark.deploy.SparkSubmitArguments
  
  /**
   * @Author: Small_Ran
   * @Date: 2022/5/24
   * @Description:
   * 解析并封装来自 spark-submit 脚本的参数
   */
  private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, String] = sys.env)
  extends SparkSubmitArgumentsParser {
  var master: String = null
  var deployMode: String = null
  var executorMemory: String = null
  var executorCores: String = null
  var totalExecutorCores: String = null
  var propertiesFile: String = null
  var driverMemory: String = null
  var driverExtraClassPath: String = null
  var driverExtraLibraryPath: String = null
  var driverExtraJavaOptions: String = null
  var queue: String = null
  var numExecutors: String = null
  var files: String = null
  var archives: String = null
  var mainClass: String = null
  var primaryResource: String = null
  var name: String = null
................. 中间省略部分代码 ....................
//获取 -class 中提交的主类  
  mainClass = jar.getManifest.getMainAttributes.getValue("Main-Class")

SparkSubmitArguments 类主要用于解析封装 spark-submit 脚本的参数。

org.apache.spark.deploy.SparkSubmit

/**
   * @Author: Small_Ran
   * @Date: 2022/5/24
   * @Description:
   * 使用提供的参数提交申请
   */
private def submit(args: SparkSubmitArguments): Unit = {
//	准备所需要提交的环境，如主类（childMainClass）
    val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)

    def doRunMain(): Unit = {
      if (args.proxyUser != null) {
        val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
          UserGroupInformation.getCurrentUser())
        try {
          proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
            override def run(): Unit = {
              runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
            }
          })
        } catch {
          case e: Exception =>
            // Hadoop's AuthorizationException suppresses the exception's stack trace, which
            // makes the message printed to the output by the JVM not very helpful. Instead,
            // detect exceptions with empty stack traces here, and treat them differently.
            if (e.getStackTrace().length == 0) {
              // scalastyle:off println
              printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
              // scalastyle:on println
              exitFn(1)
            } else {
              throw e
            }
        }
      } else {
        //开始执行主类
        runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
      }
    }

org.apache.spark.deploy.SparkSubmit

/**
   * @Author: Small_Ran
   * @Date: 2022/5/24
   * @Description:
   * 准备所需要提交的环境，如主类（childMainClass）
   */
private[deploy] def prepareSubmitEnvironment(args: SparkSubmitArguments)
      : (Seq[String], Seq[String], Map[String, String], String) = {
    ................. 中间省略部分代码 ....................
    //如果是 Yarn Client 模式，则选择的主类为指定类(--class SparkTest)
    if (deployMode == CLIENT || isYarnCluster) {
      childMainClass = args.mainClass
      if (isUserJar(args.primaryResource)) {
        childClasspath += args.primaryResource
      }
      if (args.jars != null) { childClasspath ++= args.jars.split(",") }
    }
    ................. 中间省略部分代码 ....................    
    //在 yarn-cluster 模式下，使用 org.apache.spark.deploy.yarn.Client
    if (isYarnCluster) {
      childMainClass = "org.apache.spark.deploy.yarn.Client"
      if (args.isPython) {
        childArgs += ("--primary-py-file", args.primaryResource)
        childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
      } else if (args.isR) {
        val mainFile = new Path(args.primaryResource).getName
        childArgs += ("--primary-r-file", mainFile)
        childArgs += ("--class", "org.apache.spark.deploy.RRunner")
      } else {
        if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
          childArgs += ("--jar", args.primaryResource)
        }
        childArgs += ("--class", args.mainClass)
      }
      if (args.childArgs != null) {
        args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
      }
    }

prepareSubmitEnvironment() 方法主要用于准备提交申请的环境。

org.apache.spark.deploy.SparkSubmit

/**
   * @Author: Small_Ran
   * @Date: 2022/5/24
   * @Description:
   * 通过反射类加载执行主类的 main 方法
   */  
private def runMain(
      childArgs: Seq[String],
      childClasspath: Seq[String],
      sysProps: Map[String, String],
      childMainClass: String,
      verbose: Boolean): Unit = {
		................. 中间省略部分代码 ....................  
//    类加载器
    Thread.currentThread.setContextClassLoader(loader)

		................. 中间省略部分代码 ....................  
    var mainClass: Class[_] = null

    try {
//      反射加载类
      mainClass = Utils.classForName(childMainClass)
    } catch {
		................. 中间省略部分代码 ....................  
        System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
    }

		................. 中间省略部分代码 ....................  
//    判断指定的类中是否有 main 方法
    val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
    if (!Modifier.isStatic(mainMethod.getModifiers)) {
      throw new IllegalStateException("The main method in the given main class must be static")
    }
		................. 中间省略部分代码 ....................  
    try {
//      执行指定类中的 main 方法
      mainMethod.invoke(null, childArgs.toArray)
    } catch {
		................. 中间省略部分代码 ....................  
    }
  }

runMain() 方法主要使用提供的启动环境运行子类的 main 方法

org.apache.spark.deploy.yarn.Client
  
/**第五步
   * @Author: Small_Ran
   * @Date: 2022/5/25
   * @param argStrings
   * @Description: Yarn的Client类
   */
def main(argStrings: Array[String]) {
    if (!sys.props.contains("SPARK_SUBMIT")) {
      logWarning("WARNING: This client is deprecated and will be removed in a " +
        "future version of Spark. Use ./bin/spark-submit with \"--master yarn\"")
    }

    // Set an env variable indicating we are running in YARN mode.
    // Note that any env variable with the SPARK_ prefix gets propagated to all (remote) processes
    System.setProperty("SPARK_YARN_MODE", "true")
    val sparkConf = new SparkConf
    // SparkSubmit would use yarn cache to distribute files & jars in yarn mode,
    // so remove them from sparkConf here for yarn mode.
    sparkConf.remove("spark.jars")
    sparkConf.remove("spark.files")
    val args = new ClientArguments(argStrings)
//    创建 createYarnClient Yarn的客户端，可以与 yarn 集群建立连接
    new Client(args, sparkConf).run()
  }

提交 Application 请求

org.apache.spark.deploy.yarn.Client
  
def run(): Unit = {
  
//    开始提交应用程序
    this.appId = submitApplication()
    if (!launcherBackend.isConnected() && fireAndForget) {
      val report = getApplicationReport(appId)
      val state = report.getYarnApplicationState
      logInfo(s"Application report for $appId (state: $state)")
      logInfo(formatReportDetails(report))
      if (state == YarnApplicationState.FAILED || state == YarnApplicationState.KILLED) {
        throw new SparkException(s"Application $appId finished with status: $state")
      }
    } else {
      val (yarnApplicationState, finalApplicationStatus) = monitorApplication(appId)
      if (yarnApplicationState == YarnApplicationState.FAILED ||
        finalApplicationStatus == FinalApplicationStatus.FAILED) {
        throw new SparkException(s"Application $appId finished with failed status")
      }
      if (yarnApplicationState == YarnApplicationState.KILLED ||
        finalApplicationStatus == FinalApplicationStatus.KILLED) {
        throw new SparkException(s"Application $appId is killed")
      }
      if (finalApplicationStatus == FinalApplicationStatus.UNDEFINED) {
        throw new SparkException(s"The final status of application $appId is undefined")
      }
    }
  }

run() 方法主要用于向 ResourceManager 提交应用程序。

org.apache.spark.deploy.yarn.Client
  
def submitApplication(): ApplicationId = {
    var appId: ApplicationId = null
    try {
      launcherBackend.connect()
      // Setup the credentials before doing anything else,
      // so we have don't have issues at any point.
      setupCredentials()
      yarnClient.init(yarnConf)
      yarnClient.start()

      logInfo("Requesting a new application from cluster with %d NodeManagers"
        .format(yarnClient.getYarnClusterMetrics.getNumNodeManagers))

      ................. 中间省略部分代码 ....................  

      // Set up the appropriate contexts to launch our AM
//      创建提交的内容，包括容器、Java环境、ApplicationMaster命令等等
      val containerContext = createContainerLaunchContext(newAppResponse)
      val appContext = createApplicationSubmissionContext(newApp, containerContext)

      // Finally, submit and monitor the application
//      通过YarnClient提交 Application
      logInfo(s"Submitting application $appId to ResourceManager")
      yarnClient.submitApplication(appContext)
      launcherBackend.setAppId(appId.toString)
      reportLauncherState(SparkAppHandle.State.SUBMITTED)

      appId
    } catch {
      ................. 中间省略部分代码 ....................  
    }
  }

submitApplication() 方法主要用于将 ApplicationMaster 的应用程序提交到 ResourceManager。

  private def createContainerLaunchContext(newAppResponse: GetNewApplicationResponse)
    : ContainerLaunchContext = {
    ................. 中间省略部分代码 .................... 
    val useConcurrentAndIncrementalGC = launchEnv.get("SPARK_USE_CONC_INCR_GC").exists(_.toBoolean)
    if (useConcurrentAndIncrementalGC) {
      // In our expts, using (default) throughput collector has severe perf ramifications in
      // multi-tenant machines
      javaOpts += "-XX:+UseConcMarkSweepGC"
      javaOpts += "-XX:MaxTenuringThreshold=31"
      javaOpts += "-XX:SurvivorRatio=8"
      javaOpts += "-XX:+CMSIncrementalMode"
      javaOpts += "-XX:+CMSIncrementalPacing"
      javaOpts += "-XX:CMSIncrementalDutyCycleMin=0"
      javaOpts += "-XX:CMSIncrementalDutyCycle=10"
    }

    ................. 中间省略部分代码 .................... 
    val userClass =
      if (isClusterMode) {
        Seq("--class", YarnSparkHadoopUtil.escapeForShell(args.userClass))
      } else {
        Nil
      }
    val userJar =
      if (args.userJar != null) {
        Seq("--jar", args.userJar)
      } else {
        Nil
      }
    val primaryPyFile =
      if (isClusterMode && args.primaryPyFile != null) {
        Seq("--primary-py-file", new Path(args.primaryPyFile).getName())
      } else {
        Nil
      }
    val primaryRFile =
      if (args.primaryRFile != null) {
        Seq("--primary-r-file", args.primaryRFile)
      } else {
        Nil
      }
    val amClass =
      if (isClusterMode) {
//        Yarn Cluster 任务提交方式
        Utils.classForName("org.apache.spark.deploy.yarn.ApplicationMaster").getName
      } else {
//        Yarn Client 任务提交方式
        Utils.classForName("org.apache.spark.deploy.yarn.ExecutorLauncher").getName
      }
    ................. 中间省略部分代码 .................... 

//  启动 ApplicationMaster 命令
    val commands = prefixEnv ++
      Seq(Environment.JAVA_HOME.$$() + "/bin/java", "-server") ++
      javaOpts ++ amArgs ++
      Seq(
        "1>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout",
        "2>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr")

    // TODO: it would be nicer to just make sure there are no null commands here
    val printableCommands = commands.map(s => if (s == null) "null" else s).toList
    amContainer.setCommands(printableCommands.asJava)
		................. 中间省略部分代码 ....................  
  }

createContainerLaunchContext() 方法主要用于将设置启动环境、Java 选项和启动 ApplicationMaster 的命令。

启动 ApplicationMaster

org.apache.spark.deploy.yarn.ApplicationMaster
  
def main(args: Array[String]): Unit = {
    SignalUtils.registerLogger(log)
    //    封装传入的参数
    val amArgs = new ApplicationMasterArguments(args)

    // Load the properties file with the Spark configuration and set entries as system properties,
    // so that user code run inside the AM also has access to them.
    // Note: we must do this before SparkHadoopUtil instantiated
    if (amArgs.propertiesFile != null) {
      Utils.getPropertiesFromFile(amArgs.propertiesFile).foreach { case (k, v) =>
        sys.props(k) = v
      }
    }
    SparkHadoopUtil.get.runAsSparkUser { () =>
//      创建YarnRMClient与ResourceManager进行连接
      master = new ApplicationMaster(amArgs, new YarnRMClient)
      System.exit(master.run())
    }
  }

org.apache.spark.deploy.yarn.ApplicationMaster
  
final def run(): Int = {
    try {
      val appAttemptId = client.getAttemptId()

      var attemptID: Option[String] = None

      ................. 中间省略部分代码 ....................  
//      开始启动 Driver
      if (isClusterMode) {
//        Yarn Cluster模式
        runDriver(securityMgr)
      } else {
//        Yarn Client 模式
        runExecutorLauncher(securityMgr)
      }
    } catch {
      ................. 中间省略部分代码 ....................
    }
    exitCode
  }

向 ResourceManager 申请资源

org.apache.spark.deploy.yarn.ApplicationMaster
  
private def runDriver(securityMgr: SecurityManager): Unit = {
    addAmIpFilter()
//    启动指定类
    userClassThread = startUserApplication()

		................. 中间省略部分代码 ....................
    try {
      val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
        Duration(totalWaitTime, TimeUnit.MILLISECONDS))
      if (sc != null) {
        rpcEnv = sc.env.rpcEnv
        val driverRef = runAMEndpoint(
          sc.getConf.get("spark.driver.host"),
          sc.getConf.get("spark.driver.port"),
          isClusterMode = true)
        registerAM(sc.getConf, rpcEnv, driverRef, sc.ui.map(_.webUrl), securityMgr)
      } else {
        // Sanity check; should never happen in normal operation, since sc should only be null
        // if the user app did not create a SparkContext.
        if (!finished) {
          throw new IllegalStateException("SparkContext is null but app is still running!")
        }
      }
//      join 表示需要等线程执行完成之后才会继续往下面运行
      userClassThread.join()
    } catch {
      ................. 中间省略部分代码 ....................
    }
  }

org.apache.spark.deploy.yarn.ApplicationMaster
  
/**
   * @Author: Small_Ran
   * @Date: 2022/5/24
   * @Description:
   * ApplicationMaster主要与 NodeManager 交互（资源），以及 Driver 进行交互
   */  
private def startUserApplication(): Thread = {
    logInfo("Starting the user application in a separate Thread")

    ................. 中间省略部分代码 ....................
//    获取 --class 指定的 main 方法
    val mainMethod = userClassLoader.loadClass(args.userClass)
      .getMethod("main", classOf[Array[String]])

//    启动 Driver 线程，并执行指定类的 main方法
    val userThread = new Thread {
      override def run() {
        try {
          mainMethod.invoke(null, userArgs.toArray)
          finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
          logDebug("Done running users class")
        } catch {
          ................. 中间省略部分代码 ....................
            }
            sparkContextPromise.tryFailure(e.getCause())
        } finally {
					................. 中间省略部分代码 ....................
        }
      }
    }
    userThread.setContextClassLoader(userClassLoader)
    userThread.setName("Driver")
    userThread.start()
    userThread
  }

startUserApplication() 方法主要用于启动 Driver 线程。

org.apache.spark.deploy.yarn.ApplicationMaster

private def registerAM(
      _sparkConf: SparkConf,
      _rpcEnv: RpcEnv,
      driverRef: RpcEndpointRef,
      uiAddress: Option[String],
      securityMgr: SecurityManager) = {
    ................. 中间省略部分代码 ....................

//    ApplicationMaster 开始向 ResourceManager申请资源
    allocator = client.register(driverUrl,
      driverRef,
      yarnConf,
      _sparkConf,
      uiAddress,
      historyAddress,
      securityMgr,
      localResources)

//    分配可以用资源，并启动容器
    allocator.allocateResources()
    reporterThread = launchReporterThread()
  }

registerAM() 方法主要用于向 ApplicationMaster 注册，并且开始申请任务所需资源。

ResourceManager 返回集群可用容器

org.apache.spark.deploy.yarn.YarnAllocator
  
def allocateResources(): Unit = synchronized {
    updateResourceRequests()

    val progressIndicator = 0.1f
    // Poll the ResourceManager. This doubles as a heartbeat if there are no pending container
    // requests.
    val allocateResponse = amClient.allocate(progressIndicator)

//    获取可分配的容器
    val allocatedContainers = allocateResponse.getAllocatedContainers()

    if (allocatedContainers.size > 0) {
      logDebug("Allocated containers: %d. Current executor count: %d. Cluster resources: %s."
        .format(
          allocatedContainers.size,
          numExecutorsRunning,
          allocateResponse.getAvailableResources))
        
//      开始处理分配的容器
      handleAllocatedContainers(allocatedContainers.asScala)
    }

allocateResources()方法主要用于向 ResourceManager 发请求申请资源，然后 ResourceManager 会返回一个可用资源列表。

启动容器与 Executor

org.apache.spark.deploy.yarn.YarnAllocator
  
def handleAllocatedContainers(allocatedContainers: Seq[Container]): Unit = {
    val containersToUse = new ArrayBuffer[Container](allocatedContainers.size)

    ................. 中间省略部分代码 ....................

//    运行分配的容器
    runAllocatedContainers(containersToUse)

    logInfo("Received %d containers from YARN, launching executors on %d of them."
      .format(allocatedContainers.size, containersToUse.size))
  }

handleAllocatedContainers() 方法主要用于处理启动 RM 授予的容器中的 Executor。

org.apache.spark.deploy.yarn.YarnAllocator
  
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = {
    for (container <- containersToUse) {
      ................. 中间省略部分代码 ....................
        
//      运行 Executor线程
      if (numExecutorsRunning < targetNumExecutors) {
        if (launchContainers) {
          launcherPool.execute(new Runnable {
            override def run(): Unit = {
              try {
                new ExecutorRunnable(
                  Some(container),
                  conf,
                  sparkConf,
                  driverUrl,
                  executorId,
                  executorHostname,
                  executorMemory,
                  executorCores,
                  appAttemptId.getApplicationId.toString,
                  securityMgr,
                  localResources
                ).run()
                updateInternalState()
              } catch {
                case NonFatal(e) =>
                  logError(s"Failed to launch executor $executorId on container $containerId", e)
                  // Assigned container should be released immediately to avoid unnecessary resource
                  // occupation.
                  amClient.releaseAssignedContainer(containerId)
              }
            }
          })
				................. 中间省略部分代码 ....................
    }
  }

runAllocatedContainers()方法主要用于运行分配容器中的程序。

org.apache.spark.deploy.yarn.ExecutorRunnable
  
def run(): Unit = {
    logDebug("Starting Executor Container")
    nmClient = NMClient.createNMClient()
    nmClient.init(conf)
    nmClient.start()
//    启动容器
    startContainer()
  }

def startContainer(): java.util.Map[String, ByteBuffer] = {
    val ctx = Records.newRecord(classOf[ContainerLaunchContext])
      .asInstanceOf[ContainerLaunchContext]
    val env = prepareEnvironment().asJava

    ctx.setLocalResources(localResources.asJava)
    ctx.setEnvironment(env)

    val credentials = UserGroupInformation.getCurrentUser().getCredentials()
    val dob = new DataOutputBuffer()
    credentials.writeTokenStorageToStream(dob)
    ctx.setTokens(ByteBuffer.wrap(dob.getData()))

//    启动 CoarseGrainedExecutorBackend 进程
    val commands = prepareCommand()

    ctx.setCommands(commands.asJava)
    ................. 中间省略部分代码 ....................
  }

private def prepareCommand(): List[String] = {
    // Extra options for the JVM
    val javaOpts = ListBuffer[String]()

		................. 中间省略部分代码 ....................
      
    javaOpts += ("-Dspark.yarn.app.container.log.dir=" + ApplicationConstants.LOG_DIR_EXPANSION_VAR)

    val userClassPath = Client.getUserClasspath(sparkConf).flatMap { uri =>
      val absPath =
        if (new File(uri.getPath()).isAbsolute()) {
          Client.getClusterPath(sparkConf, uri.getPath())
        } else {
          Client.buildPath(Environment.PWD.$(), uri.getPath())
        }
      Seq("--user-class-path", "file:" + absPath)
    }.toSeq

    YarnSparkHadoopUtil.addOutOfMemoryErrorArgument(javaOpts)
       
//    设置 CoarseGrainedExecutorBackend 启动命令  
    val commands = prefixEnv ++
      Seq(Environment.JAVA_HOME.$$() + "/bin/java", "-server") ++
      javaOpts ++
      Seq("org.apache.spark.executor.CoarseGrainedExecutorBackend",
        "--driver-url", masterAddress,
        "--executor-id", executorId,
        "--hostname", hostname,
        "--cores", executorCores.toString,
        "--app-id", appId) ++
      userClassPath ++
      Seq(
        s"1>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stdout",
        s"2>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stderr")

    // TODO: it would be nicer to just make sure there are no null commands here
    commands.map(s => if (s == null) "null" else s).toList
  }

prepareCommand()方法主要用于设置 CoarseGrainedExecutorBackend 启动的命令，其中流程图中的 Executor 实际上启动的是 CoarseGrainedExecutorBackend ；Executor只能说是进行进程之间交互的名称，真正 new 的是 CoarseGrainedExecutorBackend；Task首先会把任务发给 CoarseGrainedExecutorBackend ，然后由对象属性 Executor 执行。

org.apache.spark.executor.CoarseGrainedExecutorBackend
  
private def run(
      driverUrl: String,
      executorId: String,
      hostname: String,
      cores: Int,
      appId: String,
      workerUrl: Option[String],
      userClassPath: Seq[URL]) {

    ................. 中间省略部分代码 ....................

      val env = SparkEnv.createExecutorEnv(
        driverConf, executorId, hostname, port, cores, cfg.ioEncryptionKey, isLocal = false)

//      Executor只能说是进行进程之间交互的名称，真正 new 的是 CoarseGrainedExecutorBackend；
//      Task首先会把任务发给 CoarseGrainedExecutorBackend ，然后由对象属性 Executor 执行；
      env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend(
        env.rpcEnv, driverUrl, executorId, hostname, cores, userClassPath, env))
      workerUrl.foreach { url =>
        env.rpcEnv.setupEndpoint("WorkerWatcher", new WorkerWatcher(env.rpcEnv, url))
      }
      env.rpcEnv.awaitTermination()
      SparkHadoopUtil.get.stopCredentialUpdater()
    }
  }

run()方法主要用于启动 CoarseGrainedExecutorBackend （run() 方法由执行 CoarseGrainedExecutorBackend中main 方法得来）。

Executor 反向注册

org.apache.spark.executor.CoarseGrainedExecutorBackend

override def onStart() {
    logInfo("Connecting to driver: " + driverUrl)
    rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
      // This is a very fast action so we can use "ThreadUtils.sameThread"
      //向Driver反向注册，主要作用是告诉Driver Executor已经启动好了；而且当某一个 Executor 挂掉时，Driver可以及时重新申请资源运行任务
      driver = Some(ref)
      ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores, extractLogUrls))
    }(ThreadUtils.sameThread).onComplete {
      // This is a very fast action so we can use "ThreadUtils.sameThread"
      case Success(msg) =>
        // Always receive `true`. Just ignore it
      case Failure(e) =>
        exitExecutor(1, s"Cannot register with driver: $driverUrl", e, notifyDriver = false)
    }(ThreadUtils.sameThread)
  }

onStart()方法主要用于开始向Driver进行反向注册；由于 CoarseGrainedExecutorBackend 是继承了ThreadSafeRpcEndpoint 类所以会重写该类中的方法（生命周期：constructor -> onStart -> receive -> onStop）

分配 Task 任务

org.apache.spark.executor.CoarseGrainedExecutorBackend
  
override def receive: PartialFunction[Any, Unit] = {
//    反向注册成功信息
    case RegisteredExecutor =>
      logInfo("Successfully registered with driver")
      try {
//        Executor 完成反向注册之后，Driver也会返回一个确认信息；然后Executor就开始准备计算
        executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
      } catch {
        case NonFatal(e) =>
          exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
      }

    case RegisterExecutorFailed(message) =>
      exitExecutor(1, "Slave registration failed: " + message)

    //     Executor启动信息
    case LaunchTask(data) =>
      if (executor == null) {
        exitExecutor(1, "Received LaunchTask command but executor was null")
      } else {
        val taskDesc = TaskDescription.decode(data.value)
        logInfo("Got assigned task " + taskDesc.taskId)
//        开始启动
        executor.launchTask(this, taskDesc)
      }

    case KillTask(taskId, _, interruptThread, reason) =>
      if (executor == null) {
        exitExecutor(1, "Received KillTask command but executor was null")
      } else {
        executor.killTask(taskId, interruptThread, reason)
      }

    case StopExecutor =>
      stopping.set(true)
      logInfo("Driver commanded a shutdown")
      // Cannot shutdown here because an ack may need to be sent back to the caller. So send
      // a message to self to actually do the shutdown.
      self.send(Shutdown)

    case Shutdown =>
      stopping.set(true)
      new Thread("CoarseGrainedExecutorBackend-stop-executor") {
        override def run(): Unit = {
        ................. 中间省略部分代码 ....................
          executor.stop()
        }
      }.start()
  }

receive()方法主要用于接收到 Driver返回的注册成功消息，然后开始根据分配的 Task 任务开始执行 Executor。

==> SparkContext 初始化过程源码说明

参考链接：https://www.bilibili.com/video/BV1Si4y1M7N6?p=2&spm_id_from=pageDriver

Small_Ran

关注

6
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Spark 任务提交流程涉及源码说明

本篇主要阐述 Spark on Yarn 任务提交源码分析说的流程，目的在于了解任务提交的大概流程；其主要是想通过对 Spark 任务提交流程与涉及到的源码了解；在遇到问题的时候可以快速定位到是由什么环节导致的，从而可以快速排查问题并予以解决。......
复制链接

扫一扫