sparkYarn集群提交流程分析(三)

最新推荐文章于 2023-11-23 14:55:00 发布

long_World

最新推荐文章于 2023-11-23 14:55:00 发布

阅读量610

点赞数

分类专栏： spark 文章标签： spark 源码 scala

本文链接：https://blog.csdn.net/long_World/article/details/114984490

版权

spark 专栏收录该内容

6 篇文章 1 订阅

订阅专栏

sparkYarn集群提交流程分析(三)

在这里插入图片描述

1 .上回说到了骤② 在某一个节点上创建了一个ApplicationMaster进程管理整个spark项目

2 .这回说说这ApplicationaMaster中到底干了什么

复习一下spark集群提交后有两种运行模式
- Client模式: 这种运行模式会将Driver启动在提交的节点,你在哪提交在哪给你创建
- Cluster模式: 这种运行模式会将Driver启动在集群的某一节点,具体启动在哪,由ApplicationMaster来决定,AM在哪Driver启动在哪

说到Driver启动在哪,顺带说一下ApplicationMaster启动位置,如果是在集群某一个节点上提交spark-submit命令那么ApplicationMaster的创建需要RM来指定

ApplicationMaster伴生类 main()

def main(args: Array[String]): Unit = {
    SignalUtils.registerLogger(log)
    
    val amArgs = new ApplicationMasterArguments(args)

    if (amArgs.propertiesFile != null) {
      Utils.getPropertiesFromFile(amArgs.propertiesFile).foreach { case (k, v) =>
        sys.props(k) = v
      }
    }
    SparkHadoopUtil.get.runAsSparkUser { () =>
      master = new ApplicationMaster(amArgs, new YarnRMClient)
      System.exit(master.run())
    }
  }

1 .这首先封装AM的参数对象ApplicationMasterArguments,然后加载参数供AM使用,这个类中有几个需要记住之后会使用到的属性

class ApplicationMasterArguments(val args: Array[String]) {
var userJar: String = null
var userClass: String = null
var primaryPyFile: String = null
var primaryRFile: String = null
var userArgs: Seq[String] = Nil
var propertiesFile: String = null

parseArgs(args.toList)

private def parseArgs(inputArgs: List[String]): Unit = {
  val userArgsBuffer = new ArrayBuffer[String]()

  var args = inputArgs

  while (!args.isEmpty) {
    // --num-workers, --worker-memory, and --worker-cores are deprecated since 1.0,
    // the properties with executor in their names are preferred.
    args match {
      case ("--jar") :: value :: tail =>
        userJar = value
        args = tail

      case ("--class") :: value :: tail =>
        userClass = value
        args = tail

      case ("--primary-py-file") :: value :: tail =>
        primaryPyFile = value
        args = tail

      case ("--primary-r-file") :: value :: tail =>
        primaryRFile = value
        args = tail

      case ("--arg") :: value :: tail =>
        userArgsBuffer += value
        args = tail

      case ("--properties-file") :: value :: tail =>
        propertiesFile = value
        args = tail

      case _ =>
        printUsageAndExit(1, args)
    }
  }

  if (primaryPyFile != null && primaryRFile != null) {
    // scalastyle:off println
    System.err.println("Cannot have primary-py-file and primary-r-file at the same time")
    // scalastyle:on println
    System.exit(-1)
  }

  userArgs = userArgsBuffer.toList
}


- 2 .这里会判断切割后的是否含有 `--class` 开头的参数并将其封装在 `userClass` 属性中

- 3 .继续查看`main()`中的代码会发现创建了一个`ApplicationMaster`对象

- ```java
master = new ApplicationMaster(amArgs, new YarnRMClient)

4 .在创建AM对象时,传入了一个YarnRMClient对象,这个对象类似之前的YarnClient,深究一下

private[spark] class YarnRMClient extends Logging {

private var amClient: AMRMClient[ContainerRequest] = _
private var uiHistoryAddress: String = _
private var registered: Boolean = false


- 5 .这里维护了一个`AMRMClient`他的作用也和`YarnClient`类似,用于NM节点与RM交互(在集群中NM相当于客户端,RM相当于服务器)

- 6 .再次回到Main方法中,执行了`ApplicationMaster`中的`Run()`方法

ApplicationMaster对象 run()

final def run(): Int = {
    try {
      val appAttemptId = client.getAttemptId()

      var attemptID: Option[String] = None


      new CallerContext("APPMASTER",
        Option(appAttemptId.getApplicationId.toString), attemptID).setCurrentContext()

      logInfo("ApplicationAttemptId: " + appAttemptId)

      val fs = FileSystem.get(yarnConf)

      // This shutdown hook should run *after* the SparkContext is shut down.
      val priority = ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY - 1
      ShutdownHookManager.addShutdownHook(priority) { () =>
        val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf)
        val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts

	.....

      // Call this to force generation of secret so it gets populated into the
      // Hadoop UGI. This has to happen before the startUserApplication which does a
      // doAs in order for the credentials to be passed on to the executor containers.
      val securityMgr = new SecurityManager(sparkConf)

      // If the credentials file config is present, we must periodically renew tokens. So create
      // a new AMDelegationTokenRenewer
      if (sparkConf.contains(CREDENTIALS_FILE_PATH.key)) {
        // If a principal and keytab have been set, use that to create new credentials for executors
        // periodically
        credentialRenewer =
          new ConfigurableCredentialManager(sparkConf, yarnConf).credentialRenewer()
        credentialRenewer.scheduleLoginFromKeytab()
      }

      if (isClusterMode) {
        runDriver(securityMgr)
      } else {
        runExecutorLauncher(securityMgr)
      }
    } catch {
      case e: Exception =>
        // catch everything else if not specifically handled
        logError("Uncaught exception: ", e)
        finish(FinalApplicationStatus.FAILED,
          ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION,
          "Uncaught exception: " + e)
    }
    exitCode
  }

1 .前面都是一些判断,毕竟是第一次提交一般不会出错,直到38行代码的runDirver()方法,传入的参数是对参数的安全检查后的sparkconf的封装

runDirver(securityMgr)

  private def runDriver(securityMgr: SecurityManager): Unit = {
    addAmIpFilter()
    userClassThread = startUserApplication()
  
    // This a bit hacky, but we need to wait until the spark.driver.port property has
    // been set by the Thread executing the user class.
    logInfo("Waiting for spark context initialization...")
    val totalWaitTime = sparkConf.get(AM_MAX_WAIT_TIME)
    try {
      val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
        Duration(totalWaitTime, TimeUnit.MILLISECONDS))
      if (sc != null) {
        rpcEnv = sc.env.rpcEnv
        val driverRef = runAMEndpoint(
          sc.getConf.get("spark.driver.host"),
          sc.getConf.get("spark.driver.port"),
          isClusterMode = true)
        registerAM(sc.getConf, rpcEnv, driverRef, sc.ui.map(_.appUIAddress).getOrElse(""),
          securityMgr)
      } else {
        // Sanity check; should never happen in normal operation, since sc should only be null
        // if the user app did not create a SparkContext.
        if (!finished) {
          throw new IllegalStateException("SparkContext is null but app is still running!")
        }
      }
      userClassThread.join()
    } catch {
      case e: SparkException if e.getCause().isInstanceOf[TimeoutException] =>
        logError(
          s"SparkContext did not initialize after waiting for $totalWaitTime ms. " +
           "Please check earlier log output for errors. Failing the application.")
        finish(FinalApplicationStatus.FAILED,
          ApplicationMaster.EXIT_SC_NOT_INITED,
          "Timed out waiting for SparkContext.")
    }
  }

1 .这里的第3行代码就吸引了我的注意,对于spark项目一般startxxxApplication()都是重要的代码

userClassThread = startUserApplication()

private def startUserApplication(): Thread = {
    logInfo("Starting the user application in a separate Thread")

    val classpath = Client.getUserClasspath(sparkConf)
    val urls = classpath.map { entry =>
      new URL("file:" + new File(entry.getPath()).getAbsolutePath())
    }
    val userClassLoader =
      if (Client.isUserClassPathFirst(sparkConf, isDriver = true)) {
        new ChildFirstURLClassLoader(urls, Utils.getContextOrSparkClassLoader)
      } else {
        new MutableURLClassLoader(urls, Utils.getContextOrSparkClassLoader)
      }

    var userArgs = args.userArgs
    if (args.primaryPyFile != null && args.primaryPyFile.endsWith(".py")) {
      // When running pyspark, the app is run using PythonRunner. The second argument is the list
      // of files to add to PYTHONPATH, which Client.scala already handles, so it's empty.
      userArgs = Seq(args.primaryPyFile, "") ++ userArgs
    }
    if (args.primaryRFile != null && args.primaryRFile.endsWith(".R")) {
      // TODO(davies): add R dependencies here
    }
    val mainMethod = userClassLoader.loadClass(args.userClass)
      .getMethod("main", classOf[Array[String]])

    val userThread = new Thread {
      override def run() {
        try {
          mainMethod.invoke(null, userArgs.toArray)
          finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
          logDebug("Done running users class")
        } catch {
          case e: InvocationTargetException =>
            e.getCause match {
              case _: InterruptedException =>
                // Reporter thread can interrupt to stop user class
              case SparkUserAppException(exitCode) =>
                val msg = s"User application exited with status $exitCode"
                logError(msg)
                finish(FinalApplicationStatus.FAILED, exitCode, msg)
              case cause: Throwable =>
                logError("User class threw exception: " + cause, cause)
                finish(FinalApplicationStatus.FAILED,
                  ApplicationMaster.EXIT_EXCEPTION_USER_CLASS,
                  "User class threw exception: " + cause)
            }
            sparkContextPromise.tryFailure(e.getCause())
        } finally {
          // Notify the thread waiting for the SparkContext, in case the application did not
          // instantiate one. This will do nothing when the user code instantiates a SparkContext
          // (with the correct master), or when the user code throws an exception (due to the
          // tryFailure above).
          sparkContextPromise.trySuccess(null)
        }
      }
    }
    userThread.setContextClassLoader(userClassLoader)
    userThread.setName("Driver")
    userThread.start()
    userThread
  }

1 .这里的第一句就使用日志打印了Starting the user application in a separate Thread,这里说启动一个用户线程,这里我能想到的就是Driver,可是明明一般都是Dirver是一个进程,带着这个疑惑我们继续向下看
2 .这里主要做的事情就是获取类加载器,貌似要加入一个类,是什么类呢

    val mainMethod = userClassLoader.loadClass(args.userClass)
      .getMethod("main", classOf[Array[String]])

3 .这里加载的是args.userClass的类,仔细阅读的同学会发现这个args是ApplicationMasterArguments类型 , 并且我们在执行之前的run方法还对他进行了创建 , 还说了需要记住一个属性userClass , 这个userClass就是用户自己书写的spark代码的类(比如写的是WordCount就是WordCount类) , 加载完之后先放在一边继续向下看
4 .27行创建了一个Thread对象,这个对象一创建就代表有一个线程会被创建

    val userThread = new Thread {
      override def run() {
        try {
          mainMethod.invoke(null, userArgs.toArray)
          finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
          logDebug("Done running users class")
        } catch{
            
        }
      }
    }
    userThread.setContextClassLoader(userClassLoader)
    userThread.setName("Driver")
    userThread.start()
    userThread

5 .这里我们可以发现刚刚使用类加载器加载的用户程序类在这里被执行了main方法
6 .据我所知会执行用户程序的组件就是Dirver,但是这里怎么是线程,继续向下看
7 .上面代码中的倒数第三行userThread.setName("Driver")这里把这个线程命名为Driver,这才发现原来Driver是线程

回到runDirver(securityMgr)

private def runDriver(securityMgr: SecurityManager): Unit = {
    addAmIpFilter()
    userClassThread = startUserApplication()

    // This a bit hacky, but we need to wait until the spark.driver.port property has
    // been set by the Thread executing the user class.
    logInfo("Waiting for spark context initialization...")
    val totalWaitTime = sparkConf.get(AM_MAX_WAIT_TIME)
    try {
      val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
        Duration(totalWaitTime, TimeUnit.MILLISECONDS))
      if (sc != null) {
        rpcEnv = sc.env.rpcEnv
        val driverRef = runAMEndpoint(
          sc.getConf.get("spark.driver.host"),
          sc.getConf.get("spark.driver.port"),
          isClusterMode = true)
        registerAM(sc.getConf, rpcEnv, driverRef, sc.ui.map(_.appUIAddress).getOrElse(""),
          securityMgr)
      } else {
        // Sanity check; should never happen in normal operation, since sc should only be null
        // if the user app did not create a SparkContext.
        if (!finished) {
          throw new IllegalStateException("SparkContext is null but app is still running!")
        }
      }
      userClassThread.join()
    } catch {
      case e: SparkException if e.getCause().isInstanceOf[TimeoutException] =>
        logError(
          s"SparkContext did not initialize after waiting for $totalWaitTime ms. " +
           "Please check earlier log output for errors. Failing the application.")
        finish(FinalApplicationStatus.FAILED,
          ApplicationMaster.EXIT_SC_NOT_INITED,
          "Timed out waiting for SparkContext.")
    }
  }

1 .第10行是直接翻译英语就会明白,这个好像是每隔多少时间获得一次结果,但是这里是分析提交流程,不做过多研究
```
  val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
    Duration(totalWaitTime, TimeUnit.MILLISECONDS))
```
2 .第18行代码调用了一个注册方法,根据方法名可以得出是注册AM,并且参数中把刚刚创建的Driver也传入了,我猜测大概是将AM注册到RM中,继续向下看

registerAM(sc.getConf, rpcEnv, driverRef, sc.ui.map(_.appUIAddress).getOrElse(""),securityMgr)

private def registerAM(
      _sparkConf: SparkConf,
      _rpcEnv: RpcEnv,
      driverRef: RpcEndpointRef,
      uiAddress: String,
      securityMgr: SecurityManager) = {
    val appId = client.getAttemptId().getApplicationId().toString()
    val attemptId = client.getAttemptId().getAttemptId().toString()
    val historyAddress =
      _sparkConf.get(HISTORY_SERVER_ADDRESS)
        .map { text => SparkHadoopUtil.get.substituteHadoopVariables(text, yarnConf) }
        .map { address => s"${address}${HistoryServer.UI_PATH_PREFIX}/${appId}/${attemptId}" }
        .getOrElse("")

    val driverUrl = RpcEndpointAddress(
      _sparkConf.get("spark.driver.host"),
      _sparkConf.get("spark.driver.port").toInt,
      CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString

    // Before we initialize the allocator, let's log the information about how executors will
    // be run up front, to avoid printing this out for every single executor being launched.
    // Use placeholders for information that changes such as executor IDs.
    logInfo {
      val executorMemory = sparkConf.get(EXECUTOR_MEMORY).toInt
      val executorCores = sparkConf.get(EXECUTOR_CORES)
      val dummyRunner = new ExecutorRunnable(None, yarnConf, sparkConf, driverUrl, "<executorId>",
        "<hostname>", executorMemory, executorCores, appId, securityMgr, localResources)
      dummyRunner.launchContextDebugInfo()
    }

    allocator = client.register(driverUrl,
      driverRef,
      yarnConf,
      _sparkConf,
      uiAddress,
      historyAddress,
      securityMgr,
      localResources)

    allocator.allocateResources()
    reporterThread = launchReporterThread()
  }

1 .这里的主要的代码分两部分

allocator = client.register(driverUrl,
  driverRef,
  yarnConf,
  _sparkConf,
  uiAddress,
  historyAddress,
  securityMgr,
  localResources)

2 .这里主要作用就是将当前Client客户端注册到RM上,并且还返回了一个allocator对象,但还这里的Client到底是什么,再进一步查看

def register(
    driverUrl: String,
    driverRef: RpcEndpointRef,
    conf: YarnConfiguration,
    sparkConf: SparkConf,
    uiAddress: String,
    uiHistoryAddress: String,
    securityMgr: SecurityManager,
    localResources: Map[String, LocalResource]
  ): YarnAllocator = {
  amClient = AMRMClient.createAMRMClient()
  amClient.init(conf)
  amClient.start()
  this.uiHistoryAddress = uiHistoryAddress

  logInfo("Registering the ApplicationMaster")
  synchronized {
    amClient.registerApplicationMaster(Utils.localHostName(), 0, uiAddress)
    registered = true
  }
  new YarnAllocator(driverUrl, driverRef, conf, sparkConf, amClient, getAttemptId(), securityMgr,
    localResources, new SparkRackResolver())
}

3 .这里的amClient就是我们之前提到的ApplicationMaster中维护的AMRMClient对象,这样就照应前方amClient的用处了,用来维护与RM的通信,接下来的分析就没必要,毕竟我们只是了解提交流程
4 .回到allocator对象之后分析这个应该是注册到RM之后返回的连接,里面有什么继续向下看
```
allocator.allocateResources()
```
5 .接下来的执行代码肯定了我的想法,这不仅仅是注册建立连接,同时还会返回可用的资源(从方法名中看出来的,嘻嘻),点进去一探究竟

def allocateResources(): Unit = synchronized {
  updateResourceRequests()

  val progressIndicator = 0.1f
  // Poll the ResourceManager. This doubles as a heartbeat if there are no pending container
  // requests.
  val allocateResponse = amClient.allocate(progressIndicator)

  val allocatedContainers = allocateResponse.getAllocatedContainers()

  if (allocatedContainers.size > 0) {
    logDebug("Allocated containers: %d. Current executor count: %d. Cluster resources: %s."
      .format(
        allocatedContainers.size,
        numExecutorsRunning,
        allocateResponse.getAvailableResources))

    handleAllocatedContainers(allocatedContainers.asScala)
  }

  val completedContainers = allocateResponse.getCompletedContainersStatuses()
  if (completedContainers.size > 0) {
    logDebug("Completed %d containers".format(completedContainers.size))
    processCompletedContainers(completedContainers.asScala)
    logDebug("Finished processing %d completed containers. Current running executor count: %d."
      .format(completedContainers.size, numExecutorsRunning))
  }
}

6 .到了这里可能有点懵逼

  // Poll the ResourceManager. This doubles as a heartbeat if there are no pending container
  // requests.
  val allocateResponse = amClient.allocate(progressIndicator)

  val allocatedContainers = allocateResponse.getAllocatedContainers()

7 .这两行根据注释就是从RM中得到分配的容器,并且获得所有可用的容器
8 .拿到容器当然是分配啦,然后就找到了一个关于分配的方法

    if (allocatedContainers.size > 0) {
      logDebug("Allocated containers: %d. Current executor count: %d. Cluster resources: %s."
        .format(
          allocatedContainers.size,
          numExecutorsRunning,
          allocateResponse.getAvailableResources))

      handleAllocatedContainers(allocatedContainers.asScala)
    }

9 .当然分配资源的前提当然是资源数量大于零了

handleAllocatedContainers(allocatedContainers.asScala)

def handleAllocatedContainers(allocatedContainers: Seq[Container]): Unit = {
    val containersToUse = new ArrayBuffer[Container](allocatedContainers.size)

    // Match incoming requests by host
    val remainingAfterHostMatches = new ArrayBuffer[Container]
    for (allocatedContainer <- allocatedContainers) {
      matchContainerToRequest(allocatedContainer, allocatedContainer.getNodeId.getHost,
        containersToUse, remainingAfterHostMatches)
    }

    // Match remaining by rack
    val remainingAfterRackMatches = new ArrayBuffer[Container]
    for (allocatedContainer <- remainingAfterHostMatches) {
      val rack = resolver.resolve(conf, allocatedContainer.getNodeId.getHost)
      matchContainerToRequest(allocatedContainer, rack, containersToUse,
        remainingAfterRackMatches)
    }

    // Assign remaining that are neither node-local nor rack-local
    val remainingAfterOffRackMatches = new ArrayBuffer[Container]
    for (allocatedContainer <- remainingAfterRackMatches) {
      matchContainerToRequest(allocatedContainer, ANY_HOST, containersToUse,
        remainingAfterOffRackMatches)
    }

    if (!remainingAfterOffRackMatches.isEmpty) {
      logDebug(s"Releasing ${remainingAfterOffRackMatches.size} unneeded containers that were " +
        s"allocated to us")
      for (container <- remainingAfterOffRackMatches) {
        internalReleaseContainer(container)
      }
    }

    runAllocatedContainers(containersToUse)

    logInfo("Received %d containers from YARN, launching executors on %d of them."
      .format(allocatedContainers.size, containersToUse.size))
  }

1 .看到倒数第36行的日志输出,证明分配资源明显在上面那一行代码中
2 .前面都是一些机架感知,这个大家有时间可以了解一下,整体的思想就是启动的节点离自己越近(比如本地),那么代码的网络IO和性能都能节省
3 .进入到runAllocatedContainers(containersToUse)

runAllocatedContainers(containersToUse)

 private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = {
    for (container <- containersToUse) {
      executorIdCounter += 1
      val executorHostname = container.getNodeId.getHost
      val containerId = container.getId
      val executorId = executorIdCounter.toString
      assert(container.getResource.getMemory >= resource.getMemory)
      logInfo(s"Launching container $containerId on host $executorHostname")

      def updateInternalState(): Unit = synchronized {
        numExecutorsRunning += 1
        executorIdToContainer(executorId) = container
        containerIdToExecutorId(container.getId) = executorId

        val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
          new HashSet[ContainerId])
        containerSet += containerId
        allocatedContainerToHostMap.put(containerId, executorHostname)
      }

      if (numExecutorsRunning < targetNumExecutors) {
        if (launchContainers) {
          launcherPool.execute(new Runnable {
            override def run(): Unit = {
              try {
                new ExecutorRunnable(
                  Some(container),
                  conf,
                  sparkConf,
                  driverUrl,
                  executorId,
                  executorHostname,
                  executorMemory,
                  executorCores,
                  appAttemptId.getApplicationId.toString,
                  securityMgr,
                  localResources
                ).run()
                updateInternalState()
              } catch {
                case NonFatal(e) =>
                  logError(s"Failed to launch executor $executorId on container $containerId", e)
                  // Assigned container should be released immediately to avoid unnecessary resource
                  // occupation.
                  amClient.releaseAssignedContainer(containerId)
              }
            }
          })
        } else {
          // For test only
          updateInternalState()
        }
      } else {
        logInfo(("Skip launching executorRunnable as runnning Excecutors count: %d " +
          "reached target Executors count: %d.").format(numExecutorsRunning, targetNumExecutors))
      }
    }
  }

1 .重点查看第21行,如果我们正在运行的节点数量的话,就证明我们需要分配Executor资源,当然从刚刚在RM中获取的资源列表中启动
2 .我们发现在创建一个ExecutorRunnable对象就直接执行执行了run方法,跟正常的start方法不一样
3 .进去看看,名字就暴露了他跟Executor启动有关系

new ExecutorRunnable().run()

  def run(): Unit = {
    logDebug("Starting Executor Container")
    nmClient = NMClient.createNMClient()
    nmClient.init(conf)
    nmClient.start()
    startContainer()
  }

1 .那就简单了,我们的目的就是启动Executor,那一定运行在Container上,那就看startcontainer()

startContainer()

def startContainer(): java.util.Map[String, ByteBuffer] = {
    val ctx = Records.newRecord(classOf[ContainerLaunchContext])
      .asInstanceOf[ContainerLaunchContext]
    val env = prepareEnvironment().asJava

    ctx.setLocalResources(localResources.asJava)
    ctx.setEnvironment(env)

    val credentials = UserGroupInformation.getCurrentUser().getCredentials()
    val dob = new DataOutputBuffer()
    credentials.writeTokenStorageToStream(dob)
    ctx.setTokens(ByteBuffer.wrap(dob.getData()))

    val commands = prepareCommand()

    ctx.setCommands(commands.asJava)
    ctx.setApplicationACLs(
      YarnSparkHadoopUtil.getApplicationAclsForYarn(securityMgr).asJava)

    // If external shuffle service is enabled, register with the Yarn shuffle service already
    // started on the NodeManager and, if authentication is enabled, provide it with our secret
    // key for fetching shuffle files later
    if (sparkConf.get(SHUFFLE_SERVICE_ENABLED)) {
      val secretString = securityMgr.getSecretKey()
      val secretBytes =
        if (secretString != null) {
          // This conversion must match how the YarnShuffleService decodes our secret
          JavaUtils.stringToBytes(secretString)
        } else {
          // Authentication is not enabled, so just provide dummy metadata
          ByteBuffer.allocate(0)
        }
      ctx.setServiceData(Collections.singletonMap("spark_shuffle", secretBytes))
    }

1 .这里比较重要的就是14 16行代码
2 .14行代码准备了命令
3 .16行代码将命令转换为了java命令
4 .进入Command()查看

  private def prepareCommand(): List[String] = {
    // Extra options for the JVM
	.....
    // Set the JVM memory
    val executorMemoryString = executorMemory + "m"
    javaOpts += "-Xmx" + executorMemoryString

    // Set extra Java options for the executor, if defined
    sparkConf.get(EXECUTOR_JAVA_OPTIONS).foreach { opts =>
      javaOpts ++= Utils.splitCommandString(opts).map(YarnSparkHadoopUtil.escapeForShell)
    }
    sys.env.get("SPARK_JAVA_OPTS").foreach { opts =>
      javaOpts ++= Utils.splitCommandString(opts).map(YarnSparkHadoopUtil.escapeForShell)
    }
    sparkConf.get(EXECUTOR_LIBRARY_PATH).foreach { p =>
      prefixEnv = Some(Client.getClusterPath(sparkConf, Utils.libraryPathEnvPrefix(Seq(p))))
    }

    javaOpts += "-Djava.io.tmpdir=" +
      new Path(
        YarnSparkHadoopUtil.expandEnvironment(Environment.PWD),
        YarnConfiguration.DEFAULT_CONTAINER_TEMP_DIR
      )



    YarnSparkHadoopUtil.addOutOfMemoryErrorArgument(javaOpts)
    val commands = prefixEnv ++ Seq(
      YarnSparkHadoopUtil.expandEnvironment(Environment.JAVA_HOME) + "/bin/java",
      "-server") ++
      javaOpts ++
      Seq("org.apache.spark.executor.CoarseGrainedExecutorBackend",
        "--driver-url", masterAddress,
        "--executor-id", executorId,
        "--hostname", hostname,
        "--cores", executorCores.toString,
        "--app-id", appId) ++
      userClassPath ++
      Seq(
        s"1>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stdout",
        s"2>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stderr")

    // TODO: it would be nicer to just make sure there are no null commands here
    commands.map(s => if (s == null) "null" else s).toList
  }

5 .看到这里想必很多人都会说还是封装java命令执行一个类的main方法,不错就是如此,我们需要找到的就是执行哪个类
6 .在31行有一个明显就是我们的目标了

步骤③④⑤⑥总结
- 1 .本篇文章主要讲的是,在AM中启动Driver线程
- 2 .Driver线程启动之后就会执行用户编写的程序
- 3 .执行程序同时还会吧当前AM连带里面的线程和组件一并注册给AM,目的在于请求容器资源启动Executor组件
- 4 .获取完资源后,进行判断是否需要资源,如果需要,先进行机架感知,然后在合适的节点上启动一个CoarseGrainedExecutorBackend,代表将spark任务分配出去
“–cores”, executorCores.toString,
“–app-id”, appId) ++
userClassPath ++
Seq(
s"1> ${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stdout", s"2>$ {ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stderr")
```
  // TODO: it would be nicer to just make sure there are no null commands here
  commands.map(s => if (s == null) "null" else s).toList
}
```
```
- 5 .看到这里想必很多人都会说还是封装java命令执行一个类的main方法,不错就是如此,我们需要找到的就是执行哪个类

- 6 .在31行有一个 明显就是我们的目标了
```
步骤③④⑤⑥总结
- 1 .本篇文章主要讲的是,在AM中启动Driver线程
- 2 .Driver线程启动之后就会执行用户编写的程序
- 3 .执行程序同时还会吧当前AM连带里面的线程和组件一并注册给AM,目的在于请求容器资源启动Executor组件
- 4 .获取完资源后,进行判断是否需要资源,如果需要,先进行机架感知,然后在合适的节点上启动一个CoarseGrainedExecutorBackend,代表将spark任务分配出去