3. spark源码分析(基于yarn cluster模式)-YARN ApplicationMaster启动

本系列基于spark-2.4.6
经过上一节我们分析到,Spark通过yarn client项RM提交了一个ApplicationMaster并启动,我们接下来看看其细节。

  def main(args: Array[String]): Unit = {
    SignalUtils.registerLogger(log)
    val amArgs = new ApplicationMasterArguments(args)
    master = new ApplicationMaster(amArgs)
    System.exit(master.run())
  }
    final def run(): Int = {
    doAsUser {
      runImpl()
    }
    exitCode
  }
  private def runImpl(): Unit = {
    try {
      val appAttemptId = client.getAttemptId()

      var attemptID: Option[String] = None

      if (isClusterMode) {
        // Set the web ui port to be ephemeral for yarn so we don't conflict with
        // other spark processes running on the same box
        System.setProperty("spark.ui.port", "0")

        // Set the master and deploy mode property to match the requested mode.
        System.setProperty("spark.master", "yarn")
        System.setProperty("spark.submit.deployMode", "cluster")

        // Set this internal configuration if it is running on cluster mode, this
        // configuration will be checked in SparkContext to avoid misuse of yarn cluster mode.
        System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString())

        attemptID = Option(appAttemptId.getAttemptId.toString)
      }

      new CallerContext(
        "APPMASTER", sparkConf.get(APP_CALLER_CONTEXT),
        Option(appAttemptId.getApplicationId.toString), attemptID).setCurrentContext()

      logInfo("ApplicationAttemptId: " + appAttemptId)

      // This shutdown hook should run *after* the SparkContext is shut down.
      val priority = ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY - 1
      ShutdownHookManager.addShutdownHook(priority) { () =>
        val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf)
        val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts

        if (!finished) {
          // The default state of ApplicationMaster is failed if it is invoked by shut down hook.
          // This behavior is different compared to 1.x version.
          // If user application is exited ahead of time by calling System.exit(N), here mark
          // this application as failed with EXIT_EARLY. For a good shutdown, user shouldn't call
          // System.exit(0) to terminate the application.
          finish(finalStatus,
            ApplicationMaster.EXIT_EARLY,
            "Shutdown hook called before final status was reported.")
        }

        if (!unregistered) {
          // we only want to unregister if we don't want the RM to retry
          if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) {
            unregister(finalStatus, finalMsg)
            cleanupStagingDir()
          }
        }
      }

      if (isClusterMode) {
        runDriver()
      } else {
        runExecutorLauncher()
      }
    } catch {...    } finally {
      try {
        metricsSystem.foreach { ms =>
          ms.report()
          ms.stop()
        }
      } catch {...}
    }
  }

我们这里采用的是cluster模式,所以最后运行runDriver

private def runDriver(): Unit = {
    addAmIpFilter(None)
    userClassThread = startUserApplication()

    // This a bit hacky, but we need to wait until the spark.driver.port property has
    // been set by the Thread executing the user class.
    logInfo("Waiting for spark context initialization...")
    val totalWaitTime = sparkConf.get(AM_MAX_WAIT_TIME)
    try {
      val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
        Duration(totalWaitTime, TimeUnit.MILLISECONDS))
      if (sc != null) {
        rpcEnv = sc.env.rpcEnv

        val userConf = sc.getConf
        val host = userConf.get("spark.driver.host")
        val port = userConf.get("spark.driver.port").toInt
        registerAM(host, port, userConf, sc.ui.map(_.webUrl))

        val driverRef = rpcEnv.setupEndpointRef(
          RpcAddress(host, port),
          YarnSchedulerBackend.ENDPOINT_NAME)
        createAllocator(driverRef, userConf)
      } else {
        // Sanity check; should never happen in normal operation, since sc should only be null
        // if the user app did not create a SparkContext.
        throw new IllegalStateException("User did not initialize spark context!")
      }
      resumeDriver()
      userClassThread.join()
    } catch {
      case e: SparkException if e.getCause().isInstanceOf[TimeoutException] =>
        logError(
          s"SparkContext did not initialize after waiting for $totalWaitTime ms. " +
           "Please check earlier log output for errors. Failing the application.")
        finish(FinalApplicationStatus.FAILED,
          ApplicationMaster.EXIT_SC_NOT_INITED,
          "Timed out waiting for SparkContext.")
    } finally {
      resumeDriver()
    }
  }

runDriver中第一个步就是startUserApplication,这里就是启动了我们自己编写的类,在spark启动时指定的class,这里是单独启动一个线程来启动的。
然后就行向RM申请资源createAllocator,这里是通过YarnAllocator来具体实现:

private def createAllocator(driverRef: RpcEndpointRef, _sparkConf: SparkConf): Unit = {
    val appId = client.getAttemptId().getApplicationId().toString()
    val driverUrl = RpcEndpointAddress(driverRef.address.host, driverRef.address.port,
      CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString
    allocator = client.createAllocator(
      yarnConf,
      _sparkConf,
      driverUrl,
      driverRef,
      securityMgr,
      localResources)
    credentialRenewer.foreach(_.setDriverRef(driverRef))
    rpcEnv.setupEndpoint("YarnAM", new AMEndpoint(rpcEnv, driverRef))
    allocator.allocateResources()
    val ms = MetricsSystem.createMetricsSystem("applicationMaster", sparkConf, securityMgr)
    val prefix = _sparkConf.get(YARN_METRICS_NAMESPACE).getOrElse(appId)
    ms.registerSource(new ApplicationMasterSource(prefix, allocator))
    // do not register static sources in this case as per SPARK-25277
    ms.start(false)
    metricsSystem = Some(ms)
    reporterThread = launchReporterThread()
  }

这里面会向NM申请资源:

def allocateResources(): Unit = synchronized {
    updateResourceRequests()
    val progressIndicator = 0.1f
    val allocateResponse = amClient.allocate(progressIndicator)

    val allocatedContainers = allocateResponse.getAllocatedContainers()
    allocatorBlacklistTracker.setNumClusterNodes(allocateResponse.getNumClusterNodes)

    if (allocatedContainers.size > 0) {
      handleAllocatedContainers(allocatedContainers.asScala)
    }

    val completedContainers = allocateResponse.getCompletedContainersStatuses()
   
  }

首先相当于是向RM发送了一个心跳,看当前RM是否有可用的Container,这里并没有进行实际的Container申请,然后根据我们传入的配置,设置需要申请的executor需要申请的内存和CPU的容量,然后通过handleAllocatedContainers过滤不符合要求的Container:

 def handleAllocatedContainers(allocatedContainers: Seq[Container]): Unit = {
    val containersToUse = new ArrayBuffer[Container](allocatedContainers.size)
    val remainingAfterHostMatches = new ArrayBuffer[Container]
    for (allocatedContainer <- allocatedContainers) {
      matchContainerToRequest(allocatedContainer, allocatedContainer.getNodeId.getHost,
        containersToUse, remainingAfterHostMatches)
    }
    val remainingAfterRackMatches = new ArrayBuffer[Container]
    if (remainingAfterHostMatches.nonEmpty) {
      var exception: Option[Throwable] = None
      val thread = new Thread("spark-rack-resolver") {
        override def run(): Unit = {
          try {
            for (allocatedContainer <- remainingAfterHostMatches) {
              val rack = resolver.resolve(conf, allocatedContainer.getNodeId.getHost)
              matchContainerToRequest(allocatedContainer, rack, containersToUse,
                remainingAfterRackMatches)
            }
          } catch {
            case e: Throwable =>
              exception = Some(e)
          }
        }
      }
      thread.setDaemon(true)
      thread.start()
      try {
        thread.join()
      } catch {
      }
    }

    val remainingAfterOffRackMatches = new ArrayBuffer[Container]
    for (allocatedContainer <- remainingAfterRackMatches) {
      matchContainerToRequest(allocatedContainer, ANY_HOST, containersToUse,
        remainingAfterOffRackMatches)
    }

    if (!remainingAfterOffRackMatches.isEmpty) {
      logDebug(s"Releasing ${remainingAfterOffRackMatches.size} unneeded containers that were " +
        s"allocated to us")
      for (container <- remainingAfterOffRackMatches) {
        internalReleaseContainer(container)
      }
    }

    runAllocatedContainers(containersToUse)

    logInfo("Received %d containers from YARN, launching executors on %d of them."
      .format(allocatedContainers.size, containersToUse.size))
  }

接下来在runAllocatedContainers对符合要求的Container申请资源:

private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = {
    for (container <- containersToUse) {
      executorIdCounter += 1
      val executorHostname = container.getNodeId.getHost
      val containerId = container.getId
      val executorId = executorIdCounter.toString
      def updateInternalState(): Unit = synchronized {
        runningExecutors.add(executorId)
        numExecutorsStarting.decrementAndGet()
        executorIdToContainer(executorId) = container
        containerIdToExecutorId(container.getId) = executorId

        val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
          new HashSet[ContainerId])
        containerSet += containerId
        allocatedContainerToHostMap.put(containerId, executorHostname)
      }

      if (runningExecutors.size() < targetNumExecutors) {
        numExecutorsStarting.incrementAndGet()
        if (launchContainers) {
          launcherPool.execute(new Runnable {
            override def run(): Unit = {
              try {
                new ExecutorRunnable(
                  Some(container),
                  conf,
                  sparkConf,
                  driverUrl,
                  executorId,
                  executorHostname,
                  executorMemory,
                  executorCores,
                  appAttemptId.getApplicationId.toString,
                  securityMgr,
                  localResources
                ).run()
                updateInternalState()
              } catch {
                case e: Throwable =>
                  numExecutorsStarting.decrementAndGet()
                  if (NonFatal(e)) {
                    amClient.releaseAssignedContainer(containerId)
                  } else {
                    throw e
                  }
              }
            }
          })
        } else {
        }
      } 
    }
  }

通过线程池来不断申请Container,并启动Container,这些操作在ExecutorRunnable完成:

def run(): Unit = {
    logDebug("Starting Executor Container")
    nmClient = NMClient.createNMClient()
    nmClient.init(conf)
    nmClient.start()
    startContainer()
  }
 def startContainer(): java.util.Map[String, ByteBuffer] = {
    val ctx = Records.newRecord(classOf[ContainerLaunchContext])
      .asInstanceOf[ContainerLaunchContext]
    val env = prepareEnvironment().asJava
    ctx.setLocalResources(localResources.asJava)
    ctx.setEnvironment(env)
    val credentials = UserGroupInformation.getCurrentUser().getCredentials()
    val dob = new DataOutputBuffer()
    credentials.writeTokenStorageToStream(dob)
    ctx.setTokens(ByteBuffer.wrap(dob.getData()))

    val commands = prepareCommand()

    ctx.setCommands(commands.asJava)
    ctx.setApplicationACLs(
      YarnSparkHadoopUtil.getApplicationAclsForYarn(securityMgr).asJava)

    if (sparkConf.get(SHUFFLE_SERVICE_ENABLED)) {
      val secretString = securityMgr.getSecretKey()
      val secretBytes =
        if (secretString != null) {
          JavaUtils.stringToBytes(secretString)
        } else {
          ByteBuffer.allocate(0)
        }
      ctx.setServiceData(Collections.singletonMap("spark_shuffle", secretBytes))
    }

    try {
      nmClient.startContainer(container.get, ctx)
    } catch {
    }
  } 

startContainer中,我们可以发现,其启动的是org.apache.spark.executor.CoarseGrainedExecutorBackend,详细的我们下一节在讲吧。

### 回答1: 使用spark-submit命令提交Spark应用程序到YARN集群模式,可以按照以下步骤进行操作: 1. 确保已经安装了SparkYARN,并且配置了正确的环境变量。 2. 编写Spark应用程序代码,并将其打包成jar包。 3. 打开终端,输入以下命令: ``` spark-submit --class <main-class> --master yarn --deploy-mode cluster <jar-file> <args> ``` 其中,`<main-class>`是Spark应用程序的主类名,`<jar-file>`是打包好的jar包路径,`<args>`是传递给应用程序的参数。 4. 提交命令后,Spark会将应用程序提交到YARN集群中,并在集群中启动应用程序的Driver程序。 5. 可以通过YARN的Web界面或命令行工具来监控应用程序的运行状态和日志输出。 注意事项: - 在提交应用程序时,需要指定`--master yarn`和`--deploy-mode cluster`参数,以告诉Spark将应用程序提交到YARN集群中运行。 - 如果应用程序需要访问HDFS或其他外部资源,需要在应用程序中指定相应的路径或URL,并确保YARN集群中的节点也能够访问这些资源。 - 在提交应用程序时,可以通过`--num-executors`、`--executor-memory`、`--executor-cores`等参数来指定应用程序在集群中的资源分配情况。 ### 回答2: Spark是一个快速、通用、可扩展的大数据处理引擎,能够处理包括离线批处理、实时流处理、图形处理等多种数据处理场景。其中,Spark中常见的数据处理方式是通过RDD(弹性分布式数据集)来进行计算处理。 对于Spark应用的部署,有两种方式:一种是通过Spark Standalone模式,将Spark应用作为单一进程的方式运行;另一种则是通过YARN模式,将Spark应用分布到一组计算节点中去运行。在这里,我们介绍一种常见的部署方式:通过spark-submit命令将应用提交到YARN集群上运行。 spark-submit命令是Spark提供的专门用于提交应用的命令,根据不同的运行模式,有不同的参数指定方式。其中,将应用部署到YARN模式的一般命令如下: ``` ./bin/spark-submit \ --class [app-class] \ --master yarn \ --deploy-mode cluster \ [--executor-memory <memory>] \ [--num-executors <num>] \ [path-to-app-jar] [app-arguments] ``` 其中,各参数含义如下: 1. --class:指定应用的入口类。 2. --master:指定Spark应用运行在YARN模式下。 3. --deploy-mode:指定应用在YARN模式下的部署方式,有两种模式:client和cluster。其中,client模式是指在本地运行应用,而cluster模式则是将应用提交到YARN集群上运行。 4. --executor-memory:指定每个executor占用的内存大小。 5. --num-executors:指定在YARN集群上运行的executor数目。 6. [path-to-app-jar]:指定应用程序的jar包路径。 7. [app-arguments]:应用的命令行参数。 需要注意的是,将应用提交到YARN集群上运行时,需要提前从HDFS中将数据加载到内存中,否则可能会降低应用的性能。在代码中,可以通过使用SparkContext的textFile等方法,将HDFS中的数据读取到RDD中进行处理。 总之,通过spark-submit命令将Spark应用部署到YARN集群上运行,可以充分利用集群资源,实现高效处理大规模数据。而在代码编写方面,需要注意处理好HDFS中数据的读取和分布式操作等问题。 ### 回答3: Spark是一种开源的大数据处理框架,其可以通过Spark-submit进行提交。Spark-submit是一个命令行工具,可用于将Spark应用程序提交到集群中运行。它支持多种模式,包括local模式、standalone模式yarn-cluster模式等。其中,在yarn-cluster模式中,Spark应用程序将在YARN集群上运行。 在使用Spark-submit提交Spark应用程序到YARN集群的时候,需要考虑以下几个方面: 1. 配置参数 首先,需要指定Spark-submit的参数,例如--class参数用于指定要运行的主类。在YARN集群上运行Spark应用程序需要使用--master参数,并将其设置为yarn-cluster。同时,可以添加其它的参数,例如--num-executors用于设置执行程序的数量,--executor-memory用于设置每个执行程序的内存。 示例: spark-submit --class com.test.TestApp \ --master yarn-cluster \ --num-executors 5 \ --executor-memory 4G \ /path/to/your/application.jar 2. 配置环境 为了让Spark应用程序在YARN集群上运行,需要为其配置适当的环境。需要确保所有必要的依赖项都已安装,并将其添加到Spark-submit命令中。在集群节点上,需要确保Spark和Hadoop已正确配置并运行。 3. 访问资源 将应用程序提交到YARN集群后,需要确保它能够访问必要的资源。这些资源包括存储在HDFS中的数据和应用程序所需的库和文件。如果应用程序需要访问外部资源,则还需要配置适当的访问权限。 总的来说,Spark-submit提交yarn-cluster模式是一种将Spark应用程序提交到YARN集群上运行的方法。在提交之前,需要考虑配置参数、配置环境和访问资源,以确保应用程序能够正确运行并访问所需的资源和库。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值