Spark内核原理和Spark on yarn

Schedule模块

Schedule模块主要分成了两大部分,即DAGScheduler和TaskScheduler。

将用户提交的计算任务按照DAG划分为不同的Stage并将不同Stage的计算任务提交到集群进行最终的计算。

三大类:

1)org.apache.spark.scheduler.DAGScheduler

2)org.apache.spark.scheduler.SchedulerBackend

3)org.apache.spark.scheduler.TaskScheduler

SchedulerBackend

org.apache.spark.scheduler.SchedulerBackend是是一个trait,作用是分配当前可用的资源,即为当前等待分配计算资源的Task分配计算资源(即Executor),并在所分配的Executor上启动Task,完成计算的调度过程。

计算的调度(资源)过程步骤:

① 给Task分Executor

② 启动分的Executor

计算的调度过程实现:reviveOffers方法

image-20220616101201543

org.apache.spark.scheduler.SchedulerBackend两个实现类,YARN、Standalone和Mesos都是基于其中一个实现类org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend加入了自身特有的逻辑。

TaskScheduler

org.apache.spark.scheduler.TaskScheduler是一个trait,作用是为创建它的sparkcontext调度任务,即从DAGScheduler接收不同Stage

的任务,并且向集群提交这些任务,并为执行特别慢的任务启动备份任务。

该接口只有唯一的一个实现类,org.apache.spark.scheduler.TaskSchedulerImpl

TaskSchedulerImpl的使用场景

1)有新任务提交时。

2)有任务执行失败时。

3)计算节点(即Executor)不可用时。

4)某些任务执行过慢而需要为它重新分配资源时。

每个SchedulerBackend都会对应一个的唯一TaskScheduler,而它们都被Spark-Context创建和持有。

任务调度逻辑图

image-20220616105155928

DAGScheduler

Job的提交

用户提交的Job最终会调用DAGScheduler的runJob,

它又会调用submitJob。以RDD的动作算子collect为例,调用过程如下:

org.apache.spark.rdd.RDD#collect
org.apache.spark.SparkContext#runJob  -> 这里会有多个重载的runJob,在这会生成dagSchedule
org.apache.spark.scheduler.DAGScheduler#runJob 
org.apache.spark.scheduler.DAGScheduler#submitJob -> 会生成jobId

# 后面走的流程不确定了
org.apache.spark.scheduler.DAGSchedulerEventProcessorLoop#onReceive(JobSubmitted)
org.apache.spark.scheduler.DAGScheduler#handleJobSubmitted

调用栈3org.apache.spark.scheduler.DAGScheduler#runJob中:

实际上是通过调用submitJob来完成提交任务,

image-20220616135546565

调用submitJob函数,提交action算子的任务给Scheduler,里面还会起一个 JobWaiter。这里的JobWaiter【listener】会被传入eventProcessLoop.post(****)

JobSubmiited(****)一个事件类型(该Event类型继承自trait DAGSchedulerEvent)

image-20220616143604231

image-20220616182021619

JobWaiter会监听Job的执行状态,而Job是由多个Task组成的,因此只有Job的所有Task都成功完成,Job才标记为成功;任意一个Task失

败都会标记该Job失败,这是DAGScheduler通过调用org.apache.spark.scheduler.JobWaiter#jobFailed实现的。

最后,DAGScheduler会向eventProcessActor提交该Job,最终会调用org.apache.spark.scheduler.DAGScheduler#handleJob-Submitted来处理这次提交的Job

image-20220616180330894

Stage的划分

image-20220616222400287
1)
org.apache.spark.SparkContext#runJob
2)
org.apache.spark.scheduler.DAGScheduler#runJob
3)
org.apache.spark.scheduler.DAGScheduler#submitJob
4)
org.apache.spark.scheduler.DAGSchedulerEventProcessActor#receive 
 (
JobSubmitted)
5)
org.apache.spark.scheduler.DAGScheduler#handleJobSubmitted

handleJobSubmitted通过调用org.apache.spark.scheduler.DAGScheduler#createResultStage来创建finalStage

image-20220616221951012

createResultStage函数首先会获取当前Stage的parent Stage,然后再通过ResultStage函数创建当前的stage:

image-20220616222150459

通过调用getOrCreateParentStages,上面图中的Stage 1和Stage 2就创建出来了。然后根据它们来创建Stage 3。

image-20220616223335840

private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
    getShuffleDependencies(rdd).map { shuffleDep =>
        getOrCreateShuffleMapStage(shuffleDep, firstJobId)
    }.toList
}
// 接下来,我们将去看上面函数中内部的两个函数:getShuffleDependencies 和 getOrCreateShuffleMapStage
// 1. getShuffleDependencies
/**
   * Returns shuffle dependencies that are immediate parents of 【the given RDD】.
   *
   * This function will not return more distant ancestors.  For example, if C has a shuffle
   * dependency on B which has a shuffle dependency on A:
   *
   * A <-- B <-- C
   *
   * calling this function with rdd C will only return the B <-- C dependency.
   *
   * This function is scheduler-visible for the purpose of unit testing.
   */
private[scheduler] def getShuffleDependencies(
    rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {
    val parents = new HashSet[ShuffleDependency[_, _, _]] // 存储 parent stage		
    val visited = new HashSet[RDD[_]]					  // 存储访问到过的RDD
    val waitingForVisit = new ListBuffer[RDD[_]]		  // BFS遍历RDD生成的依赖数
    waitingForVisit += rdd								  // 初始化一个起始点
    while (waitingForVisit.nonEmpty) {
        val toVisit = waitingForVisit.remove(0)
        if (!visited(toVisit)) {
            visited += toVisit
            toVisit.dependencies.foreach {
                case shuffleDep: ShuffleDependency[_, _, _] =>
                parents += shuffleDep
                case dependency =>
                // 不是shuffleDependency就说明当前rdd和搜索到的这个rdd属于同一个stage
                waitingForVisit.prepend(dependency.rdd)
            }
        }
    }
    parents
}


// 2. getOrCreateShuffleMapStage
private def getOrCreateShuffleMapStage(
    shuffleDep: ShuffleDependency[_, _, _],
    firstJobId: Int): ShuffleMapStage = {
    shuffleIdToMapStage.get(shuffleDep.shuffleId) match {  // 根据shuffleId去寻找stage,看stage是否存在
        case Some(stage) => // stage存在(被创建了)则直接返回
        stage

        case None =>
        // 为所有缺失 ancestor shuffle dependencies 的创建 stage
        getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { 
            // Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies
            // that were not already in shuffleIdToMapStage, it's possible that by the time we
            // get to a particular dependency in the foreach loop, it's been added to
            // shuffleIdToMapStage by the stage creation process for an earlier dependency. See
            // SPARK-13902 for more information.
            // 如果这个Stage已存在,那么将恢复这个Stage的结果,从而避免了重复计算。
            dep =>
            if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
                createShuffleMapStage(dep, firstJobId)
            }
        }
        // Finally, create a stage for the given shuffle dependency.
        createShuffleMapStage(shuffleDep, firstJobId)
    }
}
// 上面中有2个重要的方法:getMissingAncestorShuffleDependencies 和 createShuffleMapStage
// getMissingAncestorShuffleDependencies
/** Find ancestor shuffle dependencies that are not registered in shuffleToMapStage yet */
private def getMissingAncestorShuffleDependencies(
    rdd: RDD[_]): ListBuffer[ShuffleDependency[_, _, _]] = {
    val ancestors = new ListBuffer[ShuffleDependency[_, _, _]]
    val visited = new HashSet[RDD[_]]
    // We are manually maintaining a stack here to prevent StackOverflowError
    // caused by recursively visiting
    val waitingForVisit = new ListBuffer[RDD[_]]
    waitingForVisit += rdd
    while (waitingForVisit.nonEmpty) {
        val toVisit = waitingForVisit.remove(0)
        if (!visited(toVisit)) {
            visited += toVisit
            getShuffleDependencies(toVisit).foreach { shuffleDep =>
                if (!shuffleIdToMapStage.contains(shuffleDep.shuffleId)) {
                    ancestors.prepend(shuffleDep)
                    waitingForVisit.prepend(shuffleDep.rdd)
                } // Otherwise, the dependency and its ancestors have already been registered.
            }
        }
    }
    ancestors
}

/**
   * Creates a ShuffleMapStage that generates the given shuffle dependency's partitions. If a
   * previously run stage generated the same shuffle data, this function will copy the output
   * locations that are still available from the previous shuffle to avoid unnecessarily
   * regenerating data.
   */
// createShuffleMapStage
def createShuffleMapStage[K, V, C](
    shuffleDep: ShuffleDependency[K, V, C], jobId: Int): ShuffleMapStage = {
    val rdd = shuffleDep.rdd
    checkBarrierStageWithDynamicAllocation(rdd)
    checkBarrierStageWithNumSlots(rdd)
    checkBarrierStageWithRDDChainPattern(rdd, rdd.getNumPartitions)
    val numTasks = rdd.partitions.length
    val parents = getOrCreateParentStages(rdd, jobId)
    val id = nextStageId.getAndIncrement()
    val stage = new ShuffleMapStage(
        id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep, mapOutputTracker)

    stageIdToStage(id) = stage
    shuffleIdToMapStage(shuffleDep.shuffleId) = stage
    updateJobIdStageIdMaps(jobId, stage)

    if (!mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
        // Kind of ugly: need to register RDDs with the cache and map output tracker here
        // since we can't do it in the RDD constructor because # of partitions is unknown
        logInfo(s"Registering RDD ${rdd.id} (${rdd.getCreationSite}) as input to " +
                s"shuffle ${shuffleDep.shuffleId}")
        mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
    }
    stage
}

Task的生成

小结

要理解什么是Stage,首先要搞明白什么是Task。Task是在集群上运行的基本单位。一个Task负责处理RDD的一个Partition。RDD的多个Patition会分别由不同的Task去处理。当然,这些Task的处理逻辑是完全一致的。这一组Task就组成了一个Stage。有两种Task:

1)

org.apache.spark.scheduler.ShuffleMapTask

2)

org.apache.spark.scheduler.ResultTask

ShuffleMapTask根据Task的partitioner将计算结果放到不同的bucket中。而ResultTask将计算结果发送回Driver Application。一个Job包含了多个Stage,而Stage是由一组完全相同的Task组成的。最后的Stage包含了一组ResultTask。在用户触发了一个动作后,比如count、collect,SparkContext会通过runJob的函数开始进行任务提交。最后会通过DAG的事件处理器传递到DAGScheduler本身的handleJobSubmitted,它首先会划分Stage,提交Stage,然后提交Task。至此,Task就开始在集群上运行了。一个Stage的开始就是从外部存储或者shuffle结果中读取数据;一个Stage的结束就是由于发生shuffle或者生成结果时。在DAGScheduler将用户提交的应用划分为不同的Stage后,TaskScheduler模块负责为Stage的Task分配计算资源,这个计算资源的分配实际上是Cluster Manager职责。接下来的章节将以Standalone模式为例,分析Spark是如何为应用分配计算的。

Deploy模块

Deploy模块主要包含3个子模块:Master、Worker、Client,它们之间的通信通过AKKA完成。对于Master和Worker,它们本身就是一个Actor,因此可以直接通过AKKA实现通信。Client虽然本身不是一个Actor(Client会有一个Appclient去完成通信),这三者的主要职责如下:

1)Master:接收Worker的注册并管理所有的Worker,接收Client提交的Application,FIFO调度等待的Application并向Worker提交。

2)Worker:向Master注册自己,根据Master发送的Application配置进程环境,并启动StandaloneExecutorBackend。

3)Client:向Master注册并监控Application。当用户创建SparkContext时会实例化SparkDeploySchedulerBackend,而实例化

SparkDeploySchedulerBackend的同时会启动Client,通过向Client传递启动参数和Application有关信息,Client向Master发送请求注册Application并且在计算节点上启动StandaloneExecutorBackend。

Executor模块

Executor模块负责运行Task计算任务,并将计算结果回传到Driver。Spark支持多种资源调度框架,这些资源框架在为计算任务分配了资源后,最后都会使用Executor模块完成最终的计算。

Shuffle模块

Storage模块

Master和worker两边都是各自的Storage模块,但其功能是有所不同的。

Storage模块采用的是Master/Slave的架构。Master负责整个Application的Block的元数据信息的管理和维护;而Slave需要将Block的

更新等状态上报到Master,同时接收Master的命令,比如删除一个RDD、Shuffle相关的数据或者是广播变量。而Master与Slave之间通过AKKA消息传递机制进行通信。

image-20220617151342544

image-20220617151409431

小结

Storage模块负责管理Spark计算过程中产生的数据,包括基于Disk和基于Memory的数据。用户在实际编程中,面对的是RDD,可以将

RDD的数据通过调用org.apache.spark.rdd.RDD#cache将数据持久化;持久化的动作都是由Storage模块完成的。包括Shuffle过程中的数据,也都是由Storage模块管理的。可以说,RDD实现了用户的逻辑,而Storage管理了用户的数据。

Spark On Yarn

image-20220619172624506

Spark On Yarn 基本概念

广义上讲,“yarn-cluster” mode适用于生产作业,而 “yarn-client” mode适用于交互式和调试使用,可以马上看到应用程序的输出。

在YARN中,每个application实例 拥有一个Application Master 进程(为该Application准备的第一个Container就是Application Master)。当该container所需的任务资源(先前AM向RM请求的资源)分配完毕后,会让Nodemanager去启动这个Container进程。

Application Master 进程:负责向ResourceManager请求资源,它的存在排除了对一个Active Client的需求,Client在启动Application后就可以关闭了,接下来就由YARN管理的一个进程完成与AM之间的通信协调。

Spark On Yarn架构

image-20220620114331086

Yarn-Cluster部署模式

image-20220622140835356

Client:A Program,一个可执行文件;

ApplicationMaster(等同ExecutorLauncher 在client模式中):申请任务资源

Spark Driver:它运行在Master结点,是一个JVM process,申明了RDD的转换和行动操作。Spark Driver的作用:执行来自Client的代码,初始化SparkContext,任务切分,分配任务。

SparkContext:创建SC是使用RDDs和连接YARN集群的前提。

YarnClusterScheduler:是Yarn-cluster部署模式下的的任务调度器(TaskScheduler),作用是确保执行正确的ApplicationMaster初始化,即SparkContext被初始化和停止这两个操作。

**CoarseGrainedSchedulerBackend:**负责与Executor进行通信。一直持有,无论当前任务是否已经完成或者是正在请求一个新的Executor。

CoarseGrainedExecutorBackend:负责与Driver进行通信,告知Executor的任务执行状态。Executor创建后,由CoarseGrainedExecutorBackend主动去连接YarnClientSchedulerBackend持有的reference(CoarseGrainedSchedulerBackend),
至此之后,由CoarseGrainedExecutorBackend和CoarseGrainedSchedulerBackend作为Driver和Executor之间的联络工具。

Note:

① SparkSubmit、ApplicationMaster和CoarseGrainedExecutorBackend是独立的进程;Driver是独立的线程(ApplicationMaster进程内);Executor和YarnClusterApplication是对象。

② ExecutorLauncher进程和ApplicationMaster进程虽然进程的名字不一样,但是在本质上并没有什么区别,可以查看ExecutorLauncher的源码的注释:

/**
  * This object does not provide any special functionality. It exists so that it's easy to tell
  * apart the client-mode AM from the cluster-mode AM when using tools such as ps or jps.
  */
object ExecutorLauncher {

    def main(args: Array[String]): Unit = {
        ApplicationMaster.main(args)
    }

}

目的:在Linux中使用ps或者jps指令的时候,能够区分出是Client运行模式下的AM或者是Cluster运行模式下的AM。

Yarn-Client部署模式

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SY0JIFrk-1656985369007)(https://cloud-image-1307408192.cos.ap-shanghai.myqcloud.com/img/202206221415275.png)]

下面是基于对源码的分析对上面的执行流程的具体解释:

image-20220619173934083

Spark On Yarn任务提交流程源码分析

1、Entry Class
  • SparkSubmit

image-20220621113425209

最初submit Task的时候,提交下面的command给Yarn集群。

 ./bin/spark-submit \
 --class org.apache.spark.examples.SparkPi \
 --master yarn \
 --deploy-mode cluster examples/jars/spark-examples*.jar 10

其中,/bin/spark-submit的执行内容如下:vim ./bin/spark-submit

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

command中的java class :org.apache.spark.examples.SparkPi,最终会调用spark submit command:

org.apache.spark.deploy.SparkSubmit

// 一、SparkSubmit类

/**
 * Main gateway of launching a Spark application. 【main 方法】
 *
 * This program handles setting up the classpath with relevant Spark dependencies and provides
 * a layer over the different cluster managers and deploy modes that Spark supports.
 */
private[spark] class SparkSubmit extends Logging {

  import DependencyUtils._
  import SparkSubmit._

  def doSubmit(args: Array[String]): Unit = {
    // Initialize logging if it hasn't been done yet. Keep track of whether logging needs to
    // be reset before the application starts.
    val uninitLog = initializeLogIfNecessary(true, silent = true)

    val appArgs = parseArguments(args) // 输入参数的解析
    if (appArgs.verbose) {
      logInfo(appArgs.toString)
    }
    appArgs.action match {
      case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog) // 从这里进入往下看
    }
  }
    
  /**
   * Submit the application using the provided parameters, ensuring to first wrap
   * in a doAs when --proxy-user is specified.
   */
  @tailrec
  private def submit(args: SparkSubmitArguments, uninitLog: Boolean): Unit = {
      //args: 包含之前spark-submit时--带入的参数,org.apache.spark.examples.SparkPi,yarn,等等
	// 这里仅仅是定义了一个方法,该方法的执行还在下面
    def doRunMain(): Unit = {
     // 源码中多个if语句判断
     // 但最终都会执行runMain()
     runMain(args, unintLog)
  }
    
    

  private def runMain(args: SparkSubmitArguments, uninitLog: Boolean): Unit = {
    val (childArgs, childClasspath, sparkConf, childMainClass) = prepareSubmitEnvironment(args)
  	// 只展示关键代码
	mainClass = Utils.classForName(childMainClass) // 对应流程中,Driver通过反射调用自定义类的main方法
  	
    // 这里判断sparkapplication的类型,如果是mainclass是yarn application or java application or *** ?
	val app: SparkApplication = if (classOf[SparkApplication].isAssignableFrom(mainClass)) {
	  mainClass.getConstructor().newInstance().asInstanceOf[SparkApplication]
	} else {
	  new JavaMainApplication(mainClass)
	}
		
	// 928行		
	app.start(childArgs.toArray, sparkConf)
	}
2、SparkApplication startup
  • JavaMainApplication, YarnClusterApplication

JavaMainApplication

如果mainClass的 value 标记了 – class parameter (i.e. deployMode is client), 则会创建一个SparkApplication,app.start后来到这里;

private[spark] trait SparkApplication {

  def start(args: Array[String], conf: SparkConf): Unit

}

/**
 * Implementation of SparkApplication that wraps a standard Java class with a "main" method.
 *
 * Configuration is propagated to the application via system properties, so running multiple
 * of these in the same JVM may lead to undefined behavior due to configuration leaks.
 */
private[deploy] class JavaMainApplication(klass: Class[_]) extends SparkApplication {

  override def start(args: Array[String], conf: SparkConf): Unit = {
    val mainMethod = class.getMethod("main", new Array[String](0).getClass)
    if (!Modifier.isStatic(mainMethod.getModifiers)) {
      throw new IllegalStateException("The main method in the given main class must be static")
    }

    val sysProps = conf.getAll.toMap
    sysProps.foreach { case (k, v) =>
      sys.props(k) = v
    }

    mainMethod.invoke(null, args)
  }

}

YarnClusterApplication

如果mainClass的 value 标记了 – class parameter (i.e. deployMode is cluster), 则会创建一个yarn Application,app.start后来到这里;

新建一个Client,然后执行run(),完成提交任务给ResourceManager

private[spark] class YarnClusterApplication extends SparkApplication {

  override def start(args: Array[String], conf: SparkConf): Unit = {
    // SparkSubmit would use yarn cache to distribute files & jars in yarn mode,
    // so remove them from sparkConf here for yarn mode.
    conf.remove(JARS)
    conf.remove(FILES)

    new Client(new ClientArguments(args), conf, null).run() // 重点代码
  }

}


/**
   * Submit an application to the ResourceManager.
   * If set spark.yarn.submit.waitAppCompletion to true, it will stay alive
   * reporting the application's status until the application has exited for any reason.
   * Otherwise, the client process will exit after submission.
   * If the application finishes with a failed, killed, or undefined status,
   * throw an appropriate SparkException.
   */
  def run(): Unit = {
    // 仅展示部分代码
    this.appId = submitApplication()
    
    }
  }

至此,流程图如下:

image-20220621144717139

3、SparkContext initialization

image-20220630102448977

当前两步完成之后,就会开始来到逻辑代码部分,即从下面代码开始:

val sparkContext = new SparkContext(conf)

进入逻辑代码部分

// 在逻辑代码部分中,先进入SparkContext类
// 441行开始
// Create the Spark execution environment (cache, map output tracker, etc)
_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)

	// 进入createSparkEnv方法,来到SparkEnv类
	val rpcEnv = RpcEnv.create(systemName, bindAddress, advertiseAddress, port.getOrElse(-1), conf,
      securityManager, numUsableCores, !isDriver)
	  
	def registerOrLookupEndpoint(
			name: String, endpointCreator: => RpcEndpoint):
		  RpcEndpointRef = {
		  if (isDriver) {
			logInfo("Registering " + name)
			rpcEnv.setupEndpoint(name, endpointCreator)
		  } else {
			RpcUtils.makeDriverRef(name, conf, rpcEnv)
		  }
		}

// 521行开始
// We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
// retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
_heartbeatReceiver = env.rpcEnv.setupEndpoint(
  HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

// 527行开始  
// 创建并运行scheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode) // 至关重要的一行

_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

_heartbeater.start()
_taskScheduler.start()

上述代码最重要的就是val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode),这个方法完成创建TaskScheduler 对象。该方法的主要代码如下:

private def createTaskScheduler(
    sc: SparkContext,
    master: String,
    deployMode: String): (SchedulerBackend, TaskScheduler) = {
    import SparkMasterRegex._

    // When running locally, don't try to re-execute tasks on failure.
    val MAX_LOCAL_TASK_FAILURES = 1
 	
    // Ensure that executor's resources satisfies one or more tasks requirement.
    def checkResourcesPerTask(clusterMode: Boolean, executorCores: Option[Int]): Unit = {...}
    
    master match {
     case "local" =>
        checkResourcesPerTask(clusterMode = false, Some(1))
        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
        val backend = new LocalSchedulerBackend(sc.getConf, scheduler, 1)
        scheduler.initialize(backend)
        (backend, scheduler)

	  // ......中间省略了别的case情况
	  case masterUrl =>
        checkResourcesPerTask(clusterMode = true, None)
        val cm = getClusterManager(masterUrl) match {
          case Some(clusterMgr) => clusterMgr
          case None => throw new SparkException("Could not parse Master URL: '" + master + "'")
        }
        try {
          val scheduler = cm.createTaskScheduler(sc, masterUrl)
          val backend = cm.createSchedulerBackend(sc, masterUrl, scheduler)
          cm.initialize(scheduler, backend)
          (backend, scheduler)
        } catch {
          case se: SparkException => throw se
          case NonFatal(e) =>
            throw new SparkException("External scheduler cannot be instantiated", e)
        }
}

createTaskScheduler根据不同master parameters去创建相应的scheduler和backend实例

最下面的case情况,当是master部署模式时,会先执行getClusterManager方法。它会调用java.util内的ServiceLoader类去加载一个

ExternalClusterManager的实现类。

最终根据cm去创建对应的scheduler和backend,并返回这两个东西。

The getClusterManager() method uses the ServiceLoader class in the java toolkit to load the implementation class of ExternalClusterManager. The file is defined in spark yarn_ xxx. Under the MEIA-INF/services directory in the. Jar package, the implementation class is specified in this file as org. Inf apache. spark. scheduler. cluster. Yarnclustermanager class (this class is in the resource managers / yarn directory in the spark source package).

==Before looking further at the source code, you should first understand the relationship between taskscheduler and SchedulerBackend. In the network communication of each component of spark (Driver, Executor and ApplicationMaster), if it is a large amount of data transmission (data shuffle), the http service of netty is used. If it is a small-scale data communication (command or information transfer between components), the rpc framework of netty is used. TaskScudeler is responsible for task scheduling. When it wants to send rdd tasks to the Executor for execution, it does not directly communicate with the Executor, but hand them over to the scheduler backend for processing. ===

/**
 * Cluster Manager for creation of Yarn scheduler and backend
 */
private[spark] class YarnClusterManager extends ExternalClusterManager {

  override def canCreate(masterURL: String): Boolean = {
    masterURL == "yarn"
  }

  override def createTaskScheduler(sc: SparkContext, masterURL: String): TaskScheduler = {
    sc.deployMode match {
      case "cluster" => new YarnClusterScheduler(sc)
      case "client" => new YarnScheduler(sc)
      case _ => throw new SparkException(s"Unknown deploy mode '${sc.deployMode}' for Yarn")
    }
  }

  override def createSchedulerBackend(sc: SparkContext,
      masterURL: String,
      scheduler: TaskScheduler): SchedulerBackend = {
    sc.deployMode match {
      case "cluster" =>
        new YarnClusterSchedulerBackend(scheduler.asInstanceOf[TaskSchedulerImpl], sc)
      case "client" =>
        new YarnClientSchedulerBackend(scheduler.asInstanceOf[TaskSchedulerImpl], sc)
      case  _ =>
        throw new SparkException(s"Unknown deploy mode '${sc.deployMode}' for Yarn")
    }
  }

  override def initialize(scheduler: TaskScheduler, backend: SchedulerBackend): Unit = {
    scheduler.asInstanceOf[TaskSchedulerImpl].initialize(backend)
  }
}

createTaskScheduler 方法会根据不同的部署模式,创建不同的 TaskScheduler 实现类

下面是YarnClusterScheduler类的继承关系:

image-20220621164047471

YarnScheduler class 和 YarnClusterScheduler class 没有定义核心方法. 所有的功能都实现在 TaskSchedulerImpl class, 因此重点关注这个类TaskSchedulerImpl class

createTaskSchedulerBackend 方法会根据不同的部署模式,创建不同的 SchedulerBackend 实现类

image-20220621165021547

4、YarnClientSchedulerBackend and YarnClusterSchedulerBackend initialization

image-20220629163508008

这一部分的内容是从SparkContext初始化部分进入的

// SparkContext类中的代码createTaskScheduler方法往后
// Create and start the scheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)

_taskScheduler.start()

这里创建一个DAGScheduler 对象(用于生成任务的DAG图),然后调用_taskScheduler的start()方法;

TaskScheduler 的子类是YarnScheduler and YarnClusterScheduler,他们都没有实现start()方法的具体细节,start()方法最终在TaskShedulerImpl类中实现。

override def start(): Unit = {
    backend.start()

    if (!isLocal && conf.get(SPECULATION_ENABLED)) {
        logInfo("Starting speculative execution thread")
        speculationScheduler.scheduleWithFixedDelay(
            () => Utils.tryOrStopSparkContext(sc) { checkSpeculatableTasks() },
            SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
    }
}

backend变量的类型是SchedulerBackend类;在先前的图中我们可以得知SchedulerBackend类的具体的实现类YarnClientSchedulerBackend and YarnClusterSchedulerBackend两种,通过backend.start()来启动其实现类中的start()方法。

下面,关注这两个类中的start()方法;

① YarnClientSchedulerBackend 中的start()方法
private[spark] class YarnClientSchedulerBackend(
    scheduler: TaskSchedulerImpl,
    sc: SparkContext)
  extends YarnSchedulerBackend(scheduler, sc)
  with Logging {

  private var client: Client = null
  private var monitorThread: MonitorThread = null

  /**
   * Create a Yarn client to submit an application to the ResourceManager.
   * This waits until the application is running.
   */
  override def start(): Unit = {
    super.start()

    val driverHost = conf.get(config.DRIVER_HOST_ADDRESS)
    val driverPort = conf.get(config.DRIVER_PORT)
    val hostport = driverHost + ":" + driverPort
    sc.ui.foreach { ui => conf.set(DRIVER_APP_UI_ADDRESS, ui.webUrl) }

    val argsArrayBuf = new ArrayBuffer[String]()
    argsArrayBuf += ("--arg", hostport)

    logDebug("ClientArguments called with: " + argsArrayBuf.mkString(" "))
    val args = new ClientArguments(argsArrayBuf.toArray)
    totalExpectedExecutors = SchedulerBackendUtils.getInitialTargetExecutorNumber(conf)
    client = new Client(args, conf, sc.env.rpcEnv)
    bindToYarn(client.submitApplication(), None)

    waitForApplication()

    monitorThread = asyncMonitorApplication()
    monitorThread.start()

    startBindings()
  }
yarnClient

上面提到的Client是属于Yarn的Client

private[spark] class Client(val args: ClientArguments,
    val sparkConf: SparkConf) extends Logging {
    
  def run(): Unit = {
    this.appId = submitApplication()
  }

  def submitApplication(): ApplicationId = {
    var appId: ApplicationId = null
    try {
      launcherBackend.connect()
      // Setup the credentials before doing anything else,
      // so we have don't have issues at any point.
      setupCredentials()
      // Initialize YarnClient
      yarnClient.init(hadoopConf)
      // Start YarnClient
      yarnClient.start()

      logInfo("Requesting a new application from cluster with %d NodeManagers"
        .format(yarnClient.getYarnClusterMetrics.getNumNodeManagers))

      // Get a new application from our RM
      val newApp = yarnClient.createApplication()
      val newAppResponse = newApp.getNewApplicationResponse()
      appId = newAppResponse.getApplicationId()

      new CallerContext("CLIENT", sparkConf.get(APP_CALLER_CONTEXT),
        Option(appId.toString)).setCurrentContext()

      // Verify whether the cluster has enough resources for our AM
      verifyClusterResources(newAppResponse)

      // Set up the appropriate contexts to launch our AM
      val containerContext = createContainerLaunchContext(newAppResponse)
      val appContext = createApplicationSubmissionContext(newApp, containerContext)

      // Finally, submit and monitor the application
      logInfo(s"Submitting application $appId to ResourceManager")
      yarnClient.submitApplication(appContext)
      launcherBackend.setAppId(appId.toString)
      reportLauncherState(SparkAppHandle.State.SUBMITTED)

      appId
    } catch {
      case e: Throwable =>
        if (appId != null) {
          cleanupStagingDir(appId)
        }
        throw e
    }
  }
}

其中先调用了init and start方法,完成Yarn Client的使用,

createApplication创建一个饿YarnClientApplication,它包含两部分内容:

  • applicationId
  • ApplicationSubmissionContext object
createrConainerLuanchContext()

主要部分如下:

private def createContainerLaunchContext(newAppResponse: GetNewApplicationResponse)
    : ContainerLaunchContext = {
    logInfo("Setting up container launch context for our AM")
    val appId = newAppResponse.getApplicationId
    val launchEnv = setupLaunchEnv(appStagingDirPath, pySparkArchives)
    val localResources = prepareLocalResources(appStagingDirPath, pySparkArchives)

    val amContainer = Records.newRecord(classOf[ContainerLaunchContext])
    amContainer.setLocalResources(localResources.asJava)
    amContainer.setEnvironment(launchEnv.asJava)

    val userClass =
      if (isClusterMode) {
        Seq("--class", YarnSparkHadoopUtil.escapeForShell(args.userClass))
      } else {
        Nil
     }
   
    val amClass =
      if (isClusterMode) {
        Utils.classForName("org.apache.spark.deploy.yarn.ApplicationMaster").getName
      } else {
        Utils.classForName("org.apache.spark.deploy.yarn.ExecutorLauncher").getName
      }
  
    val amArgs =
      Seq(amClass) ++ userClass ++ userJar ++ primaryPyFile ++ primaryRFile ++ userArgs ++
      Seq("--properties-file", buildPath(Environment.PWD.$$(), LOCALIZED_CONF_DIR, SPARK_CONF_FILE))

    // Command for the ApplicationMaster
    val commands = prefixEnv ++
      Seq(Environment.JAVA_HOME.$$() + "/bin/java", "-server") ++
      javaOpts ++ amArgs ++
      Seq(
        "1>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout",
        "2>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr")

    // TODO: it would be nicer to just make sure there are no null commands here
    val printableCommands = commands.map(s => if (s == null) "null" else s).toList
    amContainer.setCommands(printableCommands.asJava)

    amContainer
}

createrConainerLuanchContext方法是用于准备创建ApplicationMaster container的context,

其中,

  1. userClass: 如果是Cluster部署模式, – class parameter value 由user指明. 如果是Client部署模式, 这里就为空, 因为在Client模式中,userClass已经先启动了。
  2. amClass: main class 启动 Application Master 进程. 如果是Cluster部署模式, 这里就是 ApplicationMaster;如果是Client部署模式, 这里就是ExecutorLuancher.
createApplicationSubmissionContext()

用于准备Task提交环境,记录了appName, task queue, appType等参数。代码如下:

def createApplicationSubmissionContext(
      newApp: YarnClientApplication,
      containerContext: ContainerLaunchContext): ApplicationSubmissionContext = {
    val appContext = newApp.getApplicationSubmissionContext
    appContext.setApplicationName(sparkConf.get("spark.app.name", "Spark"))
    appContext.setQueue(sparkConf.get(QUEUE_NAME))
    appContext.setAMContainerSpec(containerContext)
    appContext.setApplicationType("SPARK")

    sparkConf.get(APPLICATION_TAGS).foreach { tags =>
      appContext.setApplicationTags(new java.util.HashSet[String](tags.asJava))
    }
    sparkConf.get(MAX_APP_ATTEMPTS) match {
      case Some(v) => appContext.setMaxAppAttempts(v)
      case None => logDebug(s"${MAX_APP_ATTEMPTS.key} is not set. " +
          "Cluster's default value will be used.")
    }

    sparkConf.get(AM_ATTEMPT_FAILURE_VALIDITY_INTERVAL_MS).foreach { interval =>
      appContext.setAttemptFailuresValidityInterval(interval)
    }

    val capability = Records.newRecord(classOf[Resource])
    capability.setMemory(amMemory + amMemoryOverhead)
    capability.setVirtualCores(amCores)

    sparkConf.get(AM_NODE_LABEL_EXPRESSION) match {
      case Some(expr) =>
        val amRequest = Records.newRecord(classOf[ResourceRequest])
        amRequest.setResourceName(ResourceRequest.ANY)
        amRequest.setPriority(Priority.newInstance(0))
        amRequest.setCapability(capability)
        amRequest.setNumContainers(1)
        amRequest.setNodeLabelExpression(expr)
        appContext.setAMContainerResourceRequest(amRequest)
      case None =>
        appContext.setResource(capability)
    }

    appContext
  }

在appContext被创建后,接下来执行下面的代码:

// Finally, submit and monitor the application
logInfo(s"Submitting application $appId to ResourceManager")
yarnClient.submitApplication(appContext)
launcherBackend.setAppId(appId.toString)

至此,yarn任务提交已经被完成。

② YarnClusterSchedulerBackend中的start()方法
private[spark] class YarnClusterSchedulerBackend(
    scheduler: TaskSchedulerImpl,
    sc: SparkContext)
  extends YarnSchedulerBackend(scheduler, sc) {

  override def start(): Unit = {
    val attemptId = ApplicationMaster.getAttemptId
    bindToYarn(attemptId.getApplicationId(), Some(attemptId))
    super.start()
    totalExpectedExecutors = SchedulerBackendUtils.getInitialTargetExecutorNumber(sc.conf)
    startBindings()
  }

  override def getDriverLogUrls: Option[Map[String, String]] = {
    YarnContainerInfoHelper.getLogUrls(sc.hadoopConfiguration, container = None)
  }

  override def getDriverAttributes: Option[Map[String, String]] = {
    YarnContainerInfoHelper.getAttributes(sc.hadoopConfiguration, container = None)
  }
}

其中没有具体的工作内容,因为在cluster部署模式,client被第一个run()方法就调用,并执行任务提交了。

5、ApplicationMaster startup

image-20220630102430965

Main方法

ApplicationMaster and executorlauncher object的main方法主要内容如下

object ApplicationMaster extends Logging {
  private var master: ApplicationMaster = _

  def main(args: Array[String]): Unit = {
    SignalUtils.registerLogger(log)
    val amArgs = new ApplicationMasterArguments(args)
    master = new ApplicationMaster(amArgs)
    System.exit(master.run())
  }
}

/**
 * This object does not provide any special functionality. It exists so that it's easy to tell
 * apart the client-mode AM from the cluster-mode AM when using tools such as ps or jps.
 */
object ExecutorLauncher {

  def main(args: Array[String]): Unit = {
    ApplicationMaster.main(args)
  }

}

因此从这,我们可以知道AM和EM其实没有区别,只是为了通过ps or jps命令查询的时候,区分当前的部署模式。

master.run()
final def run(): Int = {
    try {
      val attemptID = if (isClusterMode) {
        // Set the web ui port to be ephemeral for yarn so we don't conflict with
        // other spark processes running on the same box
        System.setProperty(UI_PORT.key, "0")

        // Set the master and deploy mode property to match the requested mode.
        System.setProperty("spark.master", "yarn")
        System.setProperty(SUBMIT_DEPLOY_MODE.key, "cluster")

        // Set this internal configuration if it is running on cluster mode, this
        // configuration will be checked in SparkContext to avoid misuse of yarn cluster mode.
        System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString())

        Option(appAttemptId.getAttemptId.toString)
      } else {
        None
      }

 	// ......中间省略了一部分
        
        
	// 关注这一部分
      if (isClusterMode) {
        runDriver()
      } else {
        runExecutorLauncher()
      }
    } catch {
      case e: Exception =>
        // catch everything else if not specifically handled
        logError("Uncaught exception: ", e)
        finish(FinalApplicationStatus.FAILED,
          ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION,
          "Uncaught exception: " + StringUtils.stringifyException(e))
    } finally {
      try {
        metricsSystem.foreach { ms =>
          ms.report()
          ms.stop()
        }
      } catch {
        case e: Exception =>
          logWarning("Exception during stopping of the metric system: ", e)
      }
    }

    exitCode
  }

上面,if(isClusterMode),判断当前的部署模式。如果是cluster模式则调用runDriver();反之,调用runExecutorLauncher();

runDriver()
private def runDriver(): Unit = {
    addAmIpFilter(None, System.getenv(ApplicationConstants.APPLICATION_WEB_PROXY_BASE_ENV))
    // 关注这里
    userClassThread = startUserApplication() // 关注这里

    // This a bit hacky, but we need to wait until the spark.driver.port property has
    // been set by the Thread executing the user class.
    logInfo("Waiting for spark context initialization...")
    val totalWaitTime = sparkConf.get(AM_MAX_WAIT_TIME)
    try {
      val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
        Duration(totalWaitTime, TimeUnit.MILLISECONDS))
      if (sc != null) {
        val rpcEnv = sc.env.rpcEnv

        val userConf = sc.getConf
        val host = userConf.get(DRIVER_HOST_ADDRESS)
        val port = userConf.get(DRIVER_PORT)
          
          // 关注这里
        registerAM(host, port, userConf, sc.ui.map(_.webUrl), appAttemptId) // 关注这里

        val driverRef = rpcEnv.setupEndpointRef(
          RpcAddress(host, port),
          YarnSchedulerBackend.ENDPOINT_NAME)
        createAllocator(driverRef, userConf, rpcEnv, appAttemptId, distCacheConf)
      } else {
        // Sanity check; should never happen in normal operation, since sc should only be null
        // if the user app did not create a SparkContext.
        throw new IllegalStateException("User did not initialize spark context!")
      }
      resumeDriver()
      userClassThread.join()
    } catch {
      case e: SparkException if e.getCause().isInstanceOf[TimeoutException] =>
        logError(
          s"SparkContext did not initialize after waiting for $totalWaitTime ms. " +
           "Please check earlier log output for errors. Failing the application.")
        finish(FinalApplicationStatus.FAILED,
          ApplicationMaster.EXIT_SC_NOT_INITED,
          "Timed out waiting for SparkContext.")
    } finally {
      resumeDriver()
    }
  }

最先调用startUserApplication方法,启动用户的程序,然后返回一个userClassThread ,

  private def startUserApplication(): Thread = {
    logInfo("Starting the user application in a separate Thread")

    var userArgs = args.userArgs
    if (args.primaryPyFile != null && args.primaryPyFile.endsWith(".py")) {
      // When running pyspark, the app is run using PythonRunner. The second argument is the list
      // of files to add to PYTHONPATH, which Client.scala already handles, so it's empty.
      userArgs = Seq(args.primaryPyFile, "") ++ userArgs
    }
    if (args.primaryRFile != null &&
        (args.primaryRFile.endsWith(".R") || args.primaryRFile.endsWith(".r"))) {
      // TODO(davies): add R dependencies here
    }

    val mainMethod = userClassLoader.loadClass(args.userClass)
      .getMethod("main", classOf[Array[String]])
	
    // 关注一下这里
    val userThread = new Thread {
      override def run(): Unit = {
        try {
          if (!Modifier.isStatic(mainMethod.getModifiers)) {
            logError(s"Could not find static main method in object ${args.userClass}")
            finish(FinalApplicationStatus.FAILED, ApplicationMaster.EXIT_EXCEPTION_USER_CLASS)
          } else {
            mainMethod.invoke(null, userArgs.toArray) // 关注这里
            finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
            logDebug("Done running user class")
          }
        } catch {
          case e: InvocationTargetException =>
            e.getCause match {
              case _: InterruptedException =>
                // Reporter thread can interrupt to stop user class
              case SparkUserAppException(exitCode) =>
                val msg = s"User application exited with status $exitCode"
                logError(msg)
                finish(FinalApplicationStatus.FAILED, exitCode, msg)
              case cause: Throwable =>
                logError("User class threw exception: " + cause, cause)
                finish(FinalApplicationStatus.FAILED,
                  ApplicationMaster.EXIT_EXCEPTION_USER_CLASS,
                  "User class threw exception: " + StringUtils.stringifyException(cause))
            }
            sparkContextPromise.tryFailure(e.getCause())
        } finally {
          // Notify the thread waiting for the SparkContext, in case the application did not
          // instantiate one. This will do nothing when the user code instantiates a SparkContext
          // (with the correct master), or when the user code throws an exception (due to the
          // tryFailure above).
          sparkContextPromise.trySuccess(null)
        }
      }
    }
    userThread.setContextClassLoader(userClassLoader)
    userThread.setName("Driver")
    userThread.start()
    userThread
  }

该方法先创建一个thread, 里面会通过反射的方式,调用用户程序里的主方法,

在runDriver中执行完startUserApplication之后会接着执行registerAM方法,实现告诉yarn cluster已经将applicationMaster进程启动了。

runExecutorLauncher()
 private def runExecutorLauncher(): Unit = {
    val hostname = Utils.localHostName
    val amCores = sparkConf.get(AM_CORES)
     // 关注这里
    val rpcEnv = RpcEnv.create("sparkYarnAM", hostname, hostname, -1, sparkConf, securityMgr,
      amCores, true)

    // The client-mode AM doesn't listen for incoming connections, so report an invalid port.
     // 关注这里
    registerAM(hostname, -1, sparkConf, sparkConf.get(DRIVER_APP_UI_ADDRESS), appAttemptId)

    // The driver should be up and listening, so unlike cluster mode, just try to connect to it
    // with no waiting or retrying.
    val (driverHost, driverPort) = Utils.parseHostPort(args.userArgs(0))
    val driverRef = rpcEnv.setupEndpointRef(
      RpcAddress(driverHost, driverPort),
      YarnSchedulerBackend.ENDPOINT_NAME)
    addAmIpFilter(Some(driverRef),
      System.getenv(ApplicationConstants.APPLICATION_WEB_PROXY_BASE_ENV))
     
     //关注这里
    createAllocator(driverRef, sparkConf, rpcEnv, appAttemptId, distCacheConf)

    // In client mode the actor will stop the reporter thread.
    reporterThread.join()
  }

先调用RpcEnv.create ,这里面会调用waitForSparkDriver去跟Driver进程实现通信。然后调用registerAM方法去向yarn cluster注册master。

private def registerAM(
      host: String,
      port: Int,
      _sparkConf: SparkConf,
      uiAddress: Option[String],
      appAttempt: ApplicationAttemptId): Unit = {
    val appId = appAttempt.getApplicationId().toString()
    val attemptId = appAttempt.getAttemptId().toString()
    val historyAddress = ApplicationMaster
      .getHistoryServerAddress(_sparkConf, yarnConf, appId, attemptId)

    client.register(host, port, yarnConf, _sparkConf, uiAddress, historyAddress)
    registered = true
  }

其中 client.register这的client是 YarnRMClient class的一个实例,

private[spark] class YarnRMClient extends Logging {
    /**
   * Registers the application master with the RM.
   */
    
    def register(
      driverUrl: String,
      driverRef: RpcEndpointRef,
      conf: YarnConfiguration,
      sparkConf: SparkConf,
      uiAddress: Option[String],
      uiHistoryAddress: String,
      securityMgr: SecurityManager,
      localResources: Map[String, LocalResource]
    ): YarnAllocator = {
    amClient = AMRMClient.createAMRMClient()
    amClient.init(conf)
    amClient.start()
    this.uiHistoryAddress = uiHistoryAddress

    val trackingUrl = uiAddress.getOrElse {
      if (sparkConf.get(ALLOW_HISTORY_SERVER_TRACKING_URL)) uiHistoryAddress else ""
    }

    logInfo("Registering the ApplicationMaster")
    synchronized {
      amClient.registerApplicationMaster(Utils.localHostName(), 0, trackingUrl)
      registered = true
    }
        
  }
}
runDriver()/runExecutorLauncher()内的createAllocator()

runDriver()/runExecutorLauncher()下

之后runExecutorLauncher方法中会执行createAllocator方法返回一个yarnAllocator

private def createAllocator(
      driverRef: RpcEndpointRef,
      _sparkConf: SparkConf,
      rpcEnv: RpcEnv,
      appAttemptId: ApplicationAttemptId,
      distCacheConf: SparkConf): Unit = {
    // In client mode, the AM may be restarting after delegation tokens have reached their TTL. So
    // always contact the driver to get the current set of valid tokens, so that local resources can
    // be initialized below.
    if (!isClusterMode) {
      val tokens = driverRef.askSync[Array[Byte]](RetrieveDelegationTokens)
      if (tokens != null) {
        SparkHadoopUtil.get.addDelegationTokens(tokens, _sparkConf)
      }
    }

    val appId = appAttemptId.getApplicationId().toString()
    val driverUrl = RpcEndpointAddress(driverRef.address.host, driverRef.address.port,
      CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString
    val localResources = prepareLocalResources(distCacheConf)

    // Before we initialize the allocator, let's log the information about how executors will
    // be run up front, to avoid printing this out for every single executor being launched.
    // Use placeholders for information that changes such as executor IDs.
    logInfo {
      val executorMemory = _sparkConf.get(EXECUTOR_MEMORY).toInt
      val executorCores = _sparkConf.get(EXECUTOR_CORES)
      val dummyRunner = new ExecutorRunnable(None, yarnConf, _sparkConf, driverUrl, "<executorId>",
        "<hostname>", executorMemory, executorCores, appId, securityMgr, localResources,
        ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID)
      dummyRunner.launchContextDebugInfo()
    }
	
    // 关注这里1
    allocator = client.createAllocator(
      yarnConf,
      _sparkConf,
      appAttemptId,
      driverUrl,
      driverRef,
      securityMgr,
      localResources)

    // Initialize the AM endpoint *after* the allocator has been initialized. This ensures
    // that when the driver sends an initial executor request (e.g. after an AM restart),
    // the allocator is ready to service requests.
    rpcEnv.setupEndpoint("YarnAM", new AMEndpoint(rpcEnv, driverRef))
	
    // 关注这里2
    allocator.allocateResources()
    val ms = MetricsSystem.createMetricsSystem(MetricsSystemInstances.APPLICATION_MASTER,
      sparkConf, securityMgr)
    val prefix = _sparkConf.get(YARN_METRICS_NAMESPACE).getOrElse(appId)
    ms.registerSource(new ApplicationMasterSource(prefix, allocator))
    // do not register static sources in this case as per SPARK-25277
    ms.start(false)
    metricsSystem = Some(ms)
    reporterThread = launchReporterThread()
  }
allocateResources()

createAllocator()方法下的

allocateResources()方法,进行资源分配。该方法具体如下:

  def allocateResources(): Unit = synchronized {
    updateResourceRequests()

    val progressIndicator = 0.1f
    // Poll the ResourceManager. This doubles as a heartbeat if there are no pending container
    // requests.
      // 关注这里1
    val allocateResponse = amClient.allocate(progressIndicator)

    val allocatedContainers = allocateResponse.getAllocatedContainers()
    allocatorBlacklistTracker.setNumClusterNodes(allocateResponse.getNumClusterNodes)

    if (allocatedContainers.size > 0) {
      logDebug(("Allocated containers: %d. Current executor count: %d. " +
        "Launching executor count: %d. Cluster resources: %s.")
        .format(
          allocatedContainers.size,
          runningExecutors.size,
          numExecutorsStarting.get,
          allocateResponse.getAvailableResources))

      handleAllocatedContainers(allocatedContainers.asScala)
    }

    val completedContainers = allocateResponse.getCompletedContainersStatuses()
    if (completedContainers.size > 0) {
      logDebug("Completed %d containers".format(completedContainers.size))
      processCompletedContainers(completedContainers.asScala)
      logDebug("Finished processing %d completed containers. Current running executor count: %d."
        .format(completedContainers.size, runningExecutors.size))
    }
  }

先执行amClient.allocate(progressIndicator),向Yarn集群申请资源(其中有些资源在作业提交的时候就已经确认-- executor memory, - executor cores),然后该方法返回一个allocateResonse。

allocateResonse包含两类的container,一种是新分配的container,另一种是已经完成任务的container。

launchReporterThread()

createAllocator()方法下的

yarn可能不能够一次性返回足够数量的container,为了解决该问题,ApplicationMaster后面又调用了reporterThread = launchReporterThread(),来循环获取;

rivate def launchReporterThread(): Thread = {
    // The number of failures in a row until Reporter thread give up
    val reporterMaxFailures = sparkConf.get(MAX_REPORTER_THREAD_FAILURES)

    val t = new Thread {
      override def run() {
        var failureCount = 0
        while (!finished) {
          try {
            if (allocator.getNumExecutorsFailed >= maxNumExecutorFailures) {
              finish(FinalApplicationStatus.FAILED,
                ApplicationMaster.EXIT_MAX_EXECUTOR_FAILURES,
                s"Max number of executor failures ($maxNumExecutorFailures) reached")
            } else {
              logDebug("Sending progress")
              allocator.allocateResources()
            }
            failureCount = 0
          } catch {
            case i: InterruptedException => // do nothing
            case e: ApplicationAttemptNotFoundException =>
              failureCount += 1
              logError("Exception from Reporter thread.", e)
              finish(FinalApplicationStatus.FAILED, ApplicationMaster.EXIT_REPORTER_FAILURE,
                e.getMessage)
            case e: Throwable =>
              failureCount += 1
              if (!NonFatal(e) || failureCount >= reporterMaxFailures) {
                finish(FinalApplicationStatus.FAILED,
                  ApplicationMaster.EXIT_REPORTER_FAILURE, "Exception was thrown " +
                    s"$failureCount time(s) from Reporter thread.")
              } else {
                logWarning(s"Reporter thread fails $failureCount time(s) in a row.", e)
              }
          }
          try {
            val numPendingAllocate = allocator.getPendingAllocate.size
            var sleepStart = 0L
            var sleepInterval = 200L // ms
            allocatorLock.synchronized {
              sleepInterval =
                if (numPendingAllocate > 0 || allocator.getNumPendingLossReasonRequests > 0) {
                  val currentAllocationInterval =
                    math.min(heartbeatInterval, nextAllocationInterval)
                  nextAllocationInterval = currentAllocationInterval * 2 // avoid overflow
                  currentAllocationInterval
                } else {
                  nextAllocationInterval = initialAllocationInterval
                  heartbeatInterval
                }
              sleepStart = System.currentTimeMillis()
              allocatorLock.wait(sleepInterval)
            }
            val sleepDuration = System.currentTimeMillis() - sleepStart
            if (sleepDuration < sleepInterval) {
              // log when sleep is interrupted
              logDebug(s"Number of pending allocations is $numPendingAllocate. " +
                  s"Slept for $sleepDuration/$sleepInterval ms.")
              // if sleep was less than the minimum interval, sleep for the rest of it
              val toSleep = math.max(0, initialAllocationInterval - sleepDuration)
              if (toSleep > 0) {
                logDebug(s"Going back to sleep for $toSleep ms")
                // use Thread.sleep instead of allocatorLock.wait. there is no need to be woken up
                // by the methods that signal allocatorLock because this is just finishing the min
                // sleep interval, which should happen even if this is signalled again.
                Thread.sleep(toSleep)
              }
            } else {
              logDebug(s"Number of pending allocations is $numPendingAllocate. " +
                  s"Slept for $sleepDuration/$sleepInterval.")
            }
          } catch {
            case e: InterruptedException =>
          }
        }
      }
    }
    // setting to daemon status, though this is usually not a good idea.
    t.setDaemon(true)
    t.setName("Reporter")
    t.start()
    logInfo(s"Started progress reporter thread with (heartbeat : $heartbeatInterval, " +
            s"initial allocation : $initialAllocationInterval) intervals")
    t
  }

Yarn-Cluster和Yarn-Client对比

在YARN中,每个Application 实例都有一个ApplicationMaster进程,它是 Application 启动的第一个容器。它负责和ResourceManager 打交道并请求资源,获取资源之后告诉 NodeManager 为其启动 Container。从深层次的含义讲 YARN-Cluster和 YARN-Client 模式的区别其实就是 ApplicationMaster 进程的区别。

YARN-Cluster模式下,Driver运行在AM(Application Master)中,它负责向YARN申请资源,并监督作业的运行状况。当用户提交了作业之后,就可以关掉Client,作业会继续在YARN上运行,因而YARN-Cluster模式不适合运行交互类型的作业;

YARN-Client模式下,Application Master仅仅向YARN请求Executor,Client会和请求的Container通信来调度他们工作,也就是说Client不能离开;

具体说明

(1)Driver的运行位置不同

在cluster模式下,① Driver是ApplicationMaster进程中的一个子进程,运行在Hadoop集群中的某一个节点上。(这意味着,ApplicationMaster进程即负责运行Application又负责从YARN中请求资源);② Client会关闭,当完成了application的初始化/启动后。

在client模式下,① driver是client进程的一个子进程;② application master 仅用于从YARN集群请求资源

(2)是否可以查看历史日志
在client模式下,可以发现日志全部打印在了本机的输出控制台中,而使用8088查看Executor执行的时候无法查看历史日志;在cluster模式下可以通过WebUI来查看运行的历史日志信息。

(3)被YARN监管的资源

在Cluster模式下,Driver和Executor都会受到Yarn的监管;在Client模式下,只有Executor会受到Yarn的监管。

Spark On Yarn 小结

1、在Yarn中,Executor和Container的关系

**结论:**一个Container对应一个JVM进程(即,Executor);

解释:

① 当在集群上执行应用时,job会被切分成stages,每个stage切分成task,每个task单独调度,可以把executor的jvm进程看做task执行池,每个executor有 spark.executor.cores / spark.task.cpus execution 个执行槽

② task基本上就是spark的一个工作单元,作为exector的jvm进程中的一个线程执行,这也是为什么spark的job启动时间快的原因,在jvm中启动一个线程比启动一个单独的jvm进程块(在hadoop中执行mapreduce应用会启动多个jvm进程)

2、YARN的container

解释:

YARN的container是一个资源单元,是对YARN中资源的抽象,封装了某个节点上的一组限制出来的资源(Cpu+内存);

从实现上看,可看做一个可序列化/反序列化的Java类。

3、Worker结点和Executor的关系

① 一个worker默认为一个Application启动一个Executor

② 启动的Executor默认占用这个worker的全部资源

③ 如果要在一个worker上启动多个Executor,(前提:在内存充足的情况下)需要设置–executor-cores num 参数

4、YARN中,如何计算任务需要的资源

image-20220627145446634

确定YARN配置文件的参数

yarn.nodemanager.resource.memory-mb:每台节点的containers总计可用最大内存;

yarn.nodemanager.resource.cpu-vcores:每台节点的containers总计可用最大核数;

确定Spark Application参数

–executor-cores / spark.executor.cores 确定Executor可以运行的核心数或并发任务数。

–executor-memory / spark.executor.memory 确定每个Executor的heap memory

–num-executors / spark.executor.instances 确定每个Application给的Executor数量

分配资源计算过程

假设,现在我们拥有6台节点,每台节点配备16 cores and 64GB of memory.

Yarn Configuration

在每台节点上的配置

yarn.nodemanager.resource.memory-mb = 63 * 1024 = 64512 (megabytes)
yarn.nodemanager.resource.cpu-vcores = 15

留1GB内存和1个core,以防节点需要运行OS 和 Hadoop daemons。

每个spark application可用的资源

Total Number of Nodes = 6

Total Number of Cores = 6 * 15 = 90

Total Memory = 6 * 63 = 378 GB

每个Executor请求所需的内存数量必须满足:

spark.executor.memory + spark.executor.memoryOverhead < yarn.nodemanager.resource.memory-mb

假设,每个Executor分配5个core

Total Number Executor = Total Number Of Cores / 5 => 90/5 = 18.

此时我们在每个node上有3个executor,且每个node上有63G内存,所以每个executor有63/3=21G内存

Overhead Memory = max(384 , 0.1 * 21) ~ 2 GB (roughly)

Heap Memory = 21 – 2 ~ 19 GB

(实际需要再调整,因为没有满足 heapMemory + overheadMemory < container/executor

spark on yarn 中我们要留一个 executor 作为 Application Master。

最终确定如下:

--executor-cores / spark.executor.cores = 5

--executor-memory / spark.executor.memory = 19

--num-executors / spark.executor.instances = 17
确定Executors数量

在sparkcontext类中,

执行val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)

执行_taskScheduler.start()(具体的start要看YarnClientSchedulerBackend或者YarnClusterSchedulerBackend类中的start方法)

① YarnClientSchedulerBackend的start()方法

start()方法进入后,关注这行代码

totalExpectedExecutors = SchedulerBackendUtils.getInitialTargetExecutorNumber(conf)

进入到getInitialTargetExecutorNumber()方法中

private[spark] object SchedulerBackendUtils {
  val DEFAULT_NUMBER_EXECUTORS = 2

  /**
   * Getting the initial target number of executors depends on whether dynamic allocation is
   * enabled.
   * If not using dynamic allocation it gets the number of executors requested by the user.
   如果启用了动态分配策略,则会给一个初始所需的executor的数量;若未开启,则按照用户指定的数量进行分配。
   */
  def getInitialTargetExecutorNumber(
      conf: SparkConf,
      numExecutors: Int = DEFAULT_NUMBER_EXECUTORS): Int = {
      
    if (Utils.isDynamicAllocationEnabled(conf)) {
      val minNumExecutors = conf.get(DYN_ALLOCATION_MIN_EXECUTORS)
      val initialNumExecutors = Utils.getDynamicAllocationInitialExecutors(conf)
      val maxNumExecutors = conf.get(DYN_ALLOCATION_MAX_EXECUTORS)
      require(initialNumExecutors >= minNumExecutors && initialNumExecutors <= maxNumExecutors,
        s"initial executor number $initialNumExecutors must between min executor number " +
          s"$minNumExecutors and max executor number $maxNumExecutors")

      initialNumExecutors
    } else {
      conf.get(EXECUTOR_INSTANCES).getOrElse(numExecutors)
    }
  }
}

  /**
   * Return the initial number of executors for dynamic allocation.
   */
  def getDynamicAllocationInitialExecutors(conf: SparkConf): Int = {
    if (conf.get(DYN_ALLOCATION_INITIAL_EXECUTORS) < conf.get(DYN_ALLOCATION_MIN_EXECUTORS)) {
      logWarning(s"${DYN_ALLOCATION_INITIAL_EXECUTORS.key} less than " +
        s"${DYN_ALLOCATION_MIN_EXECUTORS.key} is invalid, ignoring its setting, " +
          "please update your configs.")
    }

    if (conf.get(EXECUTOR_INSTANCES).getOrElse(0) < conf.get(DYN_ALLOCATION_MIN_EXECUTORS)) {
      logWarning(s"${EXECUTOR_INSTANCES.key} less than " +
        s"${DYN_ALLOCATION_MIN_EXECUTORS.key} is invalid, ignoring its setting, " +
          "please update your configs.")
    }

    val initialExecutors = Seq(
      conf.get(DYN_ALLOCATION_MIN_EXECUTORS),
      conf.get(DYN_ALLOCATION_INITIAL_EXECUTORS),
      conf.get(EXECUTOR_INSTANCES).getOrElse(0)).max

    logInfo(s"Using initial executors = $initialExecutors, max of " +
      s"${DYN_ALLOCATION_INITIAL_EXECUTORS.key}, ${DYN_ALLOCATION_MIN_EXECUTORS.key} and " +
        s"${EXECUTOR_INSTANCES.key}")
    initialExecutors
  }

上面用到的conf是来自于CoarseGrainedSchedulerBackend类,

确定Container的数量

最后申请的Continer数量为Executor的数量加上drivernum = executorNum + 1
f)) {
val minNumExecutors = conf.get(DYN_ALLOCATION_MIN_EXECUTORS)
val initialNumExecutors = Utils.getDynamicAllocationInitialExecutors(conf)
val maxNumExecutors = conf.get(DYN_ALLOCATION_MAX_EXECUTORS)
require(initialNumExecutors >= minNumExecutors && initialNumExecutors <= maxNumExecutors,
s"initial executor number i n i t i a l N u m E x e c u t o r s m u s t b e t w e e n m i n e x e c u t o r n u m b e r " + s " initialNumExecutors must between min executor number " + s" initialNumExecutorsmustbetweenminexecutornumber"+s"minNumExecutors and max executor number $maxNumExecutors")

  initialNumExecutors
} else {
  conf.get(EXECUTOR_INSTANCES).getOrElse(numExecutors)
}

}
}

/**

  • Return the initial number of executors for dynamic allocation.
    */
    def getDynamicAllocationInitialExecutors(conf: SparkConf): Int = {
    if (conf.get(DYN_ALLOCATION_INITIAL_EXECUTORS) < conf.get(DYN_ALLOCATION_MIN_EXECUTORS)) {
    logWarning(s" D Y N A L L O C A T I O N I N I T I A L E X E C U T O R S . k e y l e s s t h a n " + s " {DYN_ALLOCATION_INITIAL_EXECUTORS.key} less than " + s" DYNALLOCATIONINITIALEXECUTORS.keylessthan"+s"{DYN_ALLOCATION_MIN_EXECUTORS.key} is invalid, ignoring its setting, " +
    “please update your configs.”)
    }
if (conf.get(EXECUTOR_INSTANCES).getOrElse(0) < conf.get(DYN_ALLOCATION_MIN_EXECUTORS)) {
  logWarning(s"${EXECUTOR_INSTANCES.key} less than " +
    s"${DYN_ALLOCATION_MIN_EXECUTORS.key} is invalid, ignoring its setting, " +
      "please update your configs.")
}

val initialExecutors = Seq(
  conf.get(DYN_ALLOCATION_MIN_EXECUTORS),
  conf.get(DYN_ALLOCATION_INITIAL_EXECUTORS),
  conf.get(EXECUTOR_INSTANCES).getOrElse(0)).max

logInfo(s"Using initial executors = $initialExecutors, max of " +
  s"${DYN_ALLOCATION_INITIAL_EXECUTORS.key}, ${DYN_ALLOCATION_MIN_EXECUTORS.key} and " +
    s"${EXECUTOR_INSTANCES.key}")
initialExecutors

}


上面用到的conf是来自于CoarseGrainedSchedulerBackend类,





确定Container的数量

最后申请的Continer数量为Executor的数量加上driver`num = executorNum + 1`
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值