我们在执行spark任务的时候,可能会好奇任务的执行流程是什么,dag是怎么生成的,task是在哪里启动的,driver和executor是怎么通信的,等等。下面我们通过一个简单的spark wordcount任务,来粗略了解下其中的奥秘。
SparkSession对象的创建
我们在开发spark作业的时候,首先会需要创建spark任务的入口类SparkSession的对象:
SparkSession spark =SparkSession.builder().appName("JavaWordCount").master("local").getOrCreate();
下面就分析下在创建这个对象过程中,发生了什么。
SparkSession.builder()方法会创建一个build对象,设置必要参数如app名字,master,之后调用getOrCreate方法,返回一个SparkSession对象。下面我们看下getOrCreate方法主要做了什么。
首先会创建一个SparkConf对象,然后设置一些刚刚传进来的比如app name,master等等的参数。
val sparkConf = new SparkConf()
options.foreach { case (k, v) => sparkConf.set(k, v) }
SparkContext的创建
接着是创建SparkContext对象:
SparkContext.getOrCreate(sparkConf)
setActiveContext(new SparkContext(config))
在创建SparkContext的过程中,会进行一些初始化,简单介绍一些重要的对象创建和配置:
对象创建
- 配置
首先是会传进来刚刚创建的配置变量SparkConf,然后继续添加其他配置,
_conf = config.clone()
并且会先进行一些必要配置的检验,比如spark.master和spark.app.name这两个配置没有的话会直接报错。
if (!_conf.contains("spark.master")) {
throw new SparkException("A master URL must be set in your configuration")
}
if (!_conf.contains("spark.app.name")) {
throw new SparkException("An application name must be set in your configuration")
}
- driver
接着设置了driver的host和port
_conf.set(DRIVER_HOST_ADDRESS, _conf.get(DRIVER_HOST_ADDRESS))
_conf.setIfMissing(DRIVER_PORT, 0)
- 文件
接着就是获取jar包路径和文件路径
_jars = Utils.getUserJars(_conf)
_files = _conf.getOption(FILES.key).map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
4.环境
spark执行环境的创建(cache, map output tracker, etc)
// Create the Spark execution environment (cache, map output tracker, etc)
_env = createSparkEnv(_conf, isLocal, listenerBus)
- 资源
_executorMemory = _conf.getOption(EXECUTOR_MEMORY.key)
- 通信
_heartbeatReceiver = env.rpcEnv.setupEndpoint(HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))
- 调度器
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
调度器,心跳等的启动
- 心跳创建和启动
这个心跳的主要作用是用来手机内存指标。
// create and start the heartbeater for collecting memory metrics
_heartbeater = new Heartbeater(
() => SparkContext.this.reportHeartBeat(_executorMetricsSource),
"driver-heartbeater",
conf.get(EXECUTOR_HEARTBEAT_INTERVAL))
_heartbeater.start()
- taskScheduler启动
// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
_taskScheduler.start()
还有其他工具的初始化和启动,暂且不讲。
SparkSession对象创建
SparkContext对象创建初始化完毕,返回sc对象
val sparkContext = userSuppliedContext.getOrElse {
// set a random app name if not given.
if (!sparkConf.contains("spark.app.name")) {
sparkConf.setAppName(java.util.UUID.randomUUID().toString)
}
// 创建sparkContext
SparkContext.getOrCreate(sparkConf)
// Do not update `SparkConf` for existing `SparkContext`, as it's shared by all sessions.
}
接着创建SparkSession对象,把sc传进来,并且返回SparkSession对象。
session = new SparkSession(sparkContext, None, None, extensions)
行动算子触发
在触发action算子后,后面具体怎么操作的,dag怎么生成的,让人好奇。
word count job的最后会调用action算子collect,返回结果到driver端,在collect方法里,会进行一番什么样的操作,需要研究一下
List<Tuple2<String, Integer>> output = counts.collect();
collect算子点进去,会看到rdd调用collect方法:
def collect(): JList[T] =
rdd.collect().toSeq.asJava
接着点进去collect方法,会看到sc调用runJob方法
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
在org.apache.spark.SparkContext#runJob方法里,会看到传入了几个参数
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: Iterator[T] => U,
partitions: Seq[Int]): Array[U] = {
val cleanedFunc = clean(func)
runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
}
parameter | functionality |
---|---|
rdd | 要计算的rdd |
func | 用于计算rdd的函数 |
partitions | rdd的分区 |
再接着往里面点几层会看到是dagScheduler调用runJob方法。 |
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
再点进去就是提交job了。
val waiter: JobWaiter[U] = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
在submitJob方法里,会真正的提交作业
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter[U](this, jobId, partitions.size, resultHandler)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
Utils.cloneProperties(properties)))
eventProcessLoop post完作业提交事件,然后就是接受event并进行处理:
override def onReceive(event: DAGSchedulerEvent): Unit = {
val timerContext = timer.time()
try {
doOnReceive(event)
} finally {
timerContext.stop()
}
}
在doOnReceive里面会进行模式匹配走job提交分支:
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
log.info("提交作业,处理提交的作业")
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
在org.apache.spark.scheduler.DAGScheduler#handleJobSubmitted里会进行阶段的划分,这个方法会创建ResultStage,同时创建所有shuffleMapStage,也就是把整个job的stage全部创建完毕。
var finalStage: ResultStage = null
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
Note: 关于stage,分为shuffleMapStage和resultStage,resultStage是带有action算子的stage,也就是job里的最后一个stage,而job里的其他stage都是shuffleMapStage。shuffleMapStage不是相当于shuffle write和shuffle read来说的,shuffleMapStage是一个完整的stage,而shuffle过程分为write和read两部分,也就是shuffle前和shuffle后。至于为什么叫shuffleMapStage,是因为凡是划分stage必有shuffle过程,所以带上shuffle没毛病。一个job是有很多stage组成的,结构如shuffleMapStage1->shuffleMapStage2->shuffleMapStage3->resultStage。
划分阶段
在org.apache.spark.scheduler.DAGScheduler#createResultStage,会先创建父所有的shuffleMapStage,返回一个stage的集合。
val parents: List[Stage] = getOrCreateParentStages(rdd, jobId)
/**
* Get or create the list of parent stages for a given RDD. The new Stages will be created with
* the provided firstJobId.
*/
private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
val shuffleDependencies: mutable.HashSet[ShuffleDependency[_, _, _]] = getShuffleDependencies(rdd)
val shuffleMapStages: List[ShuffleMapStage] = shuffleDependencies.map(getOrCreateShuffleMapStage(_, firstJobId)).toList
shuffleMapStages
}
在org.apache.spark.scheduler.DAGScheduler#getOrCreateParentStages中,总共有两步:
- 获取所有的shuffle依赖
- 生成所有的shuffleMapStage
先看获取shuffle依赖的逻辑:
private[scheduler] def getShuffleDependencies(
rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {
val parents = new HashSet[ShuffleDependency[_, _, _]]
val visited: mutable.HashSet[RDD[_]] = new HashSet[RDD[_]]
val waitingForVisit: ListBuffer[RDD[_]] = new ListBuffer[RDD[_]]
waitingForVisit += rdd
while (waitingForVisit.nonEmpty) {
val toVisit: RDD[_] = waitingForVisit.remove(0)
if (!visited(toVisit)) {
visited += toVisit
val dependencies: Seq[Dependency[_]] = toVisit.dependencies
dependencies.foreach {
case shuffleDep: ShuffleDependency[_, _, _] =>
parents += shuffleDep
case dependency =>
waitingForVisit.prepend(dependency.rdd)
}
}
}
parents
}
简单来说就是获取传进来的rdd的依赖,判断是不是shuffle依赖,如果不是再获取依赖的rdd,再获取这个rdd的依赖,判断是不是shuffle依赖,直到是shuffle依赖,则添加shuffle依赖到集合中,最后返回包含所有shuffle依赖的集合。
第二步就是获取所有的shuffleMapStage。在dag调度器中有一个shuffleIdToMapStage,里面保存着shuffle依赖id和shuffleMapStage的对应关系。而getOrCreateShuffleMapStage的主要作用就是根据shuffle依赖生成shuffleMapStage,然后添加到shuffleIdToMapStage里面去。
/**
* Mapping from shuffle dependency ID to the ShuffleMapStage that will generate the data for
* that dependency. Only includes stages that are part of currently running job (when the job(s)
* that require the shuffle stage complete, the mapping will be removed, and the only record of
* the shuffle data will be in the MapOutputTracker).
*/
private[scheduler] val shuffleIdToMapStage = new HashMap[Int, ShuffleMapStage]
private def getOrCreateShuffleMapStage(
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int): ShuffleMapStage = {
shuffleIdToMapStage.get(shuffleDep.shuffleId) match {
case Some(stage) =>
stage
case None =>
// Create stages for all missing ancestor shuffle dependencies.
val ancesterShuffleDep: ListBuffer[ShuffleDependency[_, _, _]] = getMissingAncestorShuffleDependencies(shuffleDep.rdd)
ancesterShuffleDep
.foreach { ancesterDependency =>if (!shuffleIdToMapStage.contains(ancesterDependency.shuffleId)) createShuffleMapStage(ancesterDependency, firstJobId) }
createShuffleMapStage(shuffleDep, firstJobId)
}
}
创建shuffleMapStage在org.apache.spark.scheduler.DAGScheduler#createShuffleMapStage方法中,
val stage: ShuffleMapStage = new ShuffleMapStage(id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep, mapOutputTracker)
在debug信息里,看到stage信息stageIdToStage变量里面确实存在两个shuffleMapStage和一个resultStage,这与代码逻辑里看到的是一致的,一个job所有stage除了最后一个resultStage外,其他都是shuffleMapStage。
stageIdToStage = {HashMap@9362} "HashMap" size = 3
0 = {Tuple2@12636} "(2,ResultStage 2)"
1 = {Tuple2@12637} "(1,ShuffleMapStage信息(ShuffleMapStageId=1, rdd=MapPartitionsRDD[8] at mapToPair at JavaWordCount.java:54, numTasks=1, parents=List(ShuffleMapStage信息(ShuffleMapStageId=0, rdd=MapPartitionsRDD[6] at mapToPair at JavaWordCount.java:50, numTasks=1, parents=List(), firstJobId=0, mapOutputTrackerMaster=org.apache.spark.MapOutputTrackerMaster@4ae59b3b, pendingPartitions=Set(), shuffleDep=org.apache.spark.ShuffleDependency@3bb1de8a, mapStageJobs=List(), numAvailableOutputs=0, isAvailable=false, findMissingPartitions=Vector(0))), firstJobId=0, mapOutputTrackerMaster=org.apache.spark.MapOutputTrackerMaster@4ae59b3b, pendingPartitions=Set(), shuffleDep=org.apache.spark.ShuffleDependency@2069dd1, mapStageJobs=List(), numAvailableOutputs=0, isAvailable=false, findMissingPartitions=Vector(0)))"
2 = {Tuple2@12638} "(0,ShuffleMapStage信息(ShuffleMapStageId=0, rdd=MapPartitionsRDD[6] at mapToPair at JavaWordCount.java:50, numTasks=1, parents=List(), firstJobId=0, mapOutputTrackerMaster=org.apache.spark.MapOutputTrackerMaster@4ae59b3b, pendingPartitions=Set(), shuffleDep=org.apache.spark.ShuffleDependency@3bb1de8a, mapStageJobs=List(), numAvailableOutputs=0, isAvailable=false, findMissingPartitions=Vector(0)))"
Note: rdd.dependencies指的是当前rdd的依赖,可能依赖好多个rdd,所以返回的是个列表。而dependency.rdd指的是这个rdd所依赖的一个父rdd,看代码
rdd.dependencies.map(_.rdd)的意思就是rdd通过依赖关系获取所有的父rdd。所以依赖信息包含了这个rdd的生成过程,就是如何从最开始到现在的rdd的。根据rdd的依赖可以一直往前回溯,rdd->依赖->父rdd->依赖->爷rdd。。。一直到最开始。
创建完stage会在一个stageId->stage的map中添加该stage,
private[scheduler] val stageIdToStage: mutable.HashMap[Int, Stage] = new HashMap[Int, Stage]
stageIdToStage(id) = stage
以及在jobid->stageid的map中添加该stage。
private[scheduler] val jobIdToStageIds = new HashMap[Int, HashSet[Int]]
val stage: Stage = stages.head
stage.jobIds += jobId
jobIdToStageIds.getOrElseUpdate(jobId, new HashSet[Int]()) += stage.id
然后返回到org.apache.spark.scheduler.DAGScheduler#handleJobSubmitted,在创建完resultStage后,会创建一个job,然后把该job添加到一个activejob的map中,同时把该job添加到activeJob set里,最后是提交stage。
val job: ActiveJob = new ActiveJob(jobId, finalStage, callSite, listener, properties)
private[scheduler] val jobIdToActiveJob = new HashMap[Int, ActiveJob]
private[scheduler] val activeJobs = new HashSet[ActiveJob]
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
submitStage(finalStage)
提交阶段
在org.apache.spark.scheduler.DAGScheduler#submitStage中,递归获取最早的shuffleMapStage,然后走submitMissingTasks。
/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage): Unit = {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
log.debug(s"submitStage($stage (name=${stage.name};" +
s"jobs=${stage.jobIds.toSeq.sorted.mkString(",")}))")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)
log.debug("missing: " + missing)
if (missing.isEmpty) {
log.info("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
debug信息显示submitMissingTasks传入的stage是第一个shuffleMapStage。
ShuffleMapStage信息(ShuffleMapStageId=0, rdd=MapPartitionsRDD[6] at mapToPair at JavaWordCount.java:50, numTasks=1, parents=List(), firstJobId=0, mapOutputTrackerMaster=MapOutputTrackerMaster(minSizeForBroadcast=524288, shuffleLocalityEnabled=true, SHUFFLE_PREF_MAP_THRESHOLD=1000, SHUFFLE_PREF_REDUCE_THRESHOLD=1000, REDUCER_PREF_LOCS_FRACTION=0.2, shuffleStatuses=Map(0 -> org.apache.spark.ShuffleStatus@c8d15f4, 1 -> org.apache.spark.ShuffleStatus@4323919e), maxRpcMessageSize=134217728, mapOutputRequests=[], isLocal=true, getNumCachedSerializedBroadcast=0, getEpoch=0), pendingPartitions=Set(), shuffleDep=ShuffleDependency信息(keyClassName=java.lang.Object, valueClassName=java.lang.Object, combinerClassName=Some(java.lang.Object), shuffleId=1, _rdd=MapPartitionsRDD[6] at mapToPair at JavaWordCount.java:50, partitioner=org.apache.spark.HashPartitioner@1, serializer=org.apache.spark.serializer.JavaSerializer@76d5a149, keyOrdering=None, aggregator=Some(Aggregator(org.apache.spark.rdd.PairRDDFunctions$$Lambda$1521/1852790850@73e25780,org.apache.spark.api.java.JavaPairRDD$$$Lambda$1519/493495005@1d535b78,org.apache.spark.api.java.JavaPairRDD$$$Lambda$1519/493495005@1d535b78)), mapSideCombine=true, shuffleWriterProcessor=org.apache.spark.shuffle.ShuffleWriteProcessor@610ef36f), mapStageJobs=List(), numAvailableOutputs=0, isAvailable=false, findMissingPartitions=Vector(0))
在submitMissingTasks中进行stage的提交,下面看下主要的逻辑:
- 获取要计算的分区和属性文件
// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
// Use the scheduling pool, job group, description, etc. from an ActiveJob associated
// with this Stage
val properties: Properties = jobIdToActiveJob(jobId).properties
- 把stage加入到活跃stage set中
// Stages we are running right now
private[scheduler] val runningStages = new HashSet[Stage]
runningStages += stage
- 获取分区文件所在的位置信息,然后在文件所在的节点启动task
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id)) }.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
}
- 对stage进行序列化,然后进行广播
var taskBinary: Broadcast[Array[Byte]] = null
var partitions: Array[Partition] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
// For ResultTask, serialize and broadcast (rdd, func).
var taskBinaryBytes: Array[Byte] = null
// taskBinaryBytes and partitions are both effected by the checkpoint status. We need
// this synchronization in case another concurrent job is checkpointing this RDD, so we get a
// consistent view of both variables.
RDDCheckpointData.synchronized {
taskBinaryBytes = stage match {
case stage: ShuffleMapStage =>
JavaUtils.bufferToArray(
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}
partitions = stage.rdd.partitions
}
if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KIB * 1024) {
log.warn(s"Broadcasting large task binary with size " +
s"${Utils.bytesToString(taskBinaryBytes.length)}")
}
log.info("广播task")
taskBinary = sc.broadcast(taskBinaryBytes)
}
- 根据stage类型创建对应的tasks
val tasks: Seq[Task[_]] = try {
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
case stage: ShuffleMapStage =>
stage.pendingPartitions.clear()
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = partitions(id)
stage.pendingPartitions += id
new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
stage.rdd.isBarrier())
}
}
}
- 提交tasks(stage)
if (tasks.nonEmpty) {
log.info(s"提交 ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
log.info("taskScheduler提交task,传的是一批task,即一个taskSet")
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
}
在org.apache.spark.scheduler.TaskSchedulerImpl#submitTasks中,会创建taskSetManager,这个的主要功能就是调度一个taskSet里面所有的task,如果task失败会重试,并且通过延迟调度处理taskSet的位置感知调度。
/**
* Schedules the tasks within a single TaskSet in the TaskSchedulerImpl. This class keeps track of
* each task, retries tasks if they fail (up to a limited number of times), and
* handles locality-aware scheduling for this TaskSet via delay scheduling. The main interfaces
* to it are resourceOffer, which asks the TaskSet whether it wants to run a task on one node,
* and handleSuccessfulTask/handleFailedTask, which tells it that one of its tasks changed state
* (e.g. finished/failed).
*
* THREADING: This class is designed to only be called from code with a lock on the
* TaskScheduler (e.g. its event handlers). It should not be called from other threads.
*
* @param sched the TaskSchedulerImpl associated with the TaskSetManager
* @param taskSet the TaskSet to manage scheduling for
* @param maxTaskFailures if any particular task fails this number of times, the entire
* task set will be aborted
*/
val manager: TaskSetManager = createTaskSetManager(taskSet, maxTaskFailures)
调度器添加taskSetManager,调度器有FIFO调度和fair调度,
private var schedulableBuilder: SchedulableBuilder = null
schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
看FIFO调度模式里调度池会添加该taskSetManager
override def addTaskSetManager(manager: Schedulable, properties: Properties): Unit = {
rootPool.addSchedulable(manager)
}
而rootPool就是个管理taskSetManager的。
/**
* A Schedulable entity that represents collection of Pools or TaskSetManagers
*/
private[spark] class Pool(
val poolName: String,
val schedulingMode: SchedulingMode,
initMinShare: Int,
initWeight: Int)
extends Schedulable with Logging {
调度池添加完taskSetManager后,就走到最后,复活task。
backend.reviveOffers()
在org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend#reviveOffers里,driver终端发送这个task。
override def reviveOffers(): Unit = {
log.info("发送消息")
driverEndpoint.send(ReviveOffers)
}
driverEndpoint就是在executor创建的终端
val driverEndpoint: RpcEndpointRef = rpcEnv.setupEndpoint(ENDPOINT_NAME, createDriverEndpoint())
接收task
driverEndpoint有send方法,还会有receive方法,里面走模式匹配
case ReviveOffers =>
makeOffers()
则继续点进去makeOffers,可见:
- 首先获取可用的executor;
- 遍历每个executor获取可用资源信息;
- rm提供任务所需资源;
- 启动task。
代码太过经典,所以都抄下来了
// Make fake resource offers on all executors
private def makeOffers(): Unit = {
// Make sure no executor is killed while some task is launching on it
val taskDescs = withLock {
// Filter out executors under killing
log.info("过滤出还在活跃,且没有在被杀死过程的executor")
val activeExecutors = executorDataMap.filterKeys(isExecutorActive)
val workOffers: IndexedSeq[WorkerOffer] = activeExecutors.map {
case (id, executorData) =>
val workerOffer: WorkerOffer = new WorkerOffer(
id,
executorData.executorHost,
executorData.freeCores,
Some(executorData.executorAddress.hostPort),
executorData.resourcesInfo.map { case (rName, rInfo) => (rName, rInfo.availableAddrs.toBuffer) }
)
log.debug("获取资源信息{}",workerOffer)
workerOffer
}.toIndexedSeq
val taskDesc: Seq[Seq[TaskDescription]] = scheduler.resourceOffers(workOffers)
taskDesc
}
if (taskDescs.nonEmpty) {
log.info("启动tasks")
launchTasks(taskDescs)
}
}
在第三步提供资源这里,
val taskDesc: Seq[Seq[TaskDescription]] = scheduler.resourceOffers(workOffers)
resourceOffers主要逻辑有:
- 添加host->executor->task的映射关系;
hostToExecutors(o.host) += o.executorId
executorIdToHost(o.executorId) = o.host
executorIdToRunningTaskIds(o.executorId) = HashSet[Long]()
- 添加host的rack信息;
for ((host, Some(rack)) <- hosts.zip(getRacksForHosts(hosts))) {
hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += host
}
- 过滤出不在节点黑名单的executor;
val filteredOffers = blacklistTrackerOpt.map { blacklistTracker =>
offers.filter { offer =>
!blacklistTracker.isNodeBlacklisted(offer.host) &&
!blacklistTracker.isExecutorBlacklisted(offer.executorId)
}
}.getOrElse(offers)
- 重排序以避免task都在一个worker上;
val shuffledOffers = shuffleOffers(filteredOffers)
- 创建一些资源信息;
// Build a list of tasks to assign to each worker.
val tasks: IndexedSeq[ArrayBuffer[TaskDescription]] = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores / CPUS_PER_TASK))
val availableResources: Array[Map[String, mutable.Buffer[String]]] = shuffledOffers.map(_.resources).toArray
val availableCpus: Array[Int] = shuffledOffers.map(o => o.cores).toArray
- 从调度池获取taskSetManager,对taskSetManager排序用的算法有FIFO和Fair;
val sortedTaskSets: ArrayBuffer[TaskSetManager] = rootPool.getSortedTaskSetQueue.filterNot(_.isZombie)
override def getSortedTaskSetQueue: ArrayBuffer[TaskSetManager] = {
val sortedTaskSetQueue = new ArrayBuffer[TaskSetManager]
val sortedSchedulableQueue =
schedulableQueue.asScala.toSeq.sortWith(taskSetSchedulingAlgorithm.comparator)
for (schedulable <- sortedSchedulableQueue) {
sortedTaskSetQueue ++= schedulable.getSortedTaskSetQueue
}
sortedTaskSetQueue
}
private val taskSetSchedulingAlgorithm: SchedulingAlgorithm = {
schedulingMode match {
case SchedulingMode.FAIR =>
new FairSchedulingAlgorithm()
case SchedulingMode.FIFO =>
new FIFOSchedulingAlgorithm()
case _ =>
val msg = s"Unsupported scheduling mode: $schedulingMode. Use FAIR or FIFO instead."
throw new IllegalArgumentException(msg)
}
}
- 按照调读顺序,遍历所有taskSetManager,还有其他关于barrier得判断就不看了,看主要逻辑,是判断是否启动了task;
var launchedAnyTask: Boolean = false
// Record all the executor IDs assigned barrier tasks on.
val addressesWithDescs: ArrayBuffer[(String, TaskDescription)] = ArrayBuffer[(String, TaskDescription)]()
for (currentMaxLocality <- taskSet.myLocalityLevels) {
var launchedTaskAtCurrentMaxLocality: Boolean = false
do {
launchedTaskAtCurrentMaxLocality = resourceOfferSingleTaskSet(taskSet,
currentMaxLocality, shuffledOffers, availableCpus,
availableResources, tasks, addressesWithDescs)
launchedAnyTask |= launchedTaskAtCurrentMaxLocality
} while (launchedTaskAtCurrentMaxLocality)
}
主要逻辑在org.apache.spark.scheduler.TaskSchedulerImpl#resourceOfferSingleTaskSet,会有两层遍历,外层是rm提供的可用资源(用于启动task的container吧),里面是所以的taskSet的task进行遍历,然后将task以及资源信息封装到task描述器里面
for (task <- taskSet.resourceOffer(execId, host, maxLocality, availableResources(i))) {
tasks(i) += task
val tid = task.taskId
taskIdToTaskSetManager.put(tid, taskSet)
taskIdToExecutorId(tid) = execId
executorIdToRunningTaskIds(execId).add(tid)
availableCpus(i) -= CPUS_PER_TASK
assert(availableCpus(i) >= 0)
task.resources.foreach { case (rName, rInfo) =>
// Remove the first n elements from availableResources addresses, these removed
// addresses are the same as that we allocated in taskSet.resourceOffer() since it's
// synchronized. We don't remove the exact addresses allocated because the current
// approach produces the identical result with less time complexity.
// availableResources: Array[Map[String, Buffer[String]]],
availableResources(i)(rName).remove(0, rInfo.addresses.size)
}
在org.apache.spark.scheduler.TaskSetManager#resourceOffer里面就是进行任务和资源的封装:
//资源信息
val extraResources = sched.resourcesReqsPerTask.map { taskReq =>
val rName = taskReq.resourceName
val count = taskReq.amount
val rAddresses = availableResources.getOrElse(rName, Seq.empty)
assert(rAddresses.size >= count, s"Required $count $rName addresses, but only " +
s"${rAddresses.size} available.")
// We'll drop the allocated addresses later inside TaskSchedulerImpl.
val allocatedAddresses = rAddresses.take(count)
(rName, new ResourceInformation(rName, allocatedAddresses.toArray))
}.toMap
//task序列化
val serializedTask: ByteBuffer = try {
ser.serialize(task)
}
val task = tasks(index)
//返回taskDesc
new TaskDescription(
taskId,
attemptNum,
execId,
taskName,
index,
task.partitionId,
addedFiles,
addedJars,
task.localProperties,
extraResources,
serializedTask)
- 返回task
启动task
返回到org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverEndpoint#makeOffers方法,在最后启动task描述器,
if (taskDescs.nonEmpty) {
log.info("启动tasks")
launchTasks(taskDescs)
}
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverEndpoint#launchTasks方法中,对所有task描述器进行遍历。先把taskDesc序列化,然后获取task执行所在的executor,在获取该executor的资源数据,并且减去执行该task所需要的cpu核数,请求所需要的资源,最后终端发送task。
val serializedTask: ByteBuffer = TaskDescription.encode(task)
val executorData: ExecutorData = executorDataMap(task.executorId)
executorData.freeCores -= scheduler.CPUS_PER_TASK
task.resources.foreach { case (rName, rInfo) =>
assert(executorData.resourcesInfo.contains(rName))
executorData.resourcesInfo(rName).acquire(rInfo.addresses)
}
executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
container启动command语句启动主类是org.apache.spark.executor.YarnCoarseGrainedExecutorBackend,里面调用org.apache.spark.executor.CoarseGrainedExecutorBackend#run方法,创建executor终端。终端的receive方法模式匹配会走到启动task的逻辑
val backend: CoarseGrainedExecutorBackend = backendCreateFn(env.rpcEnv, arguments, env, cfg.resourceProfile)
env.rpcEnv.setupEndpoint("Executor", backend)
case LaunchTask(data) =>
if (executor == null) {
exitExecutor(1, "Received LaunchTask command but executor was null")
} else {
log.info("executor接收到task先进行反序列化")
val taskDesc: TaskDescription = TaskDescription.decode(data.value)
log.info("Got assigned task " + taskDesc.taskId)
taskResources(taskDesc.taskId) = taskDesc.resources
log.info("终于启动task了,描述",taskDesc.toString)
executor.launchTask(this, taskDesc)
}
在executor.launchTask里启动task。会先创建一个线程,把taskId和线程put到一个map中,
// Maintains the list of running tasks.
private val runningTasks = new ConcurrentHashMap[Long, TaskRunner]
val tr: TaskRunner = new TaskRunner(context, taskDescription)
runningTasks.put(taskDescription.taskId, tr)
最后worker节点的线程池启动该线程,task最终启动。
private val threadPool = {
val threadFactory = new ThreadFactoryBuilder()
.setDaemon(true)
.setNameFormat("Executor task launch worker-%d")
.setThreadFactory((r: Runnable) => new UninterruptibleThread(r, "unused"))
.build()
Executors.newCachedThreadPool(threadFactory).asInstanceOf[ThreadPoolExecutor]
}
threadPool.execute(tr)
至此task最终在executor上启动。
执行Task
TaskRunner线程会执行run方法,在run方法里,先将task反序列化,再运行task
task = ser.deserialize[Task[Any]](
taskDescription.serializedTask, Thread.currentThread.getContextClassLoader)
val res = task.run(
taskAttemptId = taskId,
attemptNumber = taskDescription.attemptNumber,
metricsSystem = env.metricsSystem,
resources = taskDescription.resources)
Task类是个抽象类,子类有ShuffleMapTask和ResultTask,在进入task.run()方法后,会走入相应的子类的runTask方法。
val taskContext: TaskContextImpl = new TaskContextImpl(
stageId,
stageAttemptId, // stageAttemptId and stageAttemptNumber are semantically equal
partitionId,
taskAttemptId,
attemptNumber,
taskMemoryManager,
localProperties,
metricsSystem,
metrics,
resources)
runTask(context)
在shuffleMapTask的runTask方法,会先获取rdd和依赖,然后开始写出数据,
val (rdd,dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
dep.shuffleWriterProcessor.write(rdd, dep, mapId, context, partition)
而在ResultTask的runTask方法,同样获取rdd和要计算的函数,接着开始计算并且返回结果。在rdd.iterator方法中,会走到computeOrReadCheckpoint,在里面进行计算并且返回结果。
val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)](ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
val result: U = func(context, rdd.iterator(partition, context))
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
if (storageLevel != StorageLevel.NONE) {
getOrCompute(split, context)
} else {
computeOrReadCheckpoint(split, context)
}
}
private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
{
computeOrReadCheckpointCount+=1
if (isCheckpointedAndMaterialized) {
firstParent[T].iterator(split, context)
} else {
val preRDD = dependencies.map( dep=> if( dep.rdd != null ) dep.rdd.id else -1).filter(_ != -1).mkString(",")
log.debug("rdd id为{},依赖rdd为{},第{}次读取并且计算,分区信息为{},{}",this.id.toString,preRDD,computeOrReadCheckpointCount.toString,split.toString,context.toString)
val iterator: Iterator[T] = compute(split, context)
val list = iterator.toList
log.debug("返回{}条计算结果:{}",list.size,list.take(20).zipWithIndex.map(r=>"第"+r._2+"条数据: "+r._1).mkString("; "))
list.iterator
}
}
获取计算结果后,先对结果进行序列化,然后把结果发送到driver端;
val value = Utils.tryWithSafeFinally {
log.debug("运行{}",task)
val res = task.run(
taskAttemptId = taskId,
attemptNumber = taskDescription.attemptNumber,
metricsSystem = env.metricsSystem,
resources = taskDescription.resources)
threwException = false
res
}
val valueBytes = resultSer.serialize(value)
val directResult = new DirectTaskResult(valueBytes, accumUpdates, metricPeaks)
val serializedDirectResult = ser.serialize(directResult)
execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)
在org.apache.spark.executor.CoarseGrainedExecutorBackend#statusUpdate里,driver终端向driver发送结果,至此一个task执行完成并将结果返回到driver。
override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer): Unit = {
val resources = taskResources.getOrElse(taskId, Map.empty[String, ResourceInformation])
val msg = StatusUpdate(executorId, taskId, state, data, resources)
if (TaskState.isFinished(state)) {
taskResources.remove(taskId)
}
driver match {
case Some(driverRef) => driverRef.send(msg)
case None => log.warn(s"Drop $msg because has not yet connected to driver")
}
}