【大数据开发】SparkCore——SparkContext源码解析、Spark的Stage和Task执行操作源码解析

最新推荐文章于 2024-03-23 23:48:36 发布

这个妹妹我见过

最新推荐文章于 2024-03-23 23:48:36 发布

阅读量524

点赞数

分类专栏： # Spark 文章标签： spark 大数据

本文链接：https://blog.csdn.net/weixin_37090394/article/details/109034978

版权

Spark 专栏收录该内容

23 篇文章 1 订阅

订阅专栏

文章目录

大佬博客Spark核心SparkContext源码解析

一、Spark的源码下载

点击跳转官网下载

在这里插入图片描述

二、SparkContext的初始化过程

1. SparkConf

SparkConf对象，就是Spark的配置对象，主要以键值对的形式加载Spark的配置信息。一旦通过new SparkConf()形式实例化了对象，就会默认加载所有的spark.*的配置文件。

注意事项:

SparkContext实例化的时候，需要一个SparkConf对象。在SparkContext中，会获取到一个SparkConf对象的副本，后序使用的就是这个副本对象。也就是说，一旦将SparkConf应用给了SparkContext对象，此时再修改SparkConf的属性配置，已经没有作用了。
// SparkContext中第230行，获取到了SparkConf的副本
def getConf: SparkConf = conf.clone()
// SparkConf的第430行，制作了一个SparkConf对象的副本
override def clone: SparkConf = {
 val cloned = new SparkConf(false)
 settings.entrySet().asScala.foreach { e =>
     cloned.set(e.getKey(), e.getValue(), true)
 }
 cloned
}

2. SparkContext

SparkContext初始化过程:
1. 初始化了SparkConf对象，加载了Spark的配置信息
2. 将SparkConf对象，加载到SparkContext中，进行各个配置参数的初始化操作
3. 通过createTaskScheduler方法，实例化了TaskScheduler和DAGScheduler

SparkContext 第500行

// Create and start the scheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
_taskScheduler.start()

SparkContext第2692行

// 根据用户传入的Master地址，创建SchedulerBackend和TaskScheduler
private def createTaskScheduler(
    sc: SparkContext,
    master: String,
    deployMode: String): (SchedulerBackend, TaskScheduler) = {
    import SparkMasterRegex._

    // When running locally, don't try to re-execute tasks on failure.
    val MAX_LOCAL_TASK_FAILURES = 1

    master match {
        // local模式，setMaster("local")
        case "local" =>
            val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
            val backend = new LocalSchedulerBackend(sc.getConf, scheduler, 1)
            scheduler.initialize(backend)
            (backend, scheduler)
        
		// local模式，setMasetr("local[*]") || setMaster("local[N]")
        case LOCAL_N_REGEX(threads) =>
            def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
            // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
            val threadCount = if (threads == "*") localCpuCount else threads.toInt
            if (threadCount <= 0) {
                throw new SparkException(s"Asked to run locally with $threadCount threads")
            }
            val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
            val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
            scheduler.initialize(backend)
            (backend, scheduler)

        // Standalone模式，
        case SPARK_REGEX(sparkUrl) =>
            val scheduler = new TaskSchedulerImpl(sc)
            val masterUrls = sparkUrl.split(",").map("spark://" + _)
            val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
            scheduler.initialize(backend)
            (backend, scheduler)

        // 其他资源调用模式，例如：YARN、Mesos
        case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
            // Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang.
            val memoryPerSlaveInt = memoryPerSlave.toInt
            if (sc.executorMemory > memoryPerSlaveInt) {
                throw new SparkException(
                    "Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
                        memoryPerSlaveInt, sc.executorMemory))
            }
        ...
}

3. TaskScheduler

TaskScheduler是一个低级别的Task调度接口，目前就只有一个实现类 TaskSchedulerImpl。该接口可以挂载在不同的调度器(指的是SchedulerBackend)
每一个TaskScheduler只能为一个SparkContext工作。初始化TaskScheduler，是处理之前的Spark任务。如果有新的SparkApplication，此时就需要将当前的TaskScheduler销毁，并创建一个新的。
TaskScheduler可以从DAGScheduler获取每一个Stage中的TaskSet，用来提交处理这些Task，发送到集群运行。
如果失败后，进行重复提交，处理游勇散兵，并将任务的执行结果反馈给DAGScheduler。

游勇散兵: 提交给集群运行的Task，有可能会出现掉队的情况。那么我们不能因为一个两个掉队的Task而影响整个程序的进行。

3.1. TaskSchedulerImpl

客户端需要通过调用initialize和start方法，其次才可以使用runTasks方法进行任务提交。

// line81: 检测时长（Task等待时长），最低是100ms
val SPECULATION_INTERVAL_MS = conf.getTimeAsMs("spark.speculation.interval", "100ms")
// line92: 初始化Task的时长，默认是15s
val STARVATION_TIMEOUT_MS = conf.getTimeAsMs("spark.starvation.timeout", "15s")
// line95: 每一个Task分配的CPU数量
val CPUS_PER_TASK = conf.getInt("spark.task.cpus", 1)
// line136: 默认的调度模式是FIFO
private val schedulingModeConf = conf.get(SCHEDULER_MODE_PROPERTY, SchedulingMode.FIFO.toString)

// CoarseGrainedSchedulerBackend
// 粗粒度调度器（CoarseGrainedSchedulerBackend）
// 	  Job每一个生命周期中，都有一个Executor。
//    当一个Task执行结束后，不会释放Executor，
//    一个新的Task进来后，也不会马上开辟一个新的Executor，会复用之前的Executor。
//    实现了Executor的复用。
// 细粒度调度器（FineGrainedSchedulerBackend）
//    Task执行结束后，就会释放Executor。
//    一个新的Task进来之后，会开辟一个新的Executor。
// 
// Standalone模式和YARN模式，只支持粗粒度调度。Mesos支持细粒度调度。

// FIFO: 先进先出的调度模式
//		优先将Executor分配到一个Worker上，当这个Worker的资源不足的时候，才会分配到其他的Worker上
// FAIR: 公平调度模式
//      基于负载均衡，平均的将Executor分配到每一个Worker节点
def initialize(backend: SchedulerBackend) {
    this.backend = backend
    schedulableBuilder = {
        schedulingMode match {
            case SchedulingMode.FIFO =>
            	new FIFOSchedulableBuilder(rootPool)
            case SchedulingMode.FAIR =>
            	new FairSchedulableBuilder(rootPool, conf)
            case _ =>
            	throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
                                               s"$schedulingMode")
        }
    }
    schedulableBuilder.buildPools()
}
override def start() {
    backend.start()
}

StandaloneSchedulerBackend

override def start() {
	// 调用了CoarseGrainedSchedulerBackend中的start方法
    // 构建了一个Driver RPC通信终端
    super.start()
	...
    // 构建一个Application的任务对象，传递这些参数，其实这就代表了Application执行的时候所需要的资源
    val appDesc = ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
                                         webUrl, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
    // 构建Application任务对象，包含了作业的资源信息
    // 是一个用于和集群管理器通信的对象
    client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
    client.start()
    launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
    // 等待注册是否完成，是在StandaloneAppClient完成
    waitForRegistration()
    launcherBackend.setState(SparkAppHandle.State.RUNNING)
}

// CoarseGrainedSchedulerBackend
override def start() {
  // 构建一个Driver端的RPC通信模型
  driverEndpoint = createDriverEndpointRef(properties)
}

4. DriverEndPoint

CoarseGrainedSchedulerBackend的内部类，是Driver端的通信模型

// 生命周期的起点，就是这个
override def onStart() {
    // Periodically revive offers to allow delay scheduling to work
    val reviveIntervalMs = conf.getTimeAsMs("spark.scheduler.revive.interval", "1s")

    reviveThread.scheduleAtFixedRate(new Runnable {
        override def run(): Unit = Utils.tryLogNonFatalError {
            Option(self).foreach(_.send(ReviveOffers))		// 给自己发送了一个ReviveOffers信号
        }
    }, 0, reviveIntervalMs, TimeUnit.MILLISECONDS)
}

override def receive: PartialFunction[Any, Unit] = {
	case StatusUpdate(executorId, taskId, state, data) =>
    	scheduler.statusUpdate(taskId, state, data.value)
    	if (TaskState.isFinished(state)) {
        	executorDataMap.get(executorId) match {
            	case Some(executorInfo) =>
            		executorInfo.freeCores += scheduler.CPUS_PER_TASK
           		 	makeOffers(executorId)
            	case None =>
            		// Ignoring the update since we don't know about the executor.
            		logWarning(s"Ignored task status update ($taskId state $state) " +
                       s"from unknown executor with ID $executorId")
        }
    }

    case ReviveOffers =>
    	makeOffers()
}

// 给Executor创建了一个虚拟的资源信息
private def makeOffers() {
    // Make sure no executor is killed while some task is launching on it
    val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized {
        // Filter out executors under killing
        val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
        val workOffers = activeExecutors.map { case (id, executorData) =>
            new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
        }.toIndexedSeq
        scheduler.resourceOffers(workOffers)
    }
    if (!taskDescs.isEmpty) {
        launchTasks(taskDescs)
    }
}

5. StandaloneAppClient

override def onStart(): Unit = {
    try {
        // 向Master发送了一个注册的信息
        // 这里的1代表第一次注册，在注册逻辑中，每当失败一次，都会重新注册，并将这个次数+1递归调用
        // 当失败的次数 >= 3次，Application注册失败
        registerWithMaster(1)
    } catch {
        case e: Exception =>
        logWarning("Failed to connect to master", e)
        markDisconnected()
        stop()
    }
}

private def registerWithMaster(nthRetry: Int) {
    registerMasterFutures.set(tryRegisterAllMasters())		// 完成了Application向Master的注册
    // 如果注册失败，就要重新注册
    registrationRetryTimer.set(registrationRetryThread.schedule(new Runnable {
        override def run(): Unit = {
            if (registered.get) {
                registerMasterFutures.get.foreach(_.cancel(true))
                registerMasterThreadPool.shutdownNow()
            } else if (nthRetry >= REGISTRATION_RETRIES) {
                markDead("All masters are unresponsive! Giving up.")
            } else {
                registerMasterFutures.get.foreach(_.cancel(true))
                registerWithMaster(nthRetry + 1)
            }
        }
    }, REGISTRATION_TIMEOUT_SECONDS, TimeUnit.SECONDS))
}

6. Master

Master258行

case RegisterApplication(description, driver) =>
    // TODO Prevent repeated registrations from some driver
    if (state == RecoveryState.STANDBY) {
        // ignore, don't send response
    } else {
        logInfo("Registering app " + description.name)
        // 封装对应的Driver端的资源
        val app = createApplication(description, driver)		
        // 在Master内部完成了Application的注册
        registerApplication(app)
        logInfo("Registered app " + description.name + " with ID " + app.id)
        // 使用持久化操作，将任务的元数据保存起来，等待Task使用
        persistenceEngine.addApplication(app)
        // 告诉Driver端注册完成了
        driver.send(RegisteredApplication(app.id, self))
        schedule()
    }

AppClient端

override def receive: PartialFunction[Any, Unit] = {
    case RegisteredApplication(appId_, masterRef) =>
    	// FIXME How to handle the following cases?
	    // 1. A master receives multiple registrations and sends back multiple
    	// RegisteredApplications due to an unstable network.
	    // 2. Receive multiple RegisteredApplication from different masters because the master is
    	// changing.
	    appId.set(appId_)
	    registered.set(true)
	    master = Some(masterRef)
	    listener.connected(appId.get)

三、Spark的Stage和Task执行操作

1. SparkConext.runJob

// Spark的Action算子，都会触发runJob方法，生成一个Job
def runJob[T, U: ClassTag](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
        throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
        logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    // 此时，Spark的任务在SparkContext中得以运行，转发到了DAGScheduler中执行
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
}

2. DAGScheduler.runJob

def runJob[T, U](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    callSite: CallSite,
    resultHandler: (Int, U) => Unit,
    properties: Properties): Unit = {
    val start = System.nanoTime
    // 通过submitJob，去提交Spark作业
    // 通过Action算子生成了Job，通过submit提交给TaskScheduler
    val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
    ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
    waiter.completionFuture.value.get match {
        case scala.util.Success(_) =>
        logInfo("Job %d finished: %s, took %f s".format
                (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
        case scala.util.Failure(exception) =>
        logInfo("Job %d failed: %s, took %f s".format
                (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
        // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
        val callerStackTrace = Thread.currentThread().getStackTrace.tail
        exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
        throw exception
    }
}

def submitJob[T, U](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    callSite: CallSite,
    resultHandler: (Int, U) => Unit,
    properties: Properties): JobWaiter[U] = {
    ...
    // eventProcessLoop: 事件循环处理器，此时向这个循环处理器中添加了一个JobSubmitted事件
    // 通过post方法提交，添加事件后，稍后就会有线程来执行这个事件
    eventProcessLoop.post(JobSubmitted(
        jobId, rdd, func2, partitions.toArray, callSite, waiter,
        SerializationUtils.clone(properties)))
    waiter
}

3. EventLoop

override def run(): Unit = {
    try {
        while (!stopped.get) {
            val event = eventQueue.take()		// 从事件队列中获取一个事件
            try {
                onReceive(event)	// 接收到这个事件，处理，跳转到DAGSchedulerEventProcessLoop
            } catch {
                case NonFatal(e) =>
                try {
                    onError(e)
                } catch {
                    case NonFatal(e) => logError("Unexpected error in " + name, e)
                }
            }
        }
    } catch {
        case ie: InterruptedException => // exit even if eventQueue is not empty
        case NonFatal(e) => logError("Unexpected error in " + name, e)
    }
}

DAGSchedulerEventProcessLoop.onReceive

override def onReceive(event: DAGSchedulerEvent): Unit = {
    val timerContext = timer.time()
    try {
        doOnReceive(event)		// 钩子模型
    } finally {
        timerContext.stop()
    }
}

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
    // 处理JobSubmitted事件
    case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
	    // 
	    dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
}

// 这个方法，是DAG的精髓所在
// (Stage的划分和Task的划分)
private[scheduler] def handleJobSubmitted(jobId: Int,
                                          finalRDD: RDD[_],
                                          func: (TaskContext, Iterator[_]) => _,
                                          partitions: Array[Int],
                                          callSite: CallSite,
                                          listener: JobListener,
                                          properties: Properties) {
    var finalStage: ResultStage = null
    try {
        // 拆分ResultStage，一个Job的最后的阶段，就是ResultStage，即FinalStage
        finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
    } catch {
        case e: Exception =>
        logWarning("Creating new stage failed due to exception - job: " + jobId, e)
        listener.jobFailed(e)
        return
    }
    ...
    // 将拆分好的各个Stage提交
    submitStage(finalStage)
}

private def createResultStage(
    rdd: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    jobId: Int,
    callSite: CallSite): ResultStage = {
    // 通过递归的方式，找到ResultStage的父Stage
    val parents = getOrCreateParentStages(rdd, jobId)
    val id = nextStageId.getAndIncrement()
    val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
    stageIdToStage(id) = stage
    updateJobIdStageIdMaps(jobId, stage)
    stage
}
private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
    getShuffleDependencies(rdd).map { shuffleDep =>
        getOrCreateShuffleMapStage(shuffleDep, firstJobId)
    }.toList
}
// 根据宽窄依赖来切分Stage，从后往前，一直调用自己，知道找不到父Stage为止
private def getOrCreateShuffleMapStage(
    shuffleDep: ShuffleDependency[_, _, _],
    firstJobId: Int): ShuffleMapStage = {
    shuffleIdToMapStage.get(shuffleDep.shuffleId) match {
        case Some(stage) =>
        stage

        case None =>
        // Create stages for all missing ancestor shuffle dependencies.
        getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
            // Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies
            // that were not already in shuffleIdToMapStage, it's possible that by the time we
            // get to a particular dependency in the foreach loop, it's been added to
            // shuffleIdToMapStage by the stage creation process for an earlier dependency. See
            // SPARK-13902 for more information.
            if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
                createShuffleMapStage(dep, firstJobId)
            }
        }
        // Finally, create a stage for the given shuffle dependency.
        createShuffleMapStage(shuffleDep, firstJobId)
    }
}

private def submitStage(stage: Stage) {
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
        logDebug("submitStage(" + stage + ")")
        if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
            // 找到没有父stage的stage
            val missing = getMissingParentStages(stage).sortBy(_.id)
            logDebug("missing: " + missing)
            if (missing.isEmpty) {
                logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
                // 提交任务
                submitMissingTasks(stage, jobId.get)
            } else {
                for (parent <- missing) {
                    submitStage(parent)	// 找到了父stage
                }
                waitingStages += stage
            }
        }
    } else {
        abortStage(stage, "No active job for stage " + stage.id, None)
    }
}

// 主要构建Task和提交Task
private def submitMissingTasks(stage: Stage, jobId: Int) {
    // 根据stage,划分Task
    val tasks: Seq[Task[_]] = try {
        val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
        stage match {
            case stage: ShuffleMapStage =>
            stage.pendingPartitions.clear()
            partitionsToCompute.map { id =>
                val locs = taskIdToLocations(id)
                val part = stage.rdd.partitions(id)
                stage.pendingPartitions += id
                new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
                                   taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
                                   Option(sc.applicationId), sc.applicationAttemptId)
            }

            case stage: ResultStage =>
            partitionsToCompute.map { id =>
                val p: Int = stage.partitions(id)
                val part = stage.rdd.partitions(p)
                val locs = taskIdToLocations(id)
                new ResultTask(stage.id, stage.latestInfo.attemptId,
                               taskBinary, part, locs, id, properties, serializedTaskMetrics,
                               Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
            }
        }
    } catch {
        case NonFatal(e) =>
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }

    if (tasks.size > 0) {
        logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
                s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
        // 将Tasks封装到TaskSet中，交给TaskScheduler提交
        taskScheduler.submitTasks(new TaskSet(
            tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
        stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
    } else {
        // Because we posted SparkListenerStageSubmitted earlier, we should mark
        // the stage as completed here in case there are no tasks to run
        markStageAsFinished(stage, None)

        val debugString = stage match {
            case stage: ShuffleMapStage =>
            s"Stage ${stage} is actually done; " +
            s"(available: ${stage.isAvailable}," +
            s"available outputs: ${stage.numAvailableOutputs}," +
            s"partitions: ${stage.numPartitions})"
            case stage : ResultStage =>
            s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
        }
        logDebug(debugString)

        submitWaitingChildStages(stage)
    }
}

override def submitTasks(taskSet: TaskSet) {
    ...
    backend.reviveOffers()
}
override def reviveOffers() {
    driverEndpoint.send(ReviveOffers)
}
// CoarsGrainedSchedulerBackend 300行
// 在Driver端，向Executor发送一个执行句柄LaunchTask，其中包含了Task
executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))

4. CoarseGrainedExecutorBackend

搜索CoarseGrainedExecutorBackend，因为他是和Driver端进行通信的
找到LaunchTask，在92行

case LaunchTask(data) =>
    if (executor == null) {
        exitExecutor(1, "Received LaunchTask command but executor was null")
    } else {
        // 获取Task所依赖的资源
        val taskDesc = TaskDescription.decode(data.value)
        logInfo("Got assigned task " + taskDesc.taskId)
        executor.launchTask(this, taskDesc)
    }

def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
    val tr = new TaskRunner(context, taskDescription)
    runningTasks.put(taskDescription.taskId, tr)
    threadPool.execute(tr)
}
// 点击TaskRunner进到类的内部，查找run方法，第334行
// val value = xx
// 这个value就是这个Task执行的结果

// 在407行
// val serializedResult: ByteBuffer = {
// 这个是需要给Driver端的结果

// 429行
// execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)