文章目录
一、Spark的源码下载
二、SparkContext的初始化过程
1. SparkConf
SparkConf对象,就是Spark的配置对象,主要以键值对的形式加载Spark的配置信息。一旦通过new SparkConf()
形式实例化了对象,就会默认加载所有的spark.*
的配置文件。
注意事项:
SparkContext实例化的时候,需要一个SparkConf对象。在SparkContext中,会获取到一个SparkConf对象的副本,后序使用的就是这个副本对象。也就是说,一旦将SparkConf应用给了SparkContext对象,此时再修改SparkConf的属性配置,已经没有作用了。
// SparkContext中第230行,获取到了SparkConf的副本 def getConf: SparkConf = conf.clone() // SparkConf的第430行,制作了一个SparkConf对象的副本 override def clone: SparkConf = { val cloned = new SparkConf(false) settings.entrySet().asScala.foreach { e => cloned.set(e.getKey(), e.getValue(), true) } cloned }
2. SparkContext
SparkContext初始化过程:
1. 初始化了SparkConf对象,加载了Spark的配置信息
2. 将SparkConf对象,加载到SparkContext中,进行各个配置参数的初始化操作
3. 通过createTaskScheduler方法,实例化了TaskScheduler和DAGScheduler
SparkContext 第500行
// Create and start the scheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
_taskScheduler.start()
SparkContext第2692行
// 根据用户传入的Master地址,创建SchedulerBackend和TaskScheduler
private def createTaskScheduler(
sc: SparkContext,
master: String,
deployMode: String): (SchedulerBackend, TaskScheduler) = {
import SparkMasterRegex._
// When running locally, don't try to re-execute tasks on failure.
val MAX_LOCAL_TASK_FAILURES = 1
master match {
// local模式,setMaster("local")
case "local" =>
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalSchedulerBackend(sc.getConf, scheduler, 1)
scheduler.initialize(backend)
(backend, scheduler)
// local模式,setMasetr("local[*]") || setMaster("local[N]")
case LOCAL_N_REGEX(threads) =>
def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
// local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
val threadCount = if (threads == "*") localCpuCount else threads.toInt
if (threadCount <= 0) {
throw new SparkException(s"Asked to run locally with $threadCount threads")
}
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
scheduler.initialize(backend)
(backend, scheduler)
// Standalone模式,
case SPARK_REGEX(sparkUrl) =>
val scheduler = new TaskSchedulerImpl(sc)
val masterUrls = sparkUrl.split(",").map("spark://" + _)
val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
(backend, scheduler)
// 其他资源调用模式,例如:YARN、Mesos
case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
// Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang.
val memoryPerSlaveInt = memoryPerSlave.toInt
if (sc.executorMemory > memoryPerSlaveInt) {
throw new SparkException(
"Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
memoryPerSlaveInt, sc.executorMemory))
}
...
}
3. TaskScheduler
TaskScheduler是一个低级别的Task调度接口,目前就只有一个实现类 TaskSchedulerImpl。该接口可以挂载在不同的调度器(指的是SchedulerBackend)
每一个TaskScheduler只能为一个SparkContext工作。初始化TaskScheduler,是处理之前的Spark任务。如果有新的SparkApplication,此时就需要将当前的TaskScheduler销毁,并创建一个新的。
TaskScheduler可以从DAGScheduler获取每一个Stage中的TaskSet,用来提交处理这些Task,发送到集群运行。
如果失败后,进行重复提交,处理游勇散兵,并将任务的执行结果反馈给DAGScheduler。
游勇散兵: 提交给集群运行的Task,有可能会出现掉队的情况。那么我们不能因为一个两个掉队的Task而影响整个程序的进行。
3.1. TaskSchedulerImpl
客户端需要通过调用initialize和start方法,其次才可以使用runTasks方法进行任务提交。
// line81: 检测时长(Task等待时长),最低是100ms
val SPECULATION_INTERVAL_MS = conf.getTimeAsMs("spark.speculation.interval", "100ms")
// line92: 初始化Task的时长,默认是15s
val STARVATION_TIMEOUT_MS = conf.getTimeAsMs("spark.starvation.timeout", "15s")
// line95: 每一个Task分配的CPU数量
val CPUS_PER_TASK = conf.getInt("spark.task.cpus", 1)
// line136: 默认的调度模式是FIFO
private val schedulingModeConf = conf.get(SCHEDULER_MODE_PROPERTY, SchedulingMode.FIFO.toString)
// CoarseGrainedSchedulerBackend
// 粗粒度调度器(CoarseGrainedSchedulerBackend)
// Job每一个生命周期中,都有一个Executor。
// 当一个Task执行结束后,不会释放Executor,
// 一个新的Task进来后,也不会马上开辟一个新的Executor,会复用之前的Executor。
// 实现了Executor的复用。
// 细粒度调度器(FineGrainedSchedulerBackend)
// Task执行结束后,就会释放Executor。
// 一个新的Task进来之后,会开辟一个新的Executor。
//
// Standalone模式和YARN模式,只支持粗粒度调度。Mesos支持细粒度调度。
// FIFO: 先进先出的调度模式
// 优先将Executor分配到一个Worker上,当这个Worker的资源不足的时候,才会分配到其他的Worker上
// FAIR: 公平调度模式
// 基于负载均衡,平均的将Executor分配到每一个Worker节点
def initialize(backend: SchedulerBackend) {
this.backend = backend
schedulableBuilder = {
schedulingMode match {
case SchedulingMode.FIFO =>
new FIFOSchedulableBuilder(rootPool)
case SchedulingMode.FAIR =>
new FairSchedulableBuilder(rootPool, conf)
case _ =>
throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
s"$schedulingMode")
}
}
schedulableBuilder.buildPools()
}
override def start() {
backend.start()
}
StandaloneSchedulerBackend
override def start() {
// 调用了CoarseGrainedSchedulerBackend中的start方法
// 构建了一个Driver RPC通信终端
super.start()
...
// 构建一个Application的任务对象,传递这些参数,其实这就代表了Application执行的时候所需要的资源
val appDesc = ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
webUrl, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
// 构建Application任务对象,包含了作业的资源信息
// 是一个用于和集群管理器通信的对象
client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
client.start()
launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
// 等待注册是否完成,是在StandaloneAppClient完成
waitForRegistration()
launcherBackend.setState(SparkAppHandle.State.RUNNING)
}
// CoarseGrainedSchedulerBackend
override def start() {
// 构建一个Driver端的RPC通信模型
driverEndpoint = createDriverEndpointRef(properties)
}
4. DriverEndPoint
CoarseGrainedSchedulerBackend的内部类,是Driver端的通信模型
// 生命周期的起点,就是这个
override def onStart() {
// Periodically revive offers to allow delay scheduling to work
val reviveIntervalMs = conf.getTimeAsMs("spark.scheduler.revive.interval", "1s")
reviveThread.scheduleAtFixedRate(new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
Option(self).foreach(_.send(ReviveOffers)) // 给自己发送了一个ReviveOffers信号
}
}, 0, reviveIntervalMs, TimeUnit.MILLISECONDS)
}
override def receive: PartialFunction[Any, Unit] = {
case StatusUpdate(executorId, taskId, state, data) =>
scheduler.statusUpdate(taskId, state, data.value)
if (TaskState.isFinished(state)) {
executorDataMap.get(executorId) match {
case Some(executorInfo) =>
executorInfo.freeCores += scheduler.CPUS_PER_TASK
makeOffers(executorId)
case None =>
// Ignoring the update since we don't know about the executor.
logWarning(s"Ignored task status update ($taskId state $state) " +
s"from unknown executor with ID $executorId")
}
}
case ReviveOffers =>
makeOffers()
}
// 给Executor创建了一个虚拟的资源信息
private def makeOffers() {
// Make sure no executor is killed while some task is launching on it
val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized {
// Filter out executors under killing
val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
val workOffers = activeExecutors.map { case (id, executorData) =>
new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
}.toIndexedSeq
scheduler.resourceOffers(workOffers)
}
if (!taskDescs.isEmpty) {
launchTasks(taskDescs)
}
}
5. StandaloneAppClient
override def onStart(): Unit = {
try {
// 向Master发送了一个注册的信息
// 这里的1代表第一次注册,在注册逻辑中,每当失败一次,都会重新注册,并将这个次数+1递归调用
// 当失败的次数 >= 3次,Application注册失败
registerWithMaster(1)
} catch {
case e: Exception =>
logWarning("Failed to connect to master", e)
markDisconnected()
stop()
}
}
private def registerWithMaster(nthRetry: Int) {
registerMasterFutures.set(tryRegisterAllMasters()) // 完成了Application向Master的注册
// 如果注册失败,就要重新注册
registrationRetryTimer.set(registrationRetryThread.schedule(new Runnable {
override def run(): Unit = {
if (registered.get) {
registerMasterFutures.get.foreach(_.cancel(true))
registerMasterThreadPool.shutdownNow()
} else if (nthRetry >= REGISTRATION_RETRIES) {
markDead("All masters are unresponsive! Giving up.")
} else {
registerMasterFutures.get.foreach(_.cancel(true))
registerWithMaster(nthRetry + 1)
}
}
}, REGISTRATION_TIMEOUT_SECONDS, TimeUnit.SECONDS))
}
6. Master
Master258行
case RegisterApplication(description, driver) =>
// TODO Prevent repeated registrations from some driver
if (state == RecoveryState.STANDBY) {
// ignore, don't send response
} else {
logInfo("Registering app " + description.name)
// 封装对应的Driver端的资源
val app = createApplication(description, driver)
// 在Master内部完成了Application的注册
registerApplication(app)
logInfo("Registered app " + description.name + " with ID " + app.id)
// 使用持久化操作,将任务的元数据保存起来,等待Task使用
persistenceEngine.addApplication(app)
// 告诉Driver端注册完成了
driver.send(RegisteredApplication(app.id, self))
schedule()
}
AppClient端
override def receive: PartialFunction[Any, Unit] = {
case RegisteredApplication(appId_, masterRef) =>
// FIXME How to handle the following cases?
// 1. A master receives multiple registrations and sends back multiple
// RegisteredApplications due to an unstable network.
// 2. Receive multiple RegisteredApplication from different masters because the master is
// changing.
appId.set(appId_)
registered.set(true)
master = Some(masterRef)
listener.connected(appId.get)
三、Spark的Stage和Task执行操作
1. SparkConext.runJob
// Spark的Action算子,都会触发runJob方法,生成一个Job
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
// 此时,Spark的任务在SparkContext中得以运行,转发到了DAGScheduler中执行
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
2. DAGScheduler.runJob
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
// 通过submitJob,去提交Spark作业
// 通过Action算子生成了Job,通过submit提交给TaskScheduler
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
waiter.completionFuture.value.get match {
case scala.util.Success(_) =>
logInfo("Job %d finished: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
case scala.util.Failure(exception) =>
logInfo("Job %d failed: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
// SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
val callerStackTrace = Thread.currentThread().getStackTrace.tail
exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
throw exception
}
}
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
...
// eventProcessLoop: 事件循环处理器,此时向这个循环处理器中添加了一个JobSubmitted事件
// 通过post方法提交,添加事件后,稍后就会有线程来执行这个事件
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
3. EventLoop
override def run(): Unit = {
try {
while (!stopped.get) {
val event = eventQueue.take() // 从事件队列中获取一个事件
try {
onReceive(event) // 接收到这个事件,处理,跳转到DAGSchedulerEventProcessLoop
} catch {
case NonFatal(e) =>
try {
onError(e)
} catch {
case NonFatal(e) => logError("Unexpected error in " + name, e)
}
}
}
} catch {
case ie: InterruptedException => // exit even if eventQueue is not empty
case NonFatal(e) => logError("Unexpected error in " + name, e)
}
}
DAGSchedulerEventProcessLoop.onReceive
override def onReceive(event: DAGSchedulerEvent): Unit = {
val timerContext = timer.time()
try {
doOnReceive(event) // 钩子模型
} finally {
timerContext.stop()
}
}
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
// 处理JobSubmitted事件
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
//
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
}
// 这个方法,是DAG的精髓所在
// (Stage的划分和Task的划分)
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
// 拆分ResultStage,一个Job的最后的阶段,就是ResultStage,即FinalStage
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
...
// 将拆分好的各个Stage提交
submitStage(finalStage)
}
private def createResultStage(
rdd: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
jobId: Int,
callSite: CallSite): ResultStage = {
// 通过递归的方式,找到ResultStage的父Stage
val parents = getOrCreateParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
}
private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
getShuffleDependencies(rdd).map { shuffleDep =>
getOrCreateShuffleMapStage(shuffleDep, firstJobId)
}.toList
}
// 根据宽窄依赖来切分Stage,从后往前,一直调用自己,知道找不到父Stage为止
private def getOrCreateShuffleMapStage(
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int): ShuffleMapStage = {
shuffleIdToMapStage.get(shuffleDep.shuffleId) match {
case Some(stage) =>
stage
case None =>
// Create stages for all missing ancestor shuffle dependencies.
getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
// Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies
// that were not already in shuffleIdToMapStage, it's possible that by the time we
// get to a particular dependency in the foreach loop, it's been added to
// shuffleIdToMapStage by the stage creation process for an earlier dependency. See
// SPARK-13902 for more information.
if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
createShuffleMapStage(dep, firstJobId)
}
}
// Finally, create a stage for the given shuffle dependency.
createShuffleMapStage(shuffleDep, firstJobId)
}
}
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
// 找到没有父stage的stage
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
// 提交任务
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
submitStage(parent) // 找到了父stage
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
// 主要构建Task和提交Task
private def submitMissingTasks(stage: Stage, jobId: Int) {
// 根据stage,划分Task
val tasks: Seq[Task[_]] = try {
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
case stage: ShuffleMapStage =>
stage.pendingPartitions.clear()
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = stage.rdd.partitions(id)
stage.pendingPartitions += id
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId)
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
if (tasks.size > 0) {
logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
// 将Tasks封装到TaskSet中,交给TaskScheduler提交
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
} else {
// Because we posted SparkListenerStageSubmitted earlier, we should mark
// the stage as completed here in case there are no tasks to run
markStageAsFinished(stage, None)
val debugString = stage match {
case stage: ShuffleMapStage =>
s"Stage ${stage} is actually done; " +
s"(available: ${stage.isAvailable}," +
s"available outputs: ${stage.numAvailableOutputs}," +
s"partitions: ${stage.numPartitions})"
case stage : ResultStage =>
s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
}
logDebug(debugString)
submitWaitingChildStages(stage)
}
}
override def submitTasks(taskSet: TaskSet) {
...
backend.reviveOffers()
}
override def reviveOffers() {
driverEndpoint.send(ReviveOffers)
}
// CoarsGrainedSchedulerBackend 300行
// 在Driver端,向Executor发送一个执行句柄LaunchTask,其中包含了Task
executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
4. CoarseGrainedExecutorBackend
搜索CoarseGrainedExecutorBackend,因为他是和Driver端进行通信的
找到LaunchTask,在92行
case LaunchTask(data) =>
if (executor == null) {
exitExecutor(1, "Received LaunchTask command but executor was null")
} else {
// 获取Task所依赖的资源
val taskDesc = TaskDescription.decode(data.value)
logInfo("Got assigned task " + taskDesc.taskId)
executor.launchTask(this, taskDesc)
}
def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
val tr = new TaskRunner(context, taskDescription)
runningTasks.put(taskDescription.taskId, tr)
threadPool.execute(tr)
}
// 点击TaskRunner进到类的内部,查找run方法,第334行
// val value = xx
// 这个value就是这个Task执行的结果
// 在407行
// val serializedResult: ByteBuffer = {
// 这个是需要给Driver端的结果
// 429行
// execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)