spark源码

最新推荐文章于 2024-04-21 05:20:12 发布

leezsj

最新推荐文章于 2024-04-21 05:20:12 发布

阅读量601

点赞数

分类专栏： spark 文章标签： spark

本文链接：https://blog.csdn.net/leezsj/article/details/119669444

版权

sparkcontext初始化的流程-sparkConf对象,也就是spark的配置对象,用来描述spark的配置信息,主要是以键值对的形式加载配置信息-一旦通过newsparkconf()完成了对象的实例化,会默认加载spark.*配置文件class SparkConf(loadDefaults:Boolean){ def this()=this(true)}注意事项-SparkContext对象的实例化,需要一个sparkconf对象作为参数,-在sparkcontext.

摘要由CSDN通过智能技术生成

sparkcontext初始化的流程

-sparkConf对象,也就是spark的配置对象,用来描述spark的配置信息,主要是以键值对的形式加载配置信息
-一旦通过newsparkconf()完成了对象的实例化,会默认加载spark.*配置文件
class SparkConf(loadDefaults:Boolean){
    def this()=this(true)
}

注意事项

-SparkContext对象的实例化,需要一个sparkconf对象作为参数,
-在sparkcontext内部,会完成对这个sparkconf对象的克隆,得到一个各个属性值都完全相同的对象,但是和传入的sparkconf并不是一个对象,
-在sparkcontext后续的操作中,使用到的到处到时这个克隆的sparkconf对象
-注意事项:将sparkconf对象的作为参数,传递给sparkcontext对象,后续修改这个sparkconf对象是无效的


override def clone:SparkConf={
    val cloned=new SparkConf(false)
    settings.entrySet().asScala.foreach{
        e=>cloned.set(e.getKey(),e.getValue(),true)
    }
    cloned

sparkcontext

sparkcontext的初始化过程:
1.初始化了spark对象,读取了默认的配置信息,并可以设置一些信息
2.将sparkconf对象,加载到sparkcontext中,对各个配置属性进行初始化的设置
3.通过createTaskScheduler方法,实例化了taskScheduler和DAGScheduler

// 根据传入的Master地址，创建SchedulerBackend和TaskScheduler
// SparkContext.scala ，line 2692
private def createTaskScheduler(
    sc: SparkContext,
    master: String,
    deployMode: String): (SchedulerBackend, TaskScheduler) = {
    import SparkMasterRegex._

    // When running locally, don't try to re-execute tasks on failure.
    val MAX_LOCAL_TASK_FAILURES = 1

    master match {
        // setMaster("local")，local模式
        case "local" =>
            val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
            val backend = new LocalSchedulerBackend(sc.getConf, scheduler, 1)
            scheduler.initialize(backend)
            (backend, scheduler)

        // setMaster("local[2]") || setMaster("local[*]")，local模式
        case LOCAL_N_REGEX(threads) =>
            def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
            // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
            val threadCount = if (threads == "*") localCpuCount else threads.toInt
            if (threadCount <= 0) {
                throw new SparkException(s"Asked to run locally with $threadCount threads")
            }
            val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
            val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
            scheduler.initialize(backend)
            (backend, scheduler)

        // Standalone模式
        case SPARK_REGEX(sparkUrl) =>
            val scheduler = new TaskSchedulerImpl(sc)
            val masterUrls = sparkUrl.split(",").map("spark://" + _)
            val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
            scheduler.initialize(backend)
            (backend, scheduler)

        // 其他的资源调度，例如Mesos、YARN
        case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
            // Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang.
            val memoryPerSlaveInt = memoryPerSlave.toInt
            if (sc.executorMemory > memoryPerSlaveInt) {
                throw new SparkException(
                    "Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
                        memoryPerSlaveInt, sc.executorMemory))
            }

            val scheduler = new TaskSchedulerImpl(sc)
            val localCluster = new LocalSparkCluster(
                numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt, sc.conf)
            val masterUrls = localCluster.start()
            val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
            scheduler.initialize(backend)
            backend.shutdownCallback = (backend: StandaloneSchedulerBackend) => {
                localCluster.stop()
            }
            (backend, scheduler)
    }
}

taskScheduler

/**
 * Low-level task scheduler interface, currently implemented exclusively by
 * [[org.apache.spark.scheduler.TaskSchedulerImpl]].
 * This interface allows plugging in different task schedulers. Each TaskScheduler schedules tasks
 * for a single SparkContext. These schedulers get sets of tasks submitted to them from the
 * DAGScheduler for each stage, and are responsible for sending the tasks to the cluster, running
 * them, retrying if there are failures, and mitigating stragglers. They return events to the
 * DAGScheduler.
 */
 taskscheduler是一个低级别的task调度的接口,目前只有一个实现类,就是taskSchedulerimpl,这个taskScheduler可以挂载在不同的调度器上,指的是(SchedulerBackend)
 每一个taskScheduler只能为一个sparkContext调度任务,初始化taskScheduler是处理之前的spark任务,如果有心的sparkapplication提交此时就会销毁当前的taskScheduler,并创建一个新的taskScheduler来处理新的任务
 taskScheduler可以从dagScheduler获取每一个stage的taskset,用来提交,处理这些task,发到集群执行,如果失败后就进行重复提交,处理散兵游勇,并将任务的执行结果反馈给dagscheduler
 (散兵游勇:提交给集群运行的task,可能会有掉队的情况,需要将这样的task处理掉,不至于由于这一两个task影响整体的执行)

taskSchedulerimpl

客户端需要先调用initialize()和start()方法,然后才可以通过runtasks提交taskset

// line81
// Task的等待时常，默认是100ms
val SPECULATION_INTERVAL_MS = conf.getTimeAsMs("spark.speculation.interval", "100ms")
// line92
// 初始化TaskSet的时常，默认是15s
val STARVATION_TIMEOUT_MS = conf.getTimeAsMs("spark.starvation.timeout", "15s")
// line95
// 每一个Task分配到的CPU核数
val CPUS_PER_TASK = conf.getInt("spark.task.cpus", 1)
// line136
// 调度模式，默认是FIFO
private val schedulingModeConf = conf.get(SCHEDULER_MODE_PROPERTY, SchedulingMode.FIFO.toString)

/*
coarseGrainedSchedulerBackend
粗粒度调度器(coarseGrainedSchedulerBackend)
    job的每一个声明周期,都会有一个executor
    当一个task执行结束后,并不会立即释放executor
    当一个新的task进来后,不会创建一个新的executor,会复用之前的executor
    实现了

最低0.47元/天解锁文章

leezsj

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark源码

sparkcontext初始化的流程-sparkConf对象,也就是spark的配置对象,用来描述spark的配置信息,主要是以键值对的形式加载配置信息-一旦通过newsparkconf()完成了对象的实例化,会默认加载spark.*配置文件class SparkConf(loadDefaults:Boolean){ def this()=this(true)}注意事项-SparkContext对象的实例化,需要一个sparkconf对象作为参数,-在sparkcontext.
复制链接

扫一扫