sparkcontext初始化的流程
-sparkConf对象,也就是spark的配置对象,用来描述spark的配置信息,主要是以键值对的形式加载配置信息 -一旦通过newsparkconf()完成了对象的实例化,会默认加载spark.*配置文件 class SparkConf(loadDefaults:Boolean){ def this()=this(true) }
注意事项
-SparkContext对象的实例化,需要一个sparkconf对象作为参数, -在sparkcontext内部,会完成对这个sparkconf对象的克隆,得到一个各个属性值都完全相同的对象,但是和传入的sparkconf并不是一个对象, -在sparkcontext后续的操作中,使用到的到处到时这个克隆的sparkconf对象 -注意事项:将sparkconf对象的作为参数,传递给sparkcontext对象,后续修改这个sparkconf对象是无效的 override def clone:SparkConf={ val cloned=new SparkConf(false) settings.entrySet().asScala.foreach{ e=>cloned.set(e.getKey(),e.getValue(),true) } cloned
sparkcontext
sparkcontext的初始化过程: 1.初始化了spark对象,读取了默认的配置信息,并可以设置一些信息 2.将sparkconf对象,加载到sparkcontext中,对各个配置属性进行初始化的设置 3.通过createTaskScheduler方法,实例化了taskScheduler和DAGScheduler
// 根据传入的Master地址,创建SchedulerBackend和TaskScheduler // SparkContext.scala ,line 2692 private def createTaskScheduler( sc: SparkContext, master: String, deployMode: String): (SchedulerBackend, TaskScheduler) = { import SparkMasterRegex._ // When running locally, don't try to re-execute tasks on failure. val MAX_LOCAL_TASK_FAILURES = 1 master match { // setMaster("local"),local模式 case "local" => val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true) val backend = new LocalSchedulerBackend(sc.getConf, scheduler, 1) scheduler.initialize(backend) (backend, scheduler) // setMaster("local[2]") || setMaster("local[*]"),local模式 case LOCAL_N_REGEX(threads) => def localCpuCount: Int = Runtime.getRuntime.availableProcessors() // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads. val threadCount = if (threads == "*") localCpuCount else threads.toInt if (threadCount <= 0) { throw new SparkException(s"Asked to run locally with $threadCount threads") } val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true) val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount) scheduler.initialize(backend) (backend, scheduler) // Standalone模式 case SPARK_REGEX(sparkUrl) => val scheduler = new TaskSchedulerImpl(sc) val masterUrls = sparkUrl.split(",").map("spark://" + _) val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls) scheduler.initialize(backend) (backend, scheduler) // 其他的资源调度,例如Mesos、YARN case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) => // Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang. val memoryPerSlaveInt = memoryPerSlave.toInt if (sc.executorMemory > memoryPerSlaveInt) { throw new SparkException( "Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format( memoryPerSlaveInt, sc.executorMemory)) } val scheduler = new TaskSchedulerImpl(sc) val localCluster = new LocalSparkCluster( numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt, sc.conf) val masterUrls = localCluster.start() val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls) scheduler.initialize(backend) backend.shutdownCallback = (backend: StandaloneSchedulerBackend) => { localCluster.stop() } (backend, scheduler) } }
taskScheduler
/** * Low-level task scheduler interface, currently implemented exclusively by * [[org.apache.spark.scheduler.TaskSchedulerImpl]]. * This interface allows plugging in different task schedulers. Each TaskScheduler schedules tasks * for a single SparkContext. These schedulers get sets of tasks submitted to them from the * DAGScheduler for each stage, and are responsible for sending the tasks to the cluster, running * them, retrying if there are failures, and mitigating stragglers. They return events to the * DAGScheduler. */ taskscheduler是一个低级别的task调度的接口,目前只有一个实现类,就是taskSchedulerimpl,这个taskScheduler可以挂载在不同的调度器上,指的是(SchedulerBackend) 每一个taskScheduler只能为一个sparkContext调度任务,初始化taskScheduler是处理之前的spark任务,如果有心的sparkapplication提交此时就会销毁当前的taskScheduler,并创建一个新的taskScheduler来处理新的任务 taskScheduler可以从dagScheduler获取每一个stage的taskset,用来提交,处理这些task,发到集群执行,如果失败后就进行重复提交,处理散兵游勇,并将任务的执行结果反馈给dagscheduler (散兵游勇:提交给集群运行的task,可能会有掉队的情况,需要将这样的task处理掉,不至于由于这一两个task影响整体的执行)
taskSchedulerimpl
客户端需要先调用initialize()和start()方法,然后才可以通过runtasks提交taskset
// line81 // Task的等待时常,默认是100ms val SPECULATION_INTERVAL_MS = conf.getTimeAsMs("spark.speculation.interval", "100ms") // line92 // 初始化TaskSet的时常,默认是15s val STARVATION_TIMEOUT_MS = conf.getTimeAsMs("spark.starvation.timeout", "15s") // line95 // 每一个Task分配到的CPU核数 val CPUS_PER_TASK = conf.getInt("spark.task.cpus", 1) // line136 // 调度模式,默认是FIFO private val schedulingModeConf = conf.get(SCHEDULER_MODE_PROPERTY, SchedulingMode.FIFO.toString)
/* coarseGrainedSchedulerBackend 粗粒度调度器(coarseGrainedSchedulerBackend) job的每一个声明周期,都会有一个executor 当一个task执行结束后,并不会立即释放executor 当一个新的task进来后,不会创建一个新的executor,会复用之前的executor 实现了