Spark源码分析之SparkContext

最新推荐文章于 2024-03-23 23:48:36 发布

莫言静好、

最新推荐文章于 2024-03-23 23:48:36 发布

阅读量792

点赞数

分类专栏：大数据/spark/源码文章标签： Spark 源码 SparkContext

本文链接：https://blog.csdn.net/zhanglh046/article/details/78485900

版权

大数据/spark/源码专栏收录该内容

22 篇文章 2 订阅

订阅专栏

SparkContext是用户和Spark集群交互的唯一入口，它的主要是有两个：

# 可以用来创建RDD，累加器和广播变量

# 初始化应用程序所需要的核心组件，为应用程序准备运行环境

只可以有一个SparkContext实例运行在一个JVM中，所以在创建SparkContext的时候之前，确保之前的SparkContext已经关闭了，即调用stop方法停止当前JVM中唯一运行的SparkContext

一SparkContext核心属性

SparkConf conf : Spark配置参数对象

SparkEnv env：Spark环境对象

JobProgressListener jobProgressListener： Job进度监听器

SparkStatusTracker statusTracker：Spark状态跟踪器

Configuration hadoopConfiguration: Hadoop配置文件对象

Int executorMemory：executor内存

DAGScheduler dagScheduler: 一个高层级的面向Stage的调度器，接收job,将job按照RDD的划分成若干个TaskSet，也称之为Stage,然后以一个个TaskSet的形式提交给底层的调度器TaskScheduler.并且它需要寻求Task的最优调度等

TaskScheduler taskScheduler：TaskScheduler主要是提交TaskSet到集群运算并汇报结果

SchedulerBackend schedulerBackend：控制整个集群资源的控制和调度，向Master注册和Executor反注册，并且向Executor发送Task

RpcEndpointRef heartbeatReceiver：心跳接收器

String applicationId：应用程序id

Option[String] applicationAttemptId: 应用程序尝试id

Option[ExecutorAllocationManager]executorAllocationManager: Executor分配管理器

HashMap[String, String] executorEnvs：维护的executor和其环境变量的映射关系

Option[ContextCleaner] cleaner：上下文清理器

Option[String] checkpointDir：checkpoint目录

二SparkContext重要方法

2.1SparkContext 初始化各个组件，准备应用程序运行环境

// 初始化事件日志目录
_eventLogDir =
if (isEventLogEnabled) {
    val unresolvedDir = conf.get("spark.eventLog.dir", EventLoggingListener.DEFAULT_LOG_DIR)
      .stripSuffix("/")
    Some(Utils.resolveURI(unresolvedDir))
} else {
    None
}
// 初始日志事件的压缩类型
_eventLogCodec = {
val compress = _conf.getBoolean("spark.eventLog.compress", false)
if (compress && isEventLogEnabled) {
    Some(CompressionCodec.getCodecName(_conf)).map(CompressionCodec.getShortName)
} else {
    None
}
}

// 初始化Job进度监听器
_jobProgressListener = new JobProgressListener(_conf)
listenerBus.addListener(jobProgressListener)

// 创建Saprk的执行环境
_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)

// 初始化状态跟踪器
_statusTracker = new SparkStatusTracker(this)

// 获取executor所需要的内存，如果没有配置，默认1024
_executorMemory = _conf.getOption("spark.executor.memory")
  .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
  .orElse(Option(System.getenv("SPARK_MEM"))
  .map(warnSparkMem))
  .map(Utils.memoryStringToMb)
  .getOrElse(1024)

// 创建TaskScheduler之前需要注册HeartbeatReceiver心跳接收器，因为executor需要取HeartbeatReceiver
_heartbeatReceiver = env.rpcEnv.setupEndpoint(
  HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

// 创建和开始TaskScheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
// 初始化SchedulerBackend
_schedulerBackend = sched
// 构造TaskScheduler
_taskScheduler = ts
// 初始化DAGScheduler
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
// 启动TaskScheduler
_taskScheduler.start()

// 初始化BlockManager
_env.blockManager.initialize(_applicationId)

_executorAllocationManager =
  if (dynamicAllocationEnabled) {
    schedulerBackend match {
      case b: ExecutorAllocationClient =>
        Some(new ExecutorAllocationManager(
          schedulerBackend.asInstanceOf[ExecutorAllocationClient], listenerBus, _conf))
      case _ =>
        None
    }
  } else {
    None
  }
// 启动Executor分配管理器
_executorAllocationManager.foreach(_.start())
// 启动上下文清理器
_cleaner =
  if (_conf.getBoolean("spark.cleaner.referenceTracking", true)) {
    Some(new ContextCleaner(this))
  } else {
    None
  }
_cleaner.foreach(_.start())

2.2broadcast

将一个只读广播变量发送给集群中每一个节点，该变量只给集群发送一次

def broadcast[T: ClassTag](value: T): Broadcast[T] = {
  assertNotStopped()
  require(!classOf[RDD[_]].isAssignableFrom(classTag[T].runtimeClass),
    "Can not directly broadcast RDDs; instead, call collect() and broadcast the result.")
  val bc = env.broadcastManager.newBroadcast[T](value, isLocal)
  val callSite = getCallSite
  logInfo("Created broadcast " + bc.id + " from " + callSite.shortForm)
  cleaner.foreach(_.registerBroadcastForCleanup(bc))
  bc
}

2.3makeRDD 将一个序列或者集合转化为RDD

def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
  assertNotStopped()
  val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
  new ParallelCollectionRDD[T](this, seq.map(_._1), math.max(seq.size, 1), indexToPrefs)
}

2.4 textFile

读取来自于HDFS，本地文件系统或者任意一个Hadoop支持的文件系统URI的文件，返回由一个字字符串组成的RDD

def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  assertNotStopped()
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}

2.5runJob

在给定的RDD分区集合上运行函数，并且把结果传给指定的结果处理函数

def runJob[T, U: ClassTag](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    resultHandler: (Int, U) => Unit): Unit = {
  if (stopped.get()) {
    throw new IllegalStateException("SparkContext has been shutdown")
  }
  val callSite = getCallSite
  val cleanedFunc = clean(func)
  logInfo("Starting job: " + callSite.shortForm)
  if (conf.getBoolean("spark.logLineage", false)) {
    logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
  }
  dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
  progressBar.foreach(_.finishAll())
  rdd.doCheckpoint()
}

2.6submitJob

提交一个要执行的job，然后将结果放在一个FutureJob对象中

// 提交一个要执行的job,并且将返回结果放在一个FutureJob中
def submitJob[T, U, R](
    rdd: RDD[T],
    processPartition: Iterator[T] => U,
    partitions: Seq[Int],
    resultHandler: (Int, U) => Unit,
    resultFunc: => R): SimpleFutureAction[R] =
{
  assertNotStopped()
  val cleanF = clean(processPartition)
  val callSite = getCallSite
  val waiter = dagScheduler.submitJob(
    rdd,
    (context: TaskContext, iter: Iterator[T]) => cleanF(iter),
    partitions,
    callSite,
    resultHandler,
    localProperties.get)
  new SimpleFutureAction(waiter, resultFunc)
}