Spark重要源码解读

Winyar Wen

已于 2024-05-20 20:06:41 修改

阅读量242

点赞数

分类专栏：大数据文章标签： spark源码

于 2019-07-21 09:06:56 首次发布

本文链接：https://blog.csdn.net/weixin_42394052/article/details/96599680

版权

大数据专栏收录该内容

40 篇文章 1 订阅

订阅专栏

SparkConf类

/**
*Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
*
*Most of the time, you would create a SparkConf object with new SparkConf(), which will load
*values from any spark.* Java system properties set in your application as well. In this case,
*parameters you set directly on the SparkConf object take priority over system properties.
*
*For unit tests, you can also call new SparkConf(false) to skip loading external settings and
*get the same configuration no matter what the system properties are.
*
*All setter methods in this class support chaining. For example, you can write
*new SparkConf().setMaster("local").setAppName("My app").
*
*@param loadDefaults whether to also load values from Java system properties
*
*@note Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified
*by the user. Spark does not support modifying the configuration at runtime.
*/

SparkContext实例化的时候需要传进一个SparkConf作为参数，SparkConf描述整个Spark应用程序的配置信息， SparkConf可以进行链式的调用，即：
new SparkConf().setMaster(“local”).setAppName(“TestApp”)

SparkConf的部分源码如下：
// 用来存储key-value的配置信息
private val settings = new ConcurrentHashMapString, String
// 默认会加载“spark.”格式的配置信息
if (loadDefaults) {
// Load any spark. system properties
for ((key, value) <- Utils.getSystemProperties if key.startsWith(“spark.”)) { set(key, value)
}
}
/** Set a configuration variable. */
def set(key: String, value: String): SparkConf = { if (key == null) {
throw new NullPointerException(“null key”)
}
if (value == null) {
throw new NullPointerException("null value for " + key)

}
logDeprecationWarning(key) settings.put(key, value)
// 每次进行设置后都会返回SparkConf自身，所以可以进行链式的调用
this
}

SparkContext类

/**
*Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
*cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
*
*Only one SparkContext may be active per JVM. You must stop() the active SparkContext before
*creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
*
*@param config a Spark Config object describing the application configuration. Any settings in
*this config overrides the default configs as well as system properties.
*/

SparkContext是整个Spark功能的入口，代表了应用程序与整个集群的连接点，
Spark应用程序是通过SparkContext发布到Spark集群的，并且Spark程序的运行都是在SparkContext为核心的调度指挥下进行的，SparkContext崩溃或者结束就代表Spark应用程序执行结束，所以SparkContext在Spark中是非常重要的一个类。

SparkContext部分源码(只选取重要部分)：

SparkContext最主要的作用:①初始化SparkEnv对象 ②初始化并启动三个调度模块DAG，Task,Backend，此外，会建立各个工作节点的心跳机制，用于检测和监控

// 是否允许存在多个SparkContext，默认是false
// If true, log warnings instead of throwing exceptions when multiple SparkContexts are active private val allowMultipleContexts: Boolean = config.getBoolean(“spark.driver.allowMultipleContexts”, false)

// An asynchronous listener bus for Spark events private[spark] val listenerBus = new LiveListenerBus

// 追踪所有执行持久化（缓存过的）的RDD
// Keeps track of all persisted RDDs
private[spark] val persistentRdds = new TimeStampedWeakValueHashMap[Int, RDD[_]]

// System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster if (master == “yarn” && deployMode == “cluster” && !_conf.contains(“spark.yarn.app.id”)) { throw new SparkException("Detected yarn cluster mode, but isn’t running on a cluster. " +
“Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.”)
}

// Create the Spark execution environment (cache, map output tracker, etc)
_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)
//设置executor进程的大小，默认1GB
_executorMemory = _conf.getOption(“spark.executor.memory”)
.orElse(Option(System.getenv(“SPARK_EXECUTOR_MEMORY”)))
.orElse(Option(System.getenv(“SPARK_MEM”))
.map(warnSparkMem))
.map(Utils.memoryStringToMb)
.getOrElse(1024)

// We need to register “HeartbeatReceiver” before “createTaskScheduler” because Executor will
// retrieve “HeartbeatReceiver” in the constructor.
_heartbeatReceiver = env.rpcEnv.setupEndpoint( HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

// Create and start the scheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.askBoolean

// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler’s
// constructor
_taskScheduler.start()

_env.blockManager.initialize(_applicationId)
_env.metricsSystem.start()
}

其实SparkContext中最主要的三大核心对象就是DAGScheduler、TaskScheduler、SchedulerBackend 1）DAGScheduler主要负责分析依赖关系，然后将DAG划分为不同的Stage（阶段），其中每个Stage由可以并发执行的一组Task构成，这些Task的执行逻辑完全相同，只是作用于不同的数据。

2）TaskScheduler作用是为创建它的SparkContext调度任务，即从DAGScheduler接收不同Stage的任务，并且向集群提交这些任务，并为执行特别慢的任务启动备份任务
3）SchedulerBackend作用是依据当前任务申请到的可用资源，将Task在Executor进程中启动并执行，完成计算的调度过程。

SparkEnv

/**
*:: DeveloperApi ::
*Holds all the runtime environment objects for a running Spark instance (either master or worker),
*including the serializer, RpcEnv, block manager, map output tracker, etc. Currently
*Spark code finds the SparkEnv through a global variable, so all the threads can access the same
*SparkEnv. It can be accessed by SparkEnv.get (e.g. after creating a SparkContext).
*
*/

SparkEnv是Spark的执行环境对象，其中包括但不限于：
1）serializer 2）RpcEnv 3）BlockManager
4）MapOutPutTracker(Shuffle过程中非常重要)等

在local模式下Driver会创建Executor，在Standalone部署模式下，Worker上创建Executor。所以SparkEnv存在于Spark任务调度时的每个Executor中，SparkEnv中的环境信息是对一个job中所有的Task都是可见且一致的。确保运行时的环境一致。

SparkEnv的构造步骤如下：

1.创建安全管理器SecurityManager；

Spark currently supports authentication via a shared secret. Authentication can be configured to be on via the spark.authenticate configuration parameter. This parameter controls whether the Spark communication protocols do authentication using the shared secret. This authentication is a basic handshake to make sure both sides have the same

shared secret and are allowed to communicate. If the shared secret is not identical they will not be allowed to communicate. The shared secret is created as follows:

For Spark on YARN deployments, configuring spark.authenticate to true will automatically handle generating and distributing the shared secret. Each application will use a unique shared secret.
For other types of Spark deployments, the Spark parameter spark.authenticate.secret should be configured on each of the nodes. This secret will be used by all the Master/Workers and applications.

SecurityManager是Spark的安全认证模块，通过共享秘钥进行认证。启用认证功能可以通过参数spark.authenticate来配置。此参数控制spark通信协议是否使用共享秘钥进行认证。这种认证方式基于握手机制，以确保通信双方都有相同的共享秘钥时才能通信。如果共享秘钥不一致，则双方将无法通信。可以通过以下过程来创建共享秘钥：
①在spark on YARN部署模式下，配置spark.authenticate为true，就可以自动产生并分发共享秘钥。每个应用程序都使用唯一的共享秘钥。
②其他部署方式下，应当在每个节点上都配置参数spark.authenticate.secret。此秘钥将由所有Master、worker 及应用程序来使用。

2.创建RpcEnv；
Spark1.6推出的RpcEnv、RpcEndPoint、RpcEndpointRef为核心的新型架构下的RPC通信方式，在底层封装了Akka和Netty，也为未来扩充更多的通信系统提供了可能。
①如果底层用的是Akka的RPC通信RpcEnv=ActorSystem RpcEndPoint=Actor RpcEndpointRef=Actor通信的对象
②如果底层用的是Netty的RPC通信RpcEnv=NettyServer RpcEndPoint=NettyClient RpcEndpointRef=NettyClient的通信对象Spark1.6之前用的是Akka，1.6之后用的是Netty

3.创建ShuffleManager
ShuffleManager负责管理本地及远程的Block数据的shuffle操作。ShuffleManager默认通过反射方式生成的SortShuffleManager的实例。默认使用的是sort模式的SortShuffleManager，当然也可以通过修改属性spark.shuffle.manager为hash来显式控制使用HashShuffleManager。

4.创建Shuffle Map Task任务输出跟踪器MapOutputTracker
MapOutputTracker用于跟踪Shuffle Map Task任务的输出状态，此状态便于Result Task任务获取地址及中间结果。Result Task 会到各个Map Task 任务的所在节点上拉取Block，这一过程叫做Shuffle。MapOutputTracker 有两个子类：
①MapOutputTrackerMaster（for driver）
②MapOutputTrackerWorker（for executors）
shuffleReader读取shuffle文件之前就是去请求MapOutputTrackerMaster 要自己处理的数据在哪里？ MapOutputTrackerMaster给它返回一批 MapOutputTrackerWorker的列表（地址，port等信息）然后进行shuffleReader

5.内存管理器MemoryManager
spark的内存管理有两套方案，新旧方案分别对应的类是UnifiedMemoryManager和StaticMemoryManager。旧方案是静态的，storageMemory（存储内存）和executionMemory（执行内存）拥有的内存是独享的不可相互借用，故在其中一方内存充足，另一方内存不足但又不能借用的情况下会造成资源的浪费。新方案是统一管理的，初始状态是内存各占一半，但其中一方内存不足时可以向对方借用，对内存资源进行合理有效的利用，提高了整体资源的利用率。

Spark的内存管理，是把内存分为两大块，包括storageMemory和executionMemory。其中storageMemory用来缓存rdd，unroll partition，direct task result、广播变量等。executionMemory用于
shuffle、join、sort、aggregation 计算中的缓存。除了这两者以外的内存都是预留给系统的。每个Executor进程都有一个MemoryManager。

MemoryManager 的选择是由spark.memory.useLegacyMode来控制的，默认是使用UnifiedMemoryManager 来管理内存。用的是动态管理机制。即存储缓存和执行缓存可以相互借用，动态管理的优势在于可以充分里用缓存，不会出现一块缓存紧张，而另外一块缓存空闲的情况。

6.创建块传输服务NettyBlockTransferService
NettyBlockTransferService使用Netty提供的网络应用框架，提供web服务及客户端，获取远程节点上Block的集合。底层的fetchBlocks方法用于获取远程shuffle文件中的数据。

7.创建BlockManagerMaster
BlockManagerMaster负责对BlockManager的管理和协调
8.创建块管理器BlockManager
/**
*Manager running on every node (driver and executors) which provides interfaces for putting and
*retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap).
*
*/
在这里插入图片描述

如上图所示，每一个Executor进程创建时，都会创建一个BlockManager,而所有的BlockManager都由BlockManagerMaster来管理。

BlockManager主要提供了读取和写数据的接口，可以从本地或者是远程读取和写数据，读写数据可以基于内存、磁盘或者是堆外空间 (OffHeap)。

9.创建广播管理器BroadcastManager
BroadcastManager用于将配置信息、序列化后的RDD以及ShuffleDependency等信息在本地存储。此外， BroadcastManager会将数据从一个节点广播到其他的节点上。例如Driver上有一张表，而Executor中的每个并行执行的Task（100万个）都要查询这张表，那我们通过广播的方式就只需要往每个Executor把这张表发送一次就行了。Executor中的每个运行的Task查询这张唯一的表，而不是每次执行的时候都从Driver获得这张表。避免Driver 节点称为性能瓶颈。
Spark的广播机制
当声明一个广播变量时，最终的结果是所有的节点都会收到这个广播变量，Spark底层的实现细节如下：
①驱动程序driver将序列化的对象分为小块并存储在驱动器的blockmanager中。
②根据spark.broadcast.compress配置属性确认是否对广播消息进行压缩，根据spark.broadcast.blockSize配置属性确认块的大小，默认为4MB。
③为每个block生成BroadcastBlockId。即driver端会把广播数据分块，每个块做为一个block存进driver端的BlockManager
④每个executor会试图获取所有的块，来组装成一个完整的broadcast的变量。“获取块”的方法是首先从executor自身的BlockManager中获取，如果自己的BlockManager中没有这个块，就从别的BlockManager中获取。这样最初的时候，driver是获取这些块的唯一的源。
⑤但是随着各个BlockManager从driver端获取了不同的块(TorrentBroadcast会有意避免各个executor以同样的顺序获取这些块)，这样做的好处是可以使“块”的源变多。
⑥每个executor就可能从多个源中的一个,包括driver和其它executor的BlockManager中获取块，这要就使得流量在整个集群中更均匀，而不是由driver作为唯一的源。

10.创建缓存管理器CacheManager
CacheManager用于管理和持久化RDD
在这里插入图片描述

11.创建监听总线ListenerBus和检测系统MetricsSystem
Spark整个系统运行情况的监控是由ListenerBus以及MetricsSystem 来完成的。spark监听总线
（LiveListenerBus）负责监听spark中的各种事件，比如job启动、各Worker的内存使用率、BlockManager的添加等等，并通过MetricsSystem展示给UI

12.创建SparkEnv
当所有的组件准备好之后，最终可以创建执行环境SparkEnv

Winyar Wen

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Spark重要源码解读

SparkConf类/***Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.**Most of the time, you would create a SparkConf object with new SparkConf(), which wil...
复制链接

扫一扫