Spark2.0.X源码深度剖析之 SparkEnv

最新推荐文章于 2024-07-06 17:44:31 发布

按时吃早饭ABC

最新推荐文章于 2024-07-06 17:44:31 发布

阅读量1.3k

点赞数 4

文章标签： spark 大数据源码

本文链接：https://blog.csdn.net/ws0owws0ow/article/details/73017555

版权

微信号：519292115

邮箱：taosiyuan163@163.com

尊重原创，禁止转载！！

Spark目前是大数据领域中最火的框架之一，可高效实现离线批处理，实时计算和机器学习等多元化操作，阅读源码有助你加深对框架的理解和认知

本人将依次剖析Spark2.0.0.X版本的各个核心组件，包括以后章节的RpcEnv，NettyRpc，BlockManager，OutputTracker,TaskScheduler,DAGScheduler等

SparkEnv作为Spark集群中实例的运行执行环境显得非常重要，他存在于Work和Driver里（如果是在Driver端创建的话就会走之前章节中提及到的createSparkEnv中的createDriverEnv，如果是在executor上则会调用CoarseGrainedExecutorBackend的createExecutorEnv来创建SparkEnv）并且会构建 BlockManager，SerializerManager，RprEnv，MapOutputTracker等重要的执行环境组件。目前来说SparkEnv是非安全的通过一个全局变量来调用，所有线程都可访问，可能会在未来的版本变成私有化的吧。

/**
 * :: DeveloperApi ::
 * Holds all the runtime environment objects for a running Spark instance (either master or worker),
 * including the serializer, RpcEnv, block manager, map output tracker, etc. Currently
 * Spark code finds the SparkEnv through a global variable, so all the threads can access the same
 * SparkEnv. It can be accessed by SparkEnv.get (e.g. after creating a SparkContext).
 *
 * NOTE: This is not intended for external use. This is exposed for Shark and may be made private
 *       in a future release.
 */
@DeveloperApi
class SparkEnv (
    val executorId: String,
    private[spark] val rpcEnv: RpcEnv,
    val serializer: Serializer,
    val closureSerializer: Serializer,
    val serializerManager: SerializerManager,
    val mapOutputTracker: MapOutputTracker,
    val shuffleManager: ShuffleManager,
    val broadcastManager: BroadcastManager,
    val blockManager: BlockManager,
    val securityManager: SecurityManager,
    val metricsSystem: MetricsSystem,
    val memoryManager: MemoryManager,
    val outputCommitCoordinator: OutputCommitCoordinator,
    val conf: SparkConf) extends Logging {

创建一个Diver的SparkEnv

/**
 * Create a SparkEnv for the driver.
 */
private[spark] def createDriverEnv(
    conf: SparkConf,
    isLocal: Boolean,
    listenerBus: LiveListenerBus,
    numCores: Int,
    mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {
  // 先做断言 是否包含DRIVER_HOST_ADDRESS
  assert(conf.contains(DRIVER_HOST_ADDRESS),
    s"${DRIVER_HOST_ADDRESS.key} is not set on the driver!")
  // 断言是否包含spark.driver.port
  assert(conf.contains("spark.driver.port"), "spark.driver.port is not set on the driver!")
  // 拿到绑定的地址
  val bindAddress = conf.get(DRIVER_BIND_ADDRESS)
  // 拿到HOST地址
  val advertiseAddress = conf.get(DRIVER_HOST_ADDRESS)
  val port = conf.get("spark.driver.port").toInt
  // 判断下是否传输加密
  val ioEncryptionKey = if (conf.get(IO_ENCRYPTION_ENABLED)) {
    Some(CryptoStreamUtils.createKey(conf))
  } else {
    None
  }
  // 调用通用的Env的create方法并传入相关参数
  create(
    conf,
    SparkContext.DRIVER_IDENTIFIER,
    bindAddress,
    advertiseAddress,
    Option(port),
    isLocal,
    numCores,
    ioEncryptionKey,
    listenerBus = listenerBus,
    mockOutputCommitCoordinator = mockOutputCommitCoordinator
  )
}

创建executor 的 SparkEnv

在这里额外说一下，只有在coarse-grained 模式下才是在CoarseGrainedExecutorBackend调用，像mesos模式的话支持细粒度，调用方式也不一样。但是如果是Spark on Yarn的话只支持粗粒度模式，因为Container一旦启动资源是不可以动态扩展的。

/**
 * Create a SparkEnv for an executor.
 * In coarse-grained mode, the executor provides an RpcEnv that is already instantiated.
 */
private[spark] def createExecutorEnv(
    conf: SparkConf,
    executorId: String,
    hostname: String,
    numCores: Int,
    ioEncryptionKey: Option[Array[Byte]],
    isLocal: Boolean): SparkEnv = {
  // 创建SparkEnv
  val env = create(
    conf,
    executorId,
    hostname,
    hostname,
    None,
    isLocal,
    numCores,
    ioEncryptionKey
  )
  // 保存引用
  SparkEnv.set(env)
  // 返回env
  env
}

SparkEnv通用创建方法，里面会根据传入的executorId来判断是否是isDriver 并执行对应的命令

首先会构造RpcEnv以及他所需要的对象

RpcEnv他是集群之间通信的执行环境，（主要是Endpoint和EndpointRef之间的通信）在2.0之后底层完全基于Netty实现，后面章节也会对此做深度剖析

/**
 * Helper method to create a SparkEnv for a driver or an executor.
 */
private def create(
    conf: SparkConf,
    executorId: String,
    bindAddress: String,
    advertiseAddress: String,
    port: Option[Int],
    isLocal: Boolean,
    numUsableCores: Int,
    ioEncryptionKey: Option[Array[Byte]],
    listenerBus: LiveListenerBus = null,
    mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {
  // 根据executorId来判断是不是Driver
  val isDriver = executorId == SparkContext.DRIVER_IDENTIFIER

  // Listener bus is only used on the driver
  if (isDriver) {
    // 断言listenerBus 只能存在于Driver，work端则是jobProgressListener
    assert(listenerBus != null, "Attempted to create driver SparkEnv with null listener bus!")
  }
  // 主要用于权限设置，比如在使用yarn作为资源调度框架时，用于生成secret key进行登录
  val securityManager = new SecurityManager(conf, ioEncryptionKey)
  ioEncryptionKey.foreach { _ =>
    if (!securityManager.isEncryptionEnabled()) {
      logWarning("I/O encryption enabled without RPC encryption: keys will be visible on the " +
        "wire.")
    }
  }
  //  拿到对应的driver或executor的systemName
  val systemName = if (isDriver) driverSystemName else executorSystemName
  // 开始构建RpcEnv
  val rpcEnv = RpcEnv.create(systemName, bindAddress, advertiseAddress, port.getOrElse(-1), conf,
    securityManager, clientMode = !isDriver)

序列化的构建，也是通过Java反射拿到对应的序列化类型，当然你也可以提前传入Kryo序列化（官方建议的序列化，比Java序列化快10倍，但是不支持所有的类，Kryo序列化的类需要注册，当然不注册也可以运行但是很浪费空间）

// 拿到初始化序列器的相关信息，默认的还是JavaSerizlizer
val serializer = instantiateClassFromConf[Serializer](
  "spark.serializer", "org.apache.spark.serializer.JavaSerializer")
logDebug(s"Using serializer: ${serializer.getClass}")
// 通过拿到的相关信息构建SerializerManager并选择需要序列化的类型
// 自Spark 2.0后简单类型，数组及字符串的Shuffle的时候内部使用的Kryo
val serializerManager = new SerializerManager(serializer, conf, ioEncryptionKey)

val closureSerializer = new JavaSerializer(conf)

参数可以看做KV对，Key是对应的配置，Value是它的值

// Create an instance of the class named by the given SparkConf property, or defaultClassName
// if the property is not set, possibly initializing it with our conf
def instantiateClassFromConf[T](propertyName: String, defaultClassName: String): T = {
  instantiateClass[T](conf.get(propertyName, defaultClassName))
}

// Create an instance of the class with the given name, possibly initializing it with our conf
def instantiateClass[T](className: String): T = {
  // 反射拿到className
  val cls = Utils.classForName(className)
  // Look for a constructor taking a SparkConf and a boolean isDriver, then one taking just
  // SparkConf, then one taking no arguments
  try {
    // 这里都是用的Java的Api调用，拿到带多个参数的构造函数
    cls.getConstructor(classOf[SparkConf], java.lang.Boolean.TYPE)
      // 然后构造出他的对象
      .newInstance(conf, new java.lang.Boolean(isDriver))
      // 调用的Scala的Api ，转换成[T]类型
      .asInstanceOf[T]
  } catch {
    case _: NoSuchMethodException =>
      try {
        cls.getConstructor(classOf[SparkConf]).newInstance(conf).asInstanceOf[T]
      } catch {
        case _: NoSuchMethodException =>
          cls.getConstructor().newInstance().asInstanceOf[T]
      }
  }
}

/** Preferred alternative to Class.forName(className) */
def classForName(className: String): Class[_] = {
  //最后还是通过java反射机制实现
  Class.forName(className, true, getContextOrSparkClassLoader)
  // scalastyle:on classforname
}

官网对Kryo的介绍

Data Serialization

Serialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Often, this will be the first thing you should tune to optimize a Spark application. Spark aims to strike a balance between convenience (allowing you to work with any Java type in your operations) and performance. It provides two serialization libraries:

Java serialization: By default, Spark serializes objects using Java’s ObjectOutputStream framework, and can work with any class you create that implements java.io.Serializable. You can also control the performance of your serialization more closely by extendingjava.io.Externalizable. Java serialization is flexible but often quite slow, and leads to large serialized formats for many classes.
Kryo serialization: Spark can also use the Kryo library (version 2) to serialize objects more quickly. Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.

You can switch to using Kryo by initializing your job with a SparkConf and calling conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). This setting configures the serializer used for not only shuffling data between worker nodes but also when serializing RDDs to disk. The only reason Kryo is not the default is because of the custom registration requirement, but we recommend trying it in any network-intensive application. Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type.

构造广播变量，一个很实用的组件，包括在RDD 大小表join的时候和读取一些通用配置表的时候都可以调用他来优化性能

// 构造BroadcastManager
// 默认使用的是TorrentBroadcastFactory
val broadcastManager = new BroadcastManager(isDriver, conf, securityManager)

现在默认是Torrent算法实现，避免了以前使用HttpBroadcast的Driver单点瓶颈问题

其原理是把需要广播的数据块切分成多个block 然后分发到一些需要读取的Executor上，而其他需要读取的节点可以通过拿到block的Executor去fetch，这样就避免了以前Http模式中所有节点都去fetch Driver端的单点问题

private def initialize() {
  synchronized {
    if (!initialized) {
      //  之前版本用的Http模式，由于会造成Driver端单点瓶颈
      //  后来改成Torrent算法来实现
      broadcastFactory = new TorrentBroadcastFactory
      broadcastFactory.initialize(isDriver, conf, securityManager)
      initialized = true
    }
  }
}

构造mapOutputTracker并拿到Driver上对应的Reference或者注册自己，Endpoint和EndpointRef都是org.apach.spark.rpc下的特质和抽象类，他们之间的联系会在RpcEnv章节中详细介绍

// 根据ExecutorId对应生成mapOutputTracker
val mapOutputTracker = if (isDriver) {
  new MapOutputTrackerMaster(conf, broadcastManager, isLocal)
} else {
  new MapOutputTrackerWorker(conf)
}

// Have to assign trackerEndpoint after initialization as MapOutputTrackerEndpoint
// requires the MapOutputTracker itself
// 生成的mapOutputTracker实例会调用registerOrLookupEndpoint拿到Driver端的Ref或者注册自己
mapOutputTracker.trackerEndpoint = registerOrLookupEndpoint(MapOutputTracker.ENDPOINT_NAME,
  new MapOutputTrackerMasterEndpoint(
    rpcEnv, mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], conf))

调用registerOrLookupEndpoint，包括后面的BlockManagerMaster和outputCommitCoordinatorRef都是通过这个方法来获得Ref

// 拿到对应组件在Driver的Ref或者注册自己
// 参数是对应的name 和一个RpcEnpoint类型的内部类
def registerOrLookupEndpoint(
    name: String, endpointCreator: => RpcEndpoint):
  RpcEndpointRef = {
  if (isDriver) {
    logInfo("Registering " + name)
    // 如果在Driver端就注册自己返回一个自己的Reference
    rpcEnv.setupEndpoint(name, endpointCreator)
  } else {
    // 如果在Executor上就拿取到Driver端的Reference
    RpcUtils.makeDriverRef(name, conf, rpcEnv)
  }
}

构造对应的ShuffleManager，默认是sort模式（早先的Spark版本是HashShuffle）

// Let the user specify short names for shuffle managers
// 拿去用户指定的 sortShuffle和tungsten-sortShuffle的对应值map
val shortShuffleMgrNames = Map(
  "sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,
  "tungsten-sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)
// 默认拿到的Shuffle mode是Sort-Based模式
val shuffleMgrName = conf.get("spark.shuffle.manager", "sort")
// 提取到对应shuffleManagerClass的名字
val shuffleMgrClass =
  shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase(Locale.ROOT), shuffleMgrName)
// 根据拿到的name 发射生成对应的实例
val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)

默认是初始化统一内存管理，他与之前的静态内存管理的区别是根据算法存储内存和执行内存可以动态占用对方的空间

// 在Spark1.6之后引入了统一内存管理，所以这里默认是false
val useLegacyMemoryManager = conf.getBoolean("spark.memory.useLegacyMode", false)
val memoryManager: MemoryManager =
  if (useLegacyMemoryManager) {
    new StaticMemoryManager(conf, numUsableCores)
  } else {
    // 默认初始化的是统一内存管理机制
    UnifiedMemoryManager(conf, numUsableCores)
  }

最后会初始化负责管理数据存储和传输的BlockManager以及相关的组件

// 初始化NettyBlockTransferService，继承于BlockTransferService
// 负责block的fetch和uploade，底层也是基于Netty
val blockTransferService =
  new NettyBlockTransferService(conf, securityManager, bindAddress, advertiseAddress,
    blockManagerPort, numUsableCores)
// 构造一个BlockManagerMaster，这里有个误区，网上很多帖子都说只有Driver才有Master
// 其实不然，它不仅存在于Driver，而且也会存在于Worker，每个节点都是M-S的结构，也有自己的Slave
// 一共有三个参数，Driver端的Ref，Spark全局的conf，是否是Driver
val blockManagerMaster = new BlockManagerMaster(registerOrLookupEndpoint(
  BlockManagerMaster.DRIVER_ENDPOINT_NAME,
  new BlockManagerMasterEndpoint(rpcEnv, isLocal, conf, listenerBus)),
  conf, isDriver)

// NB: blockManager is not valid until initialize() is called later.
// 构造Blockmanager 用来做数据管理的主要组件，传入的参数可以看得出，他会跟很多其他组件交互
// 显得尤为重要，也会在后面的章节深度剖析
val blockManager = new BlockManager(executorId, rpcEnv, blockManagerMaster,
  serializerManager, conf, memoryManager, mapOutputTracker, shuffleManager,
  blockTransferService, securityManager, numUsableCores)

val outputCommitCoordinatorRef = registerOrLookupEndpoint("OutputCommitCoordinator",
  new OutputCommitCoordinatorEndpoint(rpcEnv, outputCommitCoordinator))
outputCommitCoordinator.coordinatorRef = Some(outputCommitCoordinatorRef)

按时吃早饭ABC

关注

4
点赞
踩
2

收藏

觉得还不错? 一键收藏
4
评论
Spark2.0.X源码深度剖析之 SparkEnv

SparkEnv作为Spark集群中实例的运行执行环境显得非常重要，他存在于Work和Driver里（如果是在Driver端创建的话就会走之前章节中提及到的createSparkEnv中的createDriverEnv，如果是在executor上则会调用CoarseGrainedExecutorBackend的createExecutorEnv来创建SparkEnv）并且会构建 BlockManager，SerializerManager，RprEnv，MapOutputTracker等重要的执行环境组件。
复制链接

扫一扫