Spark执行环境——广播管理器BroadcastManager

BroadcastManager用于将配置信息序列化后的RDD、Job及ShuffleDependency等信息在本地存储,如果为了容灾,也会复制到其他节点上。创建BroadcastManager的代码实现如下:

//org.apache.spark.SparkEnv
val broadcastManager = new BroadcastManager(isDriver, conf, securityManager)

BroadcastManager除了构造器定义的三个成员属性外,BroadcastManager内部还有三个成员:

//表示BroadcastManager是否初始化完成的状态
private var initialized = false
//广播工厂实例
private var broadcastFactory: BroadcastFactory = null
//下一个广播对象的广播ID,类型为AtomicLong
private val nextBroadcastId = new AtomicLong(0)

BroadcastManager在其初始化的过程中就会调用自身的initialize方法,当initialize执行完毕,BroadcastManager就正式生效。BraodcastManager的initialize方法的实现如下:

//org.apache.spark.broadcast.BroadcastManager
private def initialize() {
  synchronized {
    if (!initialized) {
      broadcastFactory = new TorrentBroadcastFactory
      broadcastFactory.initialize(isDriver, conf, securityManager)
      initialized = true
    }
  }
}

根据代码,initialize方法首先判断BroadcastManager是否已经初始化,以保证BroadcastManager只被初始化一次。新建TorrentBroadcastFactory作为BroadcastManager的广播工厂实例,之后调用TorrentBroadcastFactory的initialize方法对TorrentBroadcastFactory进行初始化,最后将BroadcastManager自身标记为初始化完成状态。

BroadcastManager中提供了三个方法,如下:

//org.apache.spark.broadcast.BroadcastManager
def stop() {
  broadcastFactory.stop()
}
private val nextBroadcastId = new AtomicLong(0)
def newBroadcast[T: ClassTag](value_ : T, isLocal: Boolean): Broadcast[T] = {
  broadcastFactory.newBroadcast[T](value_, isLocal, nextBroadcastId.getAndIncrement())
}
def unbroadcast(id: Long, removeFromDriver: Boolean, blocking: Boolean) {
  broadcastFactory.unbroadcast(id, removeFromDriver, blocking)
}

从代码中可看到,BroadcastManager的三个方法都分别代理了TorrentBroadcastFactory的对应方法,TorrentBroadcastFactory中提供的三个方法的实现如下:

//org.apache.spark.broadcast.TorrentBroadcastFactory
override def newBroadcast[T: ClassTag](value_ : T, isLocal: Boolean, id: Long): Broadcast[T] = {
  new TorrentBroadcast[T](value_, id)
}
override def stop() { }
override def unbroadcast(id: Long, removeFromDriver: Boolean, blocking: Boolean) {
  TorrentBroadcast.unpersist(id, removeFromDriver, blocking)
}

由代码可知TorrentBroadcastFactory的newBroadcast方法用于生成TorrentBroadcast实例,其作用为广播TorrentBroadcast中的value。表面看只是利用构造器生成了TorrentBroadcast实例,但是其效果远不止此。TorrentBroadcast对象包括以下属性

//org.apache.spark.broadcast.TorrentBroadcast
@transient private lazy val _value: T = readBroadcastBlock()
@transient private var compressionCodec: Option[CompressionCodec] = _
@transient private var blockSize: Int = _
private val broadcastId = BroadcastBlockId(id)
private val numBlocks: Int = writeBlocks(obj)
  • _value:从Executor或者Driver上读取的广播块的值。_value是通过调用readBroadcastBlock方法获得的广播对象,由于_value是个lazy及val修饰的属性,因此在构造TorrentBroadcast实例的时候不会调用readBroadcastBlock方法,而是等到明确需要使用_value的值时才调用。
  • compressionCodec:用于广播对象的压缩编解码器。可以设置spark.broadcast.compress属性为true启用,默认是启用的。
  • blockSize:每个块的大小。它是个只读属性,可以使用spark.broadcast.bockSize属性进行配置,默认为4MB。
  • broadcastId:广播ID,实际是样例类BroadcastBlockId,其代码为:
//org.apache.spark.storage.BlockId
case class BroadcastBlockId(broadcastId: Long, field: String = "") extends BlockId {
  override def name: String = "broadcast_" + broadcastId + (if (field == "") "" else "_" + field)
}
  • numBlocks:广播变量包含的块的数量。numBlocks通过调用writeBlocks方法获得,由于numBlocks是个val修饰的不可变量属性,因此在构造TorrentBroadcast实例的时候就会调用writeBlock方法将广播对象写入存储体系。

1 广播对象的写操作

上面代码中提到在构造TorrentBroadcast实例的时候就会调用writeBlocks方法,其实现代码如下

private def writeBlocks(value: T): Int = {
  import StorageLevel._
  val blockManager = SparkEnv.get.blockManager
  if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, tellMaster = false)) {
    throw new SparkException(s"Failed to store $broadcastId in BlockManager")
  }
  val blocks =
    TorrentBroadcast.blockifyObject(value, blockSize, SparkEnv.get.serializer, compressionCodec)
  blocks.zipWithIndex.foreach { case (block, i) =>
    val pieceId = BroadcastBlockId(id, "piece" + i)
    val bytes = new ChunkedByteBuffer(block.duplicate())
    if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, tellMaster = true)) {
      throw new SparkException(s"Failed to store $pieceId of $broadcastId in local BlockManager")
    }
  }
  blocks.length
}

上述代码writeBlocks的执行步骤:

  • 1)获取当前SparkEnv的BlockManager组件
  • 2)调用BlockManager的putSingle方法将广播对象写入本地的存储体系。当Spark以local模式运行时,则会将广播对象写Driver本地的存储体系,以便于任务也可以在Driver上执行。由于MEMORY_AND_DISK对应的StorageLevel的_replication属性固定为1,因此此处只会将广播对象写Driver或Executor本地的存储体系。
  • 3)调用TorrentBroadcast的blockifyObject方法,将对象转换成一系列的块。每个块的大小由blockSize决定,使用当前SparkEnv中的JavaSerializer组件进行序列化,使用TorrentBroadcast自身的compressionCodec进行压缩
  • 4)对每个块进行如下处理:给当前分片广播块生成分片的BroadcastBlockId,分片通过BroadcastBlockId的field属性区别,例如piece0、piece1......调用BlockManager的putBytes方法将分片广播块以序列化方式写入Driver本地的存储体系。由于 MEMORY_AND_DISK_SER对应的StorageLevel的_replication属性也固定为1,因此此处只会将分片广播块写入Driver或Excutor本地的存储体系。
  • 5)返回块的数量

经过以上分析,可用下图表示广播对象的写入过程:

2 广播对象的读操作

前文提到,只有当TorrentBroadcast实例的_value属性值在需要的时候,才会调用readBroadcastBlock方法获取值,readBroadcastBlock的实现代码如下:

//org.apache.spark.broadcast.TorrentBroadcast
private def readBroadcastBlock(): T = Utils.tryOrIOException {
  TorrentBroadcast.synchronized {
    setConf(SparkEnv.get.conf)
    val blockManager = SparkEnv.get.blockManager
    blockManager.getLocalValues(broadcastId).map(_.data.next()) match {
      case Some(x) =>
        releaseLock(broadcastId)
        x.asInstanceOf[T]
      case None =>
        logInfo("Started reading broadcast variable " + id)
        val startTimeMs = System.currentTimeMillis()
        val blocks = readBlocks().flatMap(_.getChunks())
        logInfo("Reading broadcast variable " + id + " took" + Utils.getUsedTimeMs(startTimeMs))
        val obj = TorrentBroadcast.unBlockifyObject[T](
          blocks, SparkEnv.get.serializer, compressionCodec)
        val storageLevel = StorageLevel.MEMORY_AND_DISK
        if (!blockManager.putSingle(broadcastId, obj, storageLevel, tellMaster = false)) {
          throw new SparkException(s"Failed to store $broadcastId in BlockManager")
        }
        obj
    }
  }
}

根据上述代码readBroadcastBlock的执行步骤如下:

  • 1)获取当前SparkEnv的BlockManager组件
  • 2)调用BlockManager的getLocalValues方法从本地的存储系统中获取广播对象,即通过BlockManager的putSingle方法写入存储体系的广播对象
  • 3)如果从本地的存储体系中可以获取广播对象,则调用releaseLock方法(这个锁保证当块被一个运行中的任务使用时,不能被其它任务再次使用,但是当任务运行完成时,则应该释放这个锁),释放当前块的锁并返回此广播对象
  • 4)如果从本地的存储体系中没有获取到广播对象,那么说明数据是通过BlockManager的putBytes方法以序列化方式写入存储体系的。此时首先调用readBlocks方法从Driver或Executor的存储体系中获取广播块,然后调用TorrentBroadcast的unBlockifyObject方法,将一系列的广播块转换回原来的广播对象,最后再次调用BlockManager的putSingle方法将广播对象写入本地的存储体系,以便于当前Executor的其它任务不用再次获取广播对象

上文的代码中调用readBlocks方法可以从Driver、Executor的存储体系中获取块,其实现代码如下:

//org.apache.spark.broadcast.TorrentBroadcast
private def readBlocks(): Array[ChunkedByteBuffer] = {
  val blocks = new Array[ChunkedByteBuffer](numBlocks)
  val bm = SparkEnv.get.blockManager
  for (pid <- Random.shuffle(Seq.range(0, numBlocks))) {
    val pieceId = BroadcastBlockId(id, "piece" + pid)
    logDebug(s"Reading piece $pieceId of $broadcastId")
    bm.getLocalBytes(pieceId) match {
      case Some(block) =>
        blocks(pid) = block
        releaseLock(pieceId)
      case None =>
        bm.getRemoteBytes(pieceId) match {
          case Some(b) =>
            if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, tellMaster = true)) {
              throw new SparkException(
                s"Failed to store $pieceId of $broadcastId in local BlockManager")
            }
            blocks(pid) = b
          case None =>
            throw new SparkException(s"Failed to get $pieceId of $broadcastId")
        }
    }
  }
  blocks
}

readBroadcastBlock的执行步骤如下:

  • 1)新建用于存储每个分片广播块的数组blocks,并获取当前SparkEnv的BlockManager组件
  • 2)对各个广播分片进行随机洗牌,避免对广播块的获取出现“热点”,提升性能。对洗牌后的各个广播分片依次执行3至4步的操作
  • 3)调用BlockManager的getLocalBytes方法从本地的存储体系中获取序列化的分片广播块,如果本地可以获取到,则将分片广播放入blocks,并且调用releaseLock方法释放此分片广播块的锁。
  • 4)如果本地没有,则调用BlockManager的getRemoteBytes方法从远端的存储体系中获取分片广播块,然后调用BlockManager的putBytes方法将分片广播块写入本地存储体系,以便于当前Executor的其它任务不用再次获取分片广播块,最后将分片广播块放入blocks。
  • 5)返回blocks中的所有分片广播块

3 广播对象的去持久化

//org.apache.spark.broadcast.TorrentBroadcast
def unpersist(id: Long, removeFromDriver: Boolean, blocking: Boolean): Unit = {
  logDebug(s"Unpersisting TorrentBroadcast $id")
  SparkEnv.get.blockManager.master.removeBroadcast(id, removeFromDriver, blocking)
}

根据上述代码可知TorrentBroadcast的unpersist方法实际调用了BlockManager的子组件BlockManagerMaster的removeBroadcast方法来实现对广播对象去持久化。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值