前面我们分析了Block的唯一标识,数据的抽象表示,BlockInfo的管理,Block的存储磁盘以及存入内存的各种功能,本节我们来看一下Driver和Executor是如何进行Block的消息通信和统一管理集群中的Block信息的。
BlockManagerInfo
Dirver端会有一个HashMap来记录每个BlockManager上的元数据信息,元数据信息是用BlockManagerInfo来抽象表示的,它管理了该BlockManager管理的所有block的BlockStatus以及堆内,堆外内存的大小及总体使用情况。Driver端通过与Executor之间的通信不断地对信息进行更新,当Executor上的block有任何变化的时候, 需要发送updateBlockInfo事件来更新Driver上block信息,同时也提供了获取和删除Block信息的接口,基本的属性如下所示:
private[spark] class BlockManagerInfo(
val blockManagerId: BlockManagerId,
timeMs: Long,
val maxOnHeapMem: Long,
val maxOffHeapMem: Long,
val storageEndpoint: RpcEndpointRef, // slave节点的RpcEndPoint
val externalShuffleServiceBlockStatus: Option[JHashMap[BlockId, BlockStatus]])
extends Logging {
val maxMem = maxOnHeapMem + maxOffHeapMem
// 是否开启了外部shuffle服务
val externalShuffleServiceEnabled = externalShuffleServiceBlockStatus.isDefined
// 记录最后一次访问当前BlockManagerInfo的时间
private var _lastSeenMs: Long = timeMs
// 记录当前BlockManagerInfo对应的BlockManager管理的所有数据块的剩余的可用内存大小
private var _remainingMem: Long = maxMem
// 记录当前BlockManagerInfo对应的BlockManager管理的所有数据块的BlockStatus状态对象
private val _blocks = new JHashMap[BlockId, BlockStatus]
}
更新元信息
当BlockManagerMasterEndpoint
接收到UpdateBlockInfo
消息时候,会进行相应BlockManagerId
中的元信息的更新操作,主要委托给BlockManagerInfo
进行处理,我们来看下是如何进行更新操作的,具体要经过以下步骤:
- 查看块是否存在,块存在时候,如果存储级别包含memory,则需要对原始的内存信息进行归还
- 如果存储级别合法,需要针对存储级别是进行不同的BlockStatus构造,加入到维护的
_blocks
中,同时外部shuffle服务开启时候也需要更新外部Shuffle服务中的BlockStatus - 如果存储级别不合法而且之前已经存在了块,则需要删除块
def updateBlockInfo(
blockId: BlockId,
storageLevel: StorageLevel,
memSize: Long,
diskSize: Long): Unit = {
updateLastSeenMs()
val blockExists = _blocks.containsKey(blockId)
var originalMemSize: Long = 0
var originalDiskSize: Long = 0
var originalLevel: StorageLevel = StorageLevel.NONE
if (blockExists) {
// 获取数据块对应的BlockStatus
val blockStatus: BlockStatus = _blocks.get(blockId)
// 记录旧的值
originalLevel = blockStatus.storageLevel
originalMemSize = blockStatus.memSize
originalDiskSize = blockStatus.diskSize
if (originalLevel.useMemory) {
/**
* 因为原来的存储级别使用的内存,现在要对其进行修改,
* 则先将旧的内存值累加到_remainingMem中,说明将该数据块原来使用的内存大小归还了,
* 在后面的操作还会将新的内存值从_remainingMem中减去,说明又将特定的内存大小分配给该数据块了。
*/
_remainingMem += originalMemSize
}
}
// 检查设置的存储级别是否有效,即数据块使用了内存或磁盘存储级别,且副本数量大于1
if (storageLevel.isValid) {
var blockStatus: BlockStatus = null
if (storageLevel.useMemory) { // 使用内存存储级别
// 构建新的BlockStatus对象,磁盘存储为0
blockStatus = BlockStatus(storageLevel, memSize = memSize, diskSize = 0)
_blocks.put(blockId, blockStatus) // 更新字典
_remainingMem -= memSize // 将特定的内存大小分配给该数据块了,所以需要从总的剩余内存中减去
}
if (storageLevel.useDisk) { // 使用磁盘存储级别
// 构建新的BlockStatus对象,内存存储为0
blockStatus = BlockStatus(storageLevel, memSize = 0, diskSize = diskSize)
_blocks.put(blockId, blockStatus) // 更新_blocks字典
}
// 外部Shuffle服务
externalShuffleServiceBlockStatus.foreach { shuffleServiceBlocks =>
if (!blockId.isBroadcast && blockStatus.diskSize > 0) {
shuffleServiceBlocks.put(blockId, blockStatus)
}
}
} else if (blockExists) { // 存储级别无效,检查当前BlockManager是否管理该数据块
_blocks.remove(blockId) // 从_blocks中移除
externalShuffleServiceBlockStatus.foreach { blockStatus =>
blockStatus.remove(blockId)
}
}
}
获取块状态信息
获取块状态直接从_blocks
获取信息即可
// 获取指定数据块的BlockStatus
def getStatus(blockId: BlockId): Option[BlockStatus] = Option(_blocks.get(blockId))
删除块
当BlockManagerMasterEndpoint
接收到RemoveRdd
消息时候,会进行相应rddId的文件删除,则会调用removeBlock
来进行删除操作,逻辑比较简单,直接更改_blocks
和外部shuffle信息,如果是存储中用了内存,归还内存信息。
def removeBlock(blockId: BlockId): Unit = {
if (_blocks.containsKey(blockId)) {
// 归还数据块占用的内存给_remainingMem
_remainingMem += _blocks.get(blockId).memSize
_blocks.remove(blockId) // 从_blocks中移除
externalShuffleServiceBlockStatus.foreach { blockStatus =>
blockStatus.remove(blockId)
}
}
}
BlockManagerMessages
Driver与Executor的块信息的更新,删除,映射关系等都是通过消息传递来进行的,有两种类型的消息抽象,分别是Master->Slave的消息: ToBlockManagerMasterStorageEndpoin
和Slave -> Master的消息ToBlockManagerMastert
。
从Salve发送到Master的消息关系图如下所示,主要分为四类,维护BlockManager的注册&停止&心跳连接相关的消息,更新&获取块状态信息相关的消息,获取存储&内存&引用的相关消息以及Executor是否存活删除等消息。
从Master发送到Slave的消息关系图如下所示,主要有三类,块的删除,块的复制以及废弃BlockManager消息:
BlockManagerMasterEndpoint
BlockManagerMasterEndpoint
由Driver上的SparkEnv负责创建和注册到Driver的RpcEnv中。BlockManagerMasterEndpoint
只存在于Driver的SparkEnv中,Driver或Executor上的BlockManagerMaster
有driverEndpoint属性,将持有BlockManagerMasterEndpoint
的RpcEndpointRef。它主要功能是接收Driver或Executor上BlockManagerMaster
发送的消息,对所有的BlockManager统一管理,主要对各个节点上的BlockManager、BlockManager与Executor的映射关系,及Block位置信息(即Block所在的BlockManager)等进行管理,我们先看下它的私有成员变量:
// 管理所有BlockManager的元数据信息,包含的执行内存,存储内存大小,维护的块的信息
blockManagerInfo: mutable.Map[BlockManagerId, BlockManagerInfo]
// 同一个数据块可能被多个BlockManager管理,通过一个BlockId找到一个BlockManagerId集合
private val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]]
// 每个BlockManager上面的Block的状态信息
private val blockStatusByShuffleService =
new mutable.HashMap[BlockManagerId, JHashMap[BlockId, BlockStatus]]
// executor-id到BlockManagerId的HashMap,可以通过executor-id找到BlockManagerId的信息
private val blockManagerIdByExecutor = new mutable.HashMap[String, BlockManagerId]
BlockManagerMasterEndpoint
最主要是进行消息处理,下面我们来看下几种主要的消息处理过程。
BlockManager相关
注册
每个BlockManager初始化时候都要先向Master注册,注册需要经过以下几个步骤:
- 由于BlockManager初始化时候的唯一标识BlockManagerId并没有拓扑信息,需要Master统一给配置;
- 绑定executor-id的存储目录对应关系到MasterEndpoint的
executorIdToLocalDirs
中,供后续getLocationsAndStatus
消息获取locations信息; - 绑定executor-id与BlockManagerId的关系;
- 如果开启了外部服务,还需要绑定
blockStatusByShuffleService
的关系; - 最后绑定
blockManagerInfo
来记录BlockManager的元数据信息;
private def register(
idWithoutTopologyInfo: BlockManagerId, // 注册时候还没有topology信息,需要Master来分配填充
localDirs: Array[String],
maxOnHeapMemSize: Long,
maxOffHeapMemSize: Long,
storageEndpoint: RpcEndpointRef): BlockManagerId = {
// 补充拓扑信息,并添加到blockManagerInfo中
val id = BlockManagerId(
idWithoutTopologyInfo.executorId,
idWithoutTopologyInfo.host,
idWithoutTopologyInfo.port,
topologyMapper.getTopologyForHost(idWithoutTopologyInfo.host)) // 主要是使用topologyMapper生产拓扑信息
val time = System.currentTimeMillis()
executorIdToLocalDirs.put(id.executorId, localDirs) // executorIdToLocalDirs是executorid -> 存储目录对应关系
if (!blockManagerInfo.contains(id)) { // 当前并未管理该BlockManagerId的BlockManagerInfo对象
// 根据BlockManagerId中保存的Executor ID获取旧的BlockManagerId
blockManagerIdByExecutor.get(id.executorId) match {
case Some(oldId) => // 存在,则移除
removeExecutor(id.executorId)
case None =>
}
// 更新blockManagerIdByExecutor字典
blockManagerIdByExecutor(id.executorId) = id
// 外部shuffle服务,blockStatusByShuffleService记录了每个BlockManager是否有shuffle服务
val externalShuffleServiceBlockStatus = if (externalShuffleServiceRddFetchEnabled) {
val externalShuffleServiceBlocks = blockStatusByShuffleService
.getOrElseUpdate(externalShuffleServiceIdOnHost(id), new JHashMap[BlockId, BlockStatus])
Some(externalShuffleServiceBlocks)
} else {
None
}
// 更新blockManagerInfo字典,添加新的BlockManagerInfo对象
blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxOnHeapMemSize, maxOffHeapMemSize,
storageEndpoint, externalShuffleServiceBlockStatus)
}
// 触发对所有SparkListener的onBlockManagerAdded方法的调用,进而达到监控的目的。
listenerBus.post(SparkListenerBlockManagerAdded(time, id, maxOnHeapMemSize + maxOffHeapMemSize,
Some(maxOnHeapMemSize), Some(maxOffHeapMemSize)))
id
}
下线
DecommissionBlockManagers
消息是用来下线executorids对应的BlockManager,主要是获取executorIds对应的BlockManager元信息,通过对应的endPointRef向Executor发送下线BlockManager通知。
def decommissionBlockManagers(blockManagerIds: Seq[BlockManagerId]): Future[Seq[Unit]] = {
val newBlockManagersToDecommission = blockManagerIds.toSet.diff(decommissioningBlockManagerSet)
val futures = newBlockManagersToDecommission.map { blockManagerId =>
decommissioningBlockManagerSet.add(blockManagerId)
val info = blockManagerInfo(blockManagerId)
info.storageEndpoint.ask[Unit](DecommissionBlockManager)
}
Future.sequence{ futures.toSeq }
}
Block相关
更新Block
当块写入后,需要向Master进行汇报,绑定到BlockManagerInfo和locations中,供后续查询块的位置信息,主要步骤如下:
- 如果是shuffleId,则只需要告知MapOutputTracker进行相应的更新即可;
- 找到相应的blockManagerId对应的云信息记录者BlockManagerInfo,使用它的
updateBlockInfo
更新元信息; - 位置信息绑定,包含了executor的位置以及外部shuffle服务的位置
private def updateBlockInfo(
blockManagerId: BlockManagerId,
blockId: BlockId, // 块标识
storageLevel: StorageLevel, // 存储级别
memSize: Long, // 使用的内存大小
diskSize: Long): Boolean = { // 使用的磁盘大小
if (blockId.isShuffle) { // shuffle中间文件是由MapOutputTracker记录的,所以不需要做什么处理
blockId match {
case ShuffleIndexBlockId(shuffleId, mapId, _) => // shuffle index文件
return true
case ShuffleDataBlockId(shuffleId: Int, mapId: Long, reduceId: Int) => // Shuffle数据文件
mapOutputTracker.updateMapOutput(shuffleId, mapId, blockManagerId) // 更新mapOutputTracker
return true
case _ =>
return false
}
}
if (!blockManagerInfo.contains(blockManagerId)) { // 判断是否存在对应的BlockManager
if (blockManagerId.isDriver && !isLocal) {
return true
} else {
return false
}
}
if (blockId == null) { // 如果指定的数据块的BlockId为空,仅仅更新BlockManagerInfo中记录的最后更新时间
blockManagerInfo(blockManagerId).updateLastSeenMs()
return true
}
// blockManagerInfo来更新Block的状态信息
blockManagerInfo(blockManagerId).updateBlockInfo(blockId, storageLevel, memSize, diskSize)
// 同一个数据块可能被多个BlockManager管理,通过一个BlockId找到一个BlockManagerId集合
var locations: mutable.HashSet[BlockManagerId] = null
if (blockLocations.containsKey(blockId)) {
locations = blockLocations.get(blockId)
} else {
locations = new mutable.HashSet[BlockManagerId]
blockLocations.put(blockId, locations)
}
// 存储级别有效,需要将BlockManagerId添加到数据块的位置信息中
if (storageLevel.isValid) {
locations.add(blockManagerId)
} else {
locations.remove(blockManagerId)
}
// 如果开启了shuffle服务,shuffle服务地址也要放到locations里面去
if (blockId.isRDD && storageLevel.useDisk && externalShuffleServiceRddFetchEnabled) {
val externalShuffleServiceId = externalShuffleServiceIdOnHost(blockManagerId)
if (storageLevel.isValid) {
locations.add(externalShuffleServiceId)
} else {
locations.remove(externalShuffleServiceId)
}
}
// Remove the block from master tracking if it has been removed on all endpoints.
if (locations.size == 0) {
blockLocations.remove(blockId)
}
true
}
查询Block
提供了GetLocations
,GetLocationsAndStatus
, GetLocationsMultipleBlockIds
, GetBlockStatus
,GetMatchingBlockIds
, GetReplicateInfoForRDDBlocks
消息来获取单个或者多个BlockId的位置或者是状态信息,位置和状态的抽象表示如下:
class BlockManagerId private (
private var executorId_ : String,
private var host_ : String,
private var port_ : Int,
private var topologyInfo_ : Option[String]) {
}
case class BlockStatus(storageLevel: StorageLevel, memSize: Long, diskSize: Long) {
def isCached: Boolean = memSize + diskSize > 0
}
主要就是从上面维护的blockManagerInfo
, blockLocations
, blockStatusByShuffleService
中获取相应BlockId对应的存储信息,返回,方法都比较简单,就不一一分析。
删除Block
删除Block相关信息,主要是转发消息给Executor进行具体的删除操作,主要消息如下
/** 删除rdd,shuffle, broadcast,blockID以及exector */
case RemoveRdd(rddId) =>
context.reply(removeRdd(rddId))
case RemoveShuffle(shuffleId) =>
context.reply(removeShuffle(shuffleId))
case RemoveBroadcast(broadcastId, removeFromDriver) =>
context.reply(removeBroadcast(broadcastId, removeFromDriver))
case RemoveBlock(blockId) =>
removeBlockFromWorkers(blockId)
context.reply(true)
删除Shuffle,Broadcast都是通过向Exector转发消息即可,删除RDD比较复杂,我们来看下是如何进行删除RDD的:
- 先根据
blockLocations
找到RDDId对应的locations; - 然后位置信息区分是否是外部shuffle服务,然后删除blockManagerInfo元数据信息和外部shuffle的元数据信息;
- 通过同步操作发送消息给Executor进行删除RDDId对应文件信息;
- 通过shuffleClient的
removeBlocks
删除RDDId对应文件信息。
/** 将rddId的文件删除:先删除meta信息,然后发送给executor或者外部shuffle服务删除rdd */
private def removeRdd(rddId: Int): Future[Seq[Int]] = {
val removeMsg = RemoveRdd(rddId)
// 查找哪些BlockInfo持有改block,由于副本的存在可能导致一个block有多个存储位置
val blocks = blockLocations.asScala.keys.flatMap(_.asRDDId).filter(_.rddId == rddId)
val blocksToDeleteByShuffleService = new mutable.HashMap[BlockManagerId, mutable.HashSet[RDDBlockId]]
// 先删除blockManagerInfo元信息
blocks.foreach { blockId =>
val bms: mutable.HashSet[BlockManagerId] = blockLocations.remove(blockId)
// 区分开启了外部shuffle的executor和不开启外部shuffle的executor
val (bmIdsExtShuffle, bmIdsExecutor) = bms.partition(_.port == externalShuffleServicePort)
val liveExecutorsForBlock = bmIdsExecutor.map(_.executorId).toSet
// 开启外部shuffle的executor,删除blockStatusByShuffleService中的信息
bmIdsExtShuffle.foreach { bmIdForShuffleService =>
// shuffle服务在executor已经释放的情况下还可以继续删除文件信息
if (!liveExecutorsForBlock.contains(bmIdForShuffleService.executorId)) {
val blockIdsToDel = blocksToDeleteByShuffleService.getOrElseUpdate(bmIdForShuffleService,
new mutable.HashSet[RDDBlockId]())
blockIdsToDel += blockId
blockStatusByShuffleService.get(bmIdForShuffleService).foreach { blockStatus =>
blockStatus.remove(blockId)
}
}
}
// 没有开启外部shuffle的通过blockManagerInfo删除
bmIdsExecutor.foreach { bmId =>
blockManagerInfo.get(bmId).foreach { bmInfo =>
bmInfo.removeBlock(blockId)
}
}
}
// 发送消息让executor删除
val removeRddFromExecutorsFutures = blockManagerInfo.values.map { bmInfo =>
bmInfo.storageEndpoint.ask[Int](removeMsg).recover {
handleBlockRemovalFailure("RDD", rddId.toString, bmInfo.blockManagerId, 0)
}
}.toSeq
// 通过外部shuffle服务删除
val removeRddBlockViaExtShuffleServiceFutures = externalBlockStoreClient.map { shuffleClient =>
blocksToDeleteByShuffleService.map { case (bmId, blockIds) =>
Future[Int] { // 通过shuffleClient调用删除block信息
val numRemovedBlocks = shuffleClient.removeBlocks(
bmId.host,
bmId.port,
bmId.executorId,
blockIds.map(_.toString).toArray)
numRemovedBlocks.get(defaultRpcTimeout.duration.toSeconds, TimeUnit.SECONDS)
}
}
}.getOrElse(Seq.empty)
Future.sequence(removeRddFromExecutorsFutures ++ removeRddBlockViaExtShuffleServiceFutures)
}
其他
还有另外一些消息主要是获取一些总体内存信息,存储内存信息等,都比较简单,自行分析源码即可。
BlockManagerStorageEndpoint
每个Executor或Driver的SparkEnv中都有属于自己的BlockManagerStorageEndpoint,分别由各自的BlockManager负责创建和注册到各自的RpcEnv中,并由各自BlockManager的storageEndpoint属性持有各自BlockManagerStorageEndpoint的RpcEndpointRef,BlockManagerStorageEndpoint存在与BlockManager中,源码如下:
// org.apache.spark.storage.BlockManager
/** slave节点上面的管理存储的RpcEndpoint,与MasterEndpoint消息发送接收处理等操作 */
private val storageEndpoint = rpcEnv.setupEndpoint(
"BlockManagerEndpoint" + BlockManager.ID_GENERATOR.next,
new BlockManagerStorageEndpoint(rpcEnv, this, mapOutputTracker))
BlockManagerStorageEndpoint
用于接收BlockManagerMasterEndpoint
下发的命令。例如,删除Block、获取Block状态、获取匹配的BlockId等,都委托给BlockManager
执行相应的操作,得到结果,需要回复的消息回复相应的消息结果数据,处理的消息如下所示:
case RemoveBlock(blockId) =>
doAsync[Boolean]("removing block " + blockId, context) {
blockManager.removeBlock(blockId)
true
}
case RemoveRdd(rddId) =>
doAsync[Int]("removing RDD " + rddId, context) {
blockManager.removeRdd(rddId)
}
case RemoveShuffle(shuffleId) =>
doAsync[Boolean]("removing shuffle " + shuffleId, context) {
if (mapOutputTracker != null) {
mapOutputTracker.unregisterShuffle(shuffleId) // 更新信息
}
SparkEnv.get.shuffleManager.unregisterShuffle(shuffleId)
}
case DecommissionBlockManager =>
context.reply(blockManager.decommissionBlockManager())
case RemoveBroadcast(broadcastId, _) =>
doAsync[Int]("removing broadcast " + broadcastId, context) {
blockManager.removeBroadcast(broadcastId, tellMaster = true)
}
case GetBlockStatus(blockId, _) =>
context.reply(blockManager.getStatus(blockId))
case GetMatchingBlockIds(filter, _) =>
context.reply(blockManager.getMatchingBlockIds(filter))
case TriggerThreadDump =>
context.reply(Utils.getThreadDump())
case ReplicateBlock(blockId, replicas, maxReplicas) =>
context.reply(blockManager.replicateBlock(blockId, replicas.toSet, maxReplicas))