前面我们详细分析了Spark存储体系里面数据的表示,节点的通信,数据的存储过程,本节我们来看下存储体系里面的核心类BlockManager,BlockManager是spark自己的存储系统,RDD-Cache、 Shuffle-output、broadcast等的实现都是基于BlockManager来实现的,BlockManager也是分布式结构,在Driver和所有Executor上都会有blockmanager节点,每个节点上存储的block信息都会汇报给driver端的BlockManagerMaster作统一管理,BlockManager对外提供get和set数据接口,可将数据存储在memory, disk, off-heap中。
BlockManagerId
BlockManagerId
是BlockManager
唯一标识,包含host port 拓扑关系。我们希望一个BlockManager只创建一个BlockManagerId对象,是个典型Singleton的场景,在Scala里面实现Singleton比较晦涩,这里是个典型的例子,将所有的构造函数设为private,然后利用伴生对象的来创建对象实例。
class BlockManagerId private (
private var executorId_ : String,
private var host_ : String,
private var port_ : Int,
private var topologyInfo_ : Option[String])
extends Externalizable {
private def this() = this(null, null, 0, None)
}
private[spark] object BlockManagerId {
def apply(
execId: String,
host: String,
port: Int,
topologyInfo: Option[String] = None): BlockManagerId =
getCachedBlockManagerId(new BlockManagerId(execId, host, port, topologyInfo))
def apply(in: ObjectInput): BlockManagerId = {
val obj = new BlockManagerId()
obj.readExternal(in)
getCachedBlockManagerId(obj)
}
val blockManagerIdCache = CacheBuilder.newBuilder()
.maximumSize(10000)
.build(new CacheLoader[BlockManagerId, BlockManagerId]() {
override def load(id: BlockManagerId) = id
})
def getCachedBlockManagerId(id: BlockManagerId): BlockManagerId = {
blockManagerIdCache.get(id)
}
}
BlockManagerMaster
BlockManager是被master和slave公用的,但有一些master特有的逻辑,Spark设计者将其wrap在BlockManagerMaster
中,BlockManagerMaster
接收到相应的调用后,会封装成BlockMessage
方法发送给MaterEndpoint,让我们来看一下注册BlockManager的实现。
def registerBlockManager(
id: BlockManagerId,
localDirs: Array[String