Spark源码解析之Storage模块

最新推荐文章于 2024-08-26 23:38:02 发布

clygm22002

最新推荐文章于 2024-08-26 23:38:02 发布

阅读量107

点赞数

文章标签：大数据

Storage模块整体架构

Storage模块主要分为两层：

通信层：storage模块采用的是master-slave结构来实现通信层，master(Driver)和slave(Executor)之间传输控制信息、状态信息，这些都是通过通信层来实现的。
存储层：storage模块需要把数据存储到disk或是memory上面，有可能还需replicate到远端，这都是由存储层来实现和提供相应接口。

而其他模块若要和storage模块进行交互，storage模块提供了统一的操作类BlockManager，外部类与storage模块打交道都需要通过调用BlockManager相应接口来实现。

理论不多说 ，贴代码以作备忘：

SparkContext类中的 _env对象在初始化：

private[spark] def createSparkEnv(
conf: SparkConf,
isLocal: Boolean,
listenerBus: LiveListenerBus): SparkEnv = {
SparkEnv.createDriverEnv(conf, isLocal, listenerBus) //创建一个SparkEnv 包含一个BlockManagerMaster对象（启动BlockManagerMasterEndPoint并获取Ref）
}

BlockManagerMaster有个Actor消息传递模型BlockManagerMasterEndpoint，该对象用来跟踪所有块管理器的信息，它有比较重要的三个变量：

1.blockManagerInfo：HashMap容器，key值为BlockManagerID对象，value值为Bolck信息(即BlockManagerInfo对象)，Block信息包括BlockManagerID，最大内存，以及从节点上的actor模型；

2.blockManagerIdByExecutor：HashMap容器，key值存放ExecutorID，value值为对应的BlockManagerID对象；

3.blockLocations：JHashMap容器，key值存放相应的块(BlockId对象)，可能有多个块管理器拥有该块，所以value值就为管理该块的所有的块管理器所构成一个HashSet

Executor中会起一个BlockManagerSlaveEndpoint来和BlockManagerMasterEndpoint通讯，负责删除块等操作

看一下Executor:

if (!isLocal) {
env.metricsSystem.registerSource(executorSource)
env.blockManager.initialize(conf.getAppId)
}

blockManager中会向主节点注册：

def initialize(appId: String): Unit = {
blockTransferService.init(this)
shuffleClient.init(appId)
blockManagerId = BlockManagerId(
executorId, blockTransferService.hostName, blockTransferService.port)
shuffleServerId = if (externalShuffleServiceEnabled) {
logInfo(s"external shuffle service port = $externalShuffleServicePort")
BlockManagerId(executorId, blockTransferService.hostName, externalShuffleServicePort)
} else {
blockManagerId
}
master.registerBlockManager(blockManagerId, maxMemory, slaveEndpoint) //向主节点注册BlockManager
// Register Executors' configuration with the local shuffle service, if one should exist.
if (externalShuffleServiceEnabled && !blockManagerId.isDriver) {
registerWithExternalShuffleServer()
}
}

看一下BlockManagerMasterEndpoint:

override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
case RegisterBlockManager(blockManagerId, maxMemSize, slaveEndpoint) =>
register(blockManagerId, maxMemSize, slaveEndpoint)
context.reply(true)
.....

private def register(id: BlockManagerId, maxMemSize: Long, slaveEndpoint: RpcEndpointRef) {
val time = System.currentTimeMillis()
if (!blockManagerInfo.contains(id)) {
blockManagerIdByExecutor.get(id.executorId) match {
case Some(oldId) =>
// A block manager of the same executor already exists, so remove it (assumed dead)
logError("Got two different block manager registrations on same executor - "
+ s" will replace old one $oldId with new one $id")
removeExecutor(id.executorId)
case None =>
}
logInfo("Registering block manager %s with %s RAM, %s".format(
id.hostPort, Utils.bytesToString(maxMemSize), id))
blockManagerIdByExecutor(id.executorId) = id
blockManagerInfo(id) = new BlockManagerInfo( //添加到blockManagerInfo
id, System.currentTimeMillis(), maxMemSize, slaveEndpoint)
}
listenerBus.post(SparkListenerBlockManagerAdded(time, id, maxMemSize))
}

Executor中的run方法中调用Task的run方法。Task的run方法调用实现类的runTask方法，runTask方法中调用Rdd的iterator迭代方法。
runTask方法返回 MapStatus对象。该对象包含了该文件存储的BlockManagerId和不同ReduceId要读取的数据大小。Executor中会将
数据序列化并根据大小决定是直接返回还是存入BlockManager。

未完待续....

参考：http://jerryshao.me/architecture/2013/10/08/spark-storage-module-analysis/

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/29754888/viewspace-1839920/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/29754888/viewspace-1839920/

clygm22002

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark源码解析之Storage模块

Storage模块整体架构 Storage模块主要分为两层：通信层：storage模块采用的是master-slave结构来实现通信层，master(Driver)和slave(Exe...
复制链接

扫一扫