Storage模块整体架构
Storage模块主要分为两层:
- 通信层:storage模块采用的是master-slave结构来实现通信层,master(Driver)和slave(Executor)之间传输控制信息、状态信息,这些都是通过通信层来实现的。
- 存储层:storage模块需要把数据存储到disk或是memory上面,有可能还需replicate到远端,这都是由存储层来实现和提供相应接口。
而其他模块若要和storage模块进行交互,storage模块提供了统一的操作类BlockManager,外部类与storage模块打交道都需要通过调用BlockManager相应接口来实现。
理论不多说 ,贴代码以作备忘:
SparkContext类中的 _env对象在初始化:
- private[spark] def createSparkEnv(
- conf: SparkConf,
- isLocal: Boolean,
- listenerBus: LiveListenerBus): SparkEnv = {
- SparkEnv.createDriverEnv(conf, isLocal, listenerBus) //创建一个SparkEnv 包含一个BlockManagerMaster对象(启动BlockManagerMasterEndPoint并获取Ref)
- }
BlockManagerMaster有个Actor消息传递模型BlockManagerMasterEndpoint,该对象用来跟踪所有块管理器的信息,它有比较重要的三个变量:
1.blockManagerInfo:HashMap容器,key值为BlockManagerID对象,value值为Bolck信息(即BlockManagerInfo对象),Block信息包括BlockManagerID,最大内存,以及从节点上的actor模型;
2.blockManagerIdByExecutor:HashMap容器,key值存放ExecutorID,value值为对应的BlockManagerID对象;
3.blockLocations:JHashMap容器,key值存放相应的块(BlockId对象),可能有多个块管理器拥有该块,所以value值就为管理该块的所有的块管理器所构成一个HashSet
Executor中会起一个BlockManagerSlaveEndpoint来和BlockManagerMasterEndpoint通讯,负责删除块等操作
看一下Executor:
- if (!isLocal) {
- env.metricsSystem.registerSource(executorSource)
- env.blockManager.initialize(conf.getAppId)
- }
- def initialize(appId: String): Unit = {
- blockTransferService.init(this)
- shuffleClient.init(appId)
-
- blockManagerId = BlockManagerId(
- executorId, blockTransferService.hostName, blockTransferService.port)
-
- shuffleServerId = if (externalShuffleServiceEnabled) {
- logInfo(s"external shuffle service port = $externalShuffleServicePort")
- BlockManagerId(executorId, blockTransferService.hostName, externalShuffleServicePort)
- } else {
- blockManagerId
- }
-
- master.registerBlockManager(blockManagerId, maxMemory, slaveEndpoint) //向主节点注册BlockManager
-
- // Register Executors' configuration with the local shuffle service, if one should exist.
- if (externalShuffleServiceEnabled && !blockManagerId.isDriver) {
- registerWithExternalShuffleServer()
- }
- }
看一下BlockManagerMasterEndpoint:
- override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
- case RegisterBlockManager(blockManagerId, maxMemSize, slaveEndpoint) =>
- register(blockManagerId, maxMemSize, slaveEndpoint)
- context.reply(true)
- .....
- private def register(id: BlockManagerId, maxMemSize: Long, slaveEndpoint: RpcEndpointRef) {
- val time = System.currentTimeMillis()
- if (!blockManagerInfo.contains(id)) {
- blockManagerIdByExecutor.get(id.executorId) match {
- case Some(oldId) =>
- // A block manager of the same executor already exists, so remove it (assumed dead)
- logError("Got two different block manager registrations on same executor - "
- + s" will replace old one $oldId with new one $id")
- removeExecutor(id.executorId)
- case None =>
- }
- logInfo("Registering block manager %s with %s RAM, %s".format(
- id.hostPort, Utils.bytesToString(maxMemSize), id))
-
- blockManagerIdByExecutor(id.executorId) = id
-
- blockManagerInfo(id) = new BlockManagerInfo( //添加到blockManagerInfo
- id, System.currentTimeMillis(), maxMemSize, slaveEndpoint)
- }
- listenerBus.post(SparkListenerBlockManagerAdded(time, id, maxMemSize))
- }
Executor中的run方法中调用Task的run方法。Task的run方法调用实现类的runTask方法,runTask方法中调用Rdd的iterator迭代方法。
runTask方法返回 MapStatus对象。该对象包含了该文件存储的BlockManagerId和不同ReduceId要读取的数据大小。Executor中会将
数据序列化并根据大小决定是直接返回还是存入BlockManager。
未完待续....
参考:http://jerryshao.me/architecture/2013/10/08/spark-storage-module-analysis/
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/29754888/viewspace-1839920/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/29754888/viewspace-1839920/