Spark为了避免Hadoop读写磁盘的I/O操作成为性能瓶颈,优先将配置信息、中间计算结果等数据存入内存,极大的提高了系统的执行效率。除此之外,还可以将这些数据放入磁盘或者外部存储系统中。
1. 块管理器BlockManager 的构造
块管理器BlockManager是Spark存储体系中的核心组件。Driver 和 Executor都会创建BlockManager。其主构造器如下:
/**
* Manager running on every node (driver and executors) which provides interfaces for putting and
* retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap).
*
* Note that #initialize() must be called before the BlockManager is usable.
*/
private[spark] class BlockManager(
executorId: String,
rpcEnv: RpcEnv,
val master: BlockManagerMaster,
defaultSerializer: Serializer,
val conf: SparkConf,
memoryManager: MemoryManager,
mapOutputTracker: MapOutputTracker,
shuffleManager: ShuffleManager,
blockTransferService: BlockTransferService,
securityManager: SecurityManager,
numUsableCores: Int)
extends BlockDataManager with Logging
BlockManager主要组成:
0. BlockManagerMaster:Driver上的BlockManagerMaster对存在于Executor上的BlockManager统一管理;
1. DiskBlockManager:磁盘块管理器;
2. blockInfo:用于缓存BlockId和对应的BlockInfo;
3. ExecutionContext: 创建ExecutionContext,它是以ThreadPoolExecutor线程池作为服务的,每个线程的名称前缀是block-manager-future,最大可以创建128个线程;
4. MemoryStore: 内存存储,将blocks存储在内存中,存储方式可以是Java object数组或者序列化后的
ByteBuffers;
5. DiskStore:磁盘存储;
6. ExternalBlockStore : ExternalBlockStore 存储BlockManager blocks,内部实际使用的是TachyonBlockManager进行管理;
7. ShuffleClient:shuffle客户端ShuffleClient,默认使用BlockTransferService ,通过spark.shuffle.service.enabled属性设置为true则可以使用外部的ShuffleService;
8. BlockManagerSlaveEndpoint:注册BlockManagerSlaveEndpoint并且返回它的引用(默认Netty模式的话为NettyRpcEndpointRef);
9. metadataCleaner : 非广播Block清理器;
10. broadcastCleaner : 广播Block清理器;
11. CompressionCodec :压缩算法实现.
val diskBlockManager = new DiskBlockManager(this, conf)
private val blockInfo = new TimeStampedHashMap[BlockId, BlockInfo]
private val futureExecutionContext = ExecutionContext.fromExecutorService( ThreadUtils.newDaemonCachedThreadPool("block-manager-future", 128))
// Actual storage of where blocks are kept
private var externalBlockStoreInitialized = false
private[spark] val memoryStore = new MemoryStore(this, memoryManager)
private[spark] val diskStore = new DiskStore(this, diskBlockManager)
private[spark] lazy val externalBlockStore: ExternalBlockStore = {
externalBlockStoreInitialized = true
new ExternalBlockStore(this, executorId)
}
memoryManager.setMemoryStore(memoryStore)
private[spark]
val externalShuffleServiceEnabled = conf.getBoolean("spark.shuffle.service.enabled", false)
// Client to read other executors' shuffle files. This is either an external service, or just the
// standard BlockTransferService to directly connect to other Executors.
private[spark] val shuffleClient = if (externalShuffleServiceEnabled) {
val transConf = SparkTransportConf.fromSparkConf(conf, "shuffle", numUsableCores)
new ExternalShuffleClient(transConf, securityManager, securityManager.isAuthenticationEnabled(),
securityManager.isSaslEncryptionEnabled())
} else {
blockTransferService
}
// Register a [[RpcEndpoint]] with a name and return its [[RpcEndpointRef]].
private val slaveEndpoint = rpcEnv.setupEndpoint(
"BlockManagerEndpoint" + BlockManager.ID_GENERATOR.next,
new BlockManagerSlaveEndpoint(rpcEnv, this, mapOutputTracker))
private val metadataCleaner = new MetadataCleaner(
MetadataCleanerType.BLOCK_MANAGER, this.dropOldNonBroadcastBlocks, conf)
private val broadcastCleaner = new MetadataCleaner(
MetadataCleanerType.BROADCAST_VARS, this.dropOldBroadcastBlocks, conf)
/* The compression codec to use. Note that the "lazy" val is necessary because we want to delay
* the initialization of the compression codec until it is first used. The reason is that a Spark
* program could be using a user-defined codec in a third party jar, which is loaded in
* Executor.updateDependencies. When the BlockManager is initialized, user level jars hasn't been
* loaded yet. */
private lazy val compressionCodec: CompressionCodec = CompressionCodec.createCodec(conf)
2. BlockManager初始化
BlockManager要生效,必须进行初始化操作。而且不能在BlockManager构造过程中进行初始化。因为这个时候应用程序的ID可能还没获得。