BlockManager运行在每个结点上(包括Driver和Executor),提供对本地或远端结点上的内存、磁盘及堆外内存中Block的管理。存储体系从狭义上来说就是指BlockManager,从广义上来说,则包括整个Spark集群中的各个BlockManager、BlockInfoManager、DiskBlockManager、DiskStore、MemoryManager、MemoryStore、对急群众所有的BlockManager进行管理的BlockManagerMaster及各个结点上对外提供Block上传和下载服务的BlockTransferService。
一、BlockManager的初始化
每个Driver或Executor在创建自身的SparkEnv时都会创建BlockManager。BlockManager只有在其initialize方法被调用后才能发挥作用。
def initialize(appId: String): Unit = {
blockTransferService.init(this) // 初始化BlockTransferService,即NettyBlockTransferService.
externalBlockStoreClient.foreach { blockStoreClient =>
blockStoreClient.init(appId)
}
blockReplicationPolicy = { // 设置Block的复制策略
val priorityClass = conf.get(config.STORAGE_REPLICATION_POLICY)
val clazz = Utils.classForName(priorityClass)
val ret = clazz.getConstructor().newInstance().asInstanceOf[BlockReplicationPolicy]
logInfo(s"Using $priorityClass for block replication policy")
ret
}
val id =
BlockManagerId(executorId, blockTransferService.hostName, blockTransferService.port, None)
val idFromMaster = master.registerBlockManager(
id,
diskBlockManager.localDirsString,
maxOnHeapMemory,
maxOffHeapMemory,
slaveEndpoint)
// 生成当前BlockManager的BlockManagerId
blockManagerId = if (idFromMaster != null) idFromMaster else id
// 生成shuffleServerId
shuffleServerId = if (externalShuffleServiceEnabled) {
logInfo(s"external shuffle service port = $externalShuffleServicePort")
BlockManagerId(executorId, blockTransferService.hostName, externalShuffleServicePort)
} else {
blockManagerId
}
// Register Executors' configuration with the local shuffle service, if one should exist.
if (externalShuffleServiceEnabled && !blockManagerId.isDriver) {
registerWithExternalShuffleServer() // 注册外部的Shffle服务,当启用了外部Shuffle服务时将建立一个BlockManager作为ShffleServerId,否则时BlockManager自身的BlockManagerId
}
hostLocalDirManager =
if (conf.get(config.SHUFFLE_HOST_LOCAL_DISK_READING_ENABLED) &&
!conf.get(config.SHUFFLE_USE_OLD_FETCH_PROTOCOL)) {
externalBlockStoreClient.map { blockStoreClient =>
new HostLocalDirManager(
futureExecutionContext,
conf.get(config.STORAGE_LOCAL_DISK_BY_EXECUTORS_CACHE_SIZE),
blockStoreClient,
blockManagerId.host,
externalShuffleServicePort)
}
} else {
None
}
logInfo(s"Initialized BlockManager: $blockManagerId")
}
二、BlockManager提供的方法
BlockManager提供了很多方法,这些方法将为Spark的存储体系提供支持,所以有必要将他们介绍一番。
2.1 reregister
此方法用于向BlockManagerMaster重新注册BlockManager,并向BlockManagerMaster报告所有的Block信息。
def reregister(): Unit = {
// TODO: We might need to rate limit re-registering.
logInfo(s"BlockManager $blockManagerId re-registering with master")
master.registerBlockManager(blockManagerId, diskBlockManager.localDirsString, maxOnHeapMemory,
maxOffHeapMemory, slaveEndpoint) // 向BlockManagerMaster注册BlockManager
reportAllBlocks() // 报告所有的Block信息
}
其中报告所有Block信息的reportAllBlocks实现如下
private def reportAllBlocks(): Unit = {
logInfo(s"Reporting ${blockInfoManager.size} blocks to the master.")
for ((blockId, info) <- blockInfoManager.entries) {
val status = getCurrentBlockStatus(blockId, info)
if (info.tellMaster && !tryToReportBlockStatus(blockId, status)) {
logError(s"Failed to report $blockId to master; giving up.")
return
}
}
}
可以看到,reportAllBlocks方法遍历BlockInfoManager管理的所有BlockId与BlockInfo的映射关系,并进行如下操作。
- 1)调用getCurrentBlockStatus方法(见下一个代码),获取Block的状态信息BlockStatus。
- 2)如果需要将Block的BlockStatus汇报给BlockManagerMaster,则调用tryToReportBlockStatus方法(见下一个代码),向BlockManagerMaster汇报此Block的状态信息。
private def getCurrentBlockStatus(blockId: BlockId, info: BlockInfo): BlockStatus = {
info.synchronized {
info.level match {
case null =>
BlockStatus.empty
case level =>
val inMem = level.useMemory && memoryStore.contains(blockId)
val onDisk = level.useDisk && diskStore.contains(blockId)
val deserialized = if (inMem) level.deserialized else false
val replication = if (inMem || onDisk) level.replication else 1
val storageLevel = StorageLevel(
useDisk = onDisk,
useMemory = inMem,
useOffHeap = level.useOffHeap,
deserialized = deserialized,
replication = replication)
val memSize = if (inMem) memoryStore.getSize(blockId) else 0L
val diskSize = if (onDisk) diskStore.getSize(blockId) else 0L
BlockStatus(storageLevel, memSize, diskSize)
}
}
}
private def tryToReportBlockStatus(
blockId: BlockId,
status: BlockStatus,
droppedMemorySize: Long = 0L): Boolean = {
val storageLevel = status.storageLevel
val inMemSize = Math.max(status.memSize, droppedMemorySize)
val onDiskSize = status.diskSize
master.updateBlockInfo(blockManagerId, blockId, storageLevel, inMemSize, onDiskSize)
}
2.2 getLocalBytes
此方法用于从存储体系获取BlockId对应的Block的数据,并封装为ChunkedByteBuffer后返回。(最新的版本中有所改动)
def getLocalBytes(blockId: BlockId): Option[BlockData] = {
logDebug(s"Getting local block $blockId as bytes")
assert(!blockId.isShuffle, s"Unexpected ShuffleBlockId $blockId")
blockInfoManager.lockForReading(blockId).map { info => doGetLocalBytes(blockId, info) }
}
private def doGetLocalBytes(blockId: BlockId, info: BlockInfo): BlockData = {
val level = info.level
logDebug(s"Level for block $blockId is $level")
// In order, try to read the serialized bytes from memory, then from disk, then fall back to
// serializing in-memory objects, and, finally, throw an exception if the block does not exist.
if (level.deserialized) { // 如果Block没有被序列化,那么按照DiskStore、MemoryStore的顺序、获取Block的数据
// Try to avoid expensive serialization by reading a pre-serialized copy from disk:
if (level.useDisk && diskStore.contains(blockId)) {
// Note: we purposely do not try to put the block back into memory here. Since this branch
// handles deserialized blocks, this block may only be cached in memory as objects, not
// serialized bytes. Because the caller only requested bytes, it doesn't make sense to
// cache the block's deserialized objects since that caching may not have a payoff.
diskStore.getBytes(blockId)
} else if (level.useMemory && memoryStore.contains(blockId)) {
// The block was not found on disk, so serialize an in-memory copy:
new ByteBufferBlockData(serializerManager.dataSerializeWithExplicitClassTag(
blockId, memoryStore.getValues(blockId).get, info.classTag), true)
} else {
handleLocalReadFailure(blockId)
}
} else { // storage level is serialized // 如果Block已经被序列化了,那么按照MemoryStore的顺序,获取Block数据
if (level.useMemory && memoryStore.contains(blockId)) {
new ByteBufferBlockData(memoryStore.getBytes(blockId).get, false)
} else if (level.useDisk && diskStore.contains(blockId)) {
val diskData = diskStore.getBytes(blockId)
maybeCacheDiskBytesInMemory(info, blockId, level, diskData)
.map(new ByteBufferBlockData(_, false))
.getOrElse(diskData)
} else {
handleLocalReadFailure(blockId)
}
}
}
- 可以看出getLocalBytes只处理不是ShuffleBlock的Block
2.3 getLocalBlockData
此方法用于获取本地Block的数据。
override def getLocalBlockData(blockId: BlockId): ManagedBuffer = {
if (blockId.isShuffle) {
shuffleManager.shuffleBlockResolver.getBlockData(blockId) // 如果当前Block是ShuffleBlock,那么调用ShuffleManager...get...获取Block消息
} else { // 当前Block不是ShuffleBlock
getLocalBytes(blockId) match {
case Some(blockData) =>
new BlockManagerManagedBuffer(blockInfoManager, blockId, blockData, true)
case None =>
// If this block manager receives a request for a block that it doesn't have then it's
// likely that the master has outdated block statuses for this block. Therefore, we send
// an RPC so that this block is marked as being unavailable from this block manager.
reportBlockStatus(blockId, BlockStatus.empty)
throw new BlockNotFoundException(blockId.toString)
}
}
}
2.4 getLocalValues
用于从本地的BlockManager中获取Block数据
def getLocalValues(blockId: BlockId): Option[BlockResult] = {
logDebug(s"Getting local block $blockId")
blockInfoManager.lockForReading(blockId) match { // 获取BlockId锁对应Block的读锁
case None =>
logDebug(s"Block $blockId was not found")
None
case Some(info) =>
val level = info.level
logDebug(s"Level for block $blockId is $level")
val taskContext = Option(TaskContext.get())
if (level.useMemory && memoryStore.contains(blockId)) {
// 优先从MemoryStore中读取Block数据
val iter: Iterator[Any] = if (level.deserialized) {
memoryStore.getValues(blockId).get
} else {
serializerManager.dataDeserializeStream(
blockId, memoryStore.getBytes(blockId).get.toInputStream())(info.classTag)
}
// We need to capture the current taskId in case the iterator completion is triggered
// from a different thread which does not have TaskContext set; see SPARK-18406 for
// discussion.
val ci = CompletionIterator[Any, Iterator[Any]](iter, {
releaseLock(blockId, taskContext)
})
Some(new BlockResult(ci, DataReadMethod.Memory, info.size))
} else if (level.useDisk && diskStore.contains(blockId)) {
// 从磁盘中读取Block数据
val diskData = diskStore.getBytes(blockId)
val iterToReturn: Iterator[Any] = {
if (level.deserialized) {
val diskValues = serializerManager.dataDeserializeStream(
blockId,
diskData.toInputStream())(info.classTag)
maybeCacheDiskValuesInMemory(info, blockId, level, diskValues)
} else {
val stream = maybeCacheDiskBytesInMemory(info, blockId, level, diskData)
.map { _.toInputStream(dispose = false) }
.getOrElse { diskData.toInputStream() }
serializerManager.dataDeserializeStream(blockId, stream)(info.classTag)
}
}
val ci = CompletionIterator[Any, Iterator[Any]](iterToReturn, {
releaseLockAndDispose(blockId, diskData, taskContext)
})
Some(new BlockResult(ci, DataReadMethod.Disk, info.size))
} else {
handleLocalReadFailure(blockId)
}
}
}
2.5 putBytes
要介绍putBytes,需要首先介绍doPut。doPut用于执行Block的写入。
private def doPut[T](
blockId: BlockId,
level: StorageLevel,
classTag: ClassTag[_],
tellMaster: Boolean,
keepReadLock: Boolean)(putBody: BlockInfo => Option[T]): Option[T] = {
require(blockId != null, "BlockId is null")
require(level != null && level.isValid, "StorageLevel is null or invalid")
val putBlockInfo = {
val newInfo = new BlockInfo(level, classTag, tellMaster)
if (blockInfoManager.lockNewBlockForWriting(blockId, newInfo)) { // 获取Block的写锁
newInfo
} else {
logWarning(s"Block $blockId already exists on this machine; not re-adding it")
if (!keepReadLock) { // 如果Block已经存在且不需要持有读锁,则需要当前线程释放持有的读锁
// lockNewBlockForWriting returned a read lock on the existing block, so we must free it:
releaseLock(blockId) // 释放读锁
}
return None
}
}
val startTimeNs = System.nanoTime()
var exceptionWasThrown: Boolean = true
val result: Option[T] = try {
val res = putBody(putBlockInfo) // 调用putBody执行写入
exceptionWasThrown = false
if (res.isEmpty) { // 写入成功
// the block was successfully stored
if (keepReadLock) { // 需要保持读锁,则将锁降级
blockInfoManager.downgradeLock(blockId)
} else { // 释放锁
blockInfoManager.unlock(blockId)
}
} else { // 写入失败,移除此Block
removeBlockInternal(blockId, tellMaster = false)
logWarning(s"Putting block $blockId failed")
}
res
} catch { // 写入时发生异常
// Since removeBlockInternal may throw exception,
// we should print exception first to show root cause.
case NonFatal(e) =>
logWarning(s"Putting block $blockId failed due to exception $e.")
throw e
} finally {
// This cleanup is performed in a finally block rather than a `catch` to avoid having to
// catch and properly re-throw InterruptedException.
if (exceptionWasThrown) {
// If an exception was thrown then it's possible that the code in `putBody` has already
// notified the master about the availability of this block, so we need to send an update
// to remove this block location.
removeBlockInternal(blockId, tellMaster = tellMaster) // 移除Block
// The `putBody` code may have also added a new block status to TaskMetrics, so we need
// to cancel that out by overwriting it with an empty block status. We only do this if
// the finally block was entered via an exception because doing this unconditionally would
// cause us to send empty block statuses for every block that failed to be cached due to
// a memory shortage (which is an expected failure, unlike an uncaught exception).
addUpdatedBlockStatusToTaskMetrics(blockId, BlockStatus.empty) // 更新任务度量信息
}
}
val usedTimeMs = Utils.getUsedTimeNs(startTimeNs)
if (level.replication > 1) {
logDebug(s"Putting block ${blockId} with replication took $usedTimeMs")
} else {
logDebug(s"Putting block ${blockId} without replication took ${usedTimeMs}")
}
result
}
再来看看putBytes的实现
def putBytes[T: ClassTag](
blockId: BlockId,
bytes: ChunkedByteBuffer,
level: StorageLevel,
tellMaster: Boolean = true): Boolean = {
require(bytes != null, "Bytes is null")
val blockStoreUpdater =
ByteBufferBlockStoreUpdater(blockId, level, implicitly[ClassTag[T]], bytes, tellMaster)
blockStoreUpdater.save() // save方法中再调用putBytes方法
}
def save(): Boolean = {
doPut(blockId, level, classTag, tellMaster, keepReadLock) { info =>
val startTimeNs = System.nanoTime()
// Since we're storing bytes, initiate the replication before storing them locally.
// This is faster as data is already serialized and ready to send.
val replicationFuture = if (level.replication > 1) {
Future { // 创建异步线程,调用replicate方法复制Block数据到其他结点的存储体系中
// This is a blocking action and should run in futureExecutionContext which is a cached
// thread pool.
replicate(blockId, blockData(), level, classTag)
}(futureExecutionContext)
} else {
null
}
if (level.useMemory) { // 优先使用内存
// Put it in memory first, even if it also has useDisk set to true;
// We will drop it to disk later if the memory store can't hold it.
val putSucceeded = if (level.deserialized) {
saveDeserializedValuesToMemoryStore(blockData().toInputStream())
} else {
saveSerializedValuesToMemoryStore(readToByteBuffer())
}
if (!putSucceeded && level.useDisk) { // 内存不足,写入磁盘
logWarning(s"Persisting block $blockId to disk instead.")
saveToDiskStore()
}
} else if (level.useDisk) { // 存储级别为写入磁盘
saveToDiskStore()
}
val putBlockStatus = getCurrentBlockStatus(blockId, info)
val blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid
if (blockWasSuccessfullyStored) {
// Now that the block is in either the memory or disk store,
// tell the master about it.
info.size = blockSize
if (tellMaster && info.tellMaster) {
reportBlockStatus(blockId, putBlockStatus)
}
addUpdatedBlockStatusToTaskMetrics(blockId, putBlockStatus)
}
logDebug(s"Put block ${blockId} locally took ${Utils.getUsedTimeNs(startTimeNs)}")
if (level.replication > 1) { // 等待异步的复制线程完成
// Wait for asynchronous replication to finish
try {
ThreadUtils.awaitReady(replicationFuture, Duration.Inf)
} catch {
case NonFatal(t) =>
throw new Exception("Error occurred while waiting for replication to finish", t)
}
}
if (blockWasSuccessfullyStored) {
None
} else {
Some(blockSize)
}
}.isEmpty
}
2.6 getRemoteBlock
此方法从远端的BlockManager以序列化的字节形式获取Block数据。
private def getLocationBlockIds(blockIds: Array[BlockId]): Array[Seq[BlockManagerId]] = {
val startTimeNs = System.nanoTime()
val locations = master.getLocations(blockIds).toArray
logDebug(s"Got multiple block location in ${Utils.getUsedTimeNs(startTimeNs)}")
locations
}
private[spark] def getRemoteBlock[T](
blockId: BlockId,
bufferTransformer: ManagedBuffer => T): Option[T] = {
logDebug(s"Getting remote block $blockId")
require(blockId != null, "BlockId is null")
// Because all the remote blocks are registered in driver, it is not necessary to ask
// all the slave executors to get block status.
val locationsAndStatusOption = master.getLocationsAndStatus(blockId, blockManagerId.host)
if (locationsAndStatusOption.isEmpty) {
logDebug(s"Block $blockId is unknown by block manager master")
None
} else {
val locationsAndStatus = locationsAndStatusOption.get
val blockSize = locationsAndStatus.status.diskSize.max(locationsAndStatus.status.memSize)
locationsAndStatus.localDirs.flatMap { localDirs =>
val blockDataOption =
readDiskBlockFromSameHostExecutor(blockId, localDirs, locationsAndStatus.status.diskSize)
val res = blockDataOption.flatMap { blockData =>
try {
Some(bufferTransformer(blockData))
} catch {
case NonFatal(e) =>
logDebug("Block from the same host executor cannot be opened: ", e)
None
}
}
logInfo(s"Read $blockId from the disk of a same host executor is " +
(if (res.isDefined) "successful." else "failed."))
res
}.orElse {
fetchRemoteManagedBuffer(blockId, blockSize, locationsAndStatus).map(bufferTransformer)
}
}
}
其中fetchRemoteManagedBuffer的实现如下:
private def fetchRemoteManagedBuffer(
blockId: BlockId,
blockSize: Long,
locationsAndStatus: BlockManagerMessages.BlockLocationsAndStatus): Option[ManagedBuffer] = {
// If the block size is above the threshold, we should pass our FileManger to
// BlockTransferService, which will leverage it to spill the block; if not, then passed-in
// null value means the block will be persisted in memory.
val tempFileManager = if (blockSize > maxRemoteBlockToMem) {
remoteBlockTempFileManager
} else {
null
}
var runningFailureCount = 0
var totalFailureCount = 0
val locations = sortLocations(locationsAndStatus.locations)
val maxFetchFailures = locations.size
var locationIterator = locations.iterator
while (locationIterator.hasNext) {
val loc = locationIterator.next()
logDebug(s"Getting remote block $blockId from $loc")
val data = try {
val buf = blockTransferService.fetchBlockSync(loc.host, loc.port, loc.executorId,
blockId.toString, tempFileManager)
if (blockSize > 0 && buf.size() == 0) {
throw new IllegalStateException("Empty buffer received for non empty block")
}
buf
} catch {
case NonFatal(e) =>
runningFailureCount += 1
totalFailureCount += 1
if (totalFailureCount >= maxFetchFailures) {
// Give up trying anymore locations. Either we've tried all of the original locations,
// or we've refreshed the list of locations from the master, and have still
// hit failures after trying locations from the refreshed list.
logWarning(s"Failed to fetch block after $totalFailureCount fetch failures. " +
s"Most recent failure cause:", e)
return None
}
logWarning(s"Failed to fetch remote block $blockId " +
s"from $loc (failed attempt $runningFailureCount)", e)
// If there is a large number of executors then locations list can contain a
// large number of stale entries causing a large number of retries that may
// take a significant amount of time. To get rid of these stale entries
// we refresh the block locations after a certain number of fetch failures
if (runningFailureCount >= maxFailuresBeforeLocationRefresh) {
locationIterator = sortLocations(master.getLocations(blockId)).iterator
logDebug(s"Refreshed locations from the driver " +
s"after ${runningFailureCount} fetch failures.")
runningFailureCount = 0
}
// This location failed, so we retry fetch from a different one by returning null here
null
}
if (data != null) {
// If the ManagedBuffer is a BlockManagerManagedBuffer, the disposal of the
// byte buffers backing it may need to be handled after reading the bytes.
// In this case, since we just fetched the bytes remotely, we do not have
// a BlockManagerManagedBuffer. The assert here is to ensure that this holds
// true (or the disposal is handled).
assert(!data.isInstanceOf[BlockManagerManagedBuffer])
return Some(data)
}
logDebug(s"The value of block $blockId is null")
}
logDebug(s"Block $blockId not found")
None
}
2.7 getStatus
此方法用于获取Block状态。
def getStatus(blockId: BlockId): Option[BlockStatus] = {
blockInfoManager.get(blockId).map { info =>
val memSize = if (memoryStore.contains(blockId)) memoryStore.getSize(blockId) else 0L
val diskSize = if (diskStore.contains(blockId)) diskStore.getSize(blockId) else 0L
BlockStatus(info.level, memSize = memSize, diskSize = diskSize)
}
}
2.8 downgradeLock
此方法用于将当前线程持有的Block的写锁降级为读锁。
def downgradeLock(blockId: BlockId): Unit = {
blockInfoManager.downgradeLock(blockId)
}
2.9 releaseLock
此方法用于当前线程对持有的Block的锁进行释放。
def releaseLock(blockId: BlockId, taskContext: Option[TaskContext] = None): Unit = {
val taskAttemptId = taskContext.map(_.taskAttemptId())
// SPARK-27666. When a task completes, Spark automatically releases all the blocks locked
// by this task. We should not release any locks for a task that is already completed.
if (taskContext.isDefined && taskContext.get.isCompleted) {
logWarning(s"Task ${taskAttemptId.get} already completed, not releasing lock for $blockId")
} else {
blockInfoManager.unlock(blockId, taskAttemptId)
}
}
2.9 registerTask
将尝试任务线程注册到BlockInfoManager
def registerTask(taskAttemptId: Long): Unit = {
blockInfoManager.registerTask(taskAttemptId) // 实际上代理了BlockInfoManager的registerTask方法
}
To be continued...太长了,分为两篇