如下时序图表示了RDD.persist方法执行之后,Spark是如何cache分区数据的。时序图可放大显示
本篇文章中,RDD.persist(StorageLevel)参数StorageLevel为:MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
也就是cache数据的时候,如果有足够的内存则将数据cache到内存,如果没有足够的内存,则可将数据cache到磁盘,分区数据不会cache到堆外内存,cache的数据是序列化的,并且cache的数据需要备份到远程其它节点。远程节点采用默认Netty服务方式(NettyBlockRpcServer)接收备份数据。
在Driver端,执行RDD.persist方法,会设置RDD的分区数据如何cache,并且将需要cache的RDD记录到了SparkContext.persistentRdds 这个HashMap中,相关代码如下:
def persist(newLevel: StorageLevel): this.type = {
// TODO: Handle changes of StorageLevel
if (storageLevel != StorageLevel.NONE && newLevel != storageLevel) {
throw new UnsupportedOperationException(
"Cannot change storage level of an RDD after it was already assigned a level")
}
sc.persistRDD(this)//sparkcontext记录都缓存了那些rdd
// Register the RDD with the ContextCleaner for automatic GC-based cleanup
sc.cleaner.foreach(_.registerRDDForCleanup(this))
storageLevel = newLevel//设置RDD的storageLevel,Executor的BlockManager会根据这个值决定如何存储分区数据
this //返回当前的RDD
}
ShuffleMapTask或者ResultTask通过调用RDD.iterator来计算一个分区的数据,如果在Driver执行了RDD.persist方法,则会进入到CacheManager.getOrCompute计算一个分区的数据。如果一个分区的数据已经Cache了,则从BlockManager中根据块ID(块ID根据RDD id和分区index创建)读取分区的数据,否则需要计算生成块的数据,然后调用CacheManager.putInBlockManager将新计算产生的块数据放入BlockManager。代码如下:
def getOrCompute[T](
rdd: RDD[T],
partition: Partition,
context: TaskContext,
storageLevel: StorageLevel): Iterator[T] = {
val key = RDDBlockId(rdd.id, partition.index)
logDebug(s"Looking for partition $key")
blockManager.get(key) match {
/*
* 如果已经cache了这个块的数据,则从BlockManager读取
* */
case Some(blockResult) =>
// Partition is already materialized, so just return its values
val existingMetrics = context.taskMetrics
.getInputMetricsForReadMethod(blockResult.readMethod)
existingMetrics.incBytesRead(blockResult.bytes)
val iter = blockResult.data.asInstanceOf[Iterator[T]]
new InterruptibleIterator[T](context, iter) {
override def next(): T = {
existingMetrics.incRecordsRead(1)
delegate.next()
}
}
case None =>
// Acquire a lock for loading this partition
// If another thread already holds the lock, wait for it to finish return its results
/*
* 如果没有cache这个块的数据,则计算出本分区的数据,并保存到BlockManager
* */
val storedValues = acquireLockForPartition[T](key)
if (storedValues.isDefined) {
return new InterruptibleIterator[T](context, storedValues.get)
}
// Otherwise, we have to load the partition ourselves
try {
logInfo(s"Partition $key not found, computing it")
val computedValues = rdd.computeOrReadCheckpoint(partition, context)//计算出一个分区的数据
// If the task is running locally, do not persist the result
if (context.isRunningLocally) {
return computedValues
}
// Otherwise, cache the values and keep track of any updates in block statuses
val updatedBlocks = new ArrayBuffer[(BlockId, BlockStatus)]
val cachedValues = putInBlockManager(key, computedValues, storageLevel, updatedBlocks)//cache一个分区的数据
val metrics = context.taskMetrics
val lastUpdatedBlocks = metrics.updatedBlocks.getOrElse(Seq[(BlockId, BlockStatus)]())
metrics.updatedBlocks = Some(lastUpdatedBlocks ++ updatedBlocks.toSeq)
new InterruptibleIterator(context, cachedValues)
} finally {
loading.synchronized {
loading.remove(key)
loading.notifyAll()
}
}
}
}
CacheManager.putInBlockManager方法将一个分区的数据(是一个Iterator数据结构)存储到磁盘或者内存中。如果RDD.storageLevel设置的存储级别只是磁盘,则调用BlockManager.putIterator直接将这个Iterator序列化之后写入磁盘文件。如果数据需要放到内存,则需要调用MemoryStore.unrollSafely将Iterator数据先放入数组,然后调用BlockManager.putArray将这个数组的数据存入BlockManager。MemoryStore.unrollSafely将Iterator数据先放入数组需要占用大量的内存,如果分区中的数据量太大,已有的空闲内存存不下分区中的数据,并且淘汰掉内存中其它RDD的数据之后还是没有足够空间,这种情况下,如果Storage.level设置了数据可以存入磁盘,则将这个分区的数据存入磁盘。CacheManager.putInBlockManager的代码如下:
private def putInBlockManager[T](
key: BlockId,
values: Iterator[T],
level: StorageLevel,
updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)],
effectiveStorageLevel: Option[StorageLevel] = None): Iterator[T] = {
//effectiveStorageLevel是本次执行putInBlockManager实际要存储分区数据的方法
val putLevel = effectiveStorageLevel.getOrElse(level)
if (!putLevel.useMemory) {
/*
* This RDD is not to be cached in memory, so we can just pass the computed values as an
* iterator directly to the BlockManager rather than first fully unrolling it in memory.
*/
updatedBlocks ++=
blockManager.putIterator(key, values, level, tellMaster = true, effectiveStorageLevel)
blockManager.get(key) match {
case Some(v) => v.data.asInstanceOf[Iterator[T]]
case None =>
logInfo(s"Failure to store $key")
throw new BlockException(key, s"Block manager failed to return cached value for $key!")
}
} else {
/*
* This RDD is to be cached in memory. In this case we cannot pass the computed values
* to the BlockManager as an iterator and expect to read it back later. This is because
* we may end up dropping a partition from memory store before getting it back.
*
* In addition, we must be careful to not unroll the entire partition in memory at once.
* Otherwise, we may cause an OOM exception if the JVM does not have enough space for this
* single partition. Instead, we unroll the values cautiously, potentially aborting and
* dropping the partition to disk if applicable.
* 首先将分区数据(是一个Iterator数据结构)放入到数组中
*/
blockManager.memoryStore.unrollSafely(key, values, updatedBlocks) match {
case Left(arr) =>
// We have successfully unrolled the entire partition, so cache it in memory
updatedBlocks ++=
blockManager.putArray(key, arr, level, tellMaster = true, effectiveStorageLevel)
arr.iterator.asInstanceOf[Iterator[T]]
case Right(it) =>
// There is not enough space to cache this partition in memory
val returnValues = it.asInstanceOf[Iterator[T]]
if (putLevel.useDisk) {
logWarning(s"Persisting partition $key to disk instead.")
val diskOnlyLevel = StorageLevel(useDisk = true, useMemory = false,
useOffHeap = false, deserialized = false, putLevel.replication)
/*
* 分区数据量太大,从分区中unroll数据的时候内存不够了,将数据全部存入磁盘
* */
putInBlockManager[T](key, returnValues, level, updatedBlocks, Some(diskOnlyLevel))
} else {
returnValues
}
}
}
}
在MemoryStore.unrollSafely方法首先将分区的数据(Iterator数据结构)放入SizeTrackingVector,然后转换成数组返回。但是如果分区的数据量太大,会首先调用MemoryStore.reserveUnrollMemoryForThisThread申请新的内存,如果当前空闲内存不能容纳当前当前线程需要的内存,MemoryStore.reserveUnrollMemoryForThisThread会执行失败,这时会调用MemoryStore.ensureFreeSpace执行内存淘汰算法,将内存中满足淘汰条件的内存块淘汰掉,腾出内存空间,再进行内存的分配。MemoryStore.unrollSafely方法分配给SizeTrackingVector的内存还需要在后面的流程将数据放入BlockManager,所以它的finally代码段将申请到的内存从unrollMemoryMap移除然后加入到pendingUnrollMemoryMap这个HashMap中,这些内存要在tryToPut成功执行把本次unrollSafely的数组数据放入BlockManager之后会执行数MemoryStore.releasePendingUnrollMemoryForThisThread释放unrollSafely数组占用的内存。MemoryStore.unrollSafely代码如下:
//将values这个Iterator打开,将它的内容放入vector.如果返回array, 在执行结束之后,会删除Iterator所占用的内存
def unrollSafely(
blockId: BlockId,
values: Iterator[Any],
droppedBlocks: ArrayBuffer[(BlockId, BlockStatus)])
: Either[Array[Any], Iterator[Any]] = {
// Number of elements unrolled so far
var elementsUnrolled = 0
// Whether there is still enough memory for us to continue unrolling this block
var keepUnrolling = true
// Initial per-thread memory to request for unrolling blocks (bytes). Exposed for testing.
val initialMemoryThreshold = unrollMemoryThreshold
// How often to check whether we need to request more memory
val memoryCheckPeriod = 16
// Memory currently reserved by this thread for this particular unrolling operation
var memoryThreshold = initialMemoryThreshold
// Memory to request as a multiple of current vector size
val memoryGrowthFactor = 1.5
// Previous unroll memory held by this thread, for releasing later (only at the very end)
val previousMemoryReserved = currentUnrollMemoryForThisThread
// Underlying vector for unrolling the block
var vector = new SizeTrackingVector[Any]
// Request enough memory to begin unrolling
keepUnrolling = reserveUnrollMemoryForThisThread(initialMemoryThreshold)
if (!keepUnrolling) {
logWarning(s"Failed to reserve initial memory threshold of " +
s"${Utils.bytesToString(initialMemoryThreshold)} for computing block $blockId in memory.")
}
// Unroll this block safely, checking whether we have exceeded our threshold periodically
try {
while (values.hasNext && keepUnrolling) {
vector += values.next()
if (elementsUnrolled % memoryCheckPeriod == 0) {
// If our vector's size has exceeded the threshold, request more memory
val currentSize = vector.estimateSize()
if (currentSize >= memoryThreshold) {
val amountToRequest = (currentSize * memoryGrowthFactor - memoryThreshold).toLong
// Hold the accounting lock, in case another thread concurrently puts a block that
// takes up the unrolling space we just ensured here
accountingLock.synchronized {
//从当前的空闲内存中申请amountToRequest数量的内存
if (!reserveUnrollMemoryForThisThread(amountToRequest)) {
// If the first request is not granted, try again after ensuring free space
// If there is still not enough space, give up and drop the partition
val spaceToEnsure = maxUnrollMemory - currentUnrollMemory
if (spaceToEnsure > 0) {
//执行内存淘汰算法,将内存中满足淘汰条件的内存块淘汰掉,腾出内存空间
val result = ensureFreeSpace(blockId, spaceToEnsure)
droppedBlocks ++= result.droppedBlocks
}
keepUnrolling = reserveUnrollMemoryForThisThread(amountToRequest)
}
}
// New threshold is currentSize * memoryGrowthFactor
memoryThreshold += amountToRequest
}
}
elementsUnrolled += 1
}
if (keepUnrolling) {
// We successfully unrolled the entirety of this block
Left(vector.toArray)
} else {
// We ran out of space while unrolling the values for this block
logUnrollFailureMessage(blockId, vector.estimateSize())
Right(vector.iterator ++ values)
}
} finally {
// If we return an array, the values returned will later be cached in `tryToPut`.
// In this case, we should release the memory after we cache the block there.
// Otherwise, if we return an iterator, we release the memory reserved here
// later when the task finishes.
if (keepUnrolling) {
accountingLock.synchronized {
val amountToRelease = currentUnrollMemoryForThisThread - previousMemoryReserved
/*
*
* 申请到的内存大小从unrollMemoryMap移除然后加入到pendingUnrollMemoryMap这个HashMap中,这个内存是本次unrollSafely成功执行申请的内存
* 这些内存要在tryToPut成功执行把本次unrollSafely的数组数据放入BlockManager之后会执行释放函数释放内存
* */
releaseUnrollMemoryForThisThread(amountToRelease)
reservePendingUnrollMemoryForThisThread(amountToRelease)
}
}
}
}
MemoryStore.ensureFreeSpace方法进行内存淘汰算法 :将其它RDD缓存到BlockManager的内存块作为淘汰对象,首先计算出为本次内存分配需要淘汰的所有数据块,将这些要淘汰的数据块加入数组中,然后依次对每个要淘汰的数据块执行BlockManager.dropFromMemory淘汰掉。MemoryStore.ensureFreeSpace方法代码如下:
private def ensureFreeSpace(
blockIdToAdd: BlockId,
space: Long): ResultWithDroppedBlocks = {
logInfo(s"ensureFreeSpace($space) called with curMem=$currentMemory, maxMem=$maxMemory")
val droppedBlocks = new ArrayBuffer[(BlockId, BlockStatus)]
if (space > maxMemory) {
logInfo(s"Will not store $blockIdToAdd as it is larger than our memory limit")
return ResultWithDroppedBlocks(success = false, droppedBlocks)
}
// Take into account the amount of memory currently occupied by unrolling blocks
// and minus the pending unroll memory for that block on current thread.
val threadId = Thread.currentThread().getId
val actualFreeMemory = freeMemory - currentUnrollMemory +
pendingUnrollMemoryMap.getOrElse(threadId, 0L)
if (actualFreeMemory < space) {
val rddToAdd = getRddId(blockIdToAdd)
val selectedBlocks = new ArrayBuffer[BlockId]
var selectedMemory = 0L
// This is synchronized to ensure that the set of entries is not changed
// (because of getValue or getBytes) while traversing the iterator, as that
// can lead to exceptions.
/*
* 计算出为了满足本次内存分配所有需要淘汰的内存块,将这些内存块加入到selectedBlocks数组中
* 要淘汰的内存块只能是其它RDD的内存块
* */
entries.synchronized {
val iterator = entries.entrySet().iterator()
while (actualFreeMemory + selectedMemory < space && iterator.hasNext) {
val pair = iterator.next()
val blockId = pair.getKey
/*
* 其它RDD缓存的数据作为删除的目标
* */
if (rddToAdd.isEmpty || rddToAdd != getRddId(blockId)) {
selectedBlocks += blockId
selectedMemory += pair.getValue.size
}
}
}
if (actualFreeMemory + selectedMemory >= space) {
logInfo(s"${selectedBlocks.size} blocks selected for dropping")
for (blockId <- selectedBlocks) {
val entry = entries.synchronized { entries.get(blockId) }
// This should never be null as only one thread should be dropping
// blocks and removing entries. However the check is still here for
// future safety.
if (entry != null) {
val data = if (entry.deserialized) {
Left(entry.value.asInstanceOf[Array[Any]])
} else {
Right(entry.value.asInstanceOf[ByteBuffer].duplicate())
}
/*
* 从MemoryStore.entries中drop其它RDD cache的数据
* */
val droppedBlockStatus = blockManager.dropFromMemory(blockId, data)
droppedBlockStatus.foreach { status => droppedBlocks += ((blockId, status)) }
}
}
return ResultWithDroppedBlocks(success = true, droppedBlocks)
} else {
logInfo(s"Will not store $blockIdToAdd as it would require dropping another block " +
"from the same RDD")
return ResultWithDroppedBlocks(success = false, droppedBlocks)
}
}
ResultWithDroppedBlocks(success = true, droppedBlocks)
}
BlockManager.dropFromMemory淘汰内存块,如 果淘汰内存块可以存储到磁盘则首先调用DiskStore.putArray或者DiskStore.putBytes将淘汰内存块的数据存入磁盘,然后将内存块从MemoryStore.entries这个LinkedHashMap中删除内存块,最后调用BlockManager.reportBlockStatus将内存块更新信息发送给驱动。BlockManager.dropFromMemory源码如下:
def dropFromMemory(
blockId: BlockId,
data: () => Either[Array[Any], ByteBuffer]): Option[BlockStatus] = {
logInfo(s"Dropping block $blockId from memory")
val info = blockInfo.get(blockId).orNull
// If the block has not already been dropped
if (info != null) {
info.synchronized {
// required ? As of now, this will be invoked only for blocks which are ready
// But in case this changes in future, adding for consistency sake.
if (!info.waitForReady()) {
// If we get here, the block write failed.
logWarning(s"Block $blockId was marked as failure. Nothing to drop")
return None
} else if (blockInfo.get(blockId).isEmpty) {
logWarning(s"Block $blockId was already dropped.")
return None
}
var blockIsUpdated = false
val level = info.level//level于rdd的storageLevel相同
// Drop to disk, if storage level requires
if (level.useDisk && !diskStore.contains(blockId)) {
logInfo(s"Writing block $blockId to disk")
//如果需要就数据保存到磁盘
data() match {
case Left(elements) =>
diskStore.putArray(blockId, elements, level, returnValues = false)
case Right(bytes) =>
diskStore.putBytes(blockId, bytes, level)
}
blockIsUpdated = true
}
// Actually drop from memory store
val droppedMemorySize =
if (memoryStore.contains(blockId)) memoryStore.getSize(blockId) else 0L
/*
* 从MemoryStore.entries中删除内存块
* */
val blockIsRemoved = memoryStore.remove(blockId)
if (blockIsRemoved) {
blockIsUpdated = true
} else {
logWarning(s"Block $blockId could not be dropped from memory as it does not exist")
}
val status = getCurrentBlockStatus(blockId, info)
if (info.tellMaster) {
//将内存块更新信息发送给驱动
reportBlockStatus(blockId, info, status, droppedMemorySize)
}
if (!level.useDisk) {
// The block is completely gone from this node; forget it so we can put() it again later.
blockInfo.remove(blockId)
}
if (blockIsUpdated) {
return Some(status)
}
}
}
None
}
BlockManager.reportBlockStatus会调用BlockManagerMaster.updateBlockInfo发送UpdateBlockInfo消息给驱动的BlockManagerMasterEndpoint,驱动的BlockManagerMasterEndpoint.receiveAndReply方法接收到UpdateBlockInfo消息后会调用BlockManagerMasterEndpoint.updateBlockInfo更新块信息
CacheManager.putInBlockManager方法调用MemoryStore.unrollSafely将Iterator数据先放入数组,然后调用BlockManager.putArray将这个数组的数据存入BlockManager。BlockManager.putArray方法调用BlockManager.doPut,BlockManager.doPut方法干了3件事情,
1. 调用MemoryStore.putArray将需要Cache的数据块存入MemoryStore.entries这个hash链表,
2. 调用BlockManager.replicate将Cache的数据块存入远端的BlockManager
3. 调用BlockManager.reportBlockStatus 将Cache的数据块存入内存后,向Driver上报块更新信息
MemoryStore.putArray最终调用MemoryStore.tryToPut方法将需要Cache的数据存入MemoryStore.entries hash链表。在加入MemoryStore.entries链表之前会首先调用MemoryStore.ensureFreeSpace执行内存淘汰算法,将内存中满足淘汰条件的内存块淘汰掉,腾出内存空间,再进行内存的分配。如果内存分配成功,则将数据快加入到MemoryStore.entries链表。如果数据块太大,分配内存失败,则执行BlockManager.dropFromMemory将当前的数据块存入磁盘。最后调用MemoryStore.releasePendingUnrollMemoryForThisThread将当前块占用的内存从pendingUnrollMemoryMap链表中摘除,释放内存。MemoryStore.tryToPut源码如下:
private def tryToPut(
blockId: BlockId,
value: () => Any,
size: Long,
deserialized: Boolean): ResultWithDroppedBlocks = {
/* TODO: Its possible to optimize the locking by locking entries only when selecting blocks
* to be dropped. Once the to-be-dropped blocks have been selected, and lock on entries has
* been released, it must be ensured that those to-be-dropped blocks are not double counted
* for freeing up more space for another block that needs to be put. Only then the actually
* dropping of blocks (and writing to disk if necessary) can proceed in parallel. */
var putSuccess = false
val droppedBlocks = new ArrayBuffer[(BlockId, BlockStatus)]
accountingLock.synchronized {
// 调用ensureFreeSpace方法确保有足够的内存存储数据,可能导致内存中的其它RDD的数据块被淘汰掉
val freeSpaceResult = ensureFreeSpace(blockId, size)
val enoughFreeSpace = freeSpaceResult.success
droppedBlocks ++= freeSpaceResult.droppedBlocks
if (enoughFreeSpace) {
//将块数据存入到entries
val entry = new MemoryEntry(value(), size, deserialized)
entries.synchronized {
entries.put(blockId, entry)
currentMemory += size
}
val valuesOrBytes = if (deserialized) "values" else "bytes"
logInfo("Block %s stored as %s in memory (estimated size %s, free %s)".format(
blockId, valuesOrBytes, Utils.bytesToString(size), Utils.bytesToString(freeMemory)))
putSuccess = true
} else {
// Tell the block manager that we couldn't put it in memory so that it can drop it to
// disk if the block allows disk storage.
lazy val data = if (deserialized) {
Left(value().asInstanceOf[Array[Any]])
} else {
Right(value().asInstanceOf[ByteBuffer].duplicate())
}
/*
* 要cache的数据块太大,在这里BlockManager.dropFromMemory只是将数据块存入磁盘,因为MemoryStore.entries里面还没有加入
* 当前的数据块
* */
val droppedBlockStatus = blockManager.dropFromMemory(blockId, () => data)
droppedBlockStatus.foreach { status => droppedBlocks += ((blockId, status)) }
}
// Release the unroll memory used because we no longer need the underlying Array
//释放value所在的块占用的内存
releasePendingUnrollMemoryForThisThread()
}
ResultWithDroppedBlocks(putSuccess, droppedBlocks)
}
BlockManager.replicate将块数据备份到其它节点,首先调用BlockManager.getPeers向驱动的BlockManagerMasterEndpoint发送GetPeers消息,驱动BlockManagerMasterEndpoint接收到GetPeers消息之后,调用BlockManagerMasterEndpoint.getPeers方法,返回集群中除去当前发送消息的BlockManagerId和除去Driver的BlockManagerId外的所有BlockManagerId,返回的对象是一个数组。
BlockManager.getPeers方法如下:
private def getPeers(forceFetch: Boolean): Seq[BlockManagerId] = {
peerFetchLock.synchronized {
val cachedPeersTtl = conf.getInt("spark.storage.cachedPeersTtl", 60 * 1000) // milliseconds
val timeout = System.currentTimeMillis - lastPeerFetchTime > cachedPeersTtl
if (cachedPeers == null || forceFetch || timeout) {
cachedPeers = master.getPeers(blockManagerId).sortBy(_.hashCode)
lastPeerFetchTime = System.currentTimeMillis
logDebug("Fetched peers from master: " + cachedPeers.mkString("[", ",", "]"))
}
cachedPeers //返回集群中除去当前发送消息的BlockManagerId和除去Driver的BlockManagerId外的所有BlockManagerId
}
BlockManagerMasterEndpoint.getPeers方法:
private def getPeers(blockManagerId: BlockManagerId): Seq[BlockManagerId] = {
val blockManagerIds = blockManagerInfo.keySet
if (blockManagerIds.contains(blockManagerId)) {
//过滤掉驱动的BlockManagerId和发送消息的BlockManagerId
blockManagerIds.filterNot { _.isDriver }.filterNot { _ == blockManagerId }.toSeq
} else {
Seq.empty //如果发送消息的节点还没有注册,则返回空序列
}
}
为了防止备份数据备份到对端节点后再备份进入死循环 ,BlockManager.replicate把要备份的数据块的存储级别设置为MEMORY_AND_DISK_SER。BlockManager.replicate方法会调用getRandomPeer从BlockManagerMasterEndpoint.getPeers方法返回的数组中 随机取出一个peer作为数据备份的节点,然后调用NettyBlockTransferService.uploadBlockSync将数据块发送到对端备份节点。BlockManager.replicate代码为:
private def replicate(blockId: BlockId, data: ByteBuffer, level: StorageLevel): Unit = {
val maxReplicationFailures = conf.getInt("spark.storage.maxReplicationFailures", 1)
val numPeersToReplicateTo = level.replication - 1
val peersForReplication = new ArrayBuffer[BlockManagerId]
val peersReplicatedTo = new ArrayBuffer[BlockManagerId]
val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId]
/*
* 同步到对端的数据格式除将备份改成1之后其它都没有改变
* */
val tLevel = StorageLevel(
level.useDisk, level.useMemory, level.useOffHeap, level.deserialized, 1)
val startTime = System.currentTimeMillis
val random = new Random(blockId.hashCode)
var replicationFailed = false
var failures = 0
var done = false
// Get cached list of peers
peersForReplication ++= getPeers(forceFetch = false)
// Get a random peer. Note that this selection of a peer is deterministic on the block id.
// So assuming the list of peers does not change and no replication failures,
// if there are multiple attempts in the same node to replicate the same block,
// the same set of peers will be selected.
def getRandomPeer(): Option[BlockManagerId] = {
// If replication had failed, then force update the cached list of peers and remove the peers
// that have been already used
if (replicationFailed) {
peersForReplication.clear()
peersForReplication ++= getPeers(forceFetch = true)
peersForReplication --= peersReplicatedTo
peersForReplication --= peersFailedToReplicateTo
}
if (!peersForReplication.isEmpty) {
Some(peersForReplication(random.nextInt(peersForReplication.size)))
} else {
None
}
}
// One by one choose a random peer and try uploading the block to it
// If replication fails (e.g., target peer is down), force the list of cached peers
// to be re-fetched from driver and then pick another random peer for replication. Also
// temporarily black list the peer for which replication failed.
//
// This selection of a peer and replication is continued in a loop until one of the
// following 3 conditions is fulfilled:
// (i) specified number of peers have been replicated to
// (ii) too many failures in replicating to peers
// (iii) no peer left to replicate to
//
while (!done) {
//为了负载均衡随机取出一个peer作为数据备份的节点
getRandomPeer() match {
case Some(peer) =>
try {
val onePeerStartTime = System.currentTimeMillis
data.rewind()
logTrace(s"Trying to replicate $blockId of ${data.limit()} bytes to $peer")
/*
* 通过本地的NettyBlockTransferService或者NioBlockTransferService将数据块发送给对端的
* NettyBlockTransferService或者NioBlockTransferService服务
* */
blockTransferService.uploadBlockSync(
peer.host, peer.port, peer.executorId, blockId, new NioManagedBuffer(data), tLevel)
logTrace(s"Replicated $blockId of ${data.limit()} bytes to $peer in %s ms"
.format(System.currentTimeMillis - onePeerStartTime))
peersReplicatedTo += peer
peersForReplication -= peer
replicationFailed = false
if (peersReplicatedTo.size == numPeersToReplicateTo) {//达到备份的节点数目则停止继续备份
done = true // specified number of peers have been replicated to
}
} catch {
case e: Exception =>
logWarning(s"Failed to replicate $blockId to $peer, failure #$failures", e)
failures += 1
replicationFailed = true
peersFailedToReplicateTo += peer
if (failures > maxReplicationFailures) { // too many failures in replcating to peers
done = true
}
}
case None => // no peer left to replicate to
done = true
}
}
val timeTakeMs = (System.currentTimeMillis - startTime)
logDebug(s"Replicating $blockId of ${data.limit()} bytes to " +
s"${peersReplicatedTo.size} peer(s) took $timeTakeMs ms")
if (peersReplicatedTo.size < numPeersToReplicateTo) {
logWarning(s"Block $blockId replicated to only " +
s"${peersReplicatedTo.size} peer(s) instead of $numPeersToReplicateTo peers")
}
}
NettyBlockTransferService.uploadBlockSync调用 NettyBlockTransferService.uploadBlock, NettyBlockTransferService.uploadBlock调用TransportClient.sendRpc将备份数据块的UploadBlock RPC消息发送给对端节点的NettyBlockRpcServer。NettyBlockTransferService.uploadBlock方法源码如下:
override def uploadBlock(
hostname: String,
port: Int,
execId: String,
blockId: BlockId,
blockData: ManagedBuffer,
level: StorageLevel): Future[Unit] = {
val result = Promise[Unit]()
val client = clientFactory.createClient(hostname, port)
// StorageLevel is serialized as bytes using our JavaSerializer. Everything else is encoded
// using our binary protocol.
val levelBytes = serializer.newInstance().serialize(level).array()
// Convert or copy nio buffer into array in order to serialize it.
val nioBuffer = blockData.nioByteBuffer()
val array = if (nioBuffer.hasArray) {
nioBuffer.array()
} else {
val data = new Array[Byte](nioBuffer.remaining())
nioBuffer.get(data)
data
}
//发送UploadBlock Rpc消息,参数array是要发送的块的数据
client.sendRpc(new UploadBlock(appId, execId, blockId.toString, levelBytes, array).toByteArray,
new RpcResponseCallback {
override def onSuccess(response: Array[Byte]): Unit = {
logTrace(s"Successfully uploaded block $blockId")
result.success()
}
override def onFailure(e: Throwable): Unit = {
logError(s"Error while uploading block $blockId", e)
result.failure(e)
}
})
result.future
}
在这里对端节点接收备份数据块NettyBlockRpcServer服务跟Shuffle用到的NettyBlockRpcServer服务是同一个服务,关于服务是如何建立的,可参考文章
Spark Shuffle系列-----3. spark shuffle reduce操作RDD partition的生成
NettyBlockRpcServer.receive方法接收到UploadBlock消息之后,调用BlockManager.putBlockData进入数据块存储流程,对端节点的BlockManager开始接手处理将块数据放入内存。