Spark storage系列------1.Spark RDD.persist对数据的存储

如下时序图表示了RDD.persist方法执行之后,Spark是如何cache分区数据的。时序图可放大显示


本篇文章中,RDD.persist(StorageLevel)参数StorageLevel为:MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)

也就是cache数据的时候,如果有足够的内存则将数据cache到内存,如果没有足够的内存,则可将数据cache到磁盘,分区数据不会cache到堆外内存,cache的数据是序列化的,并且cache的数据需要备份到远程其它节点。远程节点采用默认Netty服务方式(NettyBlockRpcServer)接收备份数据。


在Driver端,执行RDD.persist方法,会设置RDD的分区数据如何cache,并且将需要cache的RDD记录到了SparkContext.persistentRdds 这个HashMap中,相关代码如下:

def persist(newLevel: StorageLevel): this.type = {
    // TODO: Handle changes of StorageLevel
    if (storageLevel != StorageLevel.NONE && newLevel != storageLevel) {
      throw new UnsupportedOperationException(
        "Cannot change storage level of an RDD after it was already assigned a level")
    }
    sc.persistRDD(this)//sparkcontext记录都缓存了那些rdd
    // Register the RDD with the ContextCleaner for automatic GC-based cleanup
    sc.cleaner.foreach(_.registerRDDForCleanup(this))
    storageLevel = newLevel//设置RDD的storageLevel,Executor的BlockManager会根据这个值决定如何存储分区数据
    this  //返回当前的RDD
  }

ShuffleMapTask或者ResultTask通过调用RDD.iterator来计算一个分区的数据,如果在Driver执行了RDD.persist方法,则会进入到CacheManager.getOrCompute计算一个分区的数据。如果一个分区的数据已经Cache了,则从BlockManager中根据块ID(块ID根据RDD id和分区index创建)读取分区的数据,否则需要计算生成块的数据,然后调用CacheManager.putInBlockManager将新计算产生的块数据放入BlockManager。代码如下:

def getOrCompute[T](
      rdd: RDD[T],
      partition: Partition,
      context: TaskContext,
      storageLevel: StorageLevel): Iterator[T] = {

    val key = RDDBlockId(rdd.id, partition.index)
    logDebug(s"Looking for partition $key")
    blockManager.get(key) match {
      /*
      * 如果已经cache了这个块的数据,则从BlockManager读取
      * */
      case Some(blockResult) =>
        // Partition is already materialized, so just return its values
        val existingMetrics = context.taskMetrics
          .getInputMetricsForReadMethod(blockResult.readMethod)
        existingMetrics.incBytesRead(blockResult.bytes)

        val iter = blockResult.data.asInstanceOf[Iterator[T]]
        new InterruptibleIterator[T](context, iter) {
          override def next(): T = {
            existingMetrics.incRecordsRead(1)
            delegate.next()
          }
        }
      case None =>
        // Acquire a lock for loading this partition
        // If another thread already holds the lock, wait for it to finish return its results
        /*
        * 如果没有cache这个块的数据,则计算出本分区的数据,并保存到BlockManager
        * */
        val storedValues = acquireLockForPartition[T](key)
        if (storedValues.isDefined) {
          return new InterruptibleIterator[T](context, storedValues.get)
        }

        // Otherwise, we have to load the partition ourselves
        try {
          logInfo(s"Partition $key not found, computing it")
          val computedValues = rdd.computeOrReadCheckpoint(partition, context)//计算出一个分区的数据

          // If the task is running locally, do not persist the result
          if (context.isRunningLocally) {
            return computedValues
          }

          // Otherwise, cache the values and keep track of any updates in block statuses
          val updatedBlocks = new ArrayBuffer[(BlockId, BlockStatus)]
          val cachedValues = putInBlockManager(key, computedValues, storageLevel, updatedBlocks)//cache一个分区的数据
          val metrics = context.taskMetrics
          val lastUpdatedBlocks = metrics.updatedBlocks.getOrElse(Seq[(BlockId, BlockStatus)]())
          metrics.updatedBlocks = Some(lastUpdatedBlocks ++ updatedBlocks.toSeq)
          new InterruptibleIterator(context, cachedValues)

        } finally {
          loading.synchronized {
            loading.remove(key)
            loading.notifyAll()
          }
        }
    }
  }

CacheManager.putInBlockManager方法将一个分区的数据(是一个Iterator数据结构)存储到磁盘或者内存中。如果RDD.storageLevel设置的存储级别只是磁盘,则调用BlockManager.putIterator直接将这个Iterator序列化之后写入磁盘文件。如果数据需要放到内存,则需要调用MemoryStore.unrollSafely将Iterator数据先放入数组,然后调用BlockManager.putArray将这个数组的数据存入BlockManager。MemoryStore.unrollSafely将Iterator数据先放入数组需要占用大量的内存,如果分区中的数据量太大,已有的空闲内存存不下分区中的数据,并且淘汰掉内存中其它RDD的数据之后还是没有足够空间,这种情况下,如果Storage.level设置了数据可以存入磁盘,则将这个分区的数据存入磁盘。CacheManager.putInBlockManager的代码如下:

 private def putInBlockManager[T](
      key: BlockId,
      values: Iterator[T],
      level: StorageLevel,
      updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)],
      effectiveStorageLevel: Option[StorageLevel] = None): Iterator[T] = {

    //effectiveStorageLevel是本次执行putInBlockManager实际要存储分区数据的方法
    val putLevel = effectiveStorageLevel.getOrElse(level)
    if (!putLevel.useMemory) {
      /*
       * This RDD is not to be cached in memory, so we can just pass the computed values as an
       * iterator directly to the BlockManager rather than first fully unrolling it in memory.
       */
      updatedBlocks ++=
        blockManager.putIterator(key, values, level, tellMaster = true, effectiveStorageLevel)
      blockManager.get(key) match {
        case Some(v) => v.data.asInstanceOf[Iterator[T]]
        case None =>
          logInfo(s"Failure to store $key")
          throw new BlockException(key, s"Block manager failed to return cached value for $key!")
      }
    } else {
      /*
       * This RDD is to be cached in memory. In this case we cannot pass the computed values
       * to the BlockManager as an iterator and expect to read it back later. This is because
       * we may end up dropping a partition from memory store before getting it back.
       *
       * In addition, we must be careful to not unroll the entire partition in memory at once.
       * Otherwise, we may cause an OOM exception if the JVM does not have enough space for this
       * single partition. Instead, we unroll the values cautiously, potentially aborting and
       * dropping the partition to disk if applicable.
       * 首先将分区数据(是一个Iterator数据结构)放入到数组中
       */
      blockManager.memoryStore.unrollSafely(key, values, updatedBlocks) match {
        case Left(arr) =>
          // We have successfully unrolled the entire partition, so cache it in memory
          updatedBlocks ++=
            blockManager.putArray(key, arr, level, tellMaster = true, effectiveStorageLevel)
          arr.iterator.asInstanceOf[Iterator[T]]
        case Right(it) =>
          // There is not enough space to cache this partition in memory
          val returnValues = it.asInstanceOf[Iterator[T]]
          if (putLevel.useDisk) {
            logWarning(s"Persisting partition $key to disk instead.")
            val diskOnlyLevel = StorageLevel(useDisk = true, useMemory = false,
              useOffHeap = false, deserialized = false, putLevel.replication)
            /*
            * 分区数据量太大,从分区中unroll数据的时候内存不够了,将数据全部存入磁盘
            * */
            putInBlockManager[T](key, returnValues, level, updatedBlocks, Some(diskOnlyLevel))
          } else {
            returnValues
          }
      }
    }
  }

在MemoryStore.unrollSafely方法首先将分区的数据(Iterator数据结构)放入SizeTrackingVector,然后转换成数组返回。但是如果分区的数据量太大,会首先调用MemoryStore.reserveUnrollMemoryForThisThread申请新的内存,如果当前空闲内存不能容纳当前当前线程需要的内存,MemoryStore.reserveUnrollMemoryForThisThread会执行失败,这时会调用MemoryStore.ensureFreeSpace执行内存淘汰算法,将内存中满足淘汰条件的内存块淘汰掉,腾出内存空间,再进行内存的分配。MemoryStore.unrollSafely方法分配给SizeTrackingVector的内存还需要在后面的流程将数据放入BlockManager,所以它的finally代码段将申请到的内存从unrollMemoryMap移除然后加入到pendingUnrollMemoryMap这个HashMap中,这些内存要在tryToPut成功执行把本次unrollSafely的数组数据放入BlockManager之后会执行数MemoryStore.releasePendingUnrollMemoryForThisThread释放unrollSafely数组占用的内存。MemoryStore.unrollSafely代码如下:

//将values这个Iterator打开,将它的内容放入vector.如果返回array, 在执行结束之后,会删除Iterator所占用的内存
  def unrollSafely(
      blockId: BlockId,
      values: Iterator[Any],
      droppedBlocks: ArrayBuffer[(BlockId, BlockStatus)])
    : Either[Array[Any], Iterator[Any]] = {

    // Number of elements unrolled so far
    var elementsUnrolled = 0
    // Whether there is still enough memory for us to continue unrolling this block
    var keepUnrolling = true
    // Initial per-thread memory to request for unrolling blocks (bytes). Exposed for testing.
    val initialMemoryThreshold = unrollMemoryThreshold
    // How often to check whether we need to request more memory
    val memoryCheckPeriod = 16
    // Memory currently reserved by this thread for this particular unrolling operation
    var memoryThreshold = initialMemoryThreshold
    // Memory to request as a multiple of current vector size
    val memoryGrowthFactor = 1.5
    // Previous unroll memory held by this thread, for releasing later (only at the very end)
    val previousMemoryReserved = currentUnrollMemoryForThisThread
    // Underlying vector for unrolling the block
    var vector = new SizeTrackingVector[Any]

    // Request enough memory to begin unrolling
    keepUnrolling = reserveUnrollMemoryForThisThread(initialMemoryThreshold)

    if (!keepUnrolling) {
      logWarning(s"Failed to reserve initial memory threshold of " +
        s"${Utils.bytesToString(initialMemoryThreshold)} for computing block $blockId in memory.")
    }

    // Unroll this block safely, checking whether we have exceeded our threshold periodically
    try {
      while (values.hasNext && keepUnrolling) {
        vector += values.next()
        if (elementsUnrolled % memoryCheckPeriod == 0) {
          // If our vector's size has exceeded the threshold, request more memory
          val currentSize = vector.estimateSize()
          if (currentSize >= memoryThreshold) {
            val amountToRequest = (currentSize * memoryGrowthFactor - memoryThreshold).toLong
            // Hold the accounting lock, in case another thread concurrently puts a block that
            // takes up the unrolling space we just ensured here
            accountingLock.synchronized {
              //从当前的空闲内存中申请amountToRequest数量的内存
              if (!reserveUnrollMemoryForThisThread(amountToRequest)) {
                // If the first request is not granted, try again after ensuring free space
                // If there is still not enough space, give up and drop the partition
                val spaceToEnsure = maxUnrollMemory - currentUnrollMemory
                if (spaceToEnsure > 0) {
                  //执行内存淘汰算法,将内存中满足淘汰条件的内存块淘汰掉,腾出内存空间
                  val result = ensureFreeSpace(blockId, spaceToEnsure)
                  droppedBlocks ++= result.droppedBlocks
                }
                keepUnrolling = reserveUnrollMemoryForThisThread(amountToRequest)
              }
            }
            // New threshold is currentSize * memoryGrowthFactor
            memoryThreshold += amountToRequest
          }
        }
        elementsUnrolled += 1
      }

      if (keepUnrolling) {
        // We successfully unrolled the entirety of this block
        Left(vector.toArray)
      } else {
        // We ran out of space while unrolling the values for this block
        logUnrollFailureMessage(blockId, vector.estimateSize())
        Right(vector.iterator ++ values)
      }

    } finally {
      // If we return an array, the values returned will later be cached in `tryToPut`.
      // In this case, we should release the memory after we cache the block there.
      // Otherwise, if we return an iterator, we release the memory reserved here
      // later when the task finishes.
      if (keepUnrolling) {
        accountingLock.synchronized {
          val amountToRelease = currentUnrollMemoryForThisThread - previousMemoryReserved
          /*
          *
          * 申请到的内存大小从unrollMemoryMap移除然后加入到pendingUnrollMemoryMap这个HashMap中,这个内存是本次unrollSafely成功执行申请的内存
          * 这些内存要在tryToPut成功执行把本次unrollSafely的数组数据放入BlockManager之后会执行释放函数释放内存
          * */
          releaseUnrollMemoryForThisThread(amountToRelease)
          reservePendingUnrollMemoryForThisThread(amountToRelease)
        }
      }
    }
  }

MemoryStore.ensureFreeSpace方法进行内存淘汰算法 :将其它RDD缓存到BlockManager的内存块作为淘汰对象,首先计算出为本次内存分配需要淘汰的所有数据块,将这些要淘汰的数据块加入数组中,然后依次对每个要淘汰的数据块执行BlockManager.dropFromMemory淘汰掉。MemoryStore.ensureFreeSpace方法代码如下:

private def ensureFreeSpace(
      blockIdToAdd: BlockId,
      space: Long): ResultWithDroppedBlocks = {
    logInfo(s"ensureFreeSpace($space) called with curMem=$currentMemory, maxMem=$maxMemory")

    val droppedBlocks = new ArrayBuffer[(BlockId, BlockStatus)]

    if (space > maxMemory) {
      logInfo(s"Will not store $blockIdToAdd as it is larger than our memory limit")
      return ResultWithDroppedBlocks(success = false, droppedBlocks)
    }

    // Take into account the amount of memory currently occupied by unrolling blocks
    // and minus the pending unroll memory for that block on current thread.
    val threadId = Thread.currentThread().getId
    val actualFreeMemory = freeMemory - currentUnrollMemory +
      pendingUnrollMemoryMap.getOrElse(threadId, 0L)

    if (actualFreeMemory < space) {
      val rddToAdd = getRddId(blockIdToAdd)
      val selectedBlocks = new ArrayBuffer[BlockId]
      var selectedMemory = 0L

      // This is synchronized to ensure that the set of entries is not changed
      // (because of getValue or getBytes) while traversing the iterator, as that
      // can lead to exceptions.
      /*
      * 计算出为了满足本次内存分配所有需要淘汰的内存块,将这些内存块加入到selectedBlocks数组中
      * 要淘汰的内存块只能是其它RDD的内存块
      * */
      entries.synchronized {
        val iterator = entries.entrySet().iterator()
        while (actualFreeMemory + selectedMemory < space && iterator.hasNext) {
          val pair = iterator.next()
          val blockId = pair.getKey
          /*
          * 其它RDD缓存的数据作为删除的目标
          * */
          if (rddToAdd.isEmpty || rddToAdd != getRddId(blockId)) {
            selectedBlocks += blockId
            selectedMemory += pair.getValue.size
          }
        }
      }

      if (actualFreeMemory + selectedMemory >= space) {
        logInfo(s"${selectedBlocks.size} blocks selected for dropping")
        for (blockId <- selectedBlocks) {
          val entry = entries.synchronized { entries.get(blockId) }
          // This should never be null as only one thread should be dropping
          // blocks and removing entries. However the check is still here for
          // future safety.
          if (entry != null) {
            val data = if (entry.deserialized) {
              Left(entry.value.asInstanceOf[Array[Any]])
            } else {
              Right(entry.value.asInstanceOf[ByteBuffer].duplicate())
            }
            /*
            * 从MemoryStore.entries中drop其它RDD cache的数据
            * */
            val droppedBlockStatus = blockManager.dropFromMemory(blockId, data)
            droppedBlockStatus.foreach { status => droppedBlocks += ((blockId, status)) }
          }
        }
        return ResultWithDroppedBlocks(success = true, droppedBlocks)
      } else {
        logInfo(s"Will not store $blockIdToAdd as it would require dropping another block " +
          "from the same RDD")
        return ResultWithDroppedBlocks(success = false, droppedBlocks)
      }
    }
    ResultWithDroppedBlocks(success = true, droppedBlocks)
  }

BlockManager.dropFromMemory淘汰内存块,如 果淘汰内存块可以存储到磁盘则首先调用DiskStore.putArray或者DiskStore.putBytes将淘汰内存块的数据存入磁盘,然后将内存块从MemoryStore.entries这个LinkedHashMap中删除内存块,最后调用BlockManager.reportBlockStatus将内存块更新信息发送给驱动。BlockManager.dropFromMemory源码如下:

def dropFromMemory(
      blockId: BlockId,
      data: () => Either[Array[Any], ByteBuffer]): Option[BlockStatus] = {

    logInfo(s"Dropping block $blockId from memory")
    val info = blockInfo.get(blockId).orNull

    // If the block has not already been dropped
    if (info != null) {
      info.synchronized {
        // required ? As of now, this will be invoked only for blocks which are ready
        // But in case this changes in future, adding for consistency sake.
        if (!info.waitForReady()) {
          // If we get here, the block write failed.
          logWarning(s"Block $blockId was marked as failure. Nothing to drop")
          return None
        } else if (blockInfo.get(blockId).isEmpty) {
          logWarning(s"Block $blockId was already dropped.")
          return None
        }
        var blockIsUpdated = false
        val level = info.level//level于rdd的storageLevel相同

        // Drop to disk, if storage level requires
        if (level.useDisk && !diskStore.contains(blockId)) {
          logInfo(s"Writing block $blockId to disk")
          //如果需要就数据保存到磁盘
          data() match {
            case Left(elements) =>
              diskStore.putArray(blockId, elements, level, returnValues = false)
            case Right(bytes) =>
              diskStore.putBytes(blockId, bytes, level)
          }
          blockIsUpdated = true
        }

        // Actually drop from memory store
        val droppedMemorySize =
          if (memoryStore.contains(blockId)) memoryStore.getSize(blockId) else 0L
        /*
        * 从MemoryStore.entries中删除内存块
        * */
        val blockIsRemoved = memoryStore.remove(blockId)
        if (blockIsRemoved) {
          blockIsUpdated = true
        } else {
          logWarning(s"Block $blockId could not be dropped from memory as it does not exist")
        }

        val status = getCurrentBlockStatus(blockId, info)
        if (info.tellMaster) {
          //将内存块更新信息发送给驱动
          reportBlockStatus(blockId, info, status, droppedMemorySize)
        }
        if (!level.useDisk) {
          // The block is completely gone from this node; forget it so we can put() it again later.
          blockInfo.remove(blockId)
        }
        if (blockIsUpdated) {
          return Some(status)
        }
      }
    }
    None
  }

BlockManager.reportBlockStatus会调用BlockManagerMaster.updateBlockInfo发送UpdateBlockInfo消息给驱动的BlockManagerMasterEndpoint,驱动的BlockManagerMasterEndpoint.receiveAndReply方法接收到UpdateBlockInfo消息后会调用BlockManagerMasterEndpoint.updateBlockInfo更新块信息


CacheManager.putInBlockManager方法调用MemoryStore.unrollSafely将Iterator数据先放入数组,然后调用BlockManager.putArray将这个数组的数据存入BlockManager。BlockManager.putArray方法调用BlockManager.doPut,BlockManager.doPut方法干了3件事情,

1. 调用MemoryStore.putArray将需要Cache的数据块存入MemoryStore.entries这个hash链表,

2. 调用BlockManager.replicate将Cache的数据块存入远端的BlockManager

3. 调用BlockManager.reportBlockStatus 将Cache的数据块存入内存后,向Driver上报块更新信息

MemoryStore.putArray最终调用MemoryStore.tryToPut方法将需要Cache的数据存入MemoryStore.entries hash链表。在加入MemoryStore.entries链表之前会首先调用MemoryStore.ensureFreeSpace执行内存淘汰算法,将内存中满足淘汰条件的内存块淘汰掉,腾出内存空间,再进行内存的分配。如果内存分配成功,则将数据快加入到MemoryStore.entries链表。如果数据块太大,分配内存失败,则执行BlockManager.dropFromMemory将当前的数据块存入磁盘。最后调用MemoryStore.releasePendingUnrollMemoryForThisThread将当前块占用的内存从pendingUnrollMemoryMap链表中摘除,释放内存。MemoryStore.tryToPut源码如下:

 private def tryToPut(
      blockId: BlockId,
      value: () => Any,
      size: Long,
      deserialized: Boolean): ResultWithDroppedBlocks = {

    /* TODO: Its possible to optimize the locking by locking entries only when selecting blocks
     * to be dropped. Once the to-be-dropped blocks have been selected, and lock on entries has
     * been released, it must be ensured that those to-be-dropped blocks are not double counted
     * for freeing up more space for another block that needs to be put. Only then the actually
     * dropping of blocks (and writing to disk if necessary) can proceed in parallel. */

    var putSuccess = false
    val droppedBlocks = new ArrayBuffer[(BlockId, BlockStatus)]

    accountingLock.synchronized {
      // 调用ensureFreeSpace方法确保有足够的内存存储数据,可能导致内存中的其它RDD的数据块被淘汰掉
      val freeSpaceResult = ensureFreeSpace(blockId, size)
      val enoughFreeSpace = freeSpaceResult.success
      droppedBlocks ++= freeSpaceResult.droppedBlocks

      if (enoughFreeSpace) {
        //将块数据存入到entries
        val entry = new MemoryEntry(value(), size, deserialized)
        entries.synchronized {
          entries.put(blockId, entry)
          currentMemory += size
        }
        val valuesOrBytes = if (deserialized) "values" else "bytes"
        logInfo("Block %s stored as %s in memory (estimated size %s, free %s)".format(
          blockId, valuesOrBytes, Utils.bytesToString(size), Utils.bytesToString(freeMemory)))
        putSuccess = true
      } else {
        // Tell the block manager that we couldn't put it in memory so that it can drop it to
        // disk if the block allows disk storage.
        lazy val data = if (deserialized) {
          Left(value().asInstanceOf[Array[Any]])
        } else {
          Right(value().asInstanceOf[ByteBuffer].duplicate())
        }
        /*
        * 要cache的数据块太大,在这里BlockManager.dropFromMemory只是将数据块存入磁盘,因为MemoryStore.entries里面还没有加入
        * 当前的数据块
        * */
        val droppedBlockStatus = blockManager.dropFromMemory(blockId, () => data)
        droppedBlockStatus.foreach { status => droppedBlocks += ((blockId, status)) }
      }
      // Release the unroll memory used because we no longer need the underlying Array
      //释放value所在的块占用的内存
      releasePendingUnrollMemoryForThisThread()
    }
    ResultWithDroppedBlocks(putSuccess, droppedBlocks)
  }

BlockManager.replicate将块数据备份到其它节点,首先调用BlockManager.getPeers向驱动的BlockManagerMasterEndpoint发送GetPeers消息,驱动BlockManagerMasterEndpoint接收到GetPeers消息之后,调用BlockManagerMasterEndpoint.getPeers方法,返回集群中除去当前发送消息的BlockManagerId和除去Driver的BlockManagerId外的所有BlockManagerId,返回的对象是一个数组。

BlockManager.getPeers方法如下:

 private def getPeers(forceFetch: Boolean): Seq[BlockManagerId] = {
    peerFetchLock.synchronized {
      val cachedPeersTtl = conf.getInt("spark.storage.cachedPeersTtl", 60 * 1000) // milliseconds
      val timeout = System.currentTimeMillis - lastPeerFetchTime > cachedPeersTtl
      if (cachedPeers == null || forceFetch || timeout) {
        cachedPeers = master.getPeers(blockManagerId).sortBy(_.hashCode)
        lastPeerFetchTime = System.currentTimeMillis
        logDebug("Fetched peers from master: " + cachedPeers.mkString("[", ",", "]"))
      }
      cachedPeers  //返回集群中除去当前发送消息的BlockManagerId和除去Driver的BlockManagerId外的所有BlockManagerId
    }
  
BlockManagerMasterEndpoint.getPeers方法:

 private def getPeers(blockManagerId: BlockManagerId): Seq[BlockManagerId] = {
    val blockManagerIds = blockManagerInfo.keySet
    if (blockManagerIds.contains(blockManagerId)) {
      //过滤掉驱动的BlockManagerId和发送消息的BlockManagerId
      blockManagerIds.filterNot { _.isDriver }.filterNot { _ == blockManagerId }.toSeq
    } else {
      Seq.empty  //如果发送消息的节点还没有注册,则返回空序列
    }
  }

为了防止备份数据备份到对端节点后再备份进入死循环 ,BlockManager.replicate把要备份的数据块的存储级别设置为MEMORY_AND_DISK_SER。BlockManager.replicate方法会调用getRandomPeer从BlockManagerMasterEndpoint.getPeers方法返回的数组中 随机取出一个peer作为数据备份的节点,然后调用NettyBlockTransferService.uploadBlockSync将数据块发送到对端备份节点。BlockManager.replicate代码为:

 private def replicate(blockId: BlockId, data: ByteBuffer, level: StorageLevel): Unit = {
    val maxReplicationFailures = conf.getInt("spark.storage.maxReplicationFailures", 1)
    val numPeersToReplicateTo = level.replication - 1
    val peersForReplication = new ArrayBuffer[BlockManagerId]
    val peersReplicatedTo = new ArrayBuffer[BlockManagerId]
    val peersFailedToReplicateTo = new ArrayBuffer[BlockManagerId]
    /*
    * 同步到对端的数据格式除将备份改成1之后其它都没有改变
    * */
    val tLevel = StorageLevel(
      level.useDisk, level.useMemory, level.useOffHeap, level.deserialized, 1)
    val startTime = System.currentTimeMillis
    val random = new Random(blockId.hashCode)

    var replicationFailed = false
    var failures = 0
    var done = false

    // Get cached list of peers
    peersForReplication ++= getPeers(forceFetch = false)

    // Get a random peer. Note that this selection of a peer is deterministic on the block id.
    // So assuming the list of peers does not change and no replication failures,
    // if there are multiple attempts in the same node to replicate the same block,
    // the same set of peers will be selected.
    def getRandomPeer(): Option[BlockManagerId] = {
      // If replication had failed, then force update the cached list of peers and remove the peers
      // that have been already used
      if (replicationFailed) {
        peersForReplication.clear()
        peersForReplication ++= getPeers(forceFetch = true)
        peersForReplication --= peersReplicatedTo
        peersForReplication --= peersFailedToReplicateTo
      }
      if (!peersForReplication.isEmpty) {
        Some(peersForReplication(random.nextInt(peersForReplication.size)))
      } else {
        None
      }
    }

    // One by one choose a random peer and try uploading the block to it
    // If replication fails (e.g., target peer is down), force the list of cached peers
    // to be re-fetched from driver and then pick another random peer for replication. Also
    // temporarily black list the peer for which replication failed.
    //
    // This selection of a peer and replication is continued in a loop until one of the
    // following 3 conditions is fulfilled:
    // (i) specified number of peers have been replicated to
    // (ii) too many failures in replicating to peers
    // (iii) no peer left to replicate to
    //
    while (!done) {
      //为了负载均衡随机取出一个peer作为数据备份的节点
      getRandomPeer() match {
        case Some(peer) =>
          try {
            val onePeerStartTime = System.currentTimeMillis
            data.rewind()
            logTrace(s"Trying to replicate $blockId of ${data.limit()} bytes to $peer")
            /*
            * 通过本地的NettyBlockTransferService或者NioBlockTransferService将数据块发送给对端的
            * NettyBlockTransferService或者NioBlockTransferService服务
            * */
            blockTransferService.uploadBlockSync(
              peer.host, peer.port, peer.executorId, blockId, new NioManagedBuffer(data), tLevel)
            logTrace(s"Replicated $blockId of ${data.limit()} bytes to $peer in %s ms"
              .format(System.currentTimeMillis - onePeerStartTime))
            peersReplicatedTo += peer
            peersForReplication -= peer
            replicationFailed = false
            if (peersReplicatedTo.size == numPeersToReplicateTo) {//达到备份的节点数目则停止继续备份
              done = true  // specified number of peers have been replicated to
            }
          } catch {
            case e: Exception =>
              logWarning(s"Failed to replicate $blockId to $peer, failure #$failures", e)
              failures += 1
              replicationFailed = true
              peersFailedToReplicateTo += peer
              if (failures > maxReplicationFailures) { // too many failures in replcating to peers
                done = true
              }
          }
        case None => // no peer left to replicate to
          done = true
      }
    }
    val timeTakeMs = (System.currentTimeMillis - startTime)
    logDebug(s"Replicating $blockId of ${data.limit()} bytes to " +
      s"${peersReplicatedTo.size} peer(s) took $timeTakeMs ms")
    if (peersReplicatedTo.size < numPeersToReplicateTo) {
      logWarning(s"Block $blockId replicated to only " +
        s"${peersReplicatedTo.size} peer(s) instead of $numPeersToReplicateTo peers")
    }
  }

NettyBlockTransferService.uploadBlockSync调用 NettyBlockTransferService.uploadBlockNettyBlockTransferService.uploadBlock调用TransportClient.sendRpc将备份数据块的UploadBlock RPC消息发送给对端节点的NettyBlockRpcServer。NettyBlockTransferService.uploadBlock方法源码如下:

override def uploadBlock(
      hostname: String,
      port: Int,
      execId: String,
      blockId: BlockId,
      blockData: ManagedBuffer,
      level: StorageLevel): Future[Unit] = {
    val result = Promise[Unit]()
    val client = clientFactory.createClient(hostname, port)

    // StorageLevel is serialized as bytes using our JavaSerializer. Everything else is encoded
    // using our binary protocol.
    val levelBytes = serializer.newInstance().serialize(level).array()

    // Convert or copy nio buffer into array in order to serialize it.
    val nioBuffer = blockData.nioByteBuffer()
    val array = if (nioBuffer.hasArray) {
      nioBuffer.array()
    } else {
      val data = new Array[Byte](nioBuffer.remaining())
      nioBuffer.get(data)
      data
    }
    //发送UploadBlock Rpc消息,参数array是要发送的块的数据
    client.sendRpc(new UploadBlock(appId, execId, blockId.toString, levelBytes, array).toByteArray,
      new RpcResponseCallback {
        override def onSuccess(response: Array[Byte]): Unit = {
          logTrace(s"Successfully uploaded block $blockId")
          result.success()
        }
        override def onFailure(e: Throwable): Unit = {
          logError(s"Error while uploading block $blockId", e)
          result.failure(e)
        }
      })

    result.future
  }

在这里对端节点接收备份数据块NettyBlockRpcServer服务跟Shuffle用到的NettyBlockRpcServer服务是同一个服务,关于服务是如何建立的,可参考文章

Spark Shuffle系列-----3. spark shuffle reduce操作RDD partition的生成

NettyBlockRpcServer.receive方法接收到UploadBlock消息之后,调用BlockManager.putBlockData进入数据块存储流程,对端节点的BlockManager开始接手处理将块数据放入内存。















  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值