[笔记迁移][Spark][12]Spark源码——内核架构5

11. Shuffle(最最最最最重要,重点出错、调优目标)

(1)发生时机:
    与Tuple的key操作相关,包括reduceByKey/ groupByKey/ sortByKey/ countByKey/ join/ cogroup

(2)特点:
    [1] 在Spark早期版本中,bucket缓存非常重要,需要将一个ShuffleMapTask的所有数据全部写入内存之后,才会刷新到磁盘。但这也引发了一个问题,若Map一端数据过多,很容易造成内存溢出OM。所以,在之后的版本中进行了优化,默认bucket缓存是100kb,当以写入的数据达到阈值后,就将内存中的数据一点点的刷新到磁盘,避免了容易发生的OM,但若缓存过小将引发过多的io操作。
    [2] 与MapReduce完全不一样的是: MapReduce必须在Map阶段将所有数据写入本地磁盘文件后,才能启动Reduce来拉取数据,因为MR要实现默认Key排序,只有先得到所有数据才能排序。Spark默认情况下不会对数据进行排序,因此ShuffleMapTask每写入一点数据,ResultTask就可以拉取一点数据,然后在本地执行自定义聚合函数和算子计算。优点在于速度快,缺点在于不如MR计算模型方便。

(3)默认的Shuffle
DefaultShuffle
(4)优化后的Shuffle
         在Spark新版本中的consolidation机制,即ShuffleGroup的概念:第一批ShuffleMapTask将数据写入n个(n=ResultTask数量Cpu_core数量)本地文件,之后批次的ShuffleMapTask不再新建文件,而是直接将数据写入之前已创建的对应本地文件中,相当于多个ShuffleMap Task的输出在一组共享文件中合并 ,从而减少本地磁盘上的文件数量。
ShuffleMapTaskConsolidation
    【防裂说明】假设:当前节点有2个Cpu core,运行了4个ShuffleMapTask,==因此只能同时并行其中的2个(即分成了2
2组合)==
    <1>并行的ShuffleMapTask,写入的文件一定是不同的:当一批并行ShuffleMapTask完成后,新一批的ShuffleMapTask启动并行时,使用consolidation机制复用上一批内存缓存和文件。
    <2>一个ShuffleGroup中的每个文件,都存储了多个ShuffleMapTask的数据,每个ShuffleMapTask数据称为一个Segement。此外,还会通过一些索引,来标记每个ShuffleMapTask的输出在ShuffleBlockFile中的索引及偏移量,来进行不同ShuffleMapTask的数据区分。

(5)相关源码:
    <1> ShuffleWrite相关:HashShuffleWriter / FileShuffleBlockManager 【注:源码Spark2.3 与Spark1.6完全不同】
    <2> ShuffleRead入口:ShuffledRDD.compute,但其他逻辑HashShuffleReader / BlockFileShuffleFetcher【注:源码Spark2.3与Spark1.6完全不同】
    <3>调优参数:max.bytes.in.flight

 override def compute (split : Partition, context: TaskContext): Iterator[( K, C)] = {
    val dep = dependencies.head .asInstanceOf [ShuffleDependency[K, V, C]]
    SparkEnv.get .shuffleManager .getReader (dep .shuffleHandle , split .index , split.index + 1 , context )
      .read()
      .asInstanceOf[ Iterator[(K , C)]]
  }
  /**
   * 拉取的关键组件BlockShuffleFetcher,开始拉取ResultTask对应的多份数据
   */
  private[ this] def initialize (): Unit = {
    // Add a task completion callback (called in both success case and failure case) to cleanup.
    context.addTaskCompletionListener(_ => cleanup())

    // Split local and remote blocks.
    val remoteRequests = splitLocalRemoteBlocks()
    // Add the remote requests into our queue in a random order
    fetchRequests ++= Utils.randomize(remoteRequests)
    assert ((0 == reqsInFlight) == ( 0 == bytesInFlight),
      "expected reqsInFlight = 0 but found reqsInFlight = " + reqsInFlight +
      ", expected bytesInFlight = 0 but found bytesInFlight = " + bytesInFlight )

    // Send out initial requests for blocks, up to our maxBytesInFlight
    fetchUpToMaxBytes()

    val numFetches = remoteRequests.size - fetchRequests .size
    logInfo("Started " + numFetches + " remote fetches in" + Utils.getUsedTimeMs(startTime))

    // Get Local Blocks 即数据本地化
    fetchLocalBlocks()
    logDebug("Got local blocks in " + Utils. getUsedTimeMs(startTime))
  }

  /**
   * 循环往复,只要发现还有数据没有拉取完,就发送请求到远程去拉取数据
   */
  private def fetchUpToMaxBytes (): Unit = {
    // Send fetch requests up to maxBytesInFlight(max.bytes.in.flight调优参数). If you cannot fetch from a remote host
    // immediately, defer the request until the next time it can be processed.

    // Process any outstanding deferred fetch requests if possible.
    if (deferredFetchRequests. nonEmpty) {
      for (( remoteAddress, defReqQueue) <- deferredFetchRequests) {
        while (isRemoteBlockFetchable(defReqQueue) &&
            !isRemoteAddressMaxedOut(remoteAddress, defReqQueue.front)) {
          val request = defReqQueue.dequeue()
          logDebug(s "Processing deferred fetch request for $ remoteAddress with "
            + s "${request.blocks.length} blocks")
          send(remoteAddress, request)
          if (defReqQueue.isEmpty) {
            deferredFetchRequests -= remoteAddress
          }
        }
      }
    }

    // Process any regular fetch requests if possible.
    while ( isRemoteBlockFetchable(fetchRequests )) {
      val request = fetchRequests. dequeue()
      val remoteAddress = request. address
      if (isRemoteAddressMaxedOut( remoteAddress, request )) {
        logDebug(s"Deferring fetch request for $ remoteAddress with ${request.blocks.size} blocks")
        val defReqQueue = deferredFetchRequests.getOrElse(remoteAddress, new Queue[FetchRequest]())
        defReqQueue.enqueue (request)
        deferredFetchRequests(remoteAddress) = defReqQueue
      } else {
        send( remoteAddress, request )  //复杂的实现,去远程获取数据
      }
    }

12. BlockManager:底层数据管理组件(还是主从结构)

BlockManager【防裂说明】每个节点上都有BlockManager,包括几个关键组件:

组件功能
DiskStore负责对磁盘上的数据进行读写
MemoryStore负责对内存中的数据进行读写
BlockTransferService负责对远程其他节点的BlockManager管理的数据读写
ConnectionManager负责建立当前BlockManager到远程其他节点的BlockManager的网络连接

(1)BlockManagerMaster负责管理各个节点上的BlockManager“内部管理的数据”的元数据,比如增删改的变更操作都会维护在这里。
(2)每个BlockManager创建初始化之后,首先会向BlockManagerMaster进行注册,此时BlockManager会为其创建对应的BlockManagerInfo。
(3)使用BlockManger“写”操作时,例如RDD运行过程中的一些数据,或手动指定了persist(),优先会将数据写入内存中,当内存大小不够时,会使自己的算法将内存中的部分数据写入磁盘。此外,如果persist()指定了replica,那么会使用BlockTransferService将数据replicate一份到其他节点的BlockManager上。
(4)使用BlockManger“读”操作时,例如ShuffleRead,如果能从本地读取数据,那么就利用DiskStore或MemoryStore从本地读取数据。但如果本地没有数据,会使用ConnectionManger向数据所在的节点的BlockManager建立连接,然后使用BlockTransferService从远程BlockManger读取数据。
(5)只要使用了BlockManager进行增删改操作,就必须将Block的BlockStatus上报至BlockManagerMater,修改其内部对应的BlockManagerInfo的BlockStatus元数据。

【源码】
(1)BlockManagerMaster.scala中只是定义了一些操作入口,BlockManagerEndPoint才是真正作用的组件

/**
 * BlockManagerMasterEndpoint is an [[ThreadSafeRpcEndpoint]] on the master node to track statuses
 * of all slaves' block managers.
 */
private[spark]
class BlockManagerMasterEndpoint(
    override val rpcEnv : RpcEnv,
    val isLocal: Boolean,
    conf: SparkConf,
    listenerBus: LiveListenerBus)
  extends ThreadSafeRpcEndpoint with Logging {

  // Mapping from block manager id to the block manager's information.
  // BlockManagerMaster要负责维护每个BlockManager的BlockManagerInfo
  private val blockManagerInfo = new mutable.HashMap[BlockManagerId, BlockManagerInfo]

  // Mapping from executor ID to block manager ID.
  // 每个Executor与一个BlockManager相关联
  private val blockManagerIdByExecutor = new mutable.HashMap[String, BlockManagerId]

  // Mapping from block id to the set of block managers that have the block.
  private val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]]

  private val askThreadPool = ThreadUtils.newDaemonCachedThreadPool( "block-manager-ask-thread-pool" )
  // ... ...

  /**
   * 注册BlockManager
   * Returns the BlockManagerId with topology information populated, if available.
   */
  private def register (
      idWithoutTopologyInfo: BlockManagerId,
      maxOnHeapMemSize: Long,
      maxOffHeapMemSize: Long,
      slaveEndpoint: RpcEndpointRef): BlockManagerId = {
    // the dummy id is not expected to contain the topology information.
    // we get that info here and respond back with a more fleshed out block manager id
    val id = BlockManagerId(
      idWithoutTopologyInfo. executorId,
      idWithoutTopologyInfo. host,
      idWithoutTopologyInfo. port,
      topologyMapper.getTopologyForHost(idWithoutTopologyInfo.host))

    val time = System. currentTimeMillis()
    // 检查已注册的元数据,没有则新注册
    if (!blockManagerInfo.contains(id)) {
      // 根据BlockManager对应的ExecutorId找到对应的BlockManagerInfo
      // 安全判断:如果BlockManagerInfo map里没有BlockManagerId,则同步的BlockManagerIdByExecutor map里也必须没有
      blockManagerIdByExecutor.get(id.executorId) match {
        case Some (oldId ) =>
          // A block manager of the same executor already exists, so remove it (assumed dead)
          logError("Got two different block manager registrations on same executor - "
              + s " will replace old one $oldId with new one $ id")
          removeExecutor(id .executorId ) //blockManagerIdByExecutor.get(execId).foreach(removeBlockManager)
        case None =>
      }
      logInfo("Registering block manager %s with %s RAM, %s" .format(
        id. hostPort, Utils.bytesToString( maxOnHeapMemSize + maxOffHeapMemSize ), id))

      //往blockManagerIdByExecutor map中保存一份executorId到blockManagerId的映射
      blockManagerIdByExecutor(id.executorId) = id

      //为blockManagerId创建一份BlockManagerInfo并往blockManagerInfo map中保存一份ID->INFO的映射
      blockManagerInfo(id) = new BlockManagerInfo(
        id, System.currentTimeMillis(), maxOnHeapMemSize, maxOffHeapMemSize, slaveEndpoint)
    }
    listenerBus. post( SparkListenerBlockManagerAdded(time , id , maxOnHeapMemSize + maxOffHeapMemSize,
        Some(maxOnHeapMemSize ), Some(maxOffHeapMemSize)))
    id
  }



  private def removeBlockManager (blockManagerId : BlockManagerId) {
    // 尝试根据blockManagerId获取对应的BlockManagerInfo
    val info = blockManagerInfo(blockManagerId)

    // Remove the block manager from blockManagerIdByExecutor.
    blockManagerIdByExecutor -= blockManagerId.executorId

    // Remove it from blockManagerInfo and remove all the blocks.
    blockManagerInfo.remove(blockManagerId)

    val iterator = info.blocks.keySet.iterator
    // 遍历BlockManagerInfo内部所有Block对应的BlockId
    while ( iterator. hasNext) {
      //清空BlockManagerInfo内部的Block的BlockStatus
      val blockId = iterator.next
      val locations = blockLocations.get(blockId)
      locations -= blockManagerId
      // De-register the block if none of the block managers have it. Otherwise, if pro-active
      // replication is enabled, and a block is either an RDD or a test block (the latter is used
      // for unit testing), we send a message to a randomly chosen executor location to replicate
      // the given block. Note that we ignore other block types (such as broadcast/shuffle blocks
      // etc.) as replication doesn't make much sense in that context.
      if (locations.size == 0) {
        blockLocations.remove(blockId)
        logWarning(s"No more replicas available for $ blockId !")
      } else if (proactivelyReplicate && (blockId.isRDD || blockId.isInstanceOf[TestBlockId])) {
        // As a heursitic, assume single executor failure to find out the number of replicas that
        // existed before failure
        val maxReplicas = locations .size + 1
        val i = (new Random(blockId.hashCode)).nextInt(locations.size)
        val blockLocations = locations .toSeq
        val candidateBMId = blockLocations (i)
        blockManagerInfo.get(candidateBMId). foreach { bm =>
          val remainingLocations = locations.toSeq.filter(bm => bm != candidateBMId)
          val replicateMsg = ReplicateBlock(blockId, remainingLocations, maxReplicas)
          bm.slaveEndpoint.ask[Boolean](replicateMsg)
        }
      }
    }

    listenerBus. post( SparkListenerBlockManagerRemoved(System.currentTimeMillis (), blockManagerId))
    logInfo(s"Removing block manager $ blockManagerId")

  }

  /**
   * 更新BlockInfo,每个BlockManager上,如果Block发生变化,那么都要发送updateBlockInfo请求至BlockManagerMastrer更新
   */
  private def updateBlockInfo (
      blockManagerId: BlockManagerId,
      blockId: BlockId,
      storageLevel: StorageLevel,
      memSize: Long,
      diskSize: Long): Boolean = {

    if (!blockManagerInfo.contains(blockManagerId)) {
      if (blockManagerId. isDriver && !isLocal) {
        // We intentionally do not register the master (except in local mode),
        // so we should not indicate failure.
        return true
      } else {
        return false
      }
    }

    if (blockId == null) {
      blockManagerInfo(blockManagerId). updateLastSeenMs()
      return true
    }

   
    blockManagerInfo(blockManagerId). updateBlockInfo(blockId, storageLevel, memSize, diskSize)

    // 每一个Block可能会在多个BlockManager上
    // 因为如果StorageLevel设置为_2这种级别,就需要将Block赋值一份副本,放到其他BlockManager上
    // BockLocations map 保存了每个blockId对应的BlockManagerId的Set(自动去重的多个)
    var locations: mutable.HashSet[ BlockManagerId] = null
    if (blockLocations.containsKey(blockId)) {
      locations = blockLocations.get(blockId)
    } else {
      locations = new mutable.HashSet[BlockManagerId]
      blockLocations.put(blockId, locations)
    }

    if (storageLevel. isValid) {
      locations.add(blockManagerId)
    } else {
      locations.remove(blockManagerId)
    }

    // Remove the block from master tracking if it has been removed on all slaves.
    if (locations.size == 0) {
      blockLocations.remove(blockId)
    }
    true
  }


  // ... ...
}
/**
 * 每一个BlockManager的元数据结构BlockManagerInfo
 */
private[spark] class BlockManagerInfo(
    val blockManagerId: BlockManagerId,
    timeMs: Long,
    val maxOnHeapMem: Long,
    val maxOffHeapMem: Long,
    val slaveEndpoint: RpcEndpointRef)
  extends Logging {

  val maxMem = maxOnHeapMem + maxOffHeapMem

  private var _lastSeenMs : Long = timeMs
  private var _remainingMem : Long = maxMem

  // Mapping from block id to its status.
  private val _blocks = new JHashMap[BlockId, BlockStatus]

  // Cached blocks held by this BlockManager. This does not include broadcast blocks.
  private val _cachedBlocks = new mutable.HashSet[BlockId]
}

@DeveloperApi
case class BlockStatus(storageLevel: StorageLevel, memSize : Long, diskSize : Long) {
  def isCached: Boolean = memSize + diskSize > 0
}

  def updateBlockInfo(
      blockId: BlockId,
      storageLevel: StorageLevel,
      memSize: Long,
      diskSize: Long) {

    updateLastSeenMs()

    val blockExists = _blocks.containsKey(blockId)
    var originalMemSize: Long = 0
    var originalDiskSize: Long = 0
    var originalLevel: StorageLevel = StorageLevel. NONE

    if (blockExists) {
      // The block exists on the slave already.
      val blockStatus: BlockStatus = _blocks.get(blockId)
      originalLevel = blockStatus.storageLevel
      originalMemSize = blockStatus.memSize
      originalDiskSize = blockStatus.diskSize

      // 判断如果storageLevel是基于内存,那么就给剩余内存数量加上当前的内存
      if (originalLevel. useMemory) {
        _remainingMem += originalMemSize
      }
    }

    if (storageLevel. isValid) {
      /* isValid means it is either stored in-memory or on-disk.
       * The memSize here indicates the data size in or dropped from memory,
       * externalBlockStoreSize here indicates the data size in or dropped from externalBlockStore,
       * and the diskSize here indicates the data size in or dropped to disk.
       * They can be both larger than 0, when a block is dropped from memory to disk.
       * Therefore, a safe way to set BlockStatus is to set its info in accurate modes. */
      var blockStatus: BlockStatus = null
      if (storageLevel. useMemory) {
        blockStatus = BlockStatus (storageLevel, memSize = memSize, diskSize = 0)
        _blocks.put(blockId, blockStatus)
        _remainingMem -= memSize
        if (blockExists ) {
          logInfo(s"Updated $blockId in memory on ${blockManagerId.hostPort}" +
            s " (current size: ${Utils. bytesToString(memSize )}," +
            s " original size: ${Utils. bytesToString(originalMemSize )}," +
            s " free: ${Utils. bytesToString(_remainingMem)})" )
        } else {
          logInfo(s"Added $blockId in memory on ${blockManagerId.hostPort}" +
            s " (size: ${Utils. bytesToString(memSize )}," +
            s " free: ${Utils. bytesToString(_remainingMem)})" )
        }
      }
      if (storageLevel. useDisk) {
        blockStatus = BlockStatus (storageLevel, memSize = 0, diskSize = diskSize)
        _blocks.put(blockId, blockStatus)
        if (blockExists ) {
          logInfo(s"Updated $blockId on disk on ${blockManagerId.hostPort}" +
            s " (current size: ${Utils. bytesToString(diskSize )}," +
            s " original size: ${Utils. bytesToString(originalDiskSize )})" )
        } else {
          logInfo(s"Added $blockId on disk on ${blockManagerId.hostPort}" +
            s " (size: ${Utils. bytesToString(diskSize )})" )
        }
      }
      if (! blockId. isBroadcast && blockStatus.isCached) {
        _cachedBlocks += blockId
      }
    } else if (blockExists ) {
      // If isValid is not true, drop the block.
      _blocks.remove(blockId)
      _cachedBlocks -= blockId
      if (originalLevel. useMemory) {
        logInfo(s"Removed $ blockId on ${blockManagerId.hostPort} in memory" +
          s " (size: ${Utils. bytesToString(originalMemSize )}," +
          s " free: ${Utils. bytesToString(_remainingMem)})" )
      }
      if (originalLevel. useDisk) {
        logInfo(s"Removed $ blockId on ${blockManagerId.hostPort} on disk" +
          s " (size: ${Utils. bytesToString(originalDiskSize )})" )
      }
    }
  }
}
/**
 * Manager running on every node (driver and executors) which provides interfaces for putting and
 * retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap).
 *
 * Note that [[initialize()]] must be called before the BlockManager is usable.
 */
private[spark] class BlockManager(
    executorId: String,
    rpcEnv: RpcEnv,
    val master: BlockManagerMaster,
    val serializerManager: SerializerManager,
    val conf: SparkConf,
    memoryManager: MemoryManager,
    mapOutputTracker: MapOutputTracker,
    shuffleManager: ShuffleManager,
    val blockTransferService: BlockTransferService,
    securityManager: SecurityManager,
    numUsableCores: Int)
  extends BlockDataManager with BlockEvictionHandler with Logging {

  private[spark] val externalShuffleServiceEnabled =
    conf.getBoolean("spark.shuffle.service.enabled", false)

  val diskBlockManager = {
    // Only perform cleanup if an external service is not serving our shuffle files.
    val deleteFilesOnStop =
      !externalShuffleServiceEnabled || executorId == SparkContext.DRIVER_IDENTIFIER
    new DiskBlockManager(conf, deleteFilesOnStop )
  }

  // Visible for testing
  private[storage] val blockInfoManager = new BlockInfoManager

  private val futureExecutionContext = ExecutionContext .fromExecutorService (
    ThreadUtils.newDaemonCachedThreadPool ("block-manager-future" , 128 ))

  // Actual storage of where blocks are kept
  private[spark] val memoryStore =
    new MemoryStore(conf, blockInfoManager, serializerManager, memoryManager, this)
  private[spark] val diskStore = new DiskStore (conf, diskBlockManager, securityManager)
  memoryManager.setMemoryStore(memoryStore)
  // ... ...
 /**
   * Initializes the BlockManager with the given appId. This is not performed in the constructor as
   * the appId may not be known at BlockManager instantiation time (in particular for the driver,
   * where it is only learned after registration with the TaskScheduler).
   *
   * This method initializes the BlockTransferService and ShuffleClient, registers with the
   * BlockManagerMaster, starts the BlockManagerWorker endpoint, and registers with a local shuffle
   * service if configured.
   */
  def initialize(appId: String): Unit = {
    // 首先初始化用于远程数据传输的BlockTransferService
    blockTransferService.init( this)
    shuffleClient.init(appId)

    blockReplicationPolicy = {
      val priorityClass = conf.get(
        "spark.storage.replication.policy" , classOf[RandomBlockReplicationPolicy].getName)
      val clazz = Utils.classForName (priorityClass )
      val ret = clazz.newInstance. asInstanceOf[BlockReplicationPolicy]
      logInfo(s"Using $ priorityClass for block replication policy")
      ret
    }

    // 为当前这个BlockManager创建唯一一个BlockManagerId
    // 从BlockManagerId的初始化即可看出一个BlockManager是通过一个节点上的一个Executor进行唯一标识的
    val id =
      BlockManagerId( executorId, blockTransferService .hostName, blockTransferService.port, None )

    //发送BlockManager的注册消息
    val idFromMaster = master. registerBlockManager(
      id,
      maxOnHeapMemory,
      maxOffHeapMemory,
      slaveEndpoint)

    blockManagerId = if (idFromMaster != null) idFromMaster else id

    shuffleServerId = if (externalShuffleServiceEnabled) {
      logInfo(s"external shuffle service port = $ externalShuffleServicePort")
      BlockManagerId(executorId, blockTransferService.hostName, externalShuffleServicePort)
    } else {
      blockManagerId
    }

    // Register Executors' configuration with the local shuffle service, if one should exist.
    if (externalShuffleServiceEnabled && !blockManagerId.isDriver) {
      registerWithExternalShuffleServer()
    }

    logInfo(s"Initialized BlockManager: $ blockManagerId")
  }

  /**
   * Get block from the local block manager as serialized bytes.
   *
   * Must be called while holding a read lock on the block.
   * Releases the read lock upon exception; keeps the read lock upon successful return.
   */
  private def doGetLocalBytes (blockId : BlockId, info: BlockInfo): BlockData = {
    val level = info.level
    logDebug(s"Level for block $ blockId is $level ")
    // In order, try to read the serialized bytes from memory, then from disk, then fall back to
    // serializing in-memory objects, and, finally, throw an exception if the block does not exist.
    if (level.deserialized) {
      // Try to avoid expensive serialization by reading a pre -serialized copy from disk:
      if (level.useDisk && diskStore.contains(blockId)) {
        // Note: we purposely do not try to put the block back into memory here. Since this branch
        // handles deserialized blocks, this block may only be cached in memory as objects, not
        // serialized bytes. Because the caller only requested bytes, it doesn't make sense to
        // cache the block's deserialized objects since that caching may not have a payoff.
        // DiskStore底层使用Java NIO进行读写操作
        diskStore.getBytes(blockId)
      } else if (level .useMemory && memoryStore.contains(blockId)) {
        // The block was not found on disk, so serialize an in-memory copy:
        // 关键MemoryStore中用entry维持Block在内存中的数据 : private val entries = new LinkedHashMap[ BlockId, MemoryEntry[_](32,0.75f,true) ] getBytes / getValues 进行多线程并发访问同步
        new ByteBufferBlockData(serializerManager.dataSerializeWithExplicitClassTag(
          blockId, memoryStore.getValues(blockId).get, info.classTag), true)
      } else {
        handleLocalReadFailure(blockId )
      }
    } else {  // storage level is serialized
      if (level.useMemory && memoryStore.contains(blockId)) {
        new ByteBufferBlockData(memoryStore.getBytes(blockId).get, false)
      } else if (level .useDisk && diskStore.contains(blockId)) {
        val diskData = diskStore.getBytes(blockId)
        maybeCacheDiskBytesInMemory(info , blockId , level , diskData )
          . map( new ByteBufferBlockData(_, false))
          . getOrElse(diskData )
      } else {
        handleLocalReadFailure(blockId )
      }
    }
  }

  /**
   * Get block from remote block managers as serialized bytes.
   */
  def getRemoteBytes( blockId: BlockId): Option[ChunkedByteBuffer ] = {
    logDebug(s"Getting remote block $ blockId")
    require(blockId != null, "BlockId is null")
    var runningFailureCount = 0
    var totalFailureCount = 0

    // Because all the remote blocks are registered in driver, it is not necessary to ask
    // all the slave executors to get block status.
    val locationsAndStatus = master. getLocationsAndStatus(blockId )
    val blockSize = locationsAndStatus.map { b =>
      b.status.diskSize.max(b.status.memSize)
    }.getOrElse(0L)
    val blockLocations = locationsAndStatus.map (_.locations ).getOrElse (Seq.empty)

    // If the block size is above the threshold, we should pass our FileManger to
    // BlockTransferService, which will leverage it to spill the block; if not, then passed-in
    // null value means the block will be persisted in memory.
    val tempFileManager = if (blockSize > maxRemoteBlockToMem) {
      remoteBlockTempFileManager
    } else {
      null
    }

    val locations = sortLocations(blockLocations )
    val maxFetchFailures = locations.size
    var locationIterator = locations.iterator
    while ( locationIterator.hasNext ) {
      val loc = locationIterator. next()
      logDebug(s"Getting remote block $ blockId from $loc ")
      val data = try {
        blockTransferService.fetchBlockSync(
          loc.host, loc.port, loc.executorId, blockId.toString, tempFileManager).nioByteBuffer ()
      } catch {
        case NonFatal (e ) =>
          runningFailureCount += 1
          totalFailureCount += 1

          if (totalFailureCount >= maxFetchFailures ) {
            // Give up trying anymore locations. Either we've tried all of the original locations,
            // or we've refreshed the list of locations from the master, and have still
            // hit failures after trying locations from the refreshed list.
            logWarning(s"Failed to fetch block after $totalFailureCount fetch failures. " +
              s "Most recent failure cause:", e )
            return None
          }

          logWarning(s"Failed to fetch remote block $blockId " +
            s "from $loc (failed attempt $ runningFailureCount)", e)

          // If there is a large number of executors then locations list can contain a
          // large number of stale entries causing a large number of retries that may
          // take a significant amount of time. To get rid of these stale entries
          // we refresh the block locations after a certain number of fetch failures
          if (runningFailureCount >= maxFailuresBeforeLocationRefresh) {
            locationIterator = sortLocations(master .getLocations (blockId )).iterator
            logDebug(s"Refreshed locations from the driver " +
              s "after ${runningFailureCount } fetch failures." )
            runningFailureCount = 0
          }

          // This location failed, so we retry fetch from a different one by returning null here
          null
      }

      if (data != null) {
        return Some (new ChunkedByteBuffer (data ))
      }
      logDebug(s"The value of block $ blockId is null")
    }
    logDebug(s"Block $ blockId not found")
    None
  }

  /**
   * Put the given bytes according to the given level in one of the block stores, replicating
   * the values if necessary.
   *
   * If the block already exists, this method will not overwrite it.
   *
   * '''Important!''' Callers must not mutate or release the data buffer underlying `bytes`. Doing
   * so may corrupt or change the data stored by the `BlockManager`.
   *
   * @param keepReadLock if true, this method will hold the read lock when it returns (even if the
   *                     block already exists). If false, this method will hold no locks when it
   *                     returns.
   * @return true if the block was already present or if the put succeeded, false otherwise.
   */
  private def doPutBytes [T](
      blockId: BlockId,
      bytes: ChunkedByteBuffer,
      level: StorageLevel,
      classTag: ClassTag[T],
      tellMaster: Boolean = true,
      keepReadLock: Boolean = false): Boolean = {
    doPut(blockId, level, classTag, tellMaster = tellMaster, keepReadLock = keepReadLock) { info =>
      val startTimeMs = System.currentTimeMillis
      // Since we're storing bytes, initiate the replication before storing them locally.
      // This is faster as data is already serialized and ready to send.
      val replicationFuture = if (level.replication > 1 ) {
        Future {
          // This is a blocking action and should run in futureExecutionContext which is a cached
          // thread pool. The ByteBufferBlockData wrapper is not disposed of to avoid releasing
          // buffers that are owned by the caller.
          replicate(blockId, new ByteBufferBlockData(bytes, false), level, classTag)
        }(futureExecutionContext)
      } else {
        null
      }

      val size = bytes.size

      if (level.useMemory) {
        // Put it in memory first, even if it also has useDisk set to true;
        // We will drop it to disk later if the memory store can't hold it.
        val putSucceeded = if (level.deserialized) {
          val values =
            serializerManager.dataDeserializeStream(blockId, bytes.toInputStream())(classTag)
          memoryStore.putIteratorAsValues(blockId, values, classTag) match {
            case Right(_) => true
            case Left(iter) =>
              // If putting deserialized values in memory failed, we will put the bytes directly to
              // disk, so we don't need this iterator and can close it to free resources earlier.
              iter.close()
              false
          }
        } else {
          val memoryMode = level.memoryMode
          memoryStore.putBytes(blockId, size, memoryMode, () => {
            if (memoryMode == MemoryMode.OFF_HEAP &&
                bytes.chunks.exists(buffer => !buffer.isDirect)) {
              bytes.copy(Platform.allocateDirectBuffer)
            } else {
              bytes
            }
          })
        }
        if (!putSucceeded && level.useDisk) {
          logWarning(s "Persisting block $blockId to disk instead." )
          diskStore.putBytes(blockId, bytes)
        }
      } else if (level.useDisk) {
        diskStore.putBytes(blockId, bytes)
      }

      val putBlockStatus = getCurrentBlockStatus(blockId, info)
      val blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid
      if (blockWasSuccessfullyStored) {
        // Now that the block is in either the memory or disk store,
        // tell the master about it.
        info.size = size
        if (tellMaster && info.tellMaster) {
          reportBlockStatus(blockId, putBlockStatus)
        }
        addUpdatedBlockStatusToTaskMetrics(blockId, putBlockStatus)
      }
      logDebug( "Put block %s locally took %s" .format(blockId, Utils.getUsedTimeMs(startTimeMs)))
      if (level.replication > 1) {
        // Wait for asynchronous replication to finish
        try {
          ThreadUtils.awaitReady(replicationFuture, Duration.Inf)
        } catch {
          case NonFatal(t) =>
            throw new Exception("Error occurred while waiting for replication to finish", t)
        }
      }
      if (blockWasSuccessfullyStored) {
        None
      } else {
        Some(bytes)
      }
    }.isEmpty
  }


  /**
   * Put the given block according to the given level in one of the block stores, replicating
   * the values if necessary.
   *
   * If the block already exists, this method will not overwrite it.
   *
   * @param keepReadLock if true, this method will hold the read lock when it returns (even if the
   *                     block already exists). If false, this method will hold no locks when it
   *                     returns.
   * @return None if the block was already present or if the put succeeded, or Some(iterator)
   *         if the put failed.
   */
  private def doPutIterator [T](
      blockId: BlockId,
      iterator: () => Iterator[T ],
      level: StorageLevel,
      classTag: ClassTag[T],
      tellMaster: Boolean = true,
      keepReadLock: Boolean = false): Option[PartiallyUnrolledIterator[T]] = {
    doPut(blockId, level, classTag, tellMaster = tellMaster, keepReadLock = keepReadLock) { info =>
      val startTimeMs = System.currentTimeMillis
      var iteratorFromFailedMemoryStorePut: Option[PartiallyUnrolledIterator[T]] = None
      // Size of the block in bytes
      var size = 0L
      if (level.useMemory) {
        // Put it in memory first, even if it also has useDisk set to true;
        // We will drop it to disk later if the memory store can't hold it.
        if (level.deserialized) {
          memoryStore.putIteratorAsValues(blockId, iterator(), classTag) match {
            case Right(s) =>
              size = s
            case Left(iter) =>
              // Not enough space to unroll this block; drop to disk if applicable
              if (level.useDisk) {
                logWarning(s "Persisting block $blockId to disk instead." )
                diskStore.put(blockId) { channel =>
                  val out = Channels.newOutputStream(channel)
                  serializerManager.dataSerializeStream(blockId, out, iter)(classTag)
                }
                size = diskStore.getSize(blockId)
              } else {
                iteratorFromFailedMemoryStorePut = Some(iter)
              }
          }
        } else { // !level.deserialized
          memoryStore.putIteratorAsBytes(blockId, iterator(), classTag, level.memoryMode) match {
            case Right(s) =>
              size = s
            case Left(partiallySerializedValues) =>
              // Not enough space to unroll this block; drop to disk if applicable
              if (level.useDisk) {
                logWarning(s "Persisting block $blockId to disk instead." )
                diskStore.put(blockId) { channel =>
                  val out = Channels.newOutputStream(channel)
                  partiallySerializedValues.finishWritingToStream(out)
                }
                size = diskStore.getSize(blockId)
              } else {
                iteratorFromFailedMemoryStorePut = Some(partiallySerializedValues.valuesIterator)
              }
          }
        }

      } else if (level.useDisk) {
        diskStore.put(blockId) { channel =>
          val out = Channels.newOutputStream(channel)
          serializerManager.dataSerializeStream(blockId, out, iterator())(classTag)
        }
        size = diskStore.getSize(blockId)
      }

      val putBlockStatus = getCurrentBlockStatus(blockId, info)
      val blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid
      if (blockWasSuccessfullyStored) {
        // Now that the block is in either the memory or disk store, tell the master about it.
        info.size = size
        if (tellMaster && info.tellMaster) {
          reportBlockStatus(blockId, putBlockStatus)
        }
        addUpdatedBlockStatusToTaskMetrics(blockId, putBlockStatus)
        logDebug( "Put block %s locally took %s" .format(blockId, Utils.getUsedTimeMs(startTimeMs)))
        if (level.replication > 1 ) {
          val remoteStartTime = System.currentTimeMillis
          val bytesToReplicate = doGetLocalBytes(blockId, info)
          // [SPARK-16550] Erase the typed classTag when using default serialization, since
          // NettyBlockRpcServer crashes when deserializing repl-defined classes.
          // TODO( ekl) remove this once the classloader issue on the remote end is fixed.
          val remoteClassTag = if (!serializerManager.canUseKryo(classTag)) {
            scala.reflect.classTag[Any]
          } else {
            classTag
          }
          try {
            replicate(blockId, bytesToReplicate, level, remoteClassTag)
          } finally {
            bytesToReplicate.dispose()
          }
          logDebug( "Put block %s remotely took %s"
            .format(blockId, Utils.getUsedTimeMs(remoteStartTime)))
        }
      }
      assert(blockWasSuccessfullyStored == iteratorFromFailedMemoryStorePut.isEmpty)
      iteratorFromFailedMemoryStorePut
    }
  }
  // ... ...
}
/**
 * Tracks metadata for an individual block.
 *
 * Instances of this class are _not_ thread-safe and are protected by locks in the
 * [[BlockInfoManager]].
 *
 * @param level the block's storage level. This is the requested persistence level, not the
 *              effective storage level of the block (i.e. if this is MEMORY_AND_DISK, then this
 *              does not imply that the block is actually resident in memory).
 * @param classTag the block's [[ClassTag]], used to select the serializer
 * @param tellMaster whether state changes for this block should be reported to the master. This
 *                   is true for most blocks, but is false for broadcast blocks.
 */
private[storage] class BlockInfo(
    val level: StorageLevel,
    val classTag: ClassTag[_],
    val tellMaster: Boolean)

13. CacheManager(2.3中没有CacheManager)

CacheManager
源码:从RDD的iterator()入手

  /**
   * Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
   * This should ''not'' be called by users directly, but is available for implementors of custom
   * subclasses of RDD.
   */
  final def iterator (split : Partition, context: TaskContext): Iterator[ T] = {
    // 若storageLevel不为NONE,也就是之前持久化过该RDD,那么就不会去直接从父RDD执行算子计算新RDD的partition
    // 优先尝试使用CacheManager去获取持久化的数据
    if (storageLevel != StorageLevel.NONE) {
      getOrCompute( split, context)
    } else {
      // 否则,尝试从ChkPoint的持久化中获取
      computeOrReadCheckpoint( split, context)
    }
  }


  /**
   * Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached.
   */
  private[spark] def getOrCompute (partition : Partition, context: TaskContext): Iterator[T ] = {
    val blockId = RDDBlockId(id, partition .index )
    var readCachedBlock = true
    // This method is called on executors, so we need call SparkEnv.get instead of sc.env.
    SparkEnv.get .blockManager .getOrElseUpdate (blockId , storageLevel, elementClassTag, () => {
      readCachedBlock = false
      computeOrReadCheckpoint( partition, context )
    }) match {
      case Left( blockResult) =>
        if (readCachedBlock ) {
          val existingMetrics = context .taskMetrics ().inputMetrics
          existingMetrics.incBytesRead (blockResult.bytes)
          new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
            override def next (): T = {
              existingMetrics.incRecordsRead (1 )
              delegate.next()
            }
          }
        } else {
          new InterruptibleIterator( context, blockResult.data.asInstanceOf[Iterator[ T]])
        }
      case Right( iter) =>
        new InterruptibleIterator( context, iter. asInstanceOf[Iterator[T]])
    }
  }

  /**
   * Compute an RDD partition or read it from a checkpoint if the RDD is checkpointing.
   */
  private[spark] def computeOrReadCheckpoint (split : Partition, context: TaskContext): Iterator[ T] =
  {
    if (isCheckpointedAndMaterialized) {
      firstParent[ T]. iterator( split, context)
    } else {
      compute(split, context)
    }
  }
  /**
   * Retrieve the given block if it exists, otherwise call the provided `makeIterator` method
   * to compute the block, persist it, and return its values.
   *
   * @return either a BlockResult if the block was successfully cached, or an iterator if the block
   *         could not be cached.
   */
  def getOrElseUpdate[ T](
      blockId: BlockId,
      level: StorageLevel,
      classTag: ClassTag[T],
      makeIterator: () => Iterator[T ]): Either[BlockResult, Iterator[ T]] = {
    // Attempt to read the block from local or remote storage. If it's present, then we don't need
    // to go through the local-get-or-put path.
    get[T](blockId)(classTag) match {
      case Some(block ) =>
        return Left (block )
      case _ =>
        // Need to compute the block.
    }
    // Initially we hold no locks on this block.
    doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true ) match {
      case None =>
        // doPut() didn't hand work back to us, so the block already existed or was successfully
        // stored. Therefore, we now hold a read lock on the block.
        val blockResult = getLocalValues (blockId ).getOrElse {
          // Since we held a read lock between the doPut() and get() calls, the block should not
          // have been evicted, so get() not returning the block indicates some internal error.
          releaseLock(blockId)
          throw new SparkException(s "get() failed for block $blockId even though we held a lock")
        }
        // We already hold a read lock on the block from the doPut() call and getLocalValues()
        // acquires the lock again, so we need to call releaseLock() here so that the net number
        // of lock acquisitions is 1 (since the caller will only call release() once).
        releaseLock(blockId)
        Left( blockResult)
      case Some(iter ) =>
        // The put failed, likely because the data was too large to fit in memory and could not be
        // dropped to disk. Therefore, we need to pass the input iterator back to the caller so
        // that they can decide what to do with the values (e.g. process them without caching).
       Right(iter)
    }
  }
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值