SparkCore — CacheManager源码分析

最新推荐文章于 2021-08-05 09:12:38 发布

xiaoxin_ysj

最新推荐文章于 2021-08-05 09:12:38 发布

阅读量209

点赞数

分类专栏： Spark Core原理与源码分析

本文链接：https://blog.csdn.net/zlx_code/article/details/100714340

版权

Spark Core原理与源码分析专栏收录该内容

29 篇文章 5 订阅

订阅专栏

CacheManager源码分析

CacheManager主要发生在利用RDD的数据执行算子的时候，也就是在Task端ShuffleWriter的write方法写数据的时候，它传入了RDD的iterator方法作为参数。RDD的iterator()会读取或计算RDD的数据。我们分析一下iterator()方法：

RDD.iterator()方法

  final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    // 如果有本地化级别
    // 如果StorageLevel不为None，那么说明之前持久化过RDD，那么就不用直接从从父RDD执行算子，计算RDD的partition了，
    // 优先尝试使用CacheManager，去获取持久化的数据
    if (storageLevel != StorageLevel.NONE) {
      // CacheManager是从持久化的RDD中读取当前计算RDD需要的数据
      SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel)
    } else {
      // 进行RDD的partition的计算
      computeOrReadCheckpoint(split, context)
    }
  }

从上面的代码中就可以看出，在读取数据的时候，首先会判断RDD的本地化级别是否为None，如果存在本地化级别，那么就使用CacheManager去获取持久化的数据；否则就看RDD是否被checkpoint了，假如也没有checkpoint，那么就重新计算。下面我们看一下CacheManager是怎么读取数据的（Checkpoint在下面的博客会继续分析）。

CacheManager的getOrCompute()方法

def getOrCompute[T](
      rdd: RDD[T],
      partition: Partition,
      context: TaskContext,
      storageLevel: StorageLevel): Iterator[T] = {
    // 获取RDD的BlockId
    val key = RDDBlockId(rdd.id, partition.index)
    logDebug(s"Looking for partition $key")
    // 直接使用BlockManager的get方法直接获取数据，如果获取到数据，那么就直接返回
    blockManager.get(key) match {
      case Some(blockResult) =>
        // Partition is already materialized, so just return its values
        val existingMetrics = context.taskMetrics
          .getInputMetricsForReadMethod(blockResult.readMethod)
        existingMetrics.incBytesRead(blockResult.bytes)

        val iter = blockResult.data.asInstanceOf[Iterator[T]]
        new InterruptibleIterator[T](context, iter) {
          override def next(): T = {
            existingMetrics.incRecordsRead(1)
            delegate.next()
          }
        }
      // 如果从BlockManager没有获取到数据，虽然RDD持久化，但是因为未知原因
      // 数据既不在本地内存或磁盘，也不在远程BlockManager上，那么就进行后续处理
      case None =>
        // Acquire a lock for loading this partition
        // If another thread already holds the lock, wait for it to finish return its results
        // 再次调用一次BlockManager的get()方法获取，如果获取到了，那么就直接返回
        // 假设还没有获取到，那么接着执行
        val storedValues = acquireLockForPartition[T](key)
        if (storedValues.isDefined) {
          return new InterruptibleIterator[T](context, storedValues.get)
        }

        // Otherwise, we have to load the partition ourselves
        try {
          logInfo(s"Partition $key not found, computing it")
          // 调用computeOrReadCheckpoint方法
          // 如果RDD之前Checkpoint过，那么尝试读取它的checkpoint
          // 如果没有RDD没有checkpoint，那么只能重新使用父RDD的数据，执行算子，计算一份
          val computedValues = rdd.computeOrReadCheckpoint(partition, context)

          // If the task is running locally, do not persist the result
          // 如果是local模式，那么不需要持久化RDD
          if (context.isRunningLocally) {
            return computedValues
          }

          // Otherwise, cache the values and keep track of any updates in block statuses
          // 由于走CacheManager，那么意味着RDD肯定设置持久化级别的，只是因为某些原因，持久化的数据没有找到（内存不够可能被删除）
          // 所以读取了checkpoint数据，或者是重新计算数据之后，要用putInBlockManager，将数据在BlockManager持久化一份
          val updatedBlocks = new ArrayBuffer[(BlockId, BlockStatus)]
          // 将数据重新持久化一份
          val cachedValues = putInBlockManager(key, computedValues, storageLevel, updatedBlocks)
          val metrics = context.taskMetrics
          val lastUpdatedBlocks = metrics.updatedBlocks.getOrElse(Seq[(BlockId, BlockStatus)]())
          metrics.updatedBlocks = Some(lastUpdatedBlocks ++ updatedBlocks.toSeq)
          new InterruptibleIterator(context, cachedValues)

        } finally {
          loading.synchronized {
            loading.remove(key)
            loading.notifyAll()
          }
        }
    }
  }

我们分析一下这个方法，首先获取RDD的BlockId，然后直接使用BlockManager.get()方法去获取数据，它的get()方法很简单，里面先调用getLocal()方法尝试从本地获取数据（之前BlockManager分析过doGetLocal），假如没有获取到，那么就去远程获取（getRemote，里面调用了doGetRemote）；假设获取到了，那么将获取到的数据封装一下直接返回，否则的话接着执行；
这里就比较有意思，假如，第一次BlockManager没有获取到，那么这里还会去尝试获取一次，如果第二次获取到了，那么就直接返回。假设还没有获取到，就会调用RDD的computeOrReadCheckpoint()方法，尝试去读取之前被checkpoint的数据，这个方法里面会进行判断，假设数据被checkpoint了，那么读取checkpoint，否则就重新计算一份。
下面接着执行，如果是local模式，那么就不需要持久化了。否则的话，需要对之前读取不到的RDD重新进行持久化。代码执行到这里的时候，就已经说明RDD是被设置了持久化级别了的，这里调用putInBlockManager，将数据在BlockManager上持久化一份。在持久化的时候，我们要注意Memory的持久化，下面我们分析一下putInBlockManager的方法。

putInBlockManager持久化

private def putInBlockManager[T](
key: BlockId,
values: Iterator[T],
level: StorageLevel,
updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)],
effectiveStorageLevel: Option[StorageLevel] = None): Iterator[T] = {

// 获取持久化级别
val putLevel = effectiveStorageLevel.getOrElse(level)
// 如果持久化级别没有指定内存级别，那么就是磁盘的级别
if (!putLevel.useMemory) {
  /*
   * This RDD is not to be cached in memory, so we can just pass the computed values as an
   * iterator directly to the BlockManager rather than first fully unrolling it in memory.
   */
  // 直接调用blockManager的putIterator，将数据写入磁盘即可，使用doPut()来存储数据
  updatedBlocks ++=
    blockManager.putIterator(key, values, level, tellMaster = true, effectiveStorageLevel)
  blockManager.get(key) match {
    case Some(v) => v.data.asInstanceOf[Iterator[T]]
    case None =>
      logInfo(s"Failure to store $key")
      throw new BlockException(key, s"Block manager failed to return cached value for $key!")
  }
} else {
  // 指定了内存级别
  // 这里调用memoryStore的unrollSafely方法，尝试将数据写入内存
  // 如果unrollSafely判断数据可以写入内存，那么就将数据写入内存，
  // 假如内存不够，那么就将数据写入磁盘
  blockManager.memoryStore.unrollSafely(key, values, updatedBlocks) match {
    case Left(arr) =>
      // We have successfully unrolled the entire partition, so cache it in memory
      updatedBlocks ++=
        blockManager.putArray(key, arr, level, tellMaster = true, effectiveStorageLevel)
      arr.iterator.asInstanceOf[Iterator[T]]
    case Right(it) =>
      // There is not enough space to cache this partition in memory
      val returnValues = it.asInstanceOf[Iterator[T]]
      // 如果数据无法写入内存，那么判断数据是否有磁盘级别，如果有的话，那么就将数据写入磁盘文件中
      if (putLevel.useDisk) {
        logWarning(s"Persisting partition $key to disk instead.")
        val diskOnlyLevel = StorageLevel(useDisk = true, useMemory = false,
          useOffHeap = false, deserialized = false, putLevel.replication)
        putInBlockManager[T](key, returnValues, level, updatedBlocks, Some(diskOnlyLevel))
      } else {
        // 否则不进行持久化，返回
        returnValues
      }
  }
}
}

这里我们主要分析一下当持久化级别是内存的时候，是怎么做的。这里做法就很小心，它会一点一点的判断当前能申请到的内存是否足够进行存储，就是为了防止一次性申请过大的内存导致OOM。它主要还是调用了MemoryStore的unrollSafely方法。下面我们分析一下这个方法

MemoryStore的unrollSafely()方法

def unrollSafely(
blockId: BlockId,
values: Iterator[Any],
droppedBlocks: ArrayBuffer[(BlockId, BlockStatus)])
: Either[Array[Any], Iterator[Any]] = {

// 记录当前unroll的次数
var elementsUnrolled = 0
// 假设一开始当前内存足够存放数据
var keepUnrolling = true
// 默认unroll内存限制是1M
val initialMemoryThreshold = unrollMemoryThreshold
// 将数据按照16为单位切分，每当取RDD的16个数据时，就申请内存，如果能申请到
// 就接着加上下一个16个数据申请，以此类推。
val memoryCheckPeriod = 16
// 内存阈值
var memoryThreshold = initialMemoryThreshold
// 内存获取的最大为当前数据大小的1.5倍
val memoryGrowthFactor = 1.5
// 存储当前获取了多少内存
var pendingMemoryReserved = 0L
// 存储当前RDD的数据
var vector = new SizeTrackingVector[Any]

// 获取足够的内存，默认先获取1M试试，假设能获取的到，说明可以尝试去获取
// 如果连1M也获取不到，那么就只会退出
keepUnrolling = reserveUnrollMemoryForThisTask(blockId, initialMemoryThreshold, droppedBlocks)

if (!keepUnrolling) {
  logWarning(s"Failed to reserve initial memory threshold of " +
    s"${Utils.bytesToString(initialMemoryThreshold)} for computing block $blockId in memory.")
} else {
  pendingMemoryReserved += initialMemoryThreshold
}

// Unroll this block safely, checking whether we have exceeded our threshold periodically
try {
  while (values.hasNext && keepUnrolling) {
    // 获取数据，每获取到16份数据的时候，就尝试申请内存
    vector += values.next()
    if (elementsUnrolled % memoryCheckPeriod == 0) {
      // If our vector's size has exceeded the threshold, request more memory
      // 获取当前数据大小
      val currentSize = vector.estimateSize()
      // 如果数据大小超过阈值
      if (currentSize >= memoryThreshold) {
        // 计算需要获取的内存大小，当前内存1.5倍 减去门限值
        val amountToRequest = (currentSize * memoryGrowthFactor - memoryThreshold).toLong
        // 尝试申请获取内存，其实里面调用的是evictBlocksToFreeSpace，会将旧的数据删除，
        // 假如旧数据的持久化级别没有Disk，那么就被彻底删除。
        keepUnrolling = reserveUnrollMemoryForThisTask(
          blockId, amountToRequest, droppedBlocks)
        if (keepUnrolling) {
          pendingMemoryReserved += amountToRequest
        }
        // 更新门限
        memoryThreshold += amountToRequest
      }
    }
    // 计数器加1
    elementsUnrolled += 1
  }

  // 返回，假如能够申请到足够的内存，那么就返回Array形式，否则就是返回数组迭代器
  if (keepUnrolling) {
    Left(vector.toArray)
  } else {
    logUnrollFailureMessage(blockId, vector.estimateSize())
    Right(vector.iterator ++ values)
  }

} finally {
  // 如果内存足够，则将数据持久化到内存中。
  if (keepUnrolling) {
    val taskAttemptId = currentTaskAttemptId()
    memoryManager.synchronized {
      // 将unrollMemoryMap中对应需要的内存释放掉
      unrollMemoryMap(taskAttemptId) -= pendingMemoryReserved
      // 先将需要的内存大小存入pendingUnrollMemoryMap，等tryPut真正的cache之后，就会释放掉。
      pendingUnrollMemoryMap(taskAttemptId) =
        pendingUnrollMemoryMap.getOrElse(taskAttemptId, 0L) + pendingMemoryReserved
    }
  } else {
    // Otherwise, if we return an iterator, we can only release the unroll memory when
    // the task finishes since we don't know when the iterator will be consumed.
  }
}
}

这里有两个比较有意思的地方，第一：会先尝试获取一次内存，看能否获取的到，这里仅仅只获取1M的内存，如果连1M都获取不到，那么就不用去尝试不断获取了，直接返回，否则才开始下面的尝试获取；第二：不断的尝试获取内存，这里怎么获取，它为了防止一次性获取过多的内存，可能导致OOM，因此它分批次获取，每次取16个RDD元数据，依据它的大小去获取数据，而且获取数据是按照当前数据的1.5的大小来获取的，然后以此类推，直到获取到足够的内存，或内存不够为止。（这里插一句，获取内存的时候也是之前BlockManager源码中讲过，它会将当前最旧的数据尝试去删除，如果这个RDD的持久化包含Disk，那么会写入磁盘，否则就删除，空出内存）。
在申请到了足够的内存之后，也不是立刻就进行cache，它会记录需要的内存大小，在BlockManager存数据的时候进行cache。
总结一下，CacheManager它的作用就是当进行RDD的计算，读取RDD数据的时候，假设RDD进行了持久化，就用getOrCompute()尝试读取持久化的数据，假如读取到了，那么就很好立即返回；假设数据丢了，那么再尝试从checkpoint中读取，如果还没读取到，就重新计算一份。将重新计算的数据按照之前持久化级别进行持久化，这里注意内存持久化，它的设计比较有意思，先尝试获取一下内存，看能不能获取的到，这里默认1M；如果获取到了，那么就对RDD的数据分批次获取内存，每次以16个RDD元素为单位进行获取，这样就是为了防止一次性获取过多的内存，导致OOM。