Spark 缓存管理-CacheManger彻底解密源码

最新推荐文章于 2024-09-26 19:15:00 发布

zhang_yuming

最新推荐文章于 2024-09-26 19:15:00 发布

阅读量246

点赞数

分类专栏：大数据 Spark 文章标签： Spark CacheManager 大数据

大数据同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

Spark

3 篇文章 0 订阅

订阅专栏

Spark之所以非常出色是基于RDD构成了一体化、多元化计算核心，所以就需要在处理多范式的计算时不需要部署多个框架，只需要一个团队一个技术堆栈就可以了解决所有大数据的计算问题，相对来说在软件、硬件上团队的投入都会降低，产出确又会很高。

作为商业的本质属性来说：更低的成本，更高的产出永远都是对的，而且就目前来看当前 Spark产能来说，虽然目前基于RDD上面有五大子框架，但其实Spark上面5%的产能都未发挥出来，未来将会有极大的提高空间。

有些人一直以为Spark都会有只能基于内存进行计算的错误想法，其实1.2版本之前确有内存一些问题，但之后其实DAG才是他的性能的核心，好的调度和天然可以进行多步骤的迭代是其真正的核心能量。

这其中 CacheManger在多步骤迭代型的算法、数据交互式的数据仓库的使用中位置至关重要、起着举足轻重的作用，管理的是内存中的数据。

一、CacheManger分析：

1，CacheManger管理的缓存可以是基于内存的缓存，也可以是基于磁盘的缓存；

2，CacheManager需要通过BlockManager来操作数据；

3，每当Task运行的时候会调用RDD的compute方法进行计算，而compute方法调用iterator方法：

override def compute(split: Partition, context: TaskContext): Iterator[U] =

 f(context, split.index, firstParent[T].iterator(split, context))

/**
 * Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
 * This should ''not'' be called by users directly, but is available for implementors of custom
 * subclasses of RDD.
 */
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
if (storageLevel != StorageLevel.NONE) {
    SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel)
  } else {
    computeOrReadCheckpoint(split, context)
  }
}

iterator方法先看是否已经对数据进行了缓存，如果有则先取缓存，没有才会去进行计算，而且自定义RDD时也可以复写它。

if( storageLevel != StorageLevel.NONE ) 意味着RDD要本身的Storage Level要设置存储，默认是基于内存的，也可以放在磁盘或者Tachyon中，见下面的源码：

class StorageLevel private(
private var _useDisk: Boolean,  //磁盘
private var _useMemory: Boolean, //内存
private var _useOffHeap: Boolean, //Tachyon

Spark的缓存有可能在内存中、磁盘上或者是Tachyon上面。

二：CacheMaanger源码详解

1，Cache在工作的时候会最大化的保留数据，但是数据不一定完整。

如何理解这句话呢？因为当前的计算如果需要内存空间的话，那么Cache在内存中的数据必须让出空间，此时如果在RDD持久化的时候同时指定了可以把数据放在Disk上，那么部分Cache的数据就可以从内存转入磁盘（会drop到磁盘中否则的话就会丢失掉已经计算的并缓存的数据），否则的话数据就会丢失（当然丢失需要重要计算）！！！

/** Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached. */
def getOrCompute[T](
    rdd: RDD[T],
partition: Partition,
context: TaskContext,
storageLevel: StorageLevel): Iterator[T] = {

val key = RDDBlockId(rdd.id, partition.index)
  logDebug(s"Looking for partition $key")
  blockManager.get(key) match {
case Some(blockResult) =>
// Partition is already materialized, so just return its values
val existingMetrics = context.taskMetrics().registerInputMetrics(blockResult.readMethod)
      existingMetrics.incBytesRead(blockResult.bytes)

val iter = blockResult.data.asInstanceOf[Iterator[T]]
new InterruptibleIterator[T](context, iter) {
override def next(): T = {
          existingMetrics.incRecordsRead(1)
          delegate.next()
        }
      }
case None =>
// Acquire a lock for loading this partition
      // If another thread already holds the lock, wait for it to finish return its results
val storedValues = acquireLockForPartition[T](key)
if (storedValues.isDefined) {
return new InterruptibleIterator[T](context, storedValues.get)
      }

// Otherwise, we have to load the partition ourselves
try {
        logInfo(s"Partition $key not found, computing it")
val computedValues = rdd.computeOrReadCheckpoint(partition, context)
val cachedValues = putInBlockManager(key, computedValues, storageLevel)
new InterruptibleIterator(context, cachedValues)
      } finally {
loading.synchronized {
loading.remove(key)
loading.notifyAll()
        }
      }
  }
}

这样我们可以看到，Cache并不是可靠的。

2，CacheManager在获得缓存数据的时候，会通过BlockManger来抓到数据，进行Cache后BlockManager进行管理，通过这个Key就能够获得缓存的数据。

logInfo(s"Finished waiting for $id")
val values = blockManager.get(id)
if (!values.isDefined) {

/**
 * Get a block from the block manager (either local or remote).
 */
def get(blockId: BlockId): Option[BlockResult] = {
val local = getLocal(blockId)
if (local.isDefined) {
    logInfo(s"Found block $blockId locally")
return local
  }
val remote = getRemote(blockId)
if (remote.isDefined) {
    logInfo(s"Found block $blockId remotely")
return remote
  }
  None
}

本地有的话，数据本地性原则，先去本地获取，通过blockId 无论是在本地或者是在远程都会获得回来。

val tLevel = StorageLevel(level.useDisk, level.useMemory, level.deserialized, 1)

val key = RDDBlockId(rdd.id, partition.index)
logDebug(s"Looking for partition $key")
blockManager.get(key) match {
case Some(blockResult) =>
// Partition is already materialized, so just return its values
  .....
    }
case None =>
// Acquire a lock for loading this partition
    // If another thread already holds the lock, wait for it to finish return its results
val storedValues = acquireLockForPartition[T](key)
if (storedValues.isDefined) {
return new InterruptibleIterator[T](context, storedValues.get)
    }

如果是None的话，那么说明缓存已经丢失了，那么为什么还要acquireLockForPartition呢？因为还可能有其他线程在操作，为什么一个Partition还可能有其他线程在操作呢？

那是因为Spark有一个慢任务（straggle task）的推测的功能，当启动这个推测功能时候，对一个Partition就会启动两个任务在两台机器上，这样在当前机器上和远程上都没有发现这个内容，可能说明你在返回时这个任务已经计算完了。

3，如果 CacheManager没有通过BlockManger获得缓存内容的话，此时会通过RDD的如下方法: val computedValues = rdd. computeOrReadCheckpoint (partition, context)

来获得数据；见CacheManager.scala 中67行的代码：

// Otherwise, we have to load the partition ourselves
try {
  logInfo(s"Partition $key not found, computing it")
val computedValues = rdd.computeOrReadCheckpoint(partition, context)
val cachedValues = putInBlockManager(key, computedValues, storageLevel)
new InterruptibleIterator(context, cachedValues)
} finally {
loading.synchronized {
loading.remove(key)
loading.notifyAll()
  }
}

上述方法首先查看当前的RDD是否进行了CheckPoint，不会马上进行计算的，如果做了的话就会直接读取checkPoint的数据（所以说Checkpointt很重要，这样作业级别的迭代是非常有用的，不用重复计算），否则的话就必须进行计算；

计算之后通过putInBlockManager会把数据按照StorageLevel重新缓存起来。下次就更有机率读到（为什么是有机率呢，因为有可能又丢了，内存够大很重要啊！！！）。

/**
 * Compute an RDD partition or read it from a checkpoint if the RDD is checkpointing.
 */
private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
{
if (isCheckpointedAndMaterialized) {
    firstParent[T].iterator(split, context)
  } else {
    compute(split, context)
  }
}