spark源码系列(8) CacheMananger 和checkpoint

最新推荐文章于 2022-07-17 21:10:56 发布

置顶代码届彭于晏

最新推荐文章于 2022-07-17 21:10:56 发布

阅读量131

点赞数

分类专栏：框架大数据

本文链接：https://blog.csdn.net/m0_37139189/article/details/101075581

版权

框架同时被 2 个专栏收录

74 篇文章 0 订阅

订阅专栏

大数据

39 篇文章 1 订阅

订阅专栏

老规矩，一张图先说明全流程

进入源码：

RDD#iterator

final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
  if (storageLevel != StorageLevel.NONE) {
    SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel)
  } else {
    computeOrReadCheckpoint(split, context)
  }
}

首先判断该RDD有没有持久化级别

先进入if分支

CacheManager#getOrCompute

blockManager.get(key)

进入

def get(blockId: BlockId): Option[BlockResult] = {
  val local = getLocal(blockId)
  if (local.isDefined) {
    logInfo(s"Found block $blockId locally")
    return local
  }
  val remote = getRemote(blockId)
  if (remote.isDefined) {
    logInfo(s"Found block $blockId remotely")
    return remote
  }
  None
}
可以看到先从本地获取，再从远程获取，这里上一节分析过了，不继续分析了

如果没有拿到数据

val storedValues = acquireLockForPartition[T](key)

这里会尝试再次获取一次数据，点击进去：发现有这么一句代码

val values = blockManager.get(id)

如果还是没获取到数据。那么尝试去checkpoint中获取

val computedValues = rdd.computeOrReadCheckpoint(partition, context)

如果设置了checkpoit，那么直接从checkpoint中获取，否则就重新计算
if (isCheckpointedAndMaterialized) {
  firstParent[T].iterator(split, context)
} else {
  compute(split, context)
}

点击进入
val cachedValues = putInBlockManager(key, computedValues, storageLevel, updatedBlocks)

if (!putLevel.useMemory) {
  /*
。。。
} else {

这里会根据存储级别不同操作，磁盘级别比较简单，不说

看else分支

blockManager.memoryStore.unrollSafely(key, values, updatedBlocks) match {
  case Left(arr) =>
    // We have successfully unrolled the entire partition, so cache it in memory
    updatedBlocks ++=
      blockManager.putArray(key, arr, level, tellMaster = true, effectiveStorageLevel)
    arr.iterator.asInstanceOf[Iterator[T]]
  case Right(it) =>
    // There is not enough space to cache this partition in memory
    val returnValues = it.asInstanceOf[Iterator[T]]
    if (putLevel.useDisk) {
      logWarning(s"Persisting partition $key to disk instead.")
      val diskOnlyLevel = StorageLevel(useDisk = true, useMemory = false,
        useOffHeap = false, deserialized = false, putLevel.replication)
      putInBlockManager[T](key, returnValues, level, updatedBlocks, Some(diskOnlyLevel))
    } else {
      returnValues
    }

首先会调用unrollSafely尝试将数据写入内存，我们进入unrolSafely

keepUnrolling = reserveUnrollMemoryForThisTask(blockId, initialMemoryThreshold, droppedBlocks)

这边会清理一些数据

while (values.hasNext && keepUnrolling) {
  vector += values.next()
  if (elementsUnrolled % memoryCheckPeriod == 0) {
    // If our vector's size has exceeded the threshold, request more memory
    val currentSize = vector.estimateSize()
    if (currentSize >= memoryThreshold) {
      val amountToRequest = (currentSize * memoryGrowthFactor - memoryThreshold).toLong
      keepUnrolling = reserveUnrollMemoryForThisTask(
        blockId, amountToRequest, droppedBlocks)
      // New threshold is currentSize * memoryGrowthFactor
      memoryThreshold += amountToRequest
    }
  }
  elementsUnrolled += 1
}

这边会反复吧数据写入内存，如果不够就会调用reserveUnrollMemoryForThisTask清理内存。

unrollSafely返回之后这里也有两个分支，

如果可以写入，就写入，如果不能，又有一个分支。

if (putLevel.useDisk) {
  logWarning(s"Persisting partition $key to disk instead.")
  val diskOnlyLevel = StorageLevel(useDisk = true, useMemory = false,
    useOffHeap = false, deserialized = false, putLevel.replication)
  putInBlockManager[T](key, returnValues, level, updatedBlocks, Some(diskOnlyLevel))
} else {
  returnValues

如果使用了磁盘就会写入磁盘，不能，则丢弃。

回到CacheManager#getOrCompute方法

val metrics = context.taskMetrics
val lastUpdatedBlocks = metrics.updatedBlocks.getOrElse(Seq[(BlockId, BlockStatus)]())
metrics.updatedBlocks = Some(lastUpdatedBlocks ++ updatedBlocks.toSeq)

这里会做一些监控啥的，。

---------------------

接着说一下checkpoint

首先用一张图来说明checkpoint机制

checkpoint与持久化区别：

1.checkpoint比持久化更加安全，

2.checkpoint会改变RDD的血统关系，而持久化不会

通常设置checkpoint的RDD要进行持久化，因为如果不设置持久化，那么在job执行完之后checkpoint又要重新计算一次。

源码：

找到RDD#

private[spark] def doCheckpoint(): Unit = {
  RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) {
    if (!doCheckpointCalled) {
      doCheckpointCalled = true
      if (checkpointData.isDefined) {
        checkpointData.get.checkpoint()
      } else {
        dependencies.foreach(_.rdd.doCheckpoint())
      }
    }
  }
}

final def checkpoint(): Unit = {
  // Guard against multiple threads checkpointing the same RDD by
  // atomically flipping the state of this RDDCheckpointData
  RDDCheckpointData.synchronized {
    if (cpState == Initialized) {
      cpState = CheckpointingInProgress
    } else {
      return
    }
  }

可以看到这边会沿着依赖关系将状态设置成CheckpointingInProgress

我们找到ReliableRDDCheckpointData#doCheckpoint方法

protected override def doCheckpoint(): CheckpointRDD[T] = {
  val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)

  // Optionally clean our checkpoint files if the reference is out of scope
  if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {
    rdd.context.cleaner.foreach { cleaner =>
      cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
    }
  }

  logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")
  newRDD
}

这里会根据路径吧RDDcheckpoint，并且返回一个newRDD

同样是这个类中的compute方法

override def compute(split: Partition, context: TaskContext): Iterator[T] = {
  val file = new Path(checkpointPath, ReliableCheckpointRDD.checkpointFileName(split.index))
  ReliableCheckpointRDD.readCheckpointFile(file, broadcastedConf, context)
}

我们看到已经是调用hdfs的api了

代码届彭于晏

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark源码系列(8) CacheMananger 和checkpoint

老规矩，一张图先说明全流程进入源码：RDD#iteratorfinal def iterator(split: Partition, context: TaskContext): Iterator[T] = { if (storageLevel != StorageLevel.NONE) { SparkEnv.get.cacheManager.getOrCompu...
复制链接

扫一扫

专栏目录