Spark的Checkpoint源码和机制

最新推荐文章于 2020-07-09 15:35:30 发布

weiha666

最新推荐文章于 2020-07-09 15:35:30 发布

阅读量227

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/weiha666/article/details/103702870

版权

本文深入探讨Spark的Checkpoint机制，从源码角度分析LocalRDDCheckpointData和ReliableRDDCheckpointData的实现，解释如何通过Checkpoint提高流式计算的容错性。文章还介绍了如何在Local模式下实践Checkpoint，并解析了读取Checkpoint的原理。

摘要由CSDN通过智能技术生成

Spark的Checkpoint源码和机制

1 Overview

A checkpoint creates a known good point from which the SQL Server Database Engine can start applying changes contained in the log during recovery after an unexpected shutdown or crash.

在流式计算里，需要高容错的机制来确保程序的稳定和健壮。从源码中看看，在 Spark 中，Checkpoint 到底做了什么。在源码中搜索，可以在 Streaming 包中的 Checkpoint。

作为 Spark 程序的入口，我们首先关注一下 SparkContext 里关于 Checkpoint 是怎么写的。SparkContext 我们知道，定义了很多 Spark 内部的对象的引用。可以找到 Checkpoint 的文件夹路径是这么定义的。

我们从一段简单的代码开始看一下checkpoint 。spark版本2.1.1

  val sparkConf = new SparkConf().setAppName("streaming").setMaster("local[*]")
    val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
    val sc = sparkSession.sparkContext
    val ssc = new StreamingContext(sc,Seconds(5))
    //设置检查点目录
    ssc.checkpoint("./streaming_checkpoint")

看一下ssc.checkpoint

  def checkpoint(directory: String) {
   
    if (directory != null) {
   
      val path = new Path(directory)
      val fs = path.getFileSystem(sparkContext.hadoopConfiguration)
      fs.mkdirs(path)
      val fullPath = fs.getFileStatus(path).getPath().toString
      sc.setCheckpointDir(fullPath)
      checkpointDir = fullPath
    } else {
   
      checkpointDir = null
    }
  }

看一下sc.setCheckpointDir

// 定义 checkpointDir
private[spark] var checkpointDir: Option[String] = None
/**

Set the directory under which RDDs are going to be checkpointed. The directory must
be a HDFS path if running on a cluster.
*/
def setCheckpointDir(directory: String) {
   
// If we are running on a cluster, log a warning if the directory is local.
// Otherwise, the driver may attempt to reconstruct the checkpointed RDD from
// its own local file system, which is incorrect because the checkpoint files
// are actually on the executor machines.
// 如果运行的是 cluster 模式，当设置本地文件夹的时候，会报 warning
// 道理很简单，被创建出来的文件夹路径实际上是 executor 本地的文件夹路径，不是不行，
// 只是有点不合理，Checkpoint 的东西最好还是放在分布式的文件系统中
if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) {
   
logWarning("Spark is not running in local mode, therefore the checkpoint directory " +
s"must not be on the local filesystem. Directory ‘$directory’ " +
“appears to be on the local filesystem.”)
}

checkpointDir = Option(directory).map {
    dir =>
// 显然文件夹名就是 UUID.randoUUID() 生成的
val path = new Path(dir, UUID.randomUUID().toString)
val fs = path.getFileSystem(hadoopConfiguration)
fs.mkdirs(path)
fs.getFileStatus(path).getPath.toString
}
}

关于 setCheckpointDir 被那些类调用了，可以看以下截图。除了常见的 StreamingContext 中需要使用（因为容错性是流式计算的基本保证），另外的就是一些需要反复迭代计算使用 RDD 的场景，包括各种机器学习算法的时候，图中可以看到像 ALS, Decision Tree 等等算法，这些算法往往需要反复使用 RDD，遇到大的数据集用 Cache 就没有什么意义了，所以一般会用 Checkpoint。

此处我只计划深挖一下 spark core 里的代码。推荐大家一个 IDEA 的功能，下图右下方可以将你搜索的关键词的代码输出到外部文件中，到时候可以打开自己看看 spark core 中关于 Checkpoint 的代码是怎么组织的。

继续找找 Checkpoint 的相关信息，可以看到 runJob 方法的最后是一个 rdd.toCheckPoint() 的使用。runJob 我们知道是触发 action 的一个方法，那么我们进入 doCheckpoint() 看看。

/**
 * Run a function on a given set of partitions in an RDD and pass the results to the given
 * handler function. This is the main entry point for all actions in Spark.
 */
def runJob[T, U: ClassTag](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    resultHandler: (Int, U) => Unit): Unit = {
   
  if (stopped.get()) {
   
    throw new IllegalStateException("SparkContext has been shutdown")
  }
  val callSite = getCallSite
  val cleanedFunc = clean(func)
  logInfo("Starting job: " + callSite.shortForm)
  if (conf.getBoolean("spark.logLineage", false)) {
   
    logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
  }
  dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
  progressBar.foreach(_.finishAll())
  rdd.doCheckpoint()
}

然后基本就发现了 Checkpoint 的核心方法了。而 doCheckpoint() 是 RDD 的私有方法，所以这里基本可以回答最开始提出的问题，我们在说 Checkpoint 的时候，到底是 Checkpoint 什么。答案就是 RDD。

private[spark] def doCheckpoint(): Unit = {
   
  RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) {
   
    // 该rdd是否已经调用doCheckpoint，如果还没有，则开始处理
    if (!doCheckpointCalled) {
   
      // 判断RDDCheckpointData是否已经定义了，如果已经定义了
      doCheckpointCalled = true
      if (checkpointData.isDefined) {
   
        // 查看是否需要把该rdd的所有依赖即血缘全部checkpoint
        if (checkpointAllMarkedAncestors) {
   
          // Linestage上的每一个rdd递归调用该方法
          dependencies.foreach(_.rdd.doCheckpoint())
        }
        // 调用RDDCheckpointData的checkpoint方法
        checkpointData.get.checkpoint()
      } else {
   
        dependencies.foreach(_.rdd.doCheckpoint())
      }
    }
  }
}

上面代码可以看到，需要判断一下一个变量 checkpointData 是否为空。那么它是这么被定义的。

private[spark] var checkpointData: Option[RDDCheckpointData[T]] = None

然后看看 RDDCheckPointData 是个什么样的数据结构。

/**
 * This class contains all the information related to RDD checkpointing. Each instance of this
 * class is associated with an RDD. It manages process of checkpointing of the associated RDD,
 * as well as, manages the post-checkpoint state by providing the updated partitions,
 * iterator and preferred locations of the checkpointed RDD.
 */
private[spark] abstract class RDDCheckpointData[T: ClassTag](@transient private val rdd: RDD[T])
  extends