Spark学习之12：checkpoint

最新推荐文章于 2023-09-09 15:44:59 发布

ktlinker1119

最新推荐文章于 2023-09-09 15:44:59 发布

阅读量2.1k

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/ktlinker1119/article/details/45971063

版权

Spark 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

要对RDD做checkpoint操作，需要先调用SparkContext的setCheckpointDir设置checkpoint数据存储位置。RDD的checkpoint操作由SparkContext.runJob发起。如果了解整个Job的执行过程，那么理解RDD的checkpoint就相对简单了。

1. RDD.checkpoint

  def checkpoint() {
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new RDDCheckpointData(this))
      checkpointData.get.markForCheckpoint()
    }
  }

（1）代码中的context是SparkContext对象的引用。调用checkpoint方法时，会先检查checkpointDir是否设置；

（2）创建RDDCheckpointData对象，封装要做checkpoint的RDD；

（3）将 RDDCheckpointData标记为MarkedForCheckpoint状态。

2. SparkContext.runJob

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      allowLocal: Boolean,
      resultHandler: (Int, U) => Unit) {
    ......
    rdd.doCheckpoint()
  }

在RDD的job执行完成后，调用RDD的doCheckpoint方法检查RDD或依赖的RDD中是否存在checkpoint，并发起checkpoint job操作。

2.1. RDD.doCheckpoint

  private[spark] def doCheckpoint() {
    if (!doCheckpointCalled) {
      doCheckpointCalled = true
      if (checkpointData.isDefined) {
        checkpointData.get.doCheckpoint()
      } else {
        dependencies.foreach(_.rdd.doCheckpoint())
      }
    }
  }

（1）doCheckpointCalled是一个boolean变量，当一个RDD第一次调用doCheckpoint时，将之设置为true，以防止RDD属于多个Job而出现多次checkpoint。

（2）如果rdd设置了checkpoint操作，那么调用RDDCheckpointData.doCheckpoint执行RDD的checkpoint；

（3）doCheckpoint方法以递归的方式寻找第一个需要checkpoint的RDD进行checkpoint操作。从代码中看出如果当前RDD有checkpoint操作，即使其父RDD有checkpoint操作需求也不会执行。

（4）RDD的checkpoint函数必须在RDD所在的Job执行之前调用，否则即使设置checkpoint操作，也不会执行；因为在每次Job执行完后，都会调用doCheckpoint方法，而该方法会将 doCheckpointCalled设置为true，使得Job执行后设置的checkpoint无效。

2.2. RDDCheckpointData.doCheckpoint

  def doCheckpoint() {
    // If it is marked for checkpointing AND checkpointing is not already in progress,
    // then set it to be in progress, else return
    RDDCheckpointData.synchronized {
      if (cpState == MarkedForCheckpoint) {
        cpState = CheckpointingInProgress
      } else {
        return
      }
    }
    // Create the output path for the checkpoint
    val path = new Path(rdd.context.checkpointDir.get, "rdd-" + rdd.id)
    val fs = path.getFileSystem(rdd.context.hadoopConfiguration)
    if (!fs.mkdirs(path)) {
      throw new SparkException("Failed to create checkpoint path " + path)
    }
    // Save to file, and reload it as an RDD
    val broadcastedConf = rdd.context.broadcast(
      new SerializableWritable(rdd.context.hadoopConfiguration))
    rdd.context.runJob(rdd, CheckpointRDD.writeToFile[T](path.toString, broadcastedConf) _)
    val newRDD = new CheckpointRDD[T](rdd.context, path.toString)
    if (newRDD.partitions.size != rdd.partitions.size) {
      throw new SparkException(
        "Checkpoint RDD " + newRDD + "(" + newRDD.partitions.size + ") has different " +
          "number of partitions than original RDD " + rdd + "(" + rdd.partitions.size + ")")
    }
    // Change the dependencies and partitions of the RDD
    RDDCheckpointData.synchronized {
      cpFile = Some(path.toString)
      cpRDD = Some(newRDD)
      rdd.markCheckpointed(newRDD)   // Update the RDD's dependencies and partitions
      cpState = Checkpointed
    }
    logInfo("Done checkpointing RDD " + rdd.id + " to " + path + ", new parent is RDD " + newRDD.id)
  }

代码本身注释的非常明白。

（1）修改cpState状态；

（2）创建checkpoint的输出目录；

（3）启动一个Job，通过CheckpointRDD.writeToFile将RDD分区信息写入文件；

（4）创建CheckpointRDD，用于读取checkpoint后文件；

（5）设置RDDCheckpointData对象状态，调用markCheckpointed修改当前RDD（发起checkpoint操作的RDD）的依赖及分区信息。

2.3. RDD.markCheckpointed

  private[spark] def markCheckpointed(checkpointRDD: RDD[_]) {
    clearDependencies()
    partitions_ = null
    deps = null    // Forget the constructor argument for dependencies too
  }

  protected def clearDependencies() {
    dependencies_ = null
  }

清除依赖及分区。

3. 示例

3.1. 创建RDDs

scala> val rdd = sc.parallelize(List(1, 2, 3, 5, 6, 7, 8, 9, 0), 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> val mappedRDD = rdd.map(_ * 2)
mappedRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:23
scala> val filterRDD = mappedRDD.filter(_ > 10)
filterRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at filter at <console>:25

（1）创建ParallelCollectionRDD对象；

（2）在rdd上执行map方法，返回MapPartitionsRDD对象（spark1.3）；

（3）在mappedRDD上执行filter方法，返回MapPartitionsRDD对象；

3.2. 设置checkpoint

scala> sc.setCheckpointDir("hdfs://CentOS-01:8020/tmp")
scala> filterRDD.checkpoint
scala> filterRDD.toDebugString
res3: String = 
(2) MapPartitionsRDD[2] at filter at <console>:25 []
 |  MapPartitionsRDD[1] at map at <console>:23 []
 |  ParallelCollectionRDD[0] at parallelize at <console>:21 []

通过RDD.toDebugString方法可以看出rdd之间依赖关系。

3.3. 执行checkpoint

checkpoint操作需要有action操作激活。

scala> filterRDD.count
......
15/05/25 15:29:55 INFO RDDCheckpointData: Done checkpointing RDD 2 to hdfs://CentOS-01:8020/tmp/9e7ecd6d-c327-4938-8fd2-43f57006bb09/rdd-2, new parent is RDD 3
res4: Long = 4
scala> 
scala> 
scala> filterRDD.toDebugString
res5: String = 
(2) MapPartitionsRDD[2] at filter at <console>:25 []
 |  CheckpointRDD[3] at count at <console>:28 []

RDD.count方法的输出日志描述了checkpoint信息已写入HDFS。

toDebugString方法显示，filterRDD现在依赖新创建的CheckpointRDD。

在该例子中，将filterRDD对象进行了checkpoint操作，写入文件的是filterRDD结果：

scala> val p = filterRDD.dependencies(0).rdd
p: org.apache.spark.rdd.RDD[_] = CheckpointRDD[3] at count at <console>:28
scala> p.collect
......
res7: Array[_] = Array(12, 14, 16, 18)

然而，filterRDD会依赖新创建的 CheckpointRDD对象，所以filterRDD的action会多做一次不必要的filter操作。

3.4. RDD.dependencies和RDD.paritions

  private def checkpointRDD: Option[RDD[T]] = checkpointData.flatMap(_.checkpointRDD)
  final def dependencies: Seq[Dependency[_]] = {
    checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {
      if (dependencies_ == null) {
        dependencies_ = getDependencies
      }
      dependencies_
    }
  }
  final def partitions: Array[Partition] = {
    checkpointRDD.map(_.partitions).getOrElse {
      if (partitions_ == null) {
        partitions_ = getPartitions
      }
      partitions_
    }
  }

在dependencies和partitions方法中，都会先调用checkpointRDD方法，检查RDD是否checkpoint。如果RDD做了checkpoint，则从RDDCheckpointData对象获取对应的RDD，然后基于此RDD来获取dependencies和partitions。