当调用RDD#checkpoint的,checkpoint的方法如下:
1 /** 2 * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint 3 * directory set with `SparkContext#setCheckpointDir` and all references to its parent 4 * RDDs will be removed. This function must be called before any job has been 5 * executed on this RDD. It is strongly recommended that this RDD is persisted in 6 * memory, otherwise saving it on a file will require recomputation. 7 */ 8 def checkpoint(): Unit = RDDCheckpointData.synchronized { 9 // NOTE: we use a global lock here due to complexities downstream with ensuring 10 // children RDD partitions point to the correct parent partitions. In the future 11 // we should revisit this consideration. 12 if (context.checkpointDir.isEmpty) { 13 throw new SparkException("Checkpoint directory has not been set in the SparkContext") 14 } else if (checkpointData.isEmpty) { 15 //最后生成一个新的ReliableRDDCheckpointData,checkpoint的逻辑主要体现在 ReliableRDDCheckpointData#doCheckpoint函数中。 16 checkpointData = Some(new ReliableRDDCheckpointData(this)) 17 } 18 }
从注释上看,只是将此rdd标示为要checkpoint,文件保存在SparkContext#setCheckpointDir定义的目录,并且此rdd所有的父依赖将移除。
此函数一定要在所有job运行之前被执行。强烈建议把这个RDD进行persisted,否则的话数据进将行重新计算。
1 /** 2 * Materialize this RDD and write its content to a reliable DFS. 3 * This is called immediately after the first action invoked on this RDD has completed. 4 */ 5 protected override def doCheckpoint(): CheckpointRDD[T] = { 6 //核心代码,将文件写入到目录 7 val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir) 8 9 // Optionally clean our checkpoint files if the reference is out of scope 10 if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) { 11 rdd.context.cleaner.foreach { cleaner => 12 cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id) 13 } 14 } 15 16 logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}") 17 newRDD 18 }
1 /** 2 * Write RDD to checkpoint files and return a ReliableCheckpointRDD representing the RDD. 3 */ 4 def writeRDDToCheckpointDirectory[T: ClassTag]( 5 originalRDD: RDD[T], 6 checkpointDir: String, 7 blockSize: Int = -1): ReliableCheckpointRDD[T] = { 8 9 val sc = originalRDD.sparkContext 10 11 // Create the output path for the checkpoint 12 val checkpointDirPath = new Path(checkpointDir) 13 val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration) 14 if (!fs.mkdirs(checkpointDirPath)) { 15 throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath") 16 } 17 18 // Save to file, and reload it as an RDD 19 val broadcastedConf = sc.broadcast( 20 new SerializableConfiguration(sc.hadoopConfiguration)) 21 // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582) 22 //核心代码 23 sc.runJob(originalRDD, 24 writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _) 25 26 if (originalRDD.partitioner.nonEmpty) { 27 writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath) 28 } 29 30 val newRDD = new ReliableCheckpointRDD[T]( 31 sc, checkpointDirPath.toString, originalRDD.partitioner) 32 if (newRDD.partitions.length != originalRDD.partitions.length) { 33 throw new SparkException( 34 s"Checkpoint RDD $newRDD(${newRDD.partitions.length}) has different " + 35 s"number of partitions from original RDD $originalRDD(${originalRDD.partitions.length})") 36 } 37 newRDD 38 }
第23行代码,用到了柯里化的小技巧,我们把方法稍作修改
// TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582) sc.runJob(originalRDD, writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _) // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582) val func : (TaskContext, Iterator[T]) => Unit = writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) sc.runJob(originalRDD,func)
此处新提交一个Job,也是对RDD进行计算,那么如果原有的RDD对结果进行了cache的话,那么是不是减少了很多的计算呢,这就是为啥checkpoint的时候强烈推荐进行cache的缘故。
写文件的逻辑
/** * Write a RDD partition's data to a checkpoint file. */ def writePartitionToCheckpointFile[T: ClassTag]( path: String, broadcastedConf: Broadcast[SerializableConfiguration], blockSize: Int = -1)(ctx: TaskContext, iterator: Iterator[T]) { val env = SparkEnv.get val outputDir = new Path(path) val fs = outputDir.getFileSystem(broadcastedConf.value.value) val finalOutputName = ReliableCheckpointRDD.checkpointFileName(ctx.partitionId()) val finalOutputPath = new Path(outputDir, finalOutputName) val tempOutputPath = new Path(outputDir, s".$finalOutputName-attempt-${ctx.attemptNumber()}") if (fs.exists(tempOutputPath)) { throw new IOException(s"Checkpoint failed: temporary path $tempOutputPath already exists") } val bufferSize = env.conf.getInt("spark.buffer.size", 65536) val fileOutputStream = if (blockSize < 0) { fs.create(tempOutputPath, false, bufferSize) } else { // This is mainly for testing purpose fs.create(tempOutputPath, false, bufferSize, fs.getDefaultReplication, blockSize) } val serializer = env.serializer.newInstance() val serializeStream = serializer.serializeStream(fileOutputStream) Utils.tryWithSafeFinally { serializeStream.writeAll(iterator) } { serializeStream.close() } if (!fs.rename(tempOutputPath, finalOutputPath)) { if (!fs.exists(finalOutputPath)) { logInfo(s"Deleting tempOutputPath $tempOutputPath") fs.delete(tempOutputPath, false) throw new IOException("Checkpoint failed: failed to save output of task: " + s"${ctx.attemptNumber()} and final output path does not exist: $finalOutputPath") } else { // Some other copy of this task must've finished before us and renamed it logInfo(s"Final output path $finalOutputPath already exists; not overwriting it") if (!fs.delete(tempOutputPath, false)) { logWarning(s"Error deleting ${tempOutputPath}") } } } }
核心代码
1 val serializer = env.serializer.newInstance() 2 val serializeStream = serializer.serializeStream(fileOutputStream) 3 Utils.tryWithSafeFinally { 4 serializeStream.writeAll(iterator) 5 } { 6 serializeStream.close() 7 } 8 9 //把iterator返回的结果写到指定目录中。文件命为 10 ReliableCheckpointRDD.checkpointFileName(ctx.partitionId())
我们看下定义
1 /** 2 * Return the checkpoint file name for the given partition. 3 */ 4 private def checkpointFileName(partitionIndex: Int): String = { 5 "part-%05d".format(partitionIndex) 6 }
这个再RDD恢复的时候会用到这个文件名。在下一篇博客中我将写如何恢复。
以上是我们讲述了checkpoint的流程,那么checkpoint是如何启动的呢?
答案在SparkContext#runJob方法
/** * Run a function on a given set of partitions in an RDD and pass the results to the given * handler function. This is the main entry point for all actions in Spark. */ def runJob[T, U: ClassTag]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], resultHandler: (Int, U) => Unit): Unit = { if (stopped.get()) { throw new IllegalStateException("SparkContext has been shutdown") } val callSite = getCallSite val cleanedFunc = clean(func) logInfo("Starting job: " + callSite.shortForm) if (conf.getBoolean("spark.logLineage", false)) { logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString) } dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get) progressBar.foreach(_.finishAll()) rdd.doCheckpoint() }
我们看最后三行,先提交真正我们需要计算的job,然后才是 rdd.doCheckpoint()
/** * Performs the checkpointing of this RDD by saving this. It is called after a job using this RDD * has completed (therefore the RDD has been materialized and potentially stored in memory). * doCheckpoint() is called recursively on the parent RDDs. */ private[spark] def doCheckpoint(): Unit = { RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) { if (!doCheckpointCalled) { doCheckpointCalled = true if (checkpointData.isDefined) { checkpointData.get.checkpoint() } else { dependencies.foreach(_.rdd.doCheckpoint()) } } }
从注释上看,此函数是在使用此RDD的的job执行结束后执行,因此结果可能会保存在内存中,这就是提到过的最好对RDD进行cache的缘故。重要的事要说三遍。
最后调用的方法就是 checkpointData.get.checkpoint()
到此为止RDD如何进行checkpoint算是分析完成了。