RDD无疑是spark框架中的核心概念之一,RDD是什么?概念太抽象,不如看看RDD有什么用。本篇主要介绍rdd的容错机制之一checkpoint,就是将RDD写入disk进行做检查点。
大致浏览 论文 ,RDD上的操作分为两种:transformation和 action.
(1) Transformation =>从一个\多个rdd生成另一个rdd
Filter,map,sample,flatmap,join,reduceBykey等
(2) Action => Count,collect,reduce,save等
以rdd.count为例,调用栈如下:
def count(): Long= sc.runJob(this, Utils.getIteratorSize _).sum
可知count调用的是sparkContext的runJob方法,如下
def runJob[T, U: ClassTag]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], allowLocal: Boolean, resultHandler: (Int, U) => Unit) { if (dagScheduler == null) { throw new SparkException("SparkContext has been shutdown") } val callSite = getCallSite val cleanedFunc = clean(func) logInfo("Starting job: " + callSite) val start = System.nanoTime dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, allowLocal,resultHandler, localProperties.get)
//dagscheduler 是spark的任务调度,里面包括stage的划分、和taskscheduler的交互等,内容丰富,需至少再开一篇。 logInfo("Job finished: " + callSite + ", took " + (System.nanoTime - start) / 1e9 + " s") rdd.doCheckpoint()//调用了Docheckpoint方法。 }
doCheckpoint()方法如下:
private[spark] def doCheckpoint() { if (!doCheckpointCalled) {//确保每个rdd至多只执行一次checkpoint doCheckpointCalled = true if (checkpointData.isDefined) { //checkpointdata在checkpoint中初始化的,即若对该rdd调用了checkpoint()方法 checkpointData.get.doCheckpoint() // RDDCheckpointData.doCheckpoint() } else { dependencies.foreach(_.rdd.doCheckpoint())//否则就递归的对其dependency进行doCheckpoint() } } }
/*This function must be called before any job has been
/* executed on this RDD.
defcheckpoint() { if (context.checkpointDir.isEmpty) { throw new Exception("Checkpoint directory has not been set in the SparkContext") } else if (checkpointData.isEmpty) { checkpointData = Some(new RDDCheckpointData(this)) //初始化了checkpointData checkpointData.get.markForCheckpoint() //标记一下, } }
可见docheckpoint的实现得继续往RDDCheckpointData.doCheckpoint()里面追:
// Do the checkpointing of the RDD. Called after the first job using that RDD is over. def doCheckpoint() { // If it is marked for checkpointing AND checkpointing is not already in progress, // then set it to be in progress, else return RDDCheckpointData.synchronized { if (cpState == MarkedForCheckpoint) {//在rdd.checkpoint()标记的 cpState = CheckpointingInProgress } else { return } } // Create the output path for the checkpoint val path = new Path(rdd.context.checkpointDir.get, "rdd-" + rdd.id) //存盘的路径,rdd-rdd.id val fs = path.getFileSystem(rdd.context.hadoopConfiguration) if (!fs.mkdirs(path)) { throw new SparkException("Failed to create checkpoint path " + path) } // Save to file, and reload it as an RDD val broadcastedConf = rdd.context.broadcast( new SerializableWritable(rdd.context.hadoopConfiguration)) rdd.context.runJob(rdd, CheckpointRDD.writeToFile[T](path.toString, broadcastedConf) _) //rdd写到存盘path val newRDD = new CheckpointRDD[T](rdd.context, path.toString) //如果要用到这个rdd,会从路径读取到rdd if (newRDD.partitions.size != rdd.partitions.size) { throw new SparkException( "Checkpoint RDD " + newRDD + "(" + newRDD.partitions.size + ") has different " + "number of partitions than original RDD " + rdd + "(" + rdd.partitions.size + ")") } // Change the dependencies and partitions of the RDD RDDCheckpointData.synchronized { cpFile = Some(path.toString) cpRDD = Some(newRDD) rdd.markCheckpointed(newRDD) // Update the RDD's dependencies and partitions cpState = Checkpointed RDDCheckpointData.clearTaskCaches() } logInfo("Done checkpointing RDD " + rdd.id + " to " + path + ", new parent is RDD " + newRDD.id) }
总结一下:rdd.checkpoint()->rdd.count->sparkcontext.runjob->rdd.docheckpoint->rddcheckpointdata.docheckpoint