走进spark(一) rdd.checkpoint

RDD无疑是spark框架中的核心概念之一,RDD是什么?概念太抽象,不如看看RDD有什么用。本篇主要介绍rdd的容错机制之一checkpoint,就是将RDD写入disk进行做检查点

大致浏览 论文 ,RDD上的操作分为两种:transformation和 action.

(1)    Transformation =>从一个\多个rdd生成另一个rdd

Filter,map,sample,flatmap,join,reduceBykey等

(2)    Action => Count,collect,reduce,save等

以rdd.count为例,调用栈如下:

def count(): Long= sc.runJob(this, Utils.getIteratorSize _).sum

可知count调用的是sparkContext的runJob方法,如下

def runJob[T, U: ClassTag](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    allowLocal: Boolean,
    resultHandler: (Int, U) => Unit) {
  if (dagScheduler == null) {
    throw new SparkException("SparkContext has been shutdown")
  }
  val callSite = getCallSite
  val cleanedFunc = clean(func)
  logInfo("Starting job: " + callSite)
  val start = System.nanoTime
  dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, allowLocal,resultHandler, localProperties.get)
//dagscheduler 是spark的任务调度,里面包括stage的划分、和taskscheduler的交互等,内容丰富,需至少再开一篇。
  logInfo("Job finished: " + callSite + ", took " + (System.nanoTime - start) / 1e9 + " s")
  rdd.doCheckpoint()//调用了Docheckpoint方法。
}

doCheckpoint()方法如下:

private[spark] def doCheckpoint() {
  if (!doCheckpointCalled) {//确保每个rdd至多只执行一次checkpoint
    doCheckpointCalled = true
    if (checkpointData.isDefined) {  //checkpointdata在checkpoint中初始化的,即若对该rdd调用了checkpoint()方法
      checkpointData.get.doCheckpoint() // RDDCheckpointData.doCheckpoint()
    } else {
      dependencies.foreach(_.rdd.doCheckpoint())//否则就递归的对其dependency进行doCheckpoint()
    }
  }
}
/*This function must be called before any job has been
/* executed on this RDD. 
defcheckpoint() {
  if (context.checkpointDir.isEmpty) {
    throw new Exception("Checkpoint directory has not been set in the SparkContext")
  } else if (checkpointData.isEmpty) {
    checkpointData = Some(new RDDCheckpointData(this)) //初始化了checkpointData
    checkpointData.get.markForCheckpoint() //标记一下,
  }
}

可见docheckpoint的实现得继续往RDDCheckpointData.doCheckpoint()里面追

// Do the checkpointing of the RDD. Called after the first job using that RDD is over.
def doCheckpoint() {
  // If it is marked for checkpointing AND checkpointing is not already in progress,
  // then set it to be in progress, else return
  RDDCheckpointData.synchronized {
    if (cpState == MarkedForCheckpoint) {//在rdd.checkpoint()标记的
      cpState = CheckpointingInProgress
    } else {
      return
    }
  }

  // Create the output path for the checkpoint
  val path = new Path(rdd.context.checkpointDir.get, "rdd-" + rdd.id) //存盘的路径,rdd-rdd.id
  val fs = path.getFileSystem(rdd.context.hadoopConfiguration)
  if (!fs.mkdirs(path)) {
    throw new SparkException("Failed to create checkpoint path " + path)
  }

  // Save to file, and reload it as an RDD
  val broadcastedConf = rdd.context.broadcast(
    new SerializableWritable(rdd.context.hadoopConfiguration))
  rdd.context.runJob(rdd, CheckpointRDD.writeToFile[T](path.toString, broadcastedConf) _) //rdd写到存盘path
  val newRDD = new CheckpointRDD[T](rdd.context, path.toString) //如果要用到这个rdd,会从路径读取到rdd
  if (newRDD.partitions.size != rdd.partitions.size) {
    throw new SparkException(
      "Checkpoint RDD " + newRDD + "(" + newRDD.partitions.size + ") has different " +
        "number of partitions than original RDD " + rdd + "(" + rdd.partitions.size + ")")
  }

  // Change the dependencies and partitions of the RDD
  RDDCheckpointData.synchronized {
    cpFile = Some(path.toString)
    cpRDD = Some(newRDD)
    rdd.markCheckpointed(newRDD)   // Update the RDD's dependencies and partitions
    cpState = Checkpointed
    RDDCheckpointData.clearTaskCaches()
  }
  logInfo("Done checkpointing RDD " + rdd.id + " to " + path + ", new parent is RDD " + newRDD.id)
}

总结一下:rdd.checkpoint()->rdd.count->sparkcontext.runjob->rdd.docheckpoint->rddcheckpointdata.docheckpoint

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值