本文以KafkaDirectDStream方式为例说明Spark -Streaming checkpoint的原理
JobGenrerator.generateJobs负责Streaming Job的产生,产生并且提交执行Job之后,会发送DoCheckpoint事件,源码如下:
private def generateJobs(time: Time) { SparkEnv.set(ssc.env) Try { jobScheduler.receiverTracker.allocateBlocksToBatch(time) graph.generateJobs(time) } match { case Success(jobs) => val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time) val streamIdToNumRecords = streamIdToInputInfos.mapValues(_.numRecords) jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToNumRecords)) case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e) } eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false )) }
从上面代码可知道,
每次产生Streaming Job都会触发Checkpoint的执行
JobGenerator.processEvent方法接收到DoCheckpoint事件后,调用JobGenerator.doCheckpoint方法进行Checkpoint处理
JobGenerator.doCheckpoint方法调用DStreamGraph.updateCheckpointData对输出DStream进行Checkpoint,
然后调用CheckpointWriter将Checkpoint信息写到Checkpoint目录,源码如下:
private def doCheckpoint(time: Time, clearCheckpointDataLater: Boolean) { if (shouldCheckpoint && (time - graph.zeroTime).isMultipleOf(ssc.checkpointDuration)) { logInfo("Checkpointing graph for time " + time) ssc.graph.updateCheckpointData(time) checkpointWriter.write(new Checkpoint(ssc, time), clearCheckpointDataLater) } }
下面来看看如何对DStream进行Checkpoint的
def updateCheckpointData(time: Time) { logInfo("Updating checkpoint data for time " + time) this . synchronized { outputStreams.foreach(_.updateCheckpointData(time)) } logInfo("Updated checkpoint data for time " + time) }
可见DStreamGraph.updateCheckpointData方法所作的工作是将输出流中的每个DStream信息转化成相应的Checkpoint信息
对每个DStream信息转化成Checkpoint发生在DStream.updateCheckpointData方法,这个方法更新DStream的Checkpoint信息,并且更新DStream依赖的所有DStream的Checkpoint信息,源码如下:
private [streaming] def updateCheckpointData(currentTime: Time) { logDebug("Updating checkpoint data for time " + currentTime) checkpointData.update(currentTime) dependencies.foreach(_.updateCheckpointData(currentTime)) logDebug("Updated checkpoint data for time " + currentTime + ": " + checkpointData) }
DirectKafkaInputDStreamCheckpointData的Checkpoint信息更新如下:
def batchForTime: mutable.HashMap[Time, Array[(String, Int, Long, Long)]] = { data.asInstanceOf[mutable.HashMap[Time, Array[OffsetRange.OffsetRangeTuple]]] } override def update(time: Time) { batchForTime.clear() generatedRDDs.foreach { kv => val a = kv._2.asInstanceOf[KafkaRDD[K, V, U, T, R]].offsetRanges.map(_.toTuple).toArray batchForTime += kv._1 -> a } }
在Spark中,将所有没有成功完成的Job放在了JobScheduler.jobSets中,Job成功完成之后再将它从JobScheduler.jobSets删除,源码如下:
def submitJobSet(jobSet: JobSet) { if (jobSet.jobs.isEmpty) { logInfo("No jobs added for time " + jobSet.time) } else { listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo)) jobSets.put(jobSet.time, jobSet) jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job))) logInfo("Added jobs for time " + jobSet.time) } }
private def handleJobCompletion(job: Job) { job.result match { case Success(_) => val jobSet = jobSets.get(job.time) jobSet.handleJobCompletion(job) logInfo("Finished job " + job.id + " from job set of time " + jobSet.time) if (jobSet.hasCompleted) { jobSets.remove(jobSet.time) jobGenerator.onBatchCompletion(jobSet.time) logInfo("Total delay: %.3f s for time %s (execution: %.3f s)" .format( jobSet.totalDelay / 1000.0 , jobSet.time.toString, jobSet.processingDelay / 1000.0 )) listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo)) } case Failure(e) => reportError("Error running job " + job, e) } }
Checkpoint.graph对应于Spark-streaming应用的DStreamGraph,DStreamGraph.outputStreams包含了要Checkpoint的DStream信息。C
heckpoint.pendingTimes对应没有成功完成的Job,因此在将Checkpoint信息保存到HDFS的时候,这些信息都会被Checkpoint。
要想上一次Spark-streaming Application产生的Checkpoint信息有用,在创建StreamingContext的时候,必须要传入Checkpoint信息 。上一次Spark-streaming Application产生的Checkpoint信息的读取可以通过调用CheckpointReader.read方法。
如果创建StreamingContext传入上次执行产生的Checkpoint信息则会使用Checkpoint包含的DStreamGraph作为本次Application的DStreamGraph,它里面包含了需要Checkpoint的DStream信息。然后根据DStreamGraph恢复上一次执行时的DStream信息 。源码如下:
private [streaming] val graph: DStreamGraph = { if (isCheckpointPresent) { cp_.graph.setContext(this ) cp_.graph.restoreCheckpointData() cp_.graph } else { require(batchDur_ != null , "Batch duration for StreamingContext cannot be null" ) val newGraph = new DStreamGraph() newGraph.setBatchDuration(batchDur_) newGraph } }
JobGenerator.start开始Streaming Job的产生,如果存在Checkpoint信息,则调用JobGenerator.restart开始Spark-streaming Job的执行,在这个方法里面会将上一次Application执行时已经产生但是还没有成功执行完成的Streaming Job先恢复出来,然后再把从崩溃到重新执行的时间之间没有产生Job补上,然后让Spark先执行这些丢失的Job
,源码如下:
def start(): Unit = synchronized { if (eventLoop != null ) return eventLoop = new EventLoop[JobGeneratorEvent]( "JobGenerator" ) { override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = { jobScheduler.reportError("Error in job generator" , e) } } eventLoop.start() if (ssc.isCheckpointPresent) { restart() } else { startFirstTime() } }
private def restart() { if (clock.isInstanceOf[ManualClock]) { val lastTime = ssc.initialCheckpoint.checkpointTime.milliseconds val jumpTime = ssc.sc.conf.getLong("spark.streaming.manualClock.jump" , 0 ) clock.asInstanceOf[ManualClock].setTime(lastTime + jumpTime) } val batchDuration = ssc.graph.batchDuration val checkpointTime = ssc.initialCheckpoint.checkpointTime val restartTime = new Time(timer.getRestartTime(graph.zeroTime.milliseconds)) val downTimes = checkpointTime.until(restartTime, batchDuration) logInfo("Batches during down time (" + downTimes.size + " batches): " + downTimes.mkString(", " )) val pendingTimes = ssc.initialCheckpoint.pendingTimes.sorted(Time.ordering) logInfo("Batches pending processing (" + pendingTimes.size + " batches): " + pendingTimes.mkString(", " )) val timesToReschedule = (pendingTimes ++ downTimes).distinct.sorted(Time.ordering) logInfo("Batches to reschedule (" + timesToReschedule.size + " batches): " + timesToReschedule.mkString(", " )) timesToReschedule.foreach { time => jobScheduler.receiverTracker.allocateBlocksToBatch(time) jobScheduler.submitJobSet(JobSet(time, graph.generateJobs(time))) } timer.start(restartTime.milliseconds) logInfo("Restarted JobGenerator at " + restartTime) }