本期内容:
1. JobGenerator源码
2. JobGenerator图解
第6课曾经对JobGenarator生成Job的主要流程进行过一些剖析。这次在原有基础上做一些补充。第6课给出了以下生成Job的相关类的主流程图:
以下图也给出了JobGenerator的更多的工作流程供参考:
JobGenerator用于从DStream产生作业,驱动checkpoint,清理DStream元数据。
相当于转换器。
JobGenerator是通过JobScheduler.start产生的。JobGenerator对象在其中生成和启动。
JobScheduler.start:
class JobScheduler(val ssc: StreamingContext) extends Logging {
...
private val jobGenerator =
new JobGenerator
(this)
...
def start(): Unit = synchronized {
if (eventLoop != null) return // scheduler has already been started
logDebug("Starting JobScheduler")
eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
}
eventLoop.start()
// attach rate controllers of input streams to receive batch completion updates
for {
inputDStream <- ssc.graph.getInputStreams
rateController <- inputDStream.rateController
} ssc.addStreamingListener(rateController)
listenerBus.start(ssc.sparkContext)
receiverTracker = new ReceiverTracker(ssc)
inputInfoTracker = new InputInfoTracker(ssc)
receiverTracker.start()
jobGenerator.start
()
logInfo("Started JobScheduler")
...
}
JobGenerator:
class JobGenerator(jobScheduler: JobScheduler) extends Logging {
private val ssc = jobScheduler.ssc
private val conf = ssc.conf
private val graph = ssc.graph
// 时钟,自己设定是用哪个时钟类生成。
val clock = {
val clockClass = ssc.sc.conf.get(
"spark.streaming.clock", "org.apache.spark.util.SystemClock")
try {
Utils.classForName(clockClass).newInstance().asInstanceOf[Clock]
} catch {
case e: ClassNotFoundException if clockClass.startsWith("org.apache.spark.streaming") =>
val newClockClass = clockClass.replace("org.apache.spark.streaming", "org.apache.spark")
Utils.classForName(newClockClass).newInstance().asInstanceOf[Clock]
}
}
// 定时器,第一个参数是时钟,第二个是Spark Streaming批处理周期,第三个是被定时运行的回调函数,第四个是定时器名称。
// 这个匿名的回调函数,是发送GenerateJobs消息。其Time类型的参数是启示发送时间。
// 也就是说,定时器按设置的批处理周期来发送GenerateJobs消息。
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
// This is marked lazy so that this is initialized after checkpoint duration has been set
// in the context and the generator has been started.
private lazy val shouldCheckpoint = ssc.checkpointDuration != null && ssc.checkpointDir != null
private lazy val checkpointWriter = if (shouldCheckpoint) {
new CheckpointWriter(this, ssc.conf, ssc.checkpointDir, ssc.sparkContext.hadoopConfiguration)
} else {
null
}
// 消息通讯提eventLoop在JobGenerator启动是生成。
// eventLoop不为null就意味着本JobScheduler已经启动,没有停止。
// eventLoop is created when generator starts.
// This not being null means the scheduler has been started and not stopped
private var eventLoop: EventLoop[JobGeneratorEvent] = null
// last batch whose completion,checkpointing and metadata cleanup has been completed
private var lastProcessedBatch: Time = null
RecurringTimer很重要,在BlockGenerator中也使用,是一个通用的定时执行回调函数的定时器,回调函数一般发送能触发某种处理的消息。
RecurringTimer用start方法启动,用stop方法停止。它们会没外部调用,返回值是Long类型的时间。
private[streaming]
class RecurringTimer(clock: Clock, period: Long, callback: (Long) => Unit, name: String)
extends Logging {
// 守护线程,运行loop方法。
private val thread = new Thread("RecurringTimer - " + name) {
setDaemon(true)
override def run() { loop }
}
// 易变的时间变量初始化为-1,stopped为false。
@
volatile
private var prevTime = -1L
@
volatile
private var nextTime = -1L
@
volatile
private var stopped = false
RecurringTimer.loop:
/**
* Repeatedly call the callback every interval.
*/
private def loop() {
try {
while (!stopped) {
triggerActionForNextInterval
()
}
triggerActionForNextInterval
()
} catch {
case e: InterruptedException =>
}
}
RecurringTimer.triggerActionForNextInterval:
private def triggerActionForNextInterval(): Unit = {
clock.
waitTillTime
(nextTime)
callback
(nextTime)
prevTime = nextTime
nextTime += period
logDebug("Callback for " + name + " called at time " + prevTime)
}
callback就是RecurringTimer的那个回调函数。
用流程图总结一下RecurringTimer:
RecurringTimer剖析到这,回到JobGenarator。JobGenarator除了初始化消息体eventLoop,还调用startFirstTime。
JobbGenerator.startFirstTime:
/** Starts the generator for the first time */
private def startFirstTime() {
val startTime = new Time(timer.getStartTime())
graph.start
(startTime - graph.batchDuration)
timer.start
(startTime.milliseconds)
logInfo("Started JobGenerator at " + startTime)
}
用
JobGenarator是用startFirstTime启动了DStreamGraph类型的graph、RecurringTimer类型的定时器timer。这才真正启动了Job生成的周期性工作。
DStreamGraph.start:
def start(time: Time) {
this.synchronized {
require(zeroTime == null, "DStream graph computation already started")
zeroTime = time
startTime = time
outputStreams.foreach(_.
initialize
(zeroTime))
outputStreams.foreach(_.
remember
(rememberDuration))
outputStreams.foreach(_.
validateAtStart
)
inputStreams.par.foreach(_.start())
}
}
(ToDo:详细剖析)
JobGenerator.generateJobs:
/** Generate jobs and perform checkpoint for the given `time`. */
private def generateJobs(time: Time) {
// Set the SparkEnv in this thread, so that job generation code can access the environment
// Example: BlockRDDs are created in this thread, and it needs to access BlockManager
// Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
SparkEnv.set(ssc.env)
Try {
jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
graph.generateJobs(time) // generate jobs using allocated block
} match {
case Success(jobs) =>
val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
case Failure(e) =>
jobScheduler.reportError("Error generating jobs for time " + time, e)
}
eventLoop.post
(
DoCheckpoint
(time, clearCheckpointDataLater = false))
}
生成Job后,还发送了DoCheckpoint消息。通过processEvent可以知道相应的处理是JobGenerator.doCheckpoint。
JobGenerator.doCheckpoint:
/** Perform checkpoint for the give `time`. */
private def doCheckpoint(time: Time, clearCheckpointDataLater: Boolean) {
if (shouldCheckpoint && (time - graph.zeroTime).isMultipleOf(ssc.checkpointDuration)) {
logInfo("Checkpointing graph for time " + time)
ssc.graph.updateCheckpointData
(time)
checkpointWriter.write
(new Checkpoint(ssc, time), clearCheckpointDataLater)
}
}
DStrramGraph.
updateCheckpointData:
def updateCheckpointData(time: Time) {
logInfo("Updating checkpoint data for time " + time)
this.synchronized {
outputStreams.foreach(_.updateCheckpointData(time))
}
logInfo("Updated checkpoint data for time " + time)
}
DStream.updateCheckpointData:
private[streaming] def updateCheckpointData(currentTime: Time) {
logDebug("Updating checkpoint data for time " + currentTime)
checkpointData.
update
(currentTime)
dependencies.foreach(_.
updateCheckpointData
(currentTime))
logDebug("Updated checkpoint data for time " + currentTime + ": " + checkpointData)
}
updateCheckpointData在做递归调用。
checkpointData是在
InputDStream子类
中定义的
DStreamCheckpointData子类。不同Spark Streaming应用程序数据来源不一定相同,InputDStream子类就不一定相同。下面以
DirectKafkaInputDStream中的
DirectKafkaInputDStreamCheckpointData为例剖析。
DirectKafkaInputDStream.DirectKafkaInputDStreamCheckpointData.update:
override def update(time: Time) {
batchForTime.clear()
generatedRDDs.foreach { kv =>
val a = kv._2.asInstanceOf[KafkaRDD[K, V, U, T, R]].offsetRanges.map(_.toTuple).toArray
batchForTime += kv._1 -> a
}
}
batchForTime被清空,重新加入offset范围数据。
JobGenarator的核心图解: