官方案例
首先以官方启动入手
object SparkStreamingTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("aaa").setMaster("local[*]")
val sc = SparkSession.builder().config(conf).getOrCreate()
val ssc = new StreamingContext(conf,Seconds(1))
// 第二种方式
val ssc2 = new StreamingContext(sc.sparkContext,Seconds(1))
//
val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.40.179", 9999)
// words 这个DStream内部有一个dependence保存着父亲lines: Dstream的对象,而且会保留父亲节点的slideDuration, 且重写了compute方法
val words: DStream[String] = lines.flatMap(_.split(" ")) // FlatMappedDStream 继承DStream抽象
// pairs 这个DStream内部有一个dependence保存着父亲words Dstream的对象,而且会保留父亲节点的slideDuration, 且重写了compute方法
val pairs: DStream[(String, Int)] = words.map(word => (word, 1)) // MappedDStream 继承DStream抽象
// wordCounts 这个DStream内部有一个dependence保存着父亲words Dstream的对象,而且会保留父亲节点的slideDuration, 且重写了compute方法
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) // ShuffledDStream 继承DStream抽象
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
StreamingContext
创建上下文步骤
package org.apache.spark.streaming
class StreamingContext private[streaming] (
_sc: SparkContext,
_cp: Checkpoint,
_batchDur: Duration
) extends Logging {
.... 中间过程省略
/**
* 方式一为StreamingContext 提供 new SparkContext.
* @param conf a org.apache.spark.SparkConf object specifying Spark parameters
* @param batchDuration the time interval at which streaming data will be divided into batches
*/
def this(conf: SparkConf, batchDuration: Duration) = {
this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
}
/**
* 方式二 为StreamingContext 提供 sparkContext上下文
* @param sparkContext existing SparkContext
* @param batchDuration the time interval at which streaming data will be divided into batches
*/
def this(sparkContext: SparkContext, batchDuration: Duration) = {
this(sparkContext, null, batchDuration)
}
根据StreamingContext构造函数我们可以看到StreamingContext至少包含三个主要成员变量,如下
- _sc为sparkcontext,这个是spark的上下文对象
可以通过方式一或方式二创建
- _cp checkpoint中间数据保存点以及结果保存点,
由于本文关注于spark streaming启动的原理,所以暂时不对保存点进行详细的介绍,不过可以把checkpoint理解成流为了保证有序可靠稳定的运行,会把RDD中的中间计算结果临时保存到磁盘中,以作为后期运行的保障
-
batchDur,即每个批次最大的接收间隔时长
理解这一点需要看StreamingContext内部的一个graph成员变量以及后文中的调度篇(timer).其实很简单,StreamingContext就相当于就是一个画展室,画展主要功能就是把一些画展示给客户看的,同时有很多部件组成,比如画展室内部的灯光,以及控制灯光色彩的开关,音乐,开放的时间作息等等。我们先了解一下这个画展室中的那副镇馆之宝(graph),可以想象成,而定时器就好比在灯光调控式中的
package org.apache.spark.streaming
class StreamingContext private[streaming]{
....
private[streaming] val graph: DStreamGraph = {
if (isCheckpointPresent) {
_cp.graph.setContext(this)
_cp.graph.restoreCheckpointData()
_cp.graph
} else {
require(_batchDur != null, "Batch duration for StreamingContext cannot be null")
val newGraph = new DStreamGraph()
newGraph.setBatchDuration(_batchDur)
newGraph
}
}
....
}
- state (状态)
在sparkContext中有三种类型的状态
/**
* :: DeveloperApi ::
*
* Return the current state of the context. The context can be in three possible states -
*
* - StreamingContextState.INITIALIZED 初始化状态- The context has been created, but not started yet.
* Input DStreams, transformations and output operations can be created on the context.,此时的一些inputstream以及转换算子还有一些输出操作逻辑都会被创建,但是并不执行
* - StreamingContextState.ACTIVE 激活状态 - The context has been started, and not stopped.永不停止
* Input DStreams, transformations and output operations cannot be created on the context.异常操作开始执行,并不会创建新的操作,永无止尽地执行用户定义好的操作
* - StreamingContextState.STOPPED 关闭状态 - The context has been stopped and cannot be used any more.
*/
@DeveloperApi
def getState(): StreamingContextState = synchronized {
state
}
- start 开启
StreamingContext会根据当前应用所处的状态进行同步操作,在start中会开启跟site(网页端口)相关的环境,这里不做赘述了,并且做了很多健壮性验证方面的处理以及相关参数的配置,可以不做考虑,
/**
* Start the execution of the streams.
*
* @throws IllegalStateException if the StreamingContext is already stopped.
*/
def start(): Unit = synchronized {
state match {
case INITIALIZED =>
startSite.set(DStream.getCreationSite())
StreamingContext.ACTIVATION_LOCK.synchronized {
StreamingContext.assertNoOtherContextIsActive()
try {
validate()
// Start the streaming scheduler in a new thread, so that thread local properties
// like call sites and job groups can be reset without affecting those of the
// current thread.
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
scheduler.start()
}
state = StreamingContextState.ACTIVE
scheduler.listenerBus.post(
StreamingListenerStreamingStarted(System.currentTimeMillis()))
} catch {
case NonFatal(e) =>
logError("Error starting the context, marking it as stopped", e)
scheduler.stop(false)
state = StreamingContextState.STOPPED
throw e
}
StreamingContext.setActiveContext(this)
}
logDebug("Adding shutdown hook") // force eager creation of logger
shutdownHookRef = ShutdownHookManager.addShutdownHook(
StreamingContext.SHUTDOWN_HOOK_PRIORITY)(() => stopOnShutdown())
// Registering Streaming Metrics at the start of the StreamingContext
assert(env.metricsSystem != null)
env.metricsSystem.registerSource(streamingSource)
uiTab.foreach(_.attach())
logInfo("StreamingContext started")
case ACTIVE =>
logWarning("StreamingContext has already been started")
case STOPPED =>
throw new IllegalStateException("StreamingContext has already been stopped")
}
}
核心就是调用了scheduler.start(),读者可以看到这个方法是
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
scheduler.start()
}
在streamingContext中生成一个子线程用来开启调度器,并给调度器的listenerBus(调度器内部有一条listener总线来监听发送数据)发送一个当前Streaming环境开启的事件
同时给StreamingContext设置当前激活的环境,用来保证应用中只包含一个被激活的streamingContext对象
StreamingContext.setActiveContext(this)
object StreamingContext extends Logging {
private val activeContext = new AtomicReference[StreamingContext](null)
private def setActiveContext(ssc: StreamingContext): Unit = {
ACTIVATION_LOCK.synchronized {
activeContext.set(ssc)
}
}
}
DStreamGraph
关于StreamingContext中的batchDuration,我们需要关注图的类型是的类型:DStreamGraph ,所以我们首先来看 DStreamGraph 源码分析
可以发现,这个类是最为顶级的类,而且用final修饰说明该类不能够扩展了,是一个成熟的类,所以我们来看看它有哪些成员以及是如何工作的
package org.apache.spark.streaming
...
final private[streaming] class DStreamGraph extends Serializable with Logging {
private val inputStreams = new ArrayBuffer[InputDStream[_]]()
private val outputStreams = new ArrayBuffer[DStream[_]]()
@volatile private var inputStreamNameAndID: Seq[(String, Int)] = Nil
var rememberDuration: Duration = null
var checkpointInProgress = false
var zeroTime: Time = null
var startTime: Time = null
var batchDuration: Duration = null
}
DStream
我们可以发现内部组合并维护了 一个可变数组类型的输入流和一个可变数组类型的输出流对象(import scala.collection.mutable.ArrayBuffer 可变数组) ,且 这两个输出流都是抽象类型的(可以使用ClassTag运行时动态添加任一类型的rdd,k可以看如下代码,一个离散流内部定义的一个以时间为key,值为 与DStream类型一致的RDD对象,这里可以验证spark streaming底层是由RDD作为计算任务实现功能的)。我们可以想象成这个图画中心有一个圆形水库,水库的水来自上游的瀑布,瀑布上水飞流直下,而水库的水通过水汞按部就班地输送给下方村庄,供人们使用。这是一幅以农村山水为背景,结合现代化工具的现代化乡村生活图。
package org.apache.spark.streaming.dstream
abstract class DStream[T: ClassTag] {
....
/** Method that generates an RDD for the given time */
def compute(validTime: Time): Option[RDD[T]]
// =======================================================================
// Methods and fields available on all DStreams
// =======================================================================
// RDDs generated, marked as private[streaming] so that testsuites can access it
@transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()
....
}
而且 InputDStream也是继承于DStream,也就是说InputDStream不仅拥有于DStream的所有功能以外,还自定义了一些新方法,用来更好地控制自身
abstract class InputDStream[T: ClassTag](_ssc: StreamingContext)
extends DStream[T](_ssc) {
...
}
官方在讲解sparkStreaming的时候开篇就讲了很多有关DStream的相关内容
我这里从源码的角度来看看DStream到底由哪些成员于对象组成,跟batchduration由什么关系吗?
abstract class DStream[T: ClassTag] (
@transient private[streaming] var ssc: StreamingContext
) extends Serializable with Logging {
通过对DStream我们发现,DStream是强依赖于ssc(StreamingContext,为了后期方便写文章,这里用ssc专门代表StreamingContext生成的实例对象),换个角度来说,ssc内部那副水库图(graph)中的上下两台管道会受到ssc本身的影响,如果ssc内景的灯光色彩绚丽,映射到图上显示出五彩斑斓的水流色彩,可以想象有一彩虹悬于瀑布(inputstream)之上,如果ssc内景灯光灰暗,瀑布上乌云一片。。。
调度篇
输入流的注入与开启
调度一般调度的任务,而ssc是以个任务吗?其实ssc通过DSteam这一篇我们发现,Dstream内部会有一个generateJob方法,是专门用于定期生成job的方法,而由于DStreamGraph内部维护了一个DStream的可变outputstreams,所以,理所当然地也会触发一个generateJobs(复数),用来触发所有outputstream的job生成任务
class StreamingContext private[streaming] (
_sc: SparkContext,
_cp: Checkpoint,
_batchDur: Duration
) extends Logging {
....
private[streaming] val scheduler = new JobScheduler(this)
...
}
这时候我们可以大胆猜测其实是在ssc内部的schedule(调度器)内部执行了这副图的 jobs ,其实在调度器的start方法(通常也是启动方法,或者线程启动方法),调用了内部的任务生成器(JobGenerator(this))
JobScheduler.scala
package org.apache.spark.streaming.scheduler
...
/**
* This class schedules jobs to be run on Spark. It uses the JobGenerator to generate
* the jobs and runs them using a thread pool.
*/
private[streaming]
class JobScheduler(val ssc: StreamingContext) extends Logging {
...
private val jobGenerator = new JobGenerator(this)
...
def start(): Unit = synchronized {
...
receiverTracker.start()
jobGenerator.start()
...
}
}
一个调度器内部有其只有一个任务生成器,此时的这个任务生成器才是最终要工作的对象,此时代码做了下面的操作,为这一个生成器创建一个事件生成器对象用来监听所有接收到的事件,同时通过当前ssc环境是否有保存点作为条件走两条路,第一条是restart(),即根据上次保存的保存点数据继续运行,第二条是startFirstTime,也就是第一次运行job任务。现在我们可以确定的一点是,ssc中的输入输出离散流的工作就是在这两个地方触发的,不信,我们可以查看这两个方法
JobGenerator.scala
package org.apache.spark.streaming.scheduler
...
class JobGenerator(jobScheduler: JobScheduler) extends Logging {
...
/** Start generation of jobs */
def start(): Unit = synchronized {
if (eventLoop != null) return // generator has already been started
// Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.
// See SPARK-10125
checkpointWriter
eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = {
jobScheduler.reportError("Error in job generator", e)
}
}
eventLoop.start()
if (ssc.isCheckpointPresent) {
restart()
} else {
startFirstTime()
}
}
...
}
- restart
在此模式下,是用户重新启动的流程
/** Restarts the generator based on the information in checkpoint */
private def restart() {
// If manual clock is being used for testing, then
// either set the manual clock to the last checkpointed time,
// or if the property is defined set it to that time
if (clock.isInstanceOf[ManualClock]) {
val lastTime = ssc.initialCheckpoint.checkpointTime.milliseconds
val jumpTime = ssc.sc.conf.getLong("spark.streaming.manualClock.jump", 0)
clock.asInstanceOf[ManualClock].setTime(lastTime + jumpTime)
}
val batchDuration = ssc.graph.batchDuration
// Batches when the master was down, that is,
// between the checkpoint and current restart time
val checkpointTime = ssc.initialCheckpoint.checkpointTime
val restartTime = new Time(timer.getRestartTime(graph.zeroTime.milliseconds))
val downTimes = checkpointTime.until(restartTime, batchDuration)
logInfo("Batches during down time (" + downTimes.size + " batches): "
+ downTimes.mkString(", "))
// Batches that were unprocessed before failure
val pendingTimes = ssc.initialCheckpoint.pendingTimes.sorted(Time.ordering)
logInfo("Batches pending processing (" + pendingTimes.length + " batches): " +
pendingTimes.mkString(", "))
// Reschedule jobs for these times
val timesToReschedule = (pendingTimes ++ downTimes).filter { _ < restartTime }
.distinct.sorted(Time.ordering)
logInfo("Batches to reschedule (" + timesToReschedule.length + " batches): " +
timesToReschedule.mkString(", "))
timesToReschedule.foreach { time =>
// Allocate the related blocks when recovering from failure, because some blocks that were
// added but not allocated, are dangling in the queue after recovering, we have to allocate
// those blocks to the next batch, which is the batch they were supposed to go.
jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
jobScheduler.submitJobSet(JobSet(time, graph.generateJobs(time)))
}
// Restart the timer
timer.start(restartTime.milliseconds)
logInfo("Restarted JobGenerator at " + restartTime)
}
jobScheduler.submitJobSet(JobSet(time, graph.generateJobs(time))) 这个就是调用了ssc中那副图的生成jobs的方法,并设置成了一个jobSet(即当前的生成的任务的集合,任务以每一个批次duration时长作为单位划分一个jobset,每个批次最多分成batchDuration的时长,(仔细看timesToReschedule这个对象)对象,回调给其自身的scheduler调度器的,会根据jobs是否是空进行任务的真实执行流程,读者会引发一个问题,因为流会不断地生成,难道会不断地执行任务吗?相信这里地答案会给你一个答案,一旦当前的任务是空的,那么就会执行
def submitJobSet(jobSet: JobSet) {
if (jobSet.jobs.isEmpty) {
logInfo("No jobs added for time " + jobSet.time)
} else {
listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
jobSets.put(jobSet.time, jobSet)
jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
logInfo("Added jobs for time " + jobSet.time)
}
}
那么任务怎么样才算空的呢?其实回到开篇的DStream的generateJob方法,我们可以清楚地看到,如果有getOrCompute(time) 是有rdd的那么就会返回一个job类型的Some对象,否则没有rdd对象那就会返回一个none对象,此时的运行模式是重启模型
private[streaming] def generateJob(time: Time): Option[Job] = {
getOrCompute(time) match {
case Some(rdd) =>
val jobFunc = () => {
val emptyFunc = { (iterator: Iterator[T]) => {} }
context.sparkContext.runJob(rdd, emptyFunc)
}
Some(new Job(time, jobFunc))
case None => None
}
}
- startFirstTime
第一次启动是什么时候会执行如下操作
/** Starts the generator for the first time */
private def startFirstTime() {
val startTime = new Time(timer.getStartTime())
graph.start(startTime - graph.batchDuration)
timer.start(startTime.milliseconds)
logInfo("Started JobGenerator at " + startTime)
}
就是开启图的start方法,并开始计时(timer),其开启的是输入流,而这里只是对输出流进行输出话开始时间以及需要记住的时间间隔,那么对其的开启去哪里了呢?
final private[streaming] class DStreamGraph extends Serializable with Logging {
...
def start(time: Time) {
this.synchronized {
require(zeroTime == null, "DStream graph computation already started")
zeroTime = time
startTime = time
outputStreams.foreach(_.initialize(zeroTime))
outputStreams.foreach(_.remember(rememberDuration))
outputStreams.foreach(_.validateAtStart())
numReceivers = inputStreams.count(_.isInstanceOf[ReceiverInputDStream[_]])
inputStreamNameAndID = inputStreams.map(is => (is.name, is.id))
inputStreams.par.foreach(_.start()) // 开启输入流
}
}
...
}
然而我们又可以发现,inputStreams.par.foreach中对每一个inputDstream都进行开启开启,但是我们可以通过官方的 socketTextStream 输入流对象的源码可以发现,其具体实现并没有对inputDstream的start方法进行实现,也就是说,对图中的inputDStream的开启并不是调度器JobGenerator干的活,那么是哪个部件干的活呢? 其实是ReceiverTracker 对象干的活
/**
* This class manages the execution of the receivers of ReceiverInputDStreams. Instance of
* this class must be created after all input streams have been added and StreamingContext.start()
* has been called because it needs the final set of input streams at the time of instantiation.
*
* @param skipReceiverLaunch Do not launch the receiver. This is useful for testing. 只用于测试,不要开启reeiver了
/
private[streaming]
class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false) extends Logging {
private val receiverInputStreams = ssc.graph.getReceiverInputStreams()
...
def start(): Unit = synchronized {
if (isTrackerStarted) {
throw new SparkException("ReceiverTracker already started")
}
if (!receiverInputStreams.isEmpty) {
endpoint = ssc.env.rpcEnv.setupEndpoint(
"ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
if (!skipReceiverLaunch) launchReceivers()
logInfo("ReceiverTracker started")
trackerState = Started
}
}
...
/**
* Get the receivers from the ReceiverInputDStreams, distributes them to the
* worker nodes as a parallel collection, and runs them.
*/
private def launchReceivers(): Unit = {
val receivers = receiverInputStreams.map { nis =>
val rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.id)
rcvr
}
runDummySparkJob()
logInfo("Starting " + receivers.length + " receivers")
endpoint.send(StartAllReceivers(receivers))
}
...
/** RpcEndpoint to receive messages from the receivers. */
private class ReceiverTrackerEndpoint(override val rpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint {
private val walBatchingThreadPool = ExecutionContext.fromExecutorService(
ThreadUtils.newDaemonCachedThreadPool("wal-batching-thread-pool"))
@volatile private var active: Boolean = true
override def receive: PartialFunction[Any, Unit] = {
// Local messages
case StartAllReceivers(receivers) =>
val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
for (receiver <- receivers) {
val executors = scheduledLocations(receiver.streamId)
updateReceiverScheduledExecutors(receiver.streamId, executors)
receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
startReceiver(receiver, executors)
}
。。。
}
。。。
}
}
源码也很好地诠释了 这个类会获取图中的接收器inputStreams类型,可以查看这个接收器跟踪类的start启动方法,发现设置一个spark rpc(netty模式)下的一个接收器终端,并根据这个终端开启接收器,一旦send信息到该终端,就会触发ReceiverTrackerEndpoint的recive方法,开启所有的接收器,其实通过源码可知,接收器是inputDstream中的一种而已。从ReceiverTracker源码的param skipReceiverLaunch 推荐不要使用接收器,所以,应该用其他的inputDstream.同时我们可以确定通过查看ReceiverTrackerEndpoint.startReceiver方法,并可查看到接收器开启的时候会有一个ReceiverSupervisor对象专门用来开启接收器,所以我们很容易想到只要接收器实现ontStart方法就能开启流了,如下来验证想法
private[streaming] abstract class ReceiverSupervisor(
receiver: Receiver[_],
conf: SparkConf
) extends Logging {
...
/** Start receiver */
def startReceiver(): Unit = synchronized {
try {
if (onReceiverStart()) {
logInfo(s"Starting receiver $streamId")
receiverState = Started
receiver.onStart() // 只要接收器实现了onStart方法,就会给你初始化一个接收器输入流
logInfo(s"Called receiver $streamId onStart")
} else {
// The driver refused us
stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
}
} catch {
case NonFatal(t) =>
stop("Error starting receiver " + streamId, Some(t))
}
}
...
}
据目前位置,我们对接收器模式下的开启流程有了大体的把控,那么输出流如何开启呢?它的运行机制又是什么?
输出流的注入
wordCounts.print()
这个例子就是注入输出流的的具体操作,
/**
* Print the first ten elements of each RDD generated in this DStream. This is an output
* operator, so this DStream will be registered as an output stream and there materialized.
*/
def print(): Unit = ssc.withScope {
print(10)
}
/**
* Print the first num elements of each RDD generated in this DStream. This is an output
* operator, so this DStream will be registered as an output stream and there materialized.
*/
def print(num: Int): Unit = ssc.withScope {
def foreachFunc: (RDD[T], Time) => Unit = {
(rdd: RDD[T], time: Time) => {
val firstNum = rdd.take(num + 1)
// scalastyle:off println
println("-------------------------------------------")
println(s"Time: $time")
println("-------------------------------------------")
firstNum.take(num).foreach(println)
if (firstNum.length > num) println("...")
println()
// scalastyle:on println
}
}
foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps = false)
}
通过源码我们可以看到,这个print操作实质上式调用了DStream::foreachRDD中的方法,也就是如下方法,会创建一个ForEachDStream 流,这个流同时会把当前的流注册(register)到当前ssc中的图内,这样就完成了输入出流的创建
private def foreachRDD(
foreachFunc: (RDD[T], Time) => Unit,
displayInnerRDDOps: Boolean): Unit = {
new ForEachDStream(this,
context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()
}
DStream
/**
* Register this streaming as an output stream. This would ensure that RDDs of this
* DStream will be generated.
*/
private[streaming] def register(): DStream[T] = {
ssc.graph.addOutputStream(this)
this
}
流动?
那么并没有想象中的那样,图内部的两个管道在流动呀?比如DStream的运行时动态销毁和运行时动态添加流的操作,我们并没有看到,这个是什么情况???
其实我们在调度器(JobScheduler)的start方法与 任务生成器(JobGenerator)的start方法中都见到了一个内置对象,eventLoop对象,字母意思来讲就是事件循环,那么,事件循环式什么东东?我们对比查看这两个方法
JobScheduler 中start方法中的事件循环
eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
}
eventLoop.start()
JobGenerator 中start方法中的事件循环
eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = {
jobScheduler.reportError("Error in job generator", e)
}
}
eventLoop.start()
我们可以看到,在调度器中生成了以个永无止尽的调度事件类型(JobSchedulerEvent)的事件循环,专门生产任务调度的事件,并存入这个调度事件循环对象中的事件队列中eventQueue,同理任务生成器中也有一个永无止尽的生产任务生成事件(JobGeneratorEvent)的事件生成器,并通过onReceive方法处理相应的事件。
看到这应该也差不多明白了,为什么固定的DStream会定时会处理我们自定义的任务呢?其实在任务生成器(JobGenerator)内部有个定时器timer,这东东就是会根据传入的批事件间隔(batchDuration)不断地生产job事件,不断地往eventLoop对象中发送事件信息,因此就会不断触发任务生成器内部的processEvent函数,并调用 图的生成器函数不断执行输出流中的任务graph.generateJobs(time),并根据生成的任务如果为 Success(jobs),那么就会调用生成器所包裹的调度器中的submitJobSet方法,并提交给jobExecutor 执行任务
private[streaming]
class JobGenerator(jobScheduler: JobScheduler) extends Logging {
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
。。。
/** Processes all events */
private def processEvent(event: JobGeneratorEvent) {
logDebug("Got event " + event)
event match {
case GenerateJobs(time) => generateJobs(time)
case ClearMetadata(time) => clearMetadata(time)
case DoCheckpoint(time, clearCheckpointDataLater) =>
doCheckpoint(time, clearCheckpointDataLater)
case ClearCheckpointData(time) => clearCheckpointData(time)
}
}
private def generateJobs(time: Time) {
// Checkpoint all RDDs marked for checkpointing to ensure their lineages are
// truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
Try {
jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
graph.generateJobs(time) // generate jobs using allocated block
} match {
case Success(jobs) =>
val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
case Failure(e) =>
jobScheduler.reportError("Error generating jobs for time " + time, e)
PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
}
eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}
class JobScheduler
def submitJobSet(jobSet: JobSet) {
if (jobSet.jobs.isEmpty) {
logInfo("No jobs added for time " + jobSet.time)
} else {
listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
jobSets.put(jobSet.time, jobSet)
jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
logInfo("Added jobs for time " + jobSet.time)
}
}
有不对的地方希望指正谢谢。