作者:谢彪
一:JobSheduler的源码解析
1、JobScheduler是SparkStreaming整个调度的核心,相当于Spark Core上的DAGScheduler.
2、SparkStreaming为啥要设置两条线程?
setMaster指定的两条线程是指程序运行的时候至少需要两条线程,一条线程用于接收数据,需要不断的循环,而我们指定的线程数是用于作业处理的。
3. JobSheduler的启动是在StreamContext的start方法被调用的时候启动的。
def start(): Unit = synchronized {
state match {
caseINITIALIZED =>
startSite.set(DStream.getCreationSite())
StreamingContext.ACTIVATION_LOCK.synchronized {
StreamingContext.assertNoOtherContextIsActive()
try {
validate()
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL,"false")
scheduler.start()
}
jobScheduler会负责逻辑层面的Job,
private[streaming]
class JobScheduler(val ssc: StreamingContext) extendsLogging {
5. jobScheduler的start方法源码如下:
def start(): Unit = synchronized {
if (eventLoop !=null) return // scheduler has already been started
logDebug("Starting JobScheduler")
eventLoop = newEventLoop[JobSchedulerEvent]("JobScheduler") {
overrideprotected def onReceive(event: JobSchedulerEvent): Unit =processEvent(event)
overrideprotected def onError(e: Throwable): Unit = reportError("Error in job scheduler",e)
}
eventLoop.start()
// attach ratecontrollers of input streams to receive batch completion updates
for {
inputDStream<- ssc.graph.getInputStreams
rateController<- inputDStream.rateController
}ssc.addStreamingListener(rateController)
listenerBus.start(ssc.sparkContext)
receiverTracker =new ReceiverTracker(ssc)
inputInfoTracker= new InputInfoTracker(ssc)
receiverTracker.start()
jobGenerator.start()
logInfo("Started JobScheduler")
}
其中processEvent的源码如下:
private def processEvent(event: JobSchedulerEvent) {
try {
event match {
caseJobStarted(job, startTime) => handleJobStart(job, startTime)
caseJobCompleted(job, completedTime) => handleJobCompletion(job, completedTime)
caseErrorReported(m, e) => handleError(m, e)
}
} catch {
case e:Throwable =>
reportError("Error in job scheduler", e)
}
}
4. JobScheduler初始化的时候有那些信息?
为什么要设置并行度呢?
1) 如果Batch Duractions中有多个Output操作的话,提高并行度可以极大的提高性能。
2) 不同的Batch,线程池中有很多的线程,也可以并发运行,将逻辑级别的Job转化为物理级别的job就是通过newDaemonFixedThreadPool线程实现的。
private val jobSets: java.util.Map[Time, JobSet] = newConcurrentHashMap[Time, JobSet]
//可以手动设置并行度
private val numConcurrentJobs =ssc.conf.getInt("spark.streaming.concurrentJobs", 1)
//numConcurrentJobs 默认是1
private val jobExecutor =
ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs,"streaming-job-executor")
//初始化JoGenerator
private val jobGenerator = new JobGenerator(this)
val clock = jobGenerator.clock
val listenerBus = new StreamingListenerBus()
stopped
var receiverTracker: ReceiverTracker = null
5、print的函数源码
1. DStream中的print源码如下:
def print(): Unit = ssc.withScope {
print(10)
} //调用print()进行执行操作。
2. 实际调用的时候还是对RDD进行操作。
def print(num: Int): Unit = ssc.withScope {
def foreachFunc:(RDD[T], Time) => Unit = {
(rdd: RDD[T],time: Time) => {
val firstNum= rdd.take(num + 1)
//scalastyle:off println
println("-------------------------------------------")
println("Time: " + time)
println("-------------------------------------------")
firstNum.take(num).foreach(println)
if(firstNum.length > num) println("...")
println()
//scalastyle:on println
}
}
foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps =false)
}
3. foreachFunc封装了RDD的逻辑操作。
private def foreachRDD(
foreachFunc:(RDD[T], Time) => Unit,
displayInnerRDDOps: Boolean): Unit = {
newForEachDStream(this,
context.sparkContext.clean(foreachFunc, false),displayInnerRDDOps).register()
}
4. 每个BatchDuractions都会根据generateJob生成作业。
private[streaming]
class ForEachDStream[T: ClassTag] (
parent:DStream[T],
foreachFunc:(RDD[T], Time) => Unit,
displayInnerRDDOps: Boolean
) extendsDStream[Unit](parent.ssc) {
override defdependencies: List[DStream[_]] = List(parent)
override defslideDuration: Duration = parent.slideDuration
override defcompute(validTime: Time): Option[RDD[Unit]] = None
//每个BatchDuractions都根据generateJob生成Job
override defgenerateJob(time: Time): Option[Job] = {
parent.getOrCompute(time) match {
caseSome(rdd) =>
val jobFunc= () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
//foreachFunc基于rdd和time封装为func了,此时的foreachFunc就被job.run
//的时候调用了。
//此时的RDD就是基于时间生成的RDD,这个RDD就是DStreamGraph中的最后一个DStream决定的。然后
foreachFunc(rdd, time)
}
Some(newJob(time, jobFunc))
case None=> None
}
}
}
5. 此时的foreachFunc是从哪里来的?
private[streaming]
//参数传递过来的,这个时候就要去找forEachDStream在哪里被调用。
class ForEachDStream[T: ClassTag] (
parent:DStream[T],
foreachFunc:(RDD[T], Time) => Unit,
displayInnerRDDOps: Boolean
) extendsDStream[Unit](parent.ssc) {
6. 由此可以知道真正Job的生成是通过ForeachDStream通过generateJob来生成的,此时是逻辑级别的,但是真正被物理级别的调用是在JobGenerator中generateJobs被调用的。
def generateJobs(time: Time): Seq[Job] = {
logDebug("Generating jobs for time " + time)
val jobs =this.synchronized {
//此时的outputStream就是forEachDStream
outputStreams.flatMap { outputStream =>
val jobOption= outputStream.generateJob(time)
jobOption.foreach(_.setCallSite(outputStream.creationSite))
jobOption
}
}
logDebug("Generated " + jobs.length + " jobs for time" + time)
jobs
}
作者:大数据技术研发人员:谢彪
-
资料来源于:DT_大数据梦工厂(Spark发行版本定制)
-
DT大数据梦工厂微信公众号:DT_Spark
-
新浪微博:http://www.weibo.com/ilovepains
-
王家林老师每晚20:00免费大数据实战
YY直播:68917580