第7课:Spark Streaming源码解读之JobScheduler内幕实现和深度思考

原创 2016年05月30日 22:58:37
JobScheduler是SparkStreaming调度的核心,相当于Spark Core中高度中心的DAGScheduler。

StreamingContext的start方法
/**
* Start the execution of the streams.
*
* @throws IllegalStateException if the StreamingContext is already stopped.
*/
def start(): Unit = synchronized {
state match {
case INITIALIZED =>
startSite.set(DStream.getCreationSite())
StreamingContext.ACTIVATION_LOCK.synchronized {
StreamingContext.assertNoOtherContextIsActive()
try {
validate()

//
调用ThreadUtils的runInNewThread方法
// Start the streaming scheduler in a new thread, so that thread local properties
// like call sites and job groups can be reset without affecting those of the
// current thread.
ThreadUtils.runInNewThread("streaming-start") {//*1
sparkContext.setCallSite(
startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.
SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
scheduler.start() //*2
}
state = StreamingContextState.ACTIVE
} catch {
case NonFatal(e) =>
logError(
"Error starting the context, marking it as stopped", e)
scheduler.stop(false)
state = StreamingContextState.STOPPED
throw e
}
StreamingContext.setActiveContext(
this)
}
shutdownHookRef = ShutdownHookManager.addShutdownHook(
StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
// Registering Streaming Metrics at the start of the StreamingContext
assert(env.metricsSystem != null)
env.metricsSystem.registerSource(streamingSource)
uiTab.foreach(_.attach())
logInfo(
"StreamingContext started")
case ACTIVE =>
logWarning(
"StreamingContext has already been started")
case STOPPED =>
throw new IllegalStateException("StreamingContext has already been stopped")
}
}

接下来从具体代码 调用的print方法 角度跟踪查看 
val adsClickStreamFormatted = adsClickStream.map { ads => (ads.split(" ")(1), ads) }
 validClicked.map(validClick => {validClick._2._1})
}).print
DStream的print方法
/**
* Print the first ten elements of each RDD generated in this DStream. This is an output
* operator, so this DStream will be registered as an output stream and there materialized.
*/
def print(): Unit = ssc.withScope {
print(10)
}

/**
* Print the first num elements of each RDD generated in this DStream. This is an output
* operator, so this DStream will be registered as an output stream and there materialized.
*/
def print(num: Int): Unit = ssc.withScope {
def foreachFunc: (RDD[T], Time) => Unit = {
(rdd: RDD[
T], time: Time) => {
val firstNum = rdd.take(num + 1)
// scalastyle:off println
println("-------------------------------------------")
println("Time: " + time)
println("-------------------------------------------")
firstNum.take(num).foreach(
println)
if (firstNum.length > num) println("...")
println()
// scalastyle:on println
}
}
foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps = false)
}
/**
* Apply a function to each RDD in this DStream. This is an output operator, so
* 'this' DStream will be registered as an output stream and therefore materialized.
* @param foreachFunc foreachRDD function
* @param displayInnerRDDOps Whether the detailed callsites and scopes of the RDDs generated
* in the `foreachFunc` to be displayed in the UI. If `false`, then
* only the scopes and callsites of `foreachRDD` will override those
* of the RDDs on the display.
*/
private def foreachRDD(
foreachFunc: (RDD[
T], Time) => Unit,
displayInnerRDDOps: Boolean): Unit = {
new ForEachDStream(this,
context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()
}
/**
* An internal DStream used to represent output operations like DStream.foreachRDD.
* @param parent Parent DStream
* @param foreachFunc Function to apply on each RDD generated by the parent DStream
* @param displayInnerRDDOps Whether the detailed callsites and scopes of the RDDs generated
* by `foreachFunc` will be displayed in the UI; only the scope and
* callsite of `DStream.foreachRDD` will be displayed.
*/
private[streaming]
class ForEachDStream[T: ClassTag] (
parent: DStream[
T],
foreachFunc: (RDD[T], Time) => Unit,
displayInnerRDDOps: Boolean
) extends DStream[Unit](parent.ssc) {

override def dependencies: List[DStream[_]] = List(parent)

override def slideDuration: Duration = parent.slideDuration

override def compute(validTime: Time): Option[RDD[Unit]] = None

override def generateJob(time: Time): Option[Job] = {
parent.getOrCompute(time)
match {
case Some(rdd) =>
val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
foreachFunc(rdd
, time)
}
Some(new Job(time, jobFunc)) //返回一个普通的Job(将time和jobFunc构造一个Job),此处的generateJob是逻辑级别的,并未直接调用
case None => None
}
}
}
实际调用从JobGenerator开始调用,
则由DStreamGraph中的generateJobs方法
def generateJobs(time: Time): Seq[Job] = {
logDebug("Generating jobs for time " + time)
val jobs = this.synchronized {
outputStreams.flatMap { outputStream =>
val jobOption =
outputStream.generateJob(time) //此处outputStream就是 ForEachDStream
jobOption.foreach(_.setCallSite(outputStream.
creationSite))
jobOption
}
}
logDebug(
"Generated " + jobs.length + " jobs for time " + time)
jobs
}
再看JobGenerator中的generatorJobs(L240)
/** Generate jobs and perform checkpoint for the given `time`.  */
private def generateJobs(time: Time) {
// Set the SparkEnv in this thread, so that job generation code can access the environment
// Example: BlockRDDs are created in this thread, and it needs to access BlockManager
// Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
SparkEnv.set(ssc.env)
Try {
jobScheduler.
receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
graph.generateJobs(time) // generate jobs using allocated block
} match {
case Success(jobs) =>
val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
      //streamIdToInputInfos 是JOB需要处理的数据,
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
case Failure(e) =>
jobScheduler.reportError(
"Error generating jobs for time " + time, e)
}
eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}
def submitJobSet(jobSet: JobSet) { //JobSet是同一批次中的一组Job的集合
if (jobSet.jobs.isEmpty) {
logInfo(
"No jobs added for time " + jobSet.time)
}
else {
listenerBus.post(StreamingListenerB atchSubmitted(jobSet.toBatchInfo))
jobSets.put(jobSet.time, jobSet) //此处的jobSets是时间维度的多个Job,是ConcurrentHashMap
jobSet.jobs.foreach(job =>
jobExecutor.execute(new JobHandler(job)))//由线程级别的Executor执行每个job,此处job由JobHandler封装作为参数
logInfo(
"Added jobs for time " + jobSet.time)
}
}
JobHandler方法
 private class JobHandler(job: Job) extends Runnable with Logging {
import JobScheduler._

def run() {
try {
val formattedTime = UIUtils.formatBatchTime(
job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)
val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}"
val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"

ssc.sc.setJobDescription(
s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""")
ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)

// We need to assign `eventLoop` to a temp variable. Otherwise, becaus要e
// `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
// it's possible that when `post` is called, `eventLoop` happens to null.
var _eventLoop = eventLoop
if (_eventLoop != null) {
_eventLoop.post(JobStarted(job, clock.getTimeMillis()))
//记录启动Job
// Disable checks for existing output directories in jobs launched by the streaming
// scheduler, since we may need to write output to an existing directory during checkpoint
// recovery; see SPARK-4835 for more details.
PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
job.run() //job真正执行
}
_eventLoop =
eventLoop
if (_eventLoop != null) {
_eventLoop.post(
JobCompleted(job, clock.getTimeMillis()))//记录Job结束
}
}
else {
// JobScheduler has been stopped.
}
}
finally {
ssc.
sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null)
ssc.
sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null)
}
}
}
}
Job的run方法中 func
class Job(val time: Time, func: () => _) {
private var _id: String = _
private var _outputOpId: Int = _
private var isSet = false
private var _result: Try[_] = null
private var _callSite: CallSite = null
private var _startTime: Option[Long] = None
private var _endTime: Option[Long] = None

def run() {
_result = Try(func())
}
Try中的func()其实是最ForEachDStream类的jobFunc(Line 49),
其中foreachFunc其实就是print中定义的DStream的 Line 766行
 def foreachFunc: (RDD[T], Time)  

JobScheduler参数注意
jobExecutor是一个线程池,线程的个数由参数配置,如下
private val numConcurrentJobs = ssc.conf.getInt("spark.streaming.concurrentJobs", 1)
private val jobExecutor =
  ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs, "streaming-job-executor")



第7课:Spark Streaming源码解读之JobScheduler内幕实现和深度思考

本期内容 JobScheduler内幕实现 JobScheduler深度思考JobScheduler是整个streaming调度的核心,相当于core中的DAGScheduler....

第7课:Spark Streaming源码解读之JobScheduler内幕实现和深度思考

本期内容: 1. JobScheduler内幕实现 2. JobScheduler深度思考   所有工作的关键都是jobScheduler SparkStreaming至少要设置两条线程是因为一条用于...

Spark定制班第7课:Spark Streaming源码解读之JobScheduler内幕实现和深度思考

解读Spark Streaming源码时,不要把代码看成高深的东西,只要把它看成是JVM上的普通应用程序。要有信心搞定它。 本期内容 1. JobScheduler内幕实现 2. JobSche...

Spark定制班第7课:Spark Streaming源码解读之JobScheduler内幕实现和深度思考

In last course, we learned "How the Spark Streaming Job is generated dynamically". From that course,...
  • csdn_zf
  • csdn_zf
  • 2016年06月07日 22:07
  • 385

Spark Streaming源码解读之JobScheduler内幕实现和深度思考

本博文内容主要包括:1、JobScheduler内幕实现 2、JobScheduler深度思考一:JobScheduler内幕实现:JobScheduler的地位非常的重要,所有的关键都在JobSc...
  • erfucun
  • erfucun
  • 2016年09月01日 11:06
  • 851

Spark 定制版:007~Spark Streaming源码解读之JobScheduler内幕实现和深度思考

本讲内容:a. JobScheduler内幕实现 b. JobScheduler深度思考注:本讲内容基于Spark 1.6.1版本(在2016年5月来说是Spark最新版本)讲解。上节回顾上节课,我...

Spark Streaming源码解读之JobScheduler详解

一:JobSheduler的源码解析 1. JobScheduler是Spark Streaming整个调度的核心,相当于Spark Core上的DAGScheduler. 2. Spark ...

第16课:Spark Streaming源码解读之数据清理内幕彻底解密

第16课:Spark Streaming源码解读之数据清理内幕彻底解密     本篇博客的主要目的是:  1. 理清楚Spark Streaming中数据清理的流程 组织思路如下:  a) 背景 ...

第6课:Spark Streaming源码解读之Job动态生成和深度思考

1. DStream三种类型: 1) 输入的DStreams: Kafka,Socket,Flume; 2) 输出的DStreams,是一个逻辑级的Action,它是SparkStreaming框架提...

第16课:Spark Streaming源码解读之数据清理内幕彻底解密

本讲从二个方面阐述: 数据清理原因和现象 数据清理代码解析   Spark Core从技术研究的角度讲对Spark Streaming研究的彻底,没有你搞不定的Spark应用程序。 Spar...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:第7课:Spark Streaming源码解读之JobScheduler内幕实现和深度思考
举报原因:
原因补充:

(最多只允许输入30个字)