上一篇SparkStream例子HdfsWordCount--InputDStream及OutputDstream是如何进入DStreamGraph中:分析了InputDstream及OutputDstream如何进入DStreamGraph,这个单元分析一下FileInputDstream如何生成的RDD.
三、 FileInputDStream是如何生成RDD的呢?
1,入口是咱们在案例中调用StreamingContext.start()开始的
def start(): Unit = synchronized { state match { case INITIALIZED => try { validate() // Start the streaming scheduler in a new thread, so that thread local properties // like call sites and job groups can be reset without affecting those of the // current thread. //启动子线程,一方面为了本地初始化工作,另外一方面是不要阻塞主线程。 //Sparkstreaming运行时至少需要两条线程:一条用于接收数据,一条用于处理数据。这个线程是调度层面的和他们没有关系 ThreadUtils.runInNewThread("streaming-start") { sparkContext.setCallSite(startSite.get) sparkContext.clearJobGroup() sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false") scheduler.start() } state = StreamingContextState.ACTIVE } catch {
2,进入JobScheduler.start()方法:
JobScheduler中的 EventLoop中的onReceive这个方法是通过JobGenerator调用generateJobs()方法之后触发的;在“SparkStream源码分析:JobScheduler的JobStarted、JobCompleted是怎么被调用的”这一文中有详细的讲解。
def start(): Unit = synchronized { if (eventLoop != null) return // scheduler has already been started logDebug("Starting JobScheduler") //会将JobSchedulerEvent放到LinkedBlockingDeque eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") { override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e) } eventLoop.start() // attach rate controllers of input streams to receive batch completion updates for { inputDStream <- ssc.graph.getInputStreams rateController <- inputDStream.rateController } ssc.addStreamingListener(rateController) listenerBus.start(ssc.sparkContext) //处理ReceiverInputDstream的数据源,如SocketInputDstream,FlumePollingInputDstream,FlumeInputDsteam等。看ReceiverInputDstream的子类 receiverTracker = new ReceiverTracker(ssc) inputInfoTracker = new InputInfoTracker(ssc) receiverTracker.start() jobGenerator.start() logInfo("Started JobScheduler") }
#又因为FileInputDstream不需要Receiver,所以receiverTracker.start()代码,后面再研究
3,咱们在进入jobGenerator.start()代码,先简单分析一下JobGenerator是如何进行Duration定期调度的
private[streaming] class JobGenerator(jobScheduler: JobScheduler) extends Logging { private val ssc = jobScheduler.ssc private val conf = ssc.conf private val graph = ssc.graph
//在JobGenerator初始化时会得到spark自已的Clock按Duration周期性睡眠 a,方法体返回spark的SystemClock,当然也可以自己实现一个Clock,将路径放SparkConf中 val clock = { val clockClass = ssc.sc.conf.get( "spark.streaming.clock", "org.apache.spark.util.SystemClock") try { Utils.classForName(clockClass).newInstance().asInstanceOf[Clock] } catch { ….. } b,JobGenerator.start()启动后会调用startFirstTime(),它里面会调用timeer.start(),然后不断的根据Duration,
通过发送GeneratoJobs让EventLoop执行onReceiver方法 private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds, longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
==》EventLoop会被上面的eventLoop.post(GenerateJobs())触发执行onReceive方法。
(EventLoop的里面的源码还是很简单的,使用一个并发BlockingQueue队列,队列在取元素即调用take时,如果队列没有元素会阻塞在哪,
所以当RecurringTimer在post(GenerateJobs())就会往下执行,然后调用下面的onReceiver方法)
/** Start generation of jobs */ def start(): Unit = synchronized { if (eventLoop != null) return // generator has already been started checkpointWriter //JobGeneratorEvent:有三个实现类,查看上面有一个GenerateJobs表示生成Job eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") { override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event) 。。。 } eventLoop.start() if (ssc.isCheckpointPresent) { restart() } else { startFirstTime() } }
4,因为BlockingQueue队列开始得到的是GenerateJobs这个case class,所以先执行generateJobs(time)方法
private def processEvent(event: JobGeneratorEvent) { logDebug("Got event " + event) event match { case GenerateJobs(time) => generateJobs(time) 。。。。 } }
5,这个generateJobs方法中的DstreamGraph.generateJobs是会将每个Duration周期内新增的文本数据放在一个UnionRDD中,然后放在Job中,给Executor去执行。具体过程如下:
#从spark的注释:这个方法会在指定周期中是生成jobs和执行checkpoint
/** Generate jobs and perform checkpoint for the given `time`. */ private def generateJobs(time: Time) { // Set the SparkEnv in this thread, so that job generation code can access the environment // Example: BlockRDDs are created in this thread, and it needs to access BlockManager // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed. SparkEnv.set(ssc.env) Try {
//这个方法是给ReceiveInputDstream使用的,FileInputDstream不是这一类 jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch //调用graph的generateJobs方法,给方法返回Success(jobs) 或者 Failure(e), // 其中的jobs就是该方法返回的Job对象集合,如果Job创建成功,再调用JobScheduler的submitJobSet方法将job提交给集群执行。 graph.generateJobs(time) // generate jobs using allocated block } match { case Success(jobs) => val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time) //其中streamIdToInputInfos就是接收的数据的元数据 //JobSet代表了一个batch duration中的一批jobs。就是一个普通对象,包含了未提交的jobs,提交的时间,执行开始和结束时间等信息。 jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos)) case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e) } //发送执行CheckPoint时间,发送周期为streaming batch接收数据的时间 eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false)) }
6,进入DstreamGraph.generateJobs()分析一下执行流程:
#该案例的outputStream只有一个ForeachDstream(只有ForeachDstream重写了Dstream中的generateJob方法)
def generateJobs(time: Time): Seq[Job] = { logDebug("Generating jobs for time " + time) val jobs = this.synchronized { outputStreams.flatMap { outputStream => //jobOption返回就是Some(new Job(time, jobFunc)),如果是调用的print()方法,则jobFunc就是print方法中声明的方法 val jobOption = outputStream.generateJob(time) jobOption.foreach(_.setCallSite(outputStream.creationSite)) jobOption } } logDebug("Generated " + jobs.length + " jobs for time " + time) jobs }
7,咱们就看一下ForeachDstream的generateJob(time)是如何将val lines =ssc.textFileStream(args(0))监听目录中的新增的文件放在RDD中的。
private[streaming] class ForEachDStream[T: ClassTag] ( parent: DStream[T], foreachFunc: (RDD[T], Time) => Unit, displayInnerRDDOps: Boolean ) extends DStream[Unit](parent.ssc) { override def dependencies: List[DStream[_]] = List(parent) override def slideDuration: Duration = parent.slideDuration override def compute(validTime: Time): Option[RDD[Unit]] = None
override def generateJob(time: Time): Option[Job] = { //这个parent会将整个app的Dstream算子进行回塑,回塑回来的RDD是当前Dstream周期内的RDD数据 parent.getOrCompute(time) match { case Some(rdd) => //createRDDWithLocalProperties方法的作用就是执行{foreachFunc(rdd, time)}这个body,前面(time,displayInnerRDDOps)是展示在UI上的 val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) { foreachFunc(rdd, time) //这个 foreachFunc就是print方法中声明的内嵌函数。 } Some(new Job(time, jobFunc)) case None => None } }
===》当前案例中ForeachDstream的parent是ShuffledDstream,也就是下面reduceBykey算子产生的。
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print()
===》看到ShuffledDstream并自身没有重写getOrCompute(time),
private[streaming] class ShuffledDStream[K: ClassTag, V: ClassTag, C: ClassTag]( parent: DStream[(K, V)], createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiner: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true ) extends DStream[(K, C)] (parent.ssc) { override def dependencies: List[DStream[_]] = List(parent) override def slideDuration: Duration = parent.slideDuration override def compute(validTime: Time): Option[RDD[(K, C)]] = { parent.getOrCompute(validTime) match { case Some(rdd) => Some(rdd.combineByKey[C]( createCombiner, mergeValue, mergeCombiner, partitioner, mapSideCombine)) case None => None } } }
8,Dstream中getOrCompute方法最终会调用当前Dstream实例中的compute方法
/**
* Get the RDD corresponding to the given time; either retrieve it from cache
* or compute-and-cache it.
private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = { // If RDD was already generated, then retrieve it from HashMap, // or else compute the RDD generatedRDDs.get(time).orElse { // Compute the RDD if time is valid (e.g. correct time in a sliding window) // of RDD generation, else generate nothing. if (isTimeValid(time)) { //createRDDWithLocalProperties方法的作用就是执行后面柯里化函数体:{PairRDDFunctions.disableOutputSpecValidation.withValue(true)}这个body,前面(time,displayInnerRDDOps)是展示在UI上的 val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) { PairRDDFunctions.disableOutputSpecValidation.withValue(true) { compute(time) //由具体的数据源Dstream来创建RDD 如:FileInputDStream中的compute方法来创建 } }
。。。。。。。。。。
===》会按照下面的顺序往上回塑,一直到FileInputDstream的compute方法
ShuffledDStream
MappedDStream
FlatMappedDStream
FileInputDStream
9,进入FileInputDstream的compute方法,可以看到findNewFiles方法找到新的文本后,会调用了filesToRDD方法,将新文本中的数据转化成RDD
override def compute(validTime: Time): Option[RDD[(K, V)]] = { // Find new files val newFiles = findNewFiles(validTime.milliseconds) logInfo("New files at time " + validTime + ":\n" + newFiles.mkString("\n")) batchTimeToSelectedFiles += ((validTime, newFiles)) recentlySelectedFiles ++= newFiles val rdds = Some(filesToRDD(newFiles)) // Copy newFiles to immutable.List to prevent from being modified by the user val metadata = Map( "files" -> newFiles.toList, StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n")) val inputInfo = StreamInputInfo(id, 0, metadata) ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo) rdds }
===>newAPIHadoopFile生成是NewHadoopRDD===》最后得到UnionRDD
/** Generate one RDD from an array of files */ private def filesToRDD(files: Seq[String]): RDD[(K, V)] = { val fileRDDs = files.map { file => val rdd = serializableConfOpt.map(_.value) match { case Some(config) => context.sparkContext.newAPIHadoopFile( file, fm.runtimeClass.asInstanceOf[Class[F]], km.runtimeClass.asInstanceOf[Class[K]], vm.runtimeClass.asInstanceOf[Class[V]], config) case None => context.sparkContext.newAPIHadoopFile[K, V, F](file) } if (rdd.partitions.size == 0) { logError("File " + file + " has no data in it. Spark Streaming can only ingest " + "files that have been \"moved\" to the directory assigned to the file stream. " + "Refer to the streaming programming guide for more details.") } rdd } new UnionRDD(context.sparkContext, fileRDDs) }
10、所以得到当前的Duration对应的UnionRDD,咱们再返回到ForeachDstream的generateJob()方法中。此时就能很清楚的看到每个Dstream周期内,会先将当前周期内的所有数据放到一个RDD中,然后再封装到streaming对应的Job里面。
override def generateJob(time: Time): Option[Job] = { //这个parent会将整个app的Dstream算子进行回塑,回塑回来的RDD是当前Dstream周期内的RDD数据 parent.getOrCompute(time) match { case Some(rdd) => //createRDDWithLocalProperties方法的作用就是执行{foreachFunc(rdd, time)}这个body,前面(time,displayInnerRDDOps)是展示在UI上的 val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) { foreachFunc(rdd, time) //如果是调用的print()方法,则foreachFunc就是print方法中声明的方法。 } Some(new Job(time, jobFunc)) case None => None } }== 》下一篇来分析一下Job是如何执行的。。。。