本节课基于案例试图通过一节课贯通Spark Streaming流计算框架的运行源码,这节课建立在之前4节课的基础之上,本节内容分成2部分:1,在线动态计算分类最热门商品案例回顾与演示 2,基于案例贯通Spark Streaming的运行源码。
在线动态计算分类最热门商品案例回顾与演示这个基于之前的课程内容。
OnlineTheTop3ItemForEachCategory2DB.scala业务代码如下:
1. package com.dt.spark.sparkstreaming
2.
3. importorg.apache.spark.SparkConf
4. import org.apache.spark.sql.Row
5. importorg.apache.spark.sql.hive.HiveContext
6. importorg.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
7. import org.apache.spark.streaming.{Seconds,StreamingContext}
8.
9. /**
10. * 使用Spark Streaming+Spark SQL来在线动态计算电商中不同类别中最热门的商品排名,例如手机这个类别下面最热门的三种手机、电视这个类别
11. * 下最热门的三种电视,该实例在实际生产环境下具有非常重大的意义;
12. *
13. * @author DT大数据梦工厂
14. * 新浪微博:http://weibo.com/ilovepains/
15. *
16. *
17. * 实现技术:Spark Streaming+Spark SQL,之所以Spark Streaming能够使用ML、sql、graphx等功能是因为有foreachRDD和Transform
18. * 等接口,这些接口中其实是基于RDD进行操作,所以以RDD为基石,就可以直接使用Spark其它所有的功能,就像直接调用API一样简单。
19. * 假设说这里的数据的格式:user item category,例如Rocky Samsung Android
20. */
21. object OnlineTheTop3ItemForEachCategory2DB{
22. def main(args: Array[String]){
23. /**
24. * 第1步:创建Spark的配置对象SparkConf,设置Spark程序的运行时的配置信息,
25. * 例如说通过setMaster来设置程序要链接的Spark集群的Master的URL,如果设置
26. * 为local,则代表Spark程序在本地运行,特别适合于机器配置条件非常差(例如
27. * 只有1G的内存)的初学者 *
28. */
29. val conf = new SparkConf() //创建SparkConf对象
30. conf.setAppName("OnlineTheTop3ItemForEachCategory2DB")//设置应用程序的名称,在程序运行的监控界面可以看到名称
31. // conf.setMaster("spark://Master:7077") //此时,程序在Spark集群
32. conf.setMaster("local[6]")
33. //设置batchDuration时间间隔来控制Job生成的频率并且创建SparkStreaming执行的入口
34. val ssc = new StreamingContext(conf,Seconds(5))
35.
36. ssc.checkpoint("/root/Documents/SparkApps/checkpoint")
37.
38.
39. val userClickLogsDStream =ssc.socketTextStream("Master", 9999)
40.
41. val formattedUserClickLogsDStream =userClickLogsDStream.map(clickLog =>
42. (clickLog.split(" ")(2) +"_" + clickLog.split(" ")(1), 1))
43.
44. // val categoryUserClickLogsDStream =formattedUserClickLogsDStream.reduceByKeyAndWindow((v1:Int, v2: Int) => v1 +v2,
45. // (v1:Int, v2: Int) => v1 - v2,Seconds(60), Seconds(20))
46.
47. val categoryUserClickLogsDStream =formattedUserClickLogsDStream.reduceByKeyAndWindow(_+_,
48. _-_, Seconds(60), Seconds(20))
49.
50. categoryUserClickLogsDStream.foreachRDD { rdd=> {
51. if (rdd.isEmpty()) {
52. println("No datainputted!!!")
53. } else {
54. val categoryItemRow =rdd.map(reducedItem => {
55. val category =reducedItem._1.split("_")(0)
56. val item =reducedItem._1.split("_")(1)
57. val click_count = reducedItem._2
58. Row(category, item, click_count)
59. })
60.
61. val structType = StructType(Array(
62. StructField("category",StringType, true),
63. StructField("item",StringType, true),
64. StructField("click_count",IntegerType, true)
65. ))
66.
67. val hiveContext = newHiveContext(rdd.context)
68. val categoryItemDF =hiveContext.createDataFrame(categoryItemRow, structType)
69.
70. categoryItemDF.registerTempTable("categoryItemTable")
71.
72. val reseltDataFram =hiveContext.sql("SELECT category,item,click_count FROM (SELECTcategory,item,click_count,row_number()" +
73. " OVER (PARTITION BY categoryORDER BY click_count DESC) rank FROM categoryItemTable) subquery " +
74. " WHERE rank <= 3")
75. reseltDataFram.show()
76.
77. val resultRowRDD = reseltDataFram.rdd
78.
79. resultRowRDD.foreachPartition {partitionOfRecords => {
80.
81. if (partitionOfRecords.isEmpty){
82. println("This RDD is not nullbut partition is null")
83. } else {
84. // ConnectionPool is a static,lazily initialized pool of connections
85. val connection =ConnectionPool.getConnection()
86. partitionOfRecords.foreach(record=> {
87. val sql = "insert intocategorytop3(category,item,client_count) values('" +record.getAs("category") + "','" +
88. record.getAs("item")+ "'," + record.getAs("click_count") + ")"
89. val stmt =connection.createStatement();
90. stmt.executeUpdate(sql);
91.
92. })
93. ConnectionPool.returnConnection(connection)// return to the pool for future reuse
94.
95. }
96. }
97. }
98. }
99.
100.
101. }
102. }
103.
104.
105.
106. /**
107. * 在StreamingContext调用start方法的内部其实是会启动JobScheduler的Start方法,进行消息循环,在JobScheduler
108. * 的start内部会构造JobGenerator和ReceiverTacker,并且调用JobGenerator和ReceiverTacker的start方法:
109. * 1,JobGenerator启动后会不断的根据batchDuration生成一个个的Job
110. * 2,ReceiverTracker启动后首先在Spark Cluster中启动Receiver(其实是在Executor中先启动ReceiverSupervisor),在Receiver收到
111. * 数据后会通过ReceiverSupervisor存储到Executor并且把数据的Metadata信息发送给Driver中的ReceiverTracker,在ReceiverTracker
112. * 内部会通过ReceivedBlockTracker来管理接收到的元数据信息
113. * 每个BatchInterval会产生一个具体的Job,其实这里的Job不是Spark Core中所指的Job,它只是基于DStreamGraph而生成的RDD
114. * 的DAG而已,从Java角度讲,相当于Runnable接口实例,此时要想运行Job需要提交给JobScheduler,在JobScheduler中通过线程池的方式找到一个
115. * 单独的线程来提交Job到集群运行(其实是在线程中基于RDD的Action触发真正的作业的运行),为什么使用线程池呢?
116. * 1,作业不断生成,所以为了提升效率,我们需要线程池;这和在Executor中通过线程池执行Task有异曲同工之妙;
117. * 2,有可能设置了Job的FAIR公平调度的方式,这个时候也需要多线程的支持;
118. *
119. */
120. ssc.start()
121. ssc.awaitTermination()
122.
123. }
124. }
其中reduceByKeyAndWindow是个比较高效的方式,在历史的基础之上,加上新的20秒,每20秒作为一个滑动窗口,窗口的长度是60秒,_+_是在过去60秒的成果加上新的20秒,_-_是在新的20秒的基础之上减去旧的20秒。
在线动态计算分类最热门商品案例演示的效果如下:
在业务代码中我们在val ssc = new StreamingContext(conf, Seconds(5))看一下StreamingContext的源代码。
StreamingContext.scala
1. defthis(
2. master: String,
3. appName: String,
4. batchDuration: Duration,
5. sparkHome: String = null,
6. jars: Seq[String] = Nil,
7. environment: Map[String, String] = Map())= {
8. this(StreamingContext.createNewSparkContext(master,appName, sparkHome, jars, environment),
9. null, batchDuration)
10. }
我们在构建StreamingContext的时候,传进来很多参数,master的url地址、appName应用程序的名称,batchDuration时间间隔。真正构建的时候通过StreamingContext.createNewSparkContext创建SparkContext
StreamingContext.scala
1. class StreamingContext private[streaming] (
2. _sc: SparkContext,
3. _cp: Checkpoint,
4. _batchDur: Duration
5. ) extends Logging {
StreamingContext构建的时候通过配置sparkConf或者自己的configure,在内部构建SparkContext,把收集的参数传给了conf,new出来一个SparkContext(conf)。这生动的说明了一件事情,SparkStreaming就是Spark Cores之上的一个应用程序。
StreamingContext.scala
1. private[streaming] defcreateNewSparkContext(conf: SparkConf): SparkContext = {
2. new SparkContext(conf)
3. }
构建了SparkContext之后,我们看业务代码val userClickLogsDStream =ssc.socketTextStream("Master", 9999),点击socketTextStream进入源码:
StreamingContext.scala
1. def socketTextStream(
2. hostname: String,
3. port: Int,
4. storageLevel: StorageLevel =StorageLevel.MEMORY_AND_DISK_SER_2
5. ):ReceiverInputDStream[String] = withNamedScope("socket text stream") {
6. socketStream[String](hostname,port, SocketReceiver.bytesToLines, storageLevel)
7. }
8. ……
9. defsocketStream[T: ClassTag](
10. hostname: String,
11. port: Int,
12. converter: (InputStream) =>Iterator[T],
13. storageLevel: StorageLevel
14. ): ReceiverInputDStream[T] = {
15. new SocketInputDStream[T](this, hostname,port, converter, storageLevel)
16. }
socketTextStream创建一个input的输入流,在socketStream方法中new出来一个SocketInputDStream。
SocketInputDStream.scala源代码:
1. private[streaming]
2. class SocketInputDStream[T: ClassTag](
3. _ssc: StreamingContext,
4. host: String,
5. port: Int,
6. bytesToObjects: InputStream=> Iterator[T],
7. storageLevel: StorageLevel
8. ) extendsReceiverInputDStream[T](_ssc) {
9. def getReceiver(): Receiver[T]= {
10. new SocketReceiver(host, port,bytesToObjects, storageLevel)
11. }
12. }
13.
14. private[streaming]
15. class SocketReceiver[T: ClassTag](
16. host: String,
17. port: Int,
18. bytesToObjects: InputStream =>Iterator[T],
19. storageLevel: StorageLevel
20. ) extends Receiver[T](storageLevel) withLogging {
21. private var socket: Socket = _
22.
23. def onStart() {
24.
25. logInfo(s"Connecting to $host:$port")
26. try {
27. socket = new Socket(host, port)
28. } catch {
29. case e: ConnectException =>
30. restart(s"Error connecting to$host:$port", e)
31. return
32. }
33. logInfo(s"Connected to$host:$port")
34.
35. // Start the thread that receives data overa connection
36. new Thread("Socket Receiver") {
37. setDaemon(true)
38. override def run() { receive() }
39. }.start()
40. }
SocketInputDStream就基于SocketReceiver接收数据,在SocketReceiver的onStart方法中,开启了一个线程,在其中启动了receive方法, receive方法根据ip地址和端口连接socket,获取数据socket.getInputStream(),然后不断循环。
SocketReceiver.scala源代码:
1. def receive() {
2. try {
3. val iterator =bytesToObjects(socket.getInputStream())
4. while(!isStopped &&iterator.hasNext) {
5. store(iterator.next())
6. }
7. if (!isStopped()) {
8. restart("Socket datastream had no more data")
9. } else {
10. logInfo("Stoppedreceiving")
11. }
12. } catch {
13. case NonFatal(e) =>
14. logWarning("Errorreceiving data", e)
15. restart("Errorreceiving data", e)
16. } finally {
17. onStop()
18. }
19. }
SocketInputDStream继承至ReceiverInputDStream,ReceiverInputDStream继承至InputDStream,InputDStream是所有输入流 input streams的抽象基类。这个类提供了start()和stop()方法,被 Spark Streaming系统调用分别启动或停止接收数据。输入流Input streams,可以在Driver节点上运行一个服务/线程生成RDDS(可不在Worker节点上运行接收器),通过直接继承这InputDStream实现。例如,FileInputDStream是InputDStream的一个子类,监听HDFS目录新的文件,Driver节点根据新文件生成RDD 。实现输入流如在Worker节点上运行一个接收器,使用[[org.apache.spark.streaming.dstream.ReceiverInputDStream]]作为父类。
SocketInputDStream作为InputDStream具体的子类,里面调用的是onStart()和onStop()方法,onStart()其实被回调的。ReceiverInputDStream的onStart()和onStop()方法没有具体实现,其将被ReceiverTracker调用。
InputDStream.scala源代码:
1. abstract class InputDStream[T:ClassTag](_ssc: StreamingContext)
2. extends DStream[T](_ssc) {
3. ……
4. /** Method called to start receiving data.Subclasses must implement this method. */
5. def start(): Unit
6.
7. /** Method called to stop receiving data.Subclasses must implement this method. */
8. def stop(): Unit
9. }
我们看一下ForEachDStream,ForEachDStream是内部的DStream 用来表示输出操作,如DStream.foreachRDD。输出的时候,在generateJob方法里面new 出来一个job,是Dstream级别的Action操作。所有的Dstream操作一定会有ForEachDStream的产生,ForEachDStream直接new出job。这个job的产生是基于Function,即业务逻辑。
ForEachDStream.scala源代码:
1. private[streaming]
2. class ForEachDStream[T: ClassTag] (
3. parent: DStream[T],
4. foreachFunc: (RDD[T], Time)=> Unit,
5. displayInnerRDDOps: Boolean
6. ) extendsDStream[Unit](parent.ssc) {
7. ……
8. override def generateJob(time:Time): Option[Job] = {
9. parent.getOrCompute(time) match {
10. case Some(rdd) =>
11. val jobFunc = () =>createRDDWithLocalProperties(time, displayInnerRDDOps) {
12. foreachFunc(rdd, time)
13. }
14. Some(new Job(time, jobFunc))
15. case None => None
16. }
17. }
在DStream.scala中有一句关键代码:private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]](),根据时间间隔生成相应的RDD,Dstream操作就转换为RDD操作,有相应的依赖关系。generatedRDDs这个是包[streaming]私有的,如果要访问它,包的命名就要和[streaming]一样。这个是一个非常重要的技巧,当扩展Spark的时候,框架的功能要提供给不同的应用开发者使用,那包的名称必须和它一样才能访问。generatedRDDs本身是一个HashMap,key是时间间隔,Value就是RDD。从这里可以看出,Dstream的本质就是按照时间的序列存储了一系列的数据流,因为RDD中就是Batchduration中的数据。
DStream.scala源代码:
1. @transient
2. private[streaming] var generatedRDDs = newHashMap[Time, RDD[T]]()
接下来我们看一下Dstream的getOrCompute, getOrCompute方法获取对应于给定时间的RDD,或者从缓存中检索;或者计算并缓存它。Dstream的操作是生产RDD,Dstream是RDD的模板,实际运行的生产RDD。
Spark 2.1.1版本DStream.scala的getOrCompute源代码:
1. private[streaming] final defgetOrCompute(time: Time): Option[RDD[T]] = {
2. //如果RDD已经生成, 从HashMap查询,或者重新计算RDD
3. generatedRDDs.get(time).orElse{
4. // 如果RDD产生的时间间隔是有效的(例如,在一个滑动窗口中的正确的时间
5. ,就计算RDD;否则不产生RDD。
6. if (isTimeValid(time)) {
7.
8. val rddOption =createRDDWithLocalProperties(time, displayInnerRDDOps = false) {
9. //在流调度器启动的作业中禁止对现有输出目录进行检查,因为我们可能需要在检查点将输出写到现有目录中。更多细节见SPARK-4835 ,我们需要这个调用,因为compute()方法可能导致Spark作业启动。
10. PairRDDFunctions.disableOutputSpecValidation.withValue(true){
11. compute(time)
12. }
13. }
14.
15. rddOption.foreach { casenewRDD =>
16. //注册产生的RDD用于缓存和检查点
17. if (storageLevel !=StorageLevel.NONE) {
18. newRDD.persist(storageLevel)
19. logDebug(s"Persisting RDD${newRDD.id} for time $time to $storageLevel")
20. }
21. if (checkpointDuration!= null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
22. newRDD.checkpoint()
23. logInfo(s"MarkingRDD ${newRDD.id} for time $time for checkpointing")
24. }
25. generatedRDDs.put(time,newRDD)
26. }
27. rddOption
28. } else {
29. None
30. }
31. }
32. }
Spark 2.2.0版本DStream.scala的getOrCompute源代码,与Spark 2.1.1版本相比:上段代码中第10行的PairRDDFunctions调整为SparkHadoopWriterUtils 。
1. ……
2. SparkHadoopWriterUtils.disableOutputSpecValidation.withValue(true){
3. ……
我们通过sockerTextStream获取一个输入流SocketInputDStream,SocketInputDStream在集群中运行的时候抓到数据,SocketInputDStream通过SocketReceiver的getInputStream从网络中获取数据,这样就将SDstream的离散流以HashMap的方式generatedRDDs 存储起来,最终是RDD以Time为单位存储数据。假设时间是1万年,以1秒钟为单位,切了一片一片又一片,切开了就是离散流,前前后后组成了时间的长河。
然后是业务代码tranform的处理,写的是业务逻辑代码,最终的一行框架代码ssc.start()很重要。start()启动的时候判断一下状态,如果正在运行提示已经启动了,由于Spark中不能有多个Spark Context,如果已经是正常运行,就限定不能再启动一个StreamingContext。如果状态是停止,就提示已经停止运行。
我们看一下StreamingContext.scala的start方法中的scheduler.start()。
JobScheduler.scala的start()方法源代码:
1. def start(): Unit = synchronized{
2. if (eventLoop != null) return// scheduler has already been started
3.
4. logDebug("StartingJobScheduler")
5. eventLoop = newEventLoop[JobSchedulerEvent]("JobScheduler") {
6. override protected defonReceive(event: JobSchedulerEvent): Unit = processEvent(event)
7.
8. override protected defonError(e: Throwable): Unit = reportError("Error in job scheduler",e)
9. }
10. eventLoop.start()
11.
12. // attach rate controllers ofinput streams to receive batch completion updates
13. for {
14. inputDStream <-ssc.graph.getInputStreams
15. rateController <-inputDStream.rateController
16. }ssc.addStreamingListener(rateController)
17.
18. listenerBus.start()
19. receiverTracker = newReceiverTracker(ssc)
20. inputInfoTracker = newInputInfoTracker(ssc)
21.
22. val executorAllocClient:ExecutorAllocationClient = ssc.sparkContext.schedulerBackend match {
23. case b:ExecutorAllocationClient => b.asInstanceOf[ExecutorAllocationClient]
24. case _ => null
25. }
26.
27. executorAllocationManager =ExecutorAllocationManager.createIfEnabled(
28. executorAllocClient,
29. receiverTracker,
30. ssc.conf,
31. ssc.graph.batchDuration.milliseconds,
32. clock)
33. executorAllocationManager.foreach(ssc.addStreamingListener)
34. receiverTracker.start()
35. jobGenerator.start()
36. executorAllocationManager.foreach(_.start())
37. logInfo("StartedJobScheduler")
38. }
scheduler.start()方法中首先有个eventLoop ,如果eventLoop不为空,就返回;然后构建eventLoop。
EventLoop.scala源代码:
1. private[spark] abstract class EventLoop[E](name: String) extends Logging{
2.
3. private val eventQueue:BlockingQueue[E] = new LinkedBlockingDeque[E]()
4.
5. private val stopped = newAtomicBoolean(false)
6.
7. private val eventThread = newThread(name) {
8. setDaemon(true)
9.
10. override def run(): Unit = {
11. try {
12. while (!stopped.get) {
13. val event =eventQueue.take()
14. try {
15. onReceive(event)
16. } catch {
17. case NonFatal(e) =>
18. try {
19. onError(e)
20. } catch {
21. case NonFatal(e)=> logError("Unexpected error in " + name, e)
22. }
23. }
24. }
25. } catch {
26. case ie:InterruptedException => // exit even if eventQueue is not empty
27. case NonFatal(e) =>logError("Unexpected error in " + name, e)
28. }
29. }
30.
31. }
EventLoop的内部有个线程,线程设置为setDaemon后台运行,EventLoop里面有个LinkedBlockingDeque的事件队列,确保先进先出,在线程run的时候从队列中take获取一个事件,转过来调用onReceive方法。
JobScheduler的onReceive方法调用processEvent(event),对事件event进行模式匹配,分别对各种情况进行处理:作业启动的处理;作业完成的处理;以及错误的处理。其中调用handleJobStart方法处理任务的启动。
JobScheduler.scala的handleJobStart源代码:
1. private def handleJobStart(job: Job,startTime: Long) {
2. val jobSet =jobSets.get(job.time)
3. val isFirstJobOfJobSet =!jobSet.hasStarted
4. jobSet.handleJobStart(job)
5. if (isFirstJobOfJobSet) {
6. //调用handleJobStart获取正确的jobSet.processingStartTime之后,发送StreamingListenerBatchStarted。
7. listenerBus.post(StreamingListenerBatchStarted(jobSet.toBatchInfo))
8. }
9. job.setStartTime(startTime)
10. listenerBus.post(StreamingListenerOutputOperationStarted(job.toOutputOperationInfo))
11. logInfo("Starting job" + job.id + " from job set of time " + jobSet.time)
12. }
JobScheduler是整个Job的调度器,本身用了一条线程循环,去监听不同的Job启动、Job完成、Job失败的情况,分别调用handleJobStart、handleJobCompletion、handleError进行处理,JobScheduler执行的时候在线程的内部运行,在start的时候又启动一个线程。是一个简单的消息循环处理。
在eventLoop.start()之后进行循环,eventLoop是另外一条线程。然后在for语句中获取inputDStream,可能有很多inputDStream,rateController可以控制输入的速度,然后是listenerBus.start()的启动,listenerBus至关重要,可以监听所有的事件。
onOtherEvent方法里面有很多事件,也可以进行扩展,在WebUI显示。
StreamingListenerBus.scala源代码:
1. private[streaming] classStreamingListenerBus(sparkListenerBus: LiveListenerBus)
2. extends SparkListener withListenerBus[StreamingListener, StreamingListenerEvent] {
3.
4. /**
5. * Post a StreamingListenerEventto the Spark listener bus asynchronously. This event will be
6. * dispatched to allStreamingListeners in the thread of the Spark listener bus.
7. */
8. def post(event:StreamingListenerEvent) {
9. sparkListenerBus.post(newWrappedStreamingListenerEvent(event))
10. }
11.
12. override def onOtherEvent(event:SparkListenerEvent): Unit = {
13. event match {
14. caseWrappedStreamingListenerEvent(e) =>
15. postToAll(e)
16. case _ =>
17. }
18. }
19.
20. protected override defdoPostEvent(
21. listener: StreamingListener,
22. event: StreamingListenerEvent): Unit = {
23. event match {
24. case receiverStarted:StreamingListenerReceiverStarted =>
25. listener.onReceiverStarted(receiverStarted)
26. case receiverError:StreamingListenerReceiverError =>
27. listener.onReceiverError(receiverError)
28. case receiverStopped:StreamingListenerReceiverStopped =>
29. listener.onReceiverStopped(receiverStopped)
30. case batchSubmitted:StreamingListenerBatchSubmitted =>
31. listener.onBatchSubmitted(batchSubmitted)
32. case batchStarted:StreamingListenerBatchStarted =>
33. listener.onBatchStarted(batchStarted)
34. case batchCompleted:StreamingListenerBatchCompleted =>
35. listener.onBatchCompleted(batchCompleted)
36. case outputOperationStarted:StreamingListenerOutputOperationStarted =>
37. listener.onOutputOperationStarted(outputOperationStarted)
38. caseoutputOperationCompleted: StreamingListenerOutputOperationCompleted =>
39. listener.onOutputOperationCompleted(outputOperationCompleted)
40. case streamingStarted:StreamingListenerStreamingStarted =>
41. listener.onStreamingStarted(streamingStarted)
42. case _ =>
43. }
44. }
然后调用receiverTracker.start(),ReceiverTracker本身不是Rpc,如果进行通信需要有个消息通信体,这里内部有个ReceiverTrackerEndpoint。ReceiverTrackerEndpoint的receive方法中
ReceiverTracker.scala的start方法源代码:
1. def start(): Unit = synchronized{
2. if (isTrackerStarted) {
3. throw newSparkException("ReceiverTracker already started")
4. }
5.
6. if(!receiverInputStreams.isEmpty) {
7. endpoint =ssc.env.rpcEnv.setupEndpoint(
8. "ReceiverTracker", newReceiverTrackerEndpoint(ssc.env.rpcEnv))
9. if (!skipReceiverLaunch)launchReceivers()
10. logInfo("ReceiverTrackerstarted")
11. trackerState = Started
12. }
13. }
14. …..
15. private classReceiverTrackerEndpoint(override val rpcEnv: RpcEnv) extendsThreadSafeRpcEndpoint {
16. ……
17. override def receive: PartialFunction[Any,Unit] = {
18. // Local messages
19. case StartAllReceivers(receivers) =>
20. val scheduledLocations =schedulingPolicy.scheduleReceivers(receivers, getExecutors)
21. for (receiver <- receivers) {
22. val executors =scheduledLocations(receiver.streamId)
23. updateReceiverScheduledExecutors(receiver.streamId,executors)
24. receiverPreferredLocations(receiver.streamId)= receiver.preferredLocation
25. startReceiver(receiver, executors)
26. }
27. case RestartReceiver(receiver) =>
28. // Old scheduled executors minus theones that are not active any more
29. val oldScheduledExecutors =getStoredScheduledExecutors(receiver.streamId)
30. val scheduledLocations = if(oldScheduledExecutors.nonEmpty) {
31. // Try global scheduling again
32. oldScheduledExecutors
33. } else {
34. val oldReceiverInfo =receiverTrackingInfos(receiver.streamId)
35. // Clear"scheduledLocations" to indicate we are going to do local scheduling
36. val newReceiverInfo = oldReceiverInfo.copy(
37. state = ReceiverState.INACTIVE,scheduledLocations = None)
38. receiverTrackingInfos(receiver.streamId)= newReceiverInfo
39. schedulingPolicy.rescheduleReceiver(
40. receiver.streamId,
41. receiver.preferredLocation,
42. receiverTrackingInfos,
43. getExecutors)
44. }
45. // Assume there is one receiverrestarting at one time, so we don't need to update
46. // receiverTrackingInfos
47. startReceiver(receiver,scheduledLocations)
48. case c: CleanupOldBlocks =>
49. receiverTrackingInfos.values.flatMap(_.endpoint).foreach(_.send(c))
50. case UpdateReceiverRateLimit(streamUID,newRate) =>
51. for (info <-receiverTrackingInfos.get(streamUID); eP <- info.endpoint) {
52. eP.send(UpdateRateLimit(newRate))
53. }
54. // Remote messages
55. case ReportError(streamId, message,error) =>
56. reportError(streamId, message, error)
57. }
ReceiverTrackerEndpoint的receive方法中,如果是StartAllReceivers启动所有的receivers,调用scheduleReceivers方法,scheduleReceivers方法中传入receivers、executors二个参数。分配在哪些executors上启动我们的receivers,
ReceiverSchedulingPolicy.scala源代码:
1. def scheduleReceivers(
2. receivers: Seq[Receiver[_]],
3. executors:Seq[ExecutorCacheTaskLocation]): Map[Int, Seq[TaskLocation]] = {
4. if (receivers.isEmpty) {
5. return Map.empty
6. }
7.
8. if (executors.isEmpty) {
9. returnreceivers.map(_.streamId -> Seq.empty).toMap
10. }
11.
12. val hostToExecutors =executors.groupBy(_.host)
13. val scheduledLocations =Array.fill(receivers.length)(new mutable.ArrayBuffer[TaskLocation])
14. val numReceiversOnExecutor =mutable.HashMap[ExecutorCacheTaskLocation, Int]()
15. //设置初始值为0
16. executors.foreach(e => numReceiversOnExecutor(e)= 0)
17.
18. // 首先,我们需要关注数据本地性“preferredLocation”。所以如果一个接收器具有“preferredLocation”,我们需要确保“preferredLocation”在调度executor列表的候选清单中。
19. for (i <- 0 untilreceivers.length) {
20. // 注:preferredLocation是主机,但executors节点是host_executorId
21. receivers(i).preferredLocation.foreach{ host =>
22. hostToExecutors.get(host)match {
23. caseSome(executorsOnHost) =>
24. // preferredLocation是已知宿主,选择此主机中具有最少接收器的executor
25. valleastScheduledExecutor =
26. executorsOnHost.minBy(executor=> numReceiversOnExecutor(executor))
27. scheduledLocations(i)+= leastScheduledExecutor
28. numReceiversOnExecutor(leastScheduledExecutor)=
29. numReceiversOnExecutor(leastScheduledExecutor)+ 1
30. case None =>
31. // 如果preferredLocation是一个未知的主机。注意,这里有两种情况:
32. 1、这个executor节点还没启动,但之后可能会启动。2、这个executor节点挂掉了,或者它不是集群中的主机节点。目前只需将主机添加到预定的执行器executors中。注意:主机应该是HDFSCacheTaskLocation,这种情况下,使用 TaskLocation.apply进行处理。
33. scheduledLocations(i)+= TaskLocation(host)
34. }
35. }
36. }
37.
38. //这些接收器,没有数据本地性preferredLocation,确保我们指派至少一个executor给接收器
39. for(scheduledLocationsForOneReceiver <- scheduledLocations.filter(_.isEmpty)) {
40. // 选择具有最少接收者的executor
41. val (leastScheduledExecutor,numReceivers) = numReceiversOnExecutor.minBy(_._2)
42. scheduledLocationsForOneReceiver +=leastScheduledExecutor
43. numReceiversOnExecutor(leastScheduledExecutor)= numReceivers + 1
44. }
45.
46. // 将空闲executors分配给拥有较少executors的接收器。
47. val idleExecutors =numReceiversOnExecutor.filter(_._2 == 0).map(_._1)
48. for (executor <-idleExecutors) {
49. // 将空闲executor分配给具有最少候选executor的接收器。
50. val leastScheduledExecutors= scheduledLocations.minBy(_.size)
51. leastScheduledExecutors +=executor
52. }
53.
54. receivers.map(_.streamId).zip(scheduledLocations).toMap
55. }
前面我们分析了receiverTracker.start()的源码:receiverTracker做的非常精妙,是发Job的方式到具体的集群上启动,在Executor上启动receiver,增加了ReceiverSchedulingPolicy的receiver调度,spark 1.6.0之前的版本没这个功能的,在scheduleReceivers怎么启动Receivers的,分配Receivers的算法自己也可以扩展。receiverTracker本身不监管Receivers,receiverTracker是Driver级别的,是简接的,通过ReceiverSupervisor去监控那台Executor中的Receivers。
ReceiverTracker的 start()方法中调用了launchReceivers。
ReceiverTracker.scala的launchReceivers源代码:
1. private def launchReceivers():Unit = {
2. val receivers =receiverInputStreams.map { nis =>
3. val rcvr = nis.getReceiver()
4. rcvr.setReceiverId(nis.id)
5. rcvr
6. }
7.
8. runDummySparkJob()
9.
10. logInfo("Starting "+ receivers.length + " receivers")
11. endpoint.send(StartAllReceivers(receivers))
12. }
launchReceivers方法中调用runDummySparkJob(),runDummySparkJob()里面确实运行了一个作业,runDummySparkJob方法运行一个仿真Spark作业以确保所有slaves都已注册,这避免了在同一节点上调度所有的接收器。待办事项:应该轮询executor号,并根据spark.scheduler.minRegisteredResourcesRatio和spark.scheduler.maxRegisteredResourcesWaitingTime等待executor,而不是运行一个仿真的Job。makeRDD的并行度为50,Shuffle的并行度是20,让所有的机器尽量多的使用,不要让所有的Reciever在一台机器上。
ReceiverTracker.scala的runDummySparkJob源代码:
1. private def runDummySparkJob(): Unit = {
2. if (!ssc.sparkContext.isLocal){
3. ssc.sparkContext.makeRDD(1to 50, 50).map(x => (x, 1)).reduceByKey(_ + _, 20).collect()
4. }
5. assert(getExecutors.nonEmpty)
6. }
在计算资源没有问题的基础下,ReceiverTracker.scala的launchReceivers的StartAllReceivers(receivers)看一下有几个receivers,就几个receivers就启动几个,通过endpoint.send(StartAllReceivers(receivers))提交,这里的endpoint是ReceiverTrackerEndpoint。然后在ReceiverTracker的receive方法中所有的receivers。updateReceiverScheduledExecutors(receiver.streamId,executors)具体在哪些Executor上启动receiver,receiverPreferredLocations(receiver.streamId) =receiver.preferredLocation是数据本地性,和RDD的完全一样。然后是startReceiver(receiver, executors)启动。
startReceiver方法在已调度的executors启动一个接收器
ReceiverTracker.scala的startReceiver源代码:
1. private def startReceiver(
2. receiver: Receiver[_],
3. scheduledLocations:Seq[TaskLocation]): Unit = {
4. def shouldStartReceiver:Boolean = {
5. // trackerState初始化或开始,启动OK
6. !(isTrackerStopping ||isTrackerStopped)
7. }
8.
9. val receiverId = receiver.streamId
10. if (!shouldStartReceiver) {
11. onReceiverJobFinish(receiverId)
12. return
13. }
14.
15. val checkpointDirOption =Option(ssc.checkpointDir)
16. val serializableHadoopConf =
17. newSerializableConfiguration(ssc.sparkContext.hadoopConfiguration)
18.
19. // 函数在Work节点上启动接收器。
20. val startReceiverFunc:Iterator[Receiver[_]] => Unit =
21. (iterator:Iterator[Receiver[_]]) => {
22. if (!iterator.hasNext) {
23. throw newSparkException(
24. "Could notstart receiver as object not found.")
25. }
26. if(TaskContext.get().attemptNumber() == 0) {
27. val receiver =iterator.next()
28. assert(iterator.hasNext == false)
29. val supervisor = newReceiverSupervisorImpl(
30. receiver,SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
31. supervisor.start()
32. supervisor.awaitTermination()
33. } else {
34. // 通过TaskScheduler重新启动,但我们要再次调度。所以退出它。
35. }
36. }
37.
38. //使用scheduledLocations在 Spark作业中运行接收器创建RDD
39. val receiverRDD:RDD[Receiver[_]] =
40. if(scheduledLocations.isEmpty) {
41. ssc.sc.makeRDD(Seq(receiver), 1)
42. } else {
43. val preferredLocations =scheduledLocations.map(_.toString).distinct
44. ssc.sc.makeRDD(Seq(receiver ->preferredLocations))
45. }
46. receiverRDD.setName(s"Receiver$receiverId")
47. ssc.sparkContext.setJobDescription(s"Streamingjob running receiver $receiverId")
48. ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
49.
50. val future =ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
51. receiverRDD,startReceiverFunc, Seq(0), (_, _) => Unit, ())
52. // 我们将重新启动接收器作业直到ReceiverTracker 停止
53. future.onComplete {
54. case Success(_) =>
55. if(!shouldStartReceiver) {
56. onReceiverJobFinish(receiverId)
57. } else {
58. logInfo(s"Restarting Receiver$receiverId")
59. self.send(RestartReceiver(receiver))
60. }
61. case Failure(e) =>
62. if(!shouldStartReceiver) {
63. onReceiverJobFinish(receiverId)
64. } else {
65. logError("Receiver has beenstopped. Try to restart it.", e)
66. logInfo(s"RestartingReceiver $receiverId")
67. self.send(RestartReceiver(receiver))
68. }
69. }(ThreadUtils.sameThread)
70. logInfo(s"Receiver${receiver.streamId} started")
71. }
startReceiverFunc就是业务逻辑代码,里面有RDD,然后是提交submitJob作业。这是非常巧妙的。然后是成功、失败的情况,future交给线程池ThreadUtils.sameThread
ThreadUtils.scala源代码:
1. def sameThread: ExecutionContextExecutor =sameThreadExecutionContext
2. ……
3. private valsameThreadExecutionContext =
4. ExecutionContext.fromExecutorService(MoreExecutors.sameThreadExecutor())
Executor收到这个消息,进行相关的启动过程。启动的过程用一个函数startReceiverFunc封装,其实在Executor启动的时候不需要改变框架的实现,因为这个是Function,这个函数肯定会执行,执行的时候启动ReceiverSupervisorImpl,代码非常恐怖,实现出逆天的感觉!然后在函数的supervisor.start()中调用了onStart()方法。
ReceiverSupervisor.scala源代码
1. def start() {
2. onStart()
3. startReceiver()
4. }
5. ……
6. protected def onStart() { }
ReceiverSupervisor自己的onStart()方法没有具体实现,其子类ReceiverSupervisorImpl.scala的onStart()方法如下。
ReceiverSupervisorImpl.scala源代码:
1. override protected def onStart() {
2. registeredBlockGenerators.asScala.foreach {_.start() }
3. }
ReceiverSupervisorImpl.scala的startReceiver方法:
1. def startReceiver(): Unit =synchronized {
2. try {
3. if (onReceiverStart()) {
4. logInfo(s"Startingreceiver $streamId")
5. receiverState = Started
6. receiver.onStart()
7. logInfo(s"Calledreceiver $streamId onStart")
8. } else {
9. // The driver refused us
10. stop("Registeredunsuccessfully because Driver refused to start receiver " + streamId,None)
11. }
12. } catch {
13. case NonFatal(t) =>
14. stop("Error startingreceiver " + streamId, Some(t))
15. }
16. }
其中调用了onReceiverStart方法,ReceiverSupervisor中的onReceiverStart没有具体实现,其子类ReceiverSupervisorImpl具体实现onReceiverStart方法,给trackerEndpoint发送一个消息,trackerEndpoint是ReceiverTracker里面的一个Rpc对象。里面都是消息通信。
ReceiverSupervisor.scala
1. protected def onReceiverStart():Boolean
2. ……
3. ReceiverSupervisorImpl.scala
4. override protected def onReceiverStart():Boolean = {
5. val msg = RegisterReceiver(
6. streamId,receiver.getClass.getSimpleName, host, executorId, endpoint)
7. trackerEndpoint.askSync[Boolean](msg)
8. }
9. …….
10. private val trackerEndpoint = RpcUtils.makeDriverRef("ReceiverTracker",env.conf, env.rpcEnv)
再次回到JobScheduler,然后是启动jobGenerator.start()。jobGenerator在生成作业的过程中不断的进行checkpoint,然后构建一个EventLoop循环处理
JobGenerator.scala的源代码:
1. def start(): Unit = synchronized{
2. if (eventLoop != null) return// generator has already been started
3.
4. // Call checkpointWriter hereto initialize it before eventLoop uses it to avoid a deadlock.
5. // See SPARK-10125
6. checkpointWriter
7.
8. eventLoop = newEventLoop[JobGeneratorEvent]("JobGenerator") {
9. override protected defonReceive(event: JobGeneratorEvent): Unit = processEvent(event)
10.
11. override protected defonError(e: Throwable): Unit = {
12. jobScheduler.reportError("Error injob generator", e)
13. }
14. }
15. eventLoop.start()
16.
17. if (ssc.isCheckpointPresent) {
18. restart()
19. } else {
20. startFirstTime()
21. }
22. }
在EventLoop通过 processEvent(event)对事件进行处理,根据时间间隔不断的发消息:
JobGenerator.scala的processEvent源代码:
1. private def processEvent(event:JobGeneratorEvent) {
2. logDebug("Got event" + event)
3. event match {
4. case GenerateJobs(time)=> generateJobs(time)
5. case ClearMetadata(time)=> clearMetadata(time)
6. case DoCheckpoint(time,clearCheckpointDataLater) =>
7. doCheckpoint(time,clearCheckpointDataLater)
8. caseClearCheckpointData(time) => clearCheckpointData(time)
9. }
10. }
generateJobs方法生成作业并在给定的时间内执行检查点,allocateBlocksToBatch根据特定的时间获得数据,然后通过graph.generateJobs(time) 生成Job,然后进行Checkpoint。如果成功,就提交作业,提交给JobSchedule。
JobGenerator.scala的generateJobs源代码:
1. private def generateJobs(time:Time) {
2. // Checkpoint all RDDs markedfor checkpointing to ensure their lineages are
3. // truncated periodically.Otherwise, we may run into stack overflows (SPARK-6847).
4. ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS,"true")
5. Try {
6. jobScheduler.receiverTracker.allocateBlocksToBatch(time)// allocate received blocks to batch
7. graph.generateJobs(time) //generate jobs using allocated block
8. } match {
9. case Success(jobs) =>
10. val streamIdToInputInfos =jobScheduler.inputInfoTracker.getInfo(time)
11. jobScheduler.submitJobSet(JobSet(time,jobs, streamIdToInputInfos))
12. case Failure(e) =>
13. jobScheduler.reportError("Errorgenerating jobs for time " + time, e)
14. PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
15. }
16. eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater= false))
17. }
JobScheduler.scala的源代码:
1. def submitJobSet(jobSet: JobSet){
2. if (jobSet.jobs.isEmpty) {
3. logInfo("No jobs addedfor time " + jobSet.time)
4. } else {
5. listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
6. jobSets.put(jobSet.time,jobSet)
7. jobSet.jobs.foreach(job=> jobExecutor.execute(new JobHandler(job)))
8. logInfo("Added jobs fortime " + jobSet.time)
9. }
10. }
在 submitJobSet方法中通过调用jobSet.jobs.foreach执行jobExecutor.execute,而execute中是一个线程池。
ThreadPoolExecutor.scala的源代码:
1. public void execute(Runnable command) {
2. if (command == null)
3. throw newNullPointerException();
4. /*
5. * Proceed in 3 steps:
6. *
7. * 1. If fewer thancorePoolSize threads are running, try to
8. * start a new thread withthe given command as its first
9. * task. The call to addWorker atomically checksrunState and
10. * workerCount, and soprevents false alarms that would add
11. * threads when itshouldn't, by returning false.
12. *
13. * 2. If a task can besuccessfully queued, then we still need
14. * to double-check whetherwe should have added a thread
15. * (because existing onesdied since last checking) or that
16. * the pool shut downsince entry into this method. So we
17. * recheck state and ifnecessary roll back the enqueuing if
18. * stopped, or start a newthread if there are none.
19. *
20. * 3. If we cannot queuetask, then we try to add a new
21. * thread. If it fails, we know we are shut down orsaturated
22. * and so reject the task.
23. */
24. int c = ctl.get();
25. if (workerCountOf(c) <corePoolSize) {
26. if (addWorker(command,true))
27. return;
28. c = ctl.get();
29. }
30. if (isRunning(c)&& workQueue.offer(command)) {
31. int recheck =ctl.get();
32. if (! isRunning(recheck)&& remove(command))
33. reject(command);
34. else if(workerCountOf(recheck) == 0)
35. addWorker(null,false);
36. }
37. else if(!addWorker(command, false))
38. reject(command);
39. }
JobHandler封装了一个Job,提交给Executor的线程池去执行。当前的作业是一个runnable对象。
JobScheduler.scala的源代码:
1. private class JobHandler(job: Job) extendsRunnable with Logging {
2. import JobScheduler._
3.
4. def run() {
5. val oldProps =ssc.sparkContext.getLocalProperties
6. try {
7. ssc.sparkContext.setLocalProperties(SerializationUtils.clone(ssc.savedProperties.get()))
8. val formattedTime =UIUtils.formatBatchTime(
9. job.time.milliseconds,ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)
10. val batchUrl =s"/streaming/batch/?id=${job.time.milliseconds}"
11. val batchLinkText =s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"
12.
13. ssc.sc.setJobDescription(
14. s"""Streaming job from<a href="$batchUrl">$batchLinkText</a>""")
15. ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY,job.time.milliseconds.toString)
16. ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY,job.outputOpId.toString)
17. // Checkpoint all RDDsmarked for checkpointing to ensure their lineages are
18. // truncated periodically.Otherwise, we may run into stack overflows (SPARK-6847).
19. ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS,"true")
20.
21. // We need to assign`eventLoop` to a temp variable. Otherwise, because
22. //`JobScheduler.stop(false)` may set `eventLoop` to null when this method isrunning, then
23. // it's possible that when`post` is called, `eventLoop` happens to null.
24. var _eventLoop = eventLoop
25. if (_eventLoop != null) {
26. _eventLoop.post(JobStarted(job,clock.getTimeMillis()))
27. // Disable checks forexisting output directories in jobs launched by the streaming
28. // scheduler, since wemay need to write output to an existing directory during checkpoint
29. // recovery; seeSPARK-4835 for more details.
30. PairRDDFunctions.disableOutputSpecValidation.withValue(true){
31. job.run()
32. }
33. _eventLoop = eventLoop
34. if (_eventLoop != null){
35. _eventLoop.post(JobCompleted(job,clock.getTimeMillis()))
36. }
37. } else {
38. // JobScheduler has beenstopped.
39. }
40. } finally {
41. ssc.sparkContext.setLocalProperties(oldProps)
42. }
43. }
44. }
JobHandler的run方法中调用_eventLoop.post(JobStarted(job,clock.getTimeMillis())),看一下eventLoop
JobScheduler.scala的源代码:
1. private var eventLoop:EventLoop[JobSchedulerEvent] = null
2. ……
3. def start(): Unit =synchronized {
4. if (eventLoop != null) return // schedulerhas already been started
5.
6. logDebug("Starting JobScheduler")
7. eventLoop = newEventLoop[JobSchedulerEvent]("JobScheduler") {
8. override protected def onReceive(event:JobSchedulerEvent): Unit = processEvent(event)
9.
10. override protected def onError(e:Throwable): Unit = reportError("Error in job scheduler", e)
11. }
12. eventLoop.start()
13. ……
processEvent中模式匹配到JobStarted(job,startTime),就调用handleJobStart(job, startTime)方法,然后调用jobSet.handleJobStart,以时间为分隔。
JobHandler的run方法中调用job.run()运行Job。