1, 还是从案例开始顺藤摸瓜
object NetworkWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: NetworkWordCount <hostname> <port>") System.exit(1) } StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[5]") val ssc = new StreamingContext(sparkConf, Seconds(40)) // Create a socket stream on target ip:port and count the // words in input stream of \n delimited text (eg. generated by 'nc') // Note that no duplication in storage level only for running locally. // Replication necessary in distributed scenario for fault tolerance. val lines = ssc.socketTextStream("192.168.4.41", 9999, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }
2,ssc.start()==>会调用JobScheduler.start()
private[streaming] class JobScheduler(val ssc: StreamingContext) extends Logging { 。。。 val listenerBus = new StreamingListenerBus() 。。。。 def start(): Unit = synchronized { if (eventLoop != null) return // scheduler has already been started 对于EventLoop怎么处理可以查看(SparkStream源码分析:JobScheduler的JobStarted、JobCompleted是怎么被调用的) logDebug("Starting JobScheduler") //会将JobSchedulerEvent放到LinkedBlockingDeque eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") { override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e) } eventLoop.start() for ( inputDStream <- ssc.graph.getInputStreams; rateController <- inputDStream.rateController ) { ssc.addStreamingListener(rateController) } listenerBus.start(ssc.sparkContext) //处理ReceiverInputDstream的数据源,如SocketInputDstream,FlumePollingInputDstream,FlumeInputDsteam等。看ReceiverInputDstream的子类 receiverTracker = new ReceiverTracker(ssc) inputInfoTracker = new InputInfoTracker(ssc) receiverTracker.start() jobGenerator.start() logInfo("Started JobScheduler") }
3,简单说一下RateController,它是StreamingListener的子类,它是ListenerBus管理的监听器的一种。
a,他作用:会根据上一次流的job流的数据信息,从而评估下一次job流接收数据的速度,让它处理数据量大且波动大的数据流,再适合不过了。然后会将rateController加入到listenerBus中;
b,streaming默认不会使用rateController来动态接收流中的数据,返回是None.也就是无法进入循环体 可以使用spark.streaming.backpressure.enabled设置成true来开启
abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext) extends InputDStream[T](ssc_) { /** * Asynchronously maintains & sends new rate limits to the receiver through the receiver tracker. */ override protected[streaming] val rateController: Option[RateController] = { //streaming使用最大的速率来接收当前周期内的数据,默认是不启用的:值是false。即spark接收到的数据量如果一直特别大 ,则内存可能会出现oom if (RateController.isBackPressureEnabled(ssc.conf)) { Some(new ReceiverRateController(id, RateEstimator.create(ssc.conf, ssc.graph.batchDuration))) } else { None } }
c,进入RateController.isBackPressureEnabled()方法看一下:
object RateController { //默认情况下接收数据的速度由spark.streaming.receiver.maxRate或spark.streaming.kafka.maxRatePerPartition的值限定如果.
maxRate设置成0或负数则不限速 // spark.streaming.backpressure.enabled这个参数默认就是false。即不会动态的评估接收的速度 def isBackPressureEnabled(conf: SparkConf): Boolean = conf.getBoolean("spark.streaming.backpressure.enabled", false) }
4,sparkStreaming怎么使用sparkCore中ListenerBus这个trait管理StreamingListener和StreamingListenerEvent,从而达到监听效果的呢?
a,先将所有StreamingListener放到CopyOnWriterArrayList集合中,这个集合,适合做大量并发式的add或set操作
b,当子类调用postToAll(event:StreamingListenerEvent),从源码可以得知,会将存放在CopyOnWriterArrayList集合中所有StreamingListener都去调用onPostEvent(listener,event),这个onPostEvent是由ListenerBus子类StreamingListenerBus实现的。
/** * An event bus which posts events to its listeners. */ private[spark] trait ListenerBus[L <: AnyRef, E] extends Logging { // Marked `private[spark]` for access in tests. private[spark] val listeners = new CopyOnWriteArrayList[L] /** * Add a listener to listen events. This method is thread-safe and can be called in any thread. */ final def addListener(listener: L) { listeners.add(listener) } /** * Post the event to all registered listeners. The `postToAll` caller should guarantee calling `postToAll`
in the same thread for all events. */ final def postToAll(event: E): Unit = { // JavaConverters can create a JIterableWrapper if we use asScala. // However, this method will be called frequently. To avoid the wrapper cost, here ewe use Java Iterator directly. //如果使用scala可以使用JavaConverters创建JIterableWrapper,然而这个方法会被频繁调用,为避免包的开销,这使用java的Iterator val iter = listeners.iterator while (iter.hasNext) { val listener = iter.next() try { //JobGenerator的listener:默认情况下是没有RateController的,肯定有sparkUi相关的listener,如:StreamingJobProgressListener, // 还有JobProgressListener,EnvironmentListener,StorageStatusListener,ExecutorsListener,StorageListener,
RDDOperationGraphListener,HeartbeatReceiver // event:好多种如:StreamingListenerBatchStarted onPostEvent(listener, event) } catch { case NonFatal(e) => logError(s"Listener ${Utils.getFormattedClassName(listener)} threw an exception", e) } } } /** * Post an event to the specified listener. `onPostEvent` is guaranteed to be called in the same thread. */ def onPostEvent(listener: L, event: E): Unit
4,跟踪一下sparkUI的Listener: StreamingJobProgressListener是怎么加到这个ListenerBus的CopyOnWriterArrayList
a,实例化StreamingContext就会将下面StreamingJobProgressListener初始化出来,然后由StreamingTab将它加到ListenerBus的CopyOnWriterArrayList集合中
private[streaming] val progressListener = new StreamingJobProgressListener(this) //默认StreamingTab被初始化,它是针对spark.ui,它里面会取到StreamingJobProgressListener监听类做为它的成员 private[streaming] val uiTab: Option[StreamingTab] = if (conf.getBoolean("spark.ui.enabled", true)) { Some(new StreamingTab(this)) } else { None }
b,查看StreamingTab初始化时,调用ssc.addStreamingListener()
private[spark] class StreamingTab(val ssc: StreamingContext) extends SparkUITab(getSparkUI(ssc), "streaming") with Logging { private val STATIC_RESOURCE_DIR = "org/apache/spark/streaming/ui/static" val parent = getSparkUI(ssc) val listener = ssc.progressListener
//得到一个ui相关的监听类:StreamingJobProgressListener ssc.addStreamingListener(listener) ssc.sc.addSparkListener(listener) attachPage(new StreamingPage(this)) attachPage(new BatchPage(this))
c,下面方法就是调用ListenerBus中的addListener方法
/** Add a [[org.apache.spark.streaming.scheduler.StreamingListener]] object for * receiving system events related to streaming. */ def addStreamingListener(streamingListener: StreamingListener) { //listenerBus:就是StreamingListenerBus, scheduler.listenerBus.addListener(streamingListener) }
==》对应的源码实现
private[spark] trait ListenerBus[L <: AnyRef, E] extends Logging { // Marked `private[spark]` for access in tests. private[spark] val listeners = new CopyOnWriteArrayList[L] /** * Add a listener to listen events. This method is thread-safe and can be called in any thread. */ final def addListener(listener: L) { listeners.add(listener) }
5,接下来就是启动监听了,可以发现在JobScheduler的start方法中有StreamingListenerBus.start()
private[streaming] class JobScheduler(val ssc: StreamingContext) extends Logging { 。。。 val listenerBus = new StreamingListenerBus() 。。。。 def start(): Unit = synchronized { listenerBus.start(ssc.sparkContext) 。。。
}
a,可以发现start方法是由StreamingListenerBus的父类AsynchronousListenerBus来实现的
def start(sc: SparkContext) { //started是个AtomicBoolean,如果strated的值和compareAndSet的第一个参数不一样,则会取它的第二个参数来更新AtomicBoolean
从保证始终可以进入if语句中 if (started.compareAndSet(false, true)) { sparkContext = sc listenerThread.start() } else { throw new IllegalStateException(s"$name already started!") } }
b,查看listenerThread的线程定义:这个线程负责的事是当有可用的信号量时(其实就是有StreamingListenerEvent被post进来才会放一个信号量)
c,就会从eventQueue队列中取StreamingListenerEvent,然后调用ListenerBus的postToAll方法
// A counter that represents the number of events produced and consumed in the queue private val eventLock = new Semaphore(0)
private val listenerThread = new Thread(name) { setDaemon(true) //会执行柯里化里面的{}代码块,如果有捕获到非ControlThrowable,将会停止sparkContext,当然就会停止这个while(true)线程 override def run(): Unit = Utils.tryOrStopSparkContext(sparkContext) { AsynchronousListenerBus.withinListenerThread.withValue(true) { while (true) { //使用的是信号量,初始许可是0,表示线程都会阻塞在这,除非有别调用Semaphore的release()方法 //如下面post就会调用Semaphore的release()方法 eventLock.acquire() self.synchronized { processingEvent = true } try { //获取并移除此队列的头,如果此队列为空,则返回 null val event = eventQueue.poll if (event == null) { // Get out of the while loop and shutdown the daemon thread //当队列的内容是空,则stopped默认是false,则会离开while循环并关闭守护进程线程 if (!stopped.get) { throw new IllegalStateException("Polling `null` from eventQueue means" + " the listener bus has been stopped. So `stopped` must be true") } return } //ListenerBus接口中的postToAll方法会调用onPostEvent方法,这个方法是由具体实现类:StreamingListenerBus,
event是StreamingListenerEvent的具体的子类 postToAll(event) } finally { self.synchronized { processingEvent = false } } } } } }
d,在JobScheduler中像JobStart,JobComplete中就多次通过StreamingListenerBus调用这个post
def post(event: E) { if (stopped.get) { // Drop further events to make `listenerThread` exit ASAP logError(s"$name has already stopped! Dropping event $event") return } //将指定元素插入到此队列的尾部,如果插入成功返回true,如果队列满了则无法插入,返回false val eventAdded = eventQueue.offer(event) if (eventAdded) { //元素插入成功后,释放一个被eventLock阻塞的线程 eventLock.release() } else { onDropEvent(event) } }
e,在JobScheduer的handleJobStart方法就会用StreamingListenerBus去post一个StreamingListenerBatchStarted事件对象,让上面对应的listenerThread守护线程触发postToAll方法
//捕获job的启动事件 private def handleJobStart(job: Job, startTime: Long) { val jobSet = jobSets.get(job.time) val isFirstJobOfJobSet = !jobSet.hasStarted jobSet.handleJobStart(job) if (isFirstJobOfJobSet) { // "StreamingListenerBatchStarted" should be posted after calling "handleJobStart"
to get thecorrect "jobSet.processingStartTime". listenerBus.post(StreamingListenerBatchStarted(jobSet.toBatchInfo)) } job.setStartTime(startTime) listenerBus.post(StreamingListenerOutputOperationStarted(job.toOutputOperationInfo)) logInfo("Starting job " + job.id + " from job set of time " + jobSet.time) }
f,postToAll(event: StreamingListenerEvent),会将存放在CopyOnWriterArrayList集合中所有StreamingListener都去调用onPostEvent(listener,event),这个onPostEvent是由ListenerBus子类StreamingListenerBus实现的。
private[spark] trait ListenerBus[L <: AnyRef, E] extends Logging { // Marked `private[spark]` for access in tests. private[spark] val listeners = new CopyOnWriteArrayList[L]
… final def postToAll(event: E): Unit = { // JavaConverters can create a JIterableWrapper if we use asScala. // However, this method will be called frequently. To avoid the wrapper cost, here ewe use Java Iterator directly. //如果使用scala可以使用JavaConverters创建JIterableWrapper,然而这个方法会被频繁调用,为避免包的开销,这使用java的Iterator val iter = listeners.iterator while (iter.hasNext) { val listener = iter.next() try { onPostEvent(listener, event) } catch { case NonFatal(e) => logError(s"Listener ${Utils.getFormattedClassName(listener)} threw an exception", e) } } } …
6,onPostEvent是由ListenerBus子类StreamingListenerBus实现的,查看一下源码:当post过来的是StreamingListenerBatchStarted则会listener.onBatchStarted方法
private[spark] class StreamingListenerBus extends AsynchronousListenerBus[StreamingListener, StreamingListenerEvent]("StreamingListenerBus") with Logging { private val logDroppedEvent = new AtomicBoolean(false) override def onPostEvent(listener: StreamingListener, event: StreamingListenerEvent): Unit = { event match { case receiverStarted: StreamingListenerReceiverStarted => listener.onReceiverStarted(receiverStarted) case receiverError: StreamingListenerReceiverError => listener.onReceiverError(receiverError) case receiverStopped: StreamingListenerReceiverStopped => listener.onReceiverStopped(receiverStopped) case batchSubmitted: StreamingListenerBatchSubmitted => listener.onBatchSubmitted(batchSubmitted) case batchStarted: StreamingListenerBatchStarted => listener.onBatchStarted(batchStarted) case batchCompleted: StreamingListenerBatchCompleted => listener.onBatchCompleted(batchCompleted) case outputOperationStarted: StreamingListenerOutputOperationStarted => listener.onOutputOperationStarted(outputOperationStarted) case outputOperationCompleted: StreamingListenerOutputOperationCompleted => listener.onOutputOperationCompleted(outputOperationCompleted) case _ => } }
a,查看一下StreamingJobProgressListener这个sparkUi对于.onBatchStarted方法
override def onBatchStarted(batchStarted: StreamingListenerBatchStarted): Unit = synchronized { val batchUIData = BatchUIData(batchStarted.batchInfo) runningBatchUIData(batchStarted.batchInfo.batchTime) = BatchUIData(batchStarted.batchInfo) waitingBatchUIData.remove(batchStarted.batchInfo.batchTime) totalReceivedRecords += batchUIData.numRecords }
到此:spark的监听机制分析完成,实现技术就是
a,借用多线程的信号量,监听StreamingListenerEvent事件的传入,当有事件post进来时,会使用postToAll方法让所有StreamingListener去触发onPostEvent方法
b, StreamingListenerBus使用模式匹配的方式将所有StreamingListenerEvent子类都匹配出来,有对应的方法处理不同的事件。
c, StreamingListener的子类实现处理不同StreamingListenerEvent事件的方法。
d,最后,实际业务按对应状态,post不同的事件到守护线程去执行。
===================还是很值得学习的==========================