SparkStreaming案例:NetworkWordCount--spark如何使用ListenerBus实现类似js监听事件效果

1, 还是从案例开始顺藤摸瓜

object NetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: NetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()
    // Create the context with a 1 second batch size
    val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[5]")
    val ssc = new StreamingContext(sparkConf, Seconds(40))

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream("192.168.4.41", 9999, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

2,ssc.start()==>会调用JobScheduler.start()

private[streaming]
class JobScheduler(val ssc: StreamingContext) extends Logging {
。。。
  val listenerBus = new StreamingListenerBus()
  。。。。
  def start(): Unit = synchronized {
    if (eventLoop != null) return // scheduler has already been started
对于EventLoop怎么处理可以查看(SparkStream源码分析:JobScheduler的JobStarted、JobCompleted是怎么被调用的)
    logDebug("Starting JobScheduler")
    //会将JobSchedulerEvent放到LinkedBlockingDeque
    eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
      override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
      override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
    }
    eventLoop.start()
    for (
      inputDStream <- ssc.graph.getInputStreams;
      rateController <- inputDStream.rateController
    ) {
      ssc.addStreamingListener(rateController)
    }
    listenerBus.start(ssc.sparkContext)
    //处理ReceiverInputDstream的数据源,如SocketInputDstream,FlumePollingInputDstream,FlumeInputDsteam等。看ReceiverInputDstream的子类
    receiverTracker = new ReceiverTracker(ssc)
    inputInfoTracker = new InputInfoTracker(ssc)
    receiverTracker.start()
    jobGenerator.start()
    logInfo("Started JobScheduler")
  }
 

 

3,简单说一下RateController,它是StreamingListener的子类,它是ListenerBus管理的监听器的一种。

a,他作用:会根据上一次流的job流的数据信息,从而评估下一次job流接收数据的速度,让它处理数据量大且波动大的数据流,再适合不过了。然后会将rateController加入到listenerBus中;
b,streaming默认不会使用rateController来动态接收流中的数据,返回是None.也就是无法进入循环体 可以使用spark.streaming.backpressure.enabled设置成true来开启

abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext)
  extends InputDStream[T](ssc_) {

  /**
   * Asynchronously maintains & sends new rate limits to the receiver through the receiver tracker.
   */
  override protected[streaming] val rateController: Option[RateController] = {
    //streaming使用最大的速率来接收当前周期内的数据,默认是不启用的:值是false。即spark接收到的数据量如果一直特别大 ,则内存可能会出现oom
    if (RateController.isBackPressureEnabled(ssc.conf)) {
      Some(new ReceiverRateController(id, RateEstimator.create(ssc.conf, ssc.graph.batchDuration)))
    } else {
      None
    }
  }

c,进入RateController.isBackPressureEnabled()方法看一下:

object RateController {
  //默认情况下接收数据的速度由spark.streaming.receiver.maxRate或spark.streaming.kafka.maxRatePerPartition的值限定如果.
maxRate设置成0或负数则不限速
  // spark.streaming.backpressure.enabled这个参数默认就是false。即不会动态的评估接收的速度
  def isBackPressureEnabled(conf: SparkConf): Boolean =
    conf.getBoolean("spark.streaming.backpressure.enabled", false)
}

 

4,sparkStreaming怎么使用sparkCore中ListenerBus这个trait管理StreamingListener和StreamingListenerEvent,从而达到监听效果的呢?

 

a,先将所有StreamingListener放到CopyOnWriterArrayList集合中,这个集合,适合做大量并发式的add或set操作

b,当子类调用postToAll(event:StreamingListenerEvent),从源码可以得知,会将存放在CopyOnWriterArrayList集合中所有StreamingListener都去调用onPostEvent(listener,event),这个onPostEvent是由ListenerBus子类StreamingListenerBus实现的。

/**
 * An event bus which posts events to its listeners.
 */
private[spark] trait ListenerBus[L <: AnyRef, E] extends Logging {
  // Marked `private[spark]` for access in tests. 
  private[spark] val listeners = new CopyOnWriteArrayList[L]
  /**
   * Add a listener to listen events. This method is thread-safe and can be called in any thread.
   */
  final def addListener(listener: L) {
    listeners.add(listener)
  }
  /**
   * Post the event to all registered listeners. The `postToAll` caller should guarantee calling `postToAll` 
in the same thread for all events.
   */
  final def postToAll(event: E): Unit = {
    // JavaConverters can create a JIterableWrapper if we use asScala.
    // However, this method will be called frequently. To avoid the wrapper cost, here ewe use  Java Iterator directly.
    //如果使用scala可以使用JavaConverters创建JIterableWrapper,然而这个方法会被频繁调用,为避免包的开销,这使用java的Iterator
    val iter = listeners.iterator
    while (iter.hasNext) {
      val listener = iter.next()
      try {
        //JobGenerator的listener:默认情况下是没有RateController的,肯定有sparkUi相关的listener,如:StreamingJobProgressListener,
        // 还有JobProgressListener,EnvironmentListener,StorageStatusListener,ExecutorsListener,StorageListener,
               RDDOperationGraphListener,HeartbeatReceiver
        // event:好多种如:StreamingListenerBatchStarted
        onPostEvent(listener, event)
      } catch {
        case NonFatal(e) =>
          logError(s"Listener ${Utils.getFormattedClassName(listener)} threw an exception", e)
      }
    }
  }

  /**
   * Post an event to the specified listener. `onPostEvent` is guaranteed to be called in the same thread.
   */
  def onPostEvent(listener: L, event: E): Unit

4,跟踪一下sparkUI的Listener:  StreamingJobProgressListener是怎么加到这个ListenerBus的CopyOnWriterArrayList

a,实例化StreamingContext就会将下面StreamingJobProgressListener初始化出来,然后由StreamingTab将它加到ListenerBus的CopyOnWriterArrayList集合中

private[streaming] val progressListener = new StreamingJobProgressListener(this)
//默认StreamingTab被初始化,它是针对spark.ui,它里面会取到StreamingJobProgressListener监听类做为它的成员
private[streaming] val uiTab: Option[StreamingTab] =
  if (conf.getBoolean("spark.ui.enabled", true)) {
    Some(new StreamingTab(this))
  } else {
    None
  }

b,查看StreamingTab初始化时,调用ssc.addStreamingListener()

private[spark] class StreamingTab(val ssc: StreamingContext)
  extends SparkUITab(getSparkUI(ssc), "streaming") with Logging {

  private val STATIC_RESOURCE_DIR = "org/apache/spark/streaming/ui/static"

  val parent = getSparkUI(ssc)
  val listener = ssc.progressListener  
//得到一个ui相关的监听类:StreamingJobProgressListener

  ssc.addStreamingListener(listener)
  ssc.sc.addSparkListener(listener)
  attachPage(new StreamingPage(this))
  attachPage(new BatchPage(this))

c,下面方法就是调用ListenerBus中的addListener方法

/** Add a [[org.apache.spark.streaming.scheduler.StreamingListener]] object for
  * receiving system events related to streaming.
  */
def addStreamingListener(streamingListener: StreamingListener) {
  //listenerBus:就是StreamingListenerBus,
  scheduler.listenerBus.addListener(streamingListener)
}

==》对应的源码实现

private[spark] trait ListenerBus[L <: AnyRef, E] extends Logging {

  // Marked `private[spark]` for access in tests. 
  private[spark] val listeners = new CopyOnWriteArrayList[L]
  /**
   * Add a listener to listen events. This method is thread-safe and can be called in any thread.
   */
  final def addListener(listener: L) {
    listeners.add(listener)
  }

 

5,接下来就是启动监听了,可以发现在JobScheduler的start方法中有StreamingListenerBus.start()

private[streaming]
class JobScheduler(val ssc: StreamingContext) extends Logging {
。。。
  val listenerBus = new StreamingListenerBus()
  。。。。
  def start(): Unit = synchronized {
    listenerBus.start(ssc.sparkContext)
   。。。
  }

a,可以发现start方法是由StreamingListenerBus的父类AsynchronousListenerBus来实现的

def start(sc: SparkContext) {
  //started是个AtomicBoolean,如果strated的值和compareAndSet的第一个参数不一样,则会取它的第二个参数来更新AtomicBoolean 
   从保证始终可以进入if语句中
  if (started.compareAndSet(false, true)) {
    sparkContext = sc
    listenerThread.start()
  } else {
    throw new IllegalStateException(s"$name already started!")
  }
}

b,查看listenerThread的线程定义:这个线程负责的事是当有可用的信号量时(其实就是有StreamingListenerEvent被post进来才会放一个信号量)

c,就会从eventQueue队列中取StreamingListenerEvent,然后调用ListenerBus的postToAll方法

 

// A counter that represents the number of events produced and consumed in the queue
private val eventLock = new Semaphore(0)
 
private val listenerThread = new Thread(name) {
    setDaemon(true)
  //会执行柯里化里面的{}代码块,如果有捕获到非ControlThrowable,将会停止sparkContext,当然就会停止这个while(true)线程
  override def run(): Unit = Utils.tryOrStopSparkContext(sparkContext) {
    AsynchronousListenerBus.withinListenerThread.withValue(true) {
      while (true) {
        //使用的是信号量,初始许可是0,表示线程都会阻塞在这,除非有别调用Semaphore的release()方法
        //如下面post就会调用Semaphore的release()方法
        eventLock.acquire()
        self.synchronized {
          processingEvent = true
        }
        try {
          //获取并移除此队列的头,如果此队列为空,则返回 null
          val event = eventQueue.poll
          if (event == null) {
            // Get out of the while loop and shutdown the daemon thread
            //当队列的内容是空,则stopped默认是false,则会离开while循环并关闭守护进程线程
            if (!stopped.get) {
              throw new IllegalStateException("Polling `null` from eventQueue means" +
                " the listener bus has been stopped. So `stopped` must be true")
            }
            return
          }
          //ListenerBus接口中的postToAll方法会调用onPostEvent方法,这个方法是由具体实现类:StreamingListenerBus,
            event是StreamingListenerEvent的具体的子类
          postToAll(event)
        } finally {
          self.synchronized {
            processingEvent = false
          }
        }
      }
    }
  }
}
 
 

d,在JobScheduler中像JobStart,JobComplete中就多次通过StreamingListenerBus调用这个post

 
def post(event: E) {
  if (stopped.get) {
    // Drop further events to make `listenerThread` exit ASAP
    logError(s"$name has already stopped! Dropping event $event")
    return
  }
  //将指定元素插入到此队列的尾部,如果插入成功返回true,如果队列满了则无法插入,返回false
  val eventAdded = eventQueue.offer(event)
  if (eventAdded) {
    //元素插入成功后,释放一个被eventLock阻塞的线程
    eventLock.release()
  } else {
    onDropEvent(event)
  }
}

e,在JobScheduer的handleJobStart方法就会用StreamingListenerBus去post一个StreamingListenerBatchStarted事件对象,让上面对应的listenerThread守护线程触发postToAll方法

//捕获job的启动事件
private def handleJobStart(job: Job, startTime: Long) {
  val jobSet = jobSets.get(job.time)
  val isFirstJobOfJobSet = !jobSet.hasStarted
  jobSet.handleJobStart(job)
  if (isFirstJobOfJobSet) {
    // "StreamingListenerBatchStarted" should be posted after calling "handleJobStart" 
        to get thecorrect "jobSet.processingStartTime".
    listenerBus.post(StreamingListenerBatchStarted(jobSet.toBatchInfo))
  }
  job.setStartTime(startTime)
  listenerBus.post(StreamingListenerOutputOperationStarted(job.toOutputOperationInfo))
  logInfo("Starting job " + job.id + " from job set of time " + jobSet.time)
}

 

f,postToAll(event: StreamingListenerEvent),会将存放在CopyOnWriterArrayList集合中所有StreamingListener都去调用onPostEvent(listener,event),这个onPostEvent是由ListenerBus子类StreamingListenerBus实现的。


private[spark] trait ListenerBus[L <: AnyRef, E] extends Logging {
  // Marked `private[spark]` for access in tests. 
  private[spark] val listeners = new CopyOnWriteArrayList[L]
  final def postToAll(event: E): Unit = {
    // JavaConverters can create a JIterableWrapper if we use asScala.
    // However, this method will be called frequently. To avoid the wrapper cost, here ewe use  Java Iterator directly.
    //如果使用scala可以使用JavaConverters创建JIterableWrapper,然而这个方法会被频繁调用,为避免包的开销,这使用java的Iterator
    val iter = listeners.iterator
    while (iter.hasNext) {
      val listener = iter.next()
      try {
        onPostEvent(listener, event)
      } catch {
        case NonFatal(e) =>
          logError(s"Listener ${Utils.getFormattedClassName(listener)} threw an exception", e)
      }
    }
  }

 

6,onPostEvent是由ListenerBus子类StreamingListenerBus实现的,查看一下源码:当post过来的是StreamingListenerBatchStarted则会listener.onBatchStarted方法

private[spark] class StreamingListenerBus
  extends AsynchronousListenerBus[StreamingListener, StreamingListenerEvent]("StreamingListenerBus")
  with Logging {

  private val logDroppedEvent = new AtomicBoolean(false)
  override def onPostEvent(listener: StreamingListener, event: StreamingListenerEvent): Unit = {
    event match {
      case receiverStarted: StreamingListenerReceiverStarted =>
        listener.onReceiverStarted(receiverStarted)
      case receiverError: StreamingListenerReceiverError =>
        listener.onReceiverError(receiverError)
      case receiverStopped: StreamingListenerReceiverStopped =>
        listener.onReceiverStopped(receiverStopped)
      case batchSubmitted: StreamingListenerBatchSubmitted =>
        listener.onBatchSubmitted(batchSubmitted)
      case batchStarted: StreamingListenerBatchStarted =>
        listener.onBatchStarted(batchStarted)
      case batchCompleted: StreamingListenerBatchCompleted =>
        listener.onBatchCompleted(batchCompleted)
      case outputOperationStarted: StreamingListenerOutputOperationStarted =>
        listener.onOutputOperationStarted(outputOperationStarted)
      case outputOperationCompleted: StreamingListenerOutputOperationCompleted =>
        listener.onOutputOperationCompleted(outputOperationCompleted)
      case _ =>
    }
  }

 

a,查看一下StreamingJobProgressListener这个sparkUi对于.onBatchStarted方法

override def onBatchStarted(batchStarted: StreamingListenerBatchStarted): Unit = synchronized {
  val batchUIData = BatchUIData(batchStarted.batchInfo)
  runningBatchUIData(batchStarted.batchInfo.batchTime) = BatchUIData(batchStarted.batchInfo)
  waitingBatchUIData.remove(batchStarted.batchInfo.batchTime)

  totalReceivedRecords += batchUIData.numRecords
}

 

到此:spark的监听机制分析完成,实现技术就是

a,借用多线程的信号量,监听StreamingListenerEvent事件的传入,当有事件post进来时,会使用postToAll方法让所有StreamingListener去触发onPostEvent方法

b, StreamingListenerBus使用模式匹配的方式将所有StreamingListenerEvent子类都匹配出来,有对应的方法处理不同的事件。

c, StreamingListener的子类实现处理不同StreamingListenerEvent事件的方法。

d,最后,实际业务按对应状态,post不同的事件到守护线程去执行。

===================还是很值得学习的==========================

接下来,咱们看一下“ReceiverInputDstream的Receiver是如何被放到Executor上执行的

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值