面试源码系列之:图解 sparkstreaming 源码

所有的入口就是:StreamingContext.start()

感觉写得有点乱,所以我就画了个图,比较简单的图:

这里的关键点就是: 启动了调度,调用了JobScheduler.start()

def start(): Unit = synchronized {
    state match {
      case INITIALIZED =>
        startSite.set(DStream.getCreationSite())
        StreamingContext.ACTIVATION_LOCK.synchronized {
          StreamingContext.assertNoOtherContextIsActive()
          try {
            validate()

            // Start the streaming scheduler in a new thread, so that thread local properties
            // like call sites and job groups can be reset without affecting those of the
            // current thread.
            ThreadUtils.runInNewThread("streaming-start") {
              sparkContext.setCallSite(startSite.get)
              sparkContext.clearJobGroup()
              sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
              savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))


// 这里的一个关键,start() 开始调度
              scheduler.start()
            }
            state = StreamingContextState.ACTIVE
            scheduler.listenerBus.post(
              StreamingListenerStreamingStarted(System.currentTimeMillis()))
          } catch {
            case NonFatal(e) =>
              logError("Error starting the context, marking it as stopped", e)
              scheduler.stop(false)
              state = StreamingContextState.STOPPED
              throw e
          }
          StreamingContext.setActiveContext(this)
        }

    }
  }

我们来看下这里的start(),这里new了一个EventLoop 这个是不是很熟悉,在之前的spark的dag里我们就遇到过,重写了这里的onReceive 方法用来不断的接受事件 ,调用的是processEvent 来不断的处理事件

def start(): Unit = synchronized {
    if (eventLoop != null) return // scheduler has already been started

    logDebug("Starting JobScheduler")

// 这里new了一个EventLoop 这个是不是很熟悉,在之前的spark的dag里我们就遇到过,重写了这里的onReceive 方法用来不断的接受事件 ,调用的是processEvent 来不断的处理事件

    eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
      override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
    }
    eventLoop.start()

    // attach rate controllers of input streams to receive batch completion updates
    for {
      inputDStream <- ssc.graph.getInputStreams
      rateController <- inputDStream.rateController
    } ssc.addStreamingListener(rateController)

    listenerBus.start()
    receiverTracker = new ReceiverTracker(ssc)
    inputInfoTracker = new InputInfoTracker(ssc)

    val executorAllocClient: ExecutorAllocationClient = ssc.sparkContext.schedulerBackend match {
      case b: ExecutorAllocationClient => b.asInstanceOf[ExecutorAllocationClient]
      case _ => null
    }

    executorAllocationManager = ExecutorAllocationManager.createIfEnabled(
      executorAllocClient,
      receiverTracker,
      ssc.conf,
      ssc.graph.batchDuration.milliseconds,
      clock)
    executorAllocationManager.foreach(ssc.addStreamingListener)
    receiverTracker.start()
    jobGenerator.start()
    executorAllocationManager.foreach(_.start())
    logInfo("Started JobScheduler")
  }

这里有三个地方调用了start() 方法:

  • receiverTracker.start()

  • jobGenerator.start()

  • executorAllocationManager.foreach(_.start())

var executorAllocationManager: Option[ExecutorAllocationManager] 这里的这个ExecutorAllocationManager是负责申请我们流处理过程中需要的executor。

我们来看下它的start()方法:启动了一个定时器

def start(): Unit = {
//启动了一个定时器
    timer.start()
    logInfo(s"ExecutorAllocationManager started with " +
      s"ratios = [$scalingUpRatio, $scalingDownRatio] and interval = $scalingIntervalSecs sec")
  }

我们来看下这个timer,这个timer实际上是

private val timer = new RecurringTimer(clock, scalingIntervalSecs * 1000,
//manageAllocation这个是一个管理executor分配的方法
    _ => manageAllocation(), "streaming-executor-allocation-manager")
val SCALING_INTERVAL_KEY = "spark.streaming.dynamicAllocation.scalingInterval" 这个就是这里的scalingIntervalSecs,这个参数我没有使用过.赶紧查询了一波:https://www.jianshu.com/p/e1d9456a4880

然后看了下private[streaming] class ExecutorAllocationManager的注释:

得到了一个信息:这个和spark core 里的那个动态分配策略是有差异的,spark core中的策略是根据executors
空闲的时间,但是批处理模型是不允许这种长时间的空闲的。

 * Class that manages executor allocated to a StreamingContext, and dynamically request or kill
 * executors based on the statistics of the streaming computation. This is different from the core
 * dynamic allocation policy; the core policy relies on executors being idle for a while, but the
 * micro-batch model of streaming prevents any particular executors from being idle for a long
 * time. Instead, the measure of "idle-ness" needs to be based on the time taken to process
 * each batch.

 

 

 

EventLoop的为一个实现类:

是不是越来越有spark core的感觉了

 

// 处理接收的事件
private def processEvent(event: JobSchedulerEvent) {
    try {
      event match {
        case JobStarted(job, startTime) => handleJobStart(job, startTime)
        case JobCompleted(job, completedTime) => handleJobCompletion(job, completedTime)
        case ErrorReported(m, e) => handleError(m, e)
      }
    } catch {
      case e: Throwable =>
        reportError("Error in job scheduler", e)
    }
  }


// 看看是如何实现的
  private def handleJobStart(job: Job, startTime: Long) {
    val jobSet = jobSets.get(job.time)
    val isFirstJobOfJobSet = !jobSet.hasStarted
    jobSet.handleJobStart(job)
    if (isFirstJobOfJobSet) {
      // "StreamingListenerBatchStarted" should be posted after calling "handleJobStart" to get the
      // correct "jobSet.processingStartTime".
      listenerBus.post(StreamingListenerBatchStarted(jobSet.toBatchInfo))
    }
    job.setStartTime(startTime)
    listenerBus.post(StreamingListenerOutputOperationStarted(job.toOutputOperationInfo))
    logInfo("Starting job " + job.id + " from job set of time " + jobSet.time)
  }

 这里有个JobSet 【 Class representing a set of Jobs belong to the same batch.】代表的是同一个批次的job集合

JobScheduler.start() 里有个关键点:

 jobGenerator.start()

def start(): Unit = synchronized {
    if (eventLoop != null) return // scheduler has already been started

    logDebug("Starting JobScheduler")
    eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
      override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
    }
    eventLoop.start()

    // attach rate controllers of input streams to receive batch completion updates
    for {
      inputDStream <- ssc.graph.getInputStreams
      rateController <- inputDStream.rateController
    } ssc.addStreamingListener(rateController)

    listenerBus.start()
    receiverTracker = new ReceiverTracker(ssc)
    inputInfoTracker = new InputInfoTracker(ssc)

    val executorAllocClient: ExecutorAllocationClient = ssc.sparkContext.schedulerBackend match {
      case b: ExecutorAllocationClient => b.asInstanceOf[ExecutorAllocationClient]
      case _ => null
    }

    executorAllocationManager = ExecutorAllocationManager.createIfEnabled(
      executorAllocClient,
      receiverTracker,
      ssc.conf,
      ssc.graph.batchDuration.milliseconds,
      clock)
    executorAllocationManager.foreach(ssc.addStreamingListener)
    receiverTracker.start()

//这里是个关键点
    jobGenerator.start()
    executorAllocationManager.foreach(_.start())
    logInfo("Started JobScheduler")
  }

我们来看下这里的关键角色:

JobGenerator

【This class generates jobs from DStreams as well as drives checkpointing and cleaning up DStreammetadata.】

1:生成jobs

2: 驱动checkpointing 和 元数据的清理

这种类一般我们都会习惯性的找看看有没有和start 相关的方法,必须有~

 产生jobs

这里判断是不是从checkpoint里进行恢复的操作,如果不是的化那么就要执行 startFirstTime()

/** Start generation of jobs */
  def start(): Unit = synchronized {
    if (eventLoop != null) return // generator has already been started

    // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.
    // See SPARK-10125
    checkpointWriter

    eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
      override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = {
        jobScheduler.reportError("Error in job generator", e)
      }
    }
    eventLoop.start()

//这里判断是不是从checkpoint里进行恢复的操作,如果不是的化那么就要执行 startFirstTime()
    if (ssc.isCheckpointPresent) {
      restart()
    } else {
      startFirstTime()
    }
  }

来看下这个方法:首次启动generator

/** Starts the generator for the first time */
  private def startFirstTime() {
    val startTime = new Time(timer.getStartTime())
    graph.start(startTime - graph.batchDuration)
    timer.start(startTime.milliseconds)
    logInfo("Started JobGenerator at " + startTime)
  }

两个start (),一个是graph【DStreamGraph】 一个是timer  

graph.start(startTime - graph.batchDuration)
timer.start(startTime.milliseconds)

这个重量级选手登场:DStreamGraph

说它之前我们先说一下:

DStream【abstract class】

他的全称是:Discretized Stream 离散流,我必须把官方注释拿出来说说。

它是spark 流的最基本的抽象,为什么说他是离散流呢,我举一个例子,因为这个流并不是一直连续的而是就像一个被控制的水龙头,
周期性的出水,你可以理解它其实是断断续续的,过一阵子给你出一股水,就是这个感觉。
他是一系列连续的rdd,所以它的底层还是rdd,这个rdd是周期性的产生的,就是你设定的间隔,它可以从别的dstream转化过来,
这个就是说,你在spark core里的那些map,filter 操作兴许到了这里也是可以用的,(具体的去看api),因为底层他就是rdd,
所以编程模型是一样的,所以你spark core 会了 很容易上手。

特征总结:
- A list of other DStreams that the DStream depends on            --这个说的就是dstream 之间的依赖
- A time interval at which the DStream generates an RDD        --这个没啥说的
- A function that is used to generate an RDD after each time interval    --这个说的就是产生的rdds

这些话是如何来的:

 /** Time interval after which the DStream generates an RDD */
  def slideDuration: Duration

  /** List of parent DStreams on which this DStream depends on */
  def dependencies: List[DStream[_]]

  /** Method that generates an RDD for the given time */
  def compute(validTime: Time): Option[RDD[T]]
/**
 * A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous
 * sequence of RDDs (of the same type) representing a continuous stream of data (see
 * org.apache.spark.rdd.RDD in the Spark core documentation for more details on RDDs).
 * DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume,
 * etc.) using a [[org.apache.spark.streaming.StreamingContext]] or it can be generated by
 * transforming existing DStreams using operations such as `map`,
 * `window` and `reduceByKeyAndWindow`. While a Spark Streaming program is running, each DStream
 * periodically generates a RDD, either from live data or by transforming the RDD generated by a
 * parent DStream.
 *
 * This class contains the basic operations available on all DStreams, such as `map`, `filter` and
 * `window`. In addition, [[org.apache.spark.streaming.dstream.PairDStreamFunctions]] contains
 * operations available only on DStreams of key-value pairs, such as `groupByKeyAndWindow` and
 * `join`. These operations are automatically available on any DStream of pairs
 * (e.g., DStream[(Int, Int)] through implicit conversions.
 *
 * A DStream internally is characterized by a few basic properties:
 *  - A list of other DStreams that the DStream depends on
 *  - A time interval at which the DStream generates an RDD
 *  - A function that is used to generate an RDD after each time interval
 */

它的实现类很多:

看到一个老熟人:

后面我会专门的来分析一下和kafka的对接

继续我们的 DStreamGraph.start() 方法:这里是 _.start() _是dstream  /** Method called to start receiving data. Subclasses must implement this method. */  这个方法是用来开始接收数据的

def start(time: Time) {
    this.synchronized {
      require(zeroTime == null, "DStream graph computation already started")
      zeroTime = time
      startTime = time
      outputStreams.foreach(_.initialize(zeroTime))
      outputStreams.foreach(_.remember(rememberDuration))
      outputStreams.foreach(_.validateAtStart())
      numReceivers = inputStreams.count(_.isInstanceOf[ReceiverInputDStream[_]])
      inputStreamNameAndID = inputStreams.map(is => (is.name, is.id))

//这里是 _.start() _是dstream   

inputStreams.par.foreach(_.start())
    }
  }

看下这个start() 的具体实现 

 

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值