第6课：Spark Streaming源码解读之Job动态生成和深度思考

最新推荐文章于 2020-09-02 10:22:39 发布

weixin_33967071

最新推荐文章于 2020-09-02 10:22:39 发布

阅读量81

点赞数

文章标签： java 大数据 python

原文链接：https://my.oschina.net/u/928448/blog/675083

版权

2019独角兽企业重金招聘Python工程师标准>>>

本期内容：

1，Spark Streaming Job生成深度思考

2，Spark Streaming Job生成源码解析

先来看下JobGenerator类，其构造函数中需要传入JobScheduler对象，而JobScheduler类是Spark Streaming Job生成和提交Job到集群的核心。JobGenerator基于DStreamGraph 来生成Job，再次强调这里的Job相当于Java中Runnable接口对业务逻辑的封装，他和Spark Core中Job不是同一个概念，Spark Core中的Job就是运行的作业，Spark Streaming中的Job是更高层的抽象。

/**
* This class generates jobs from DStreams as well as drives checkpointing and cleaning
* up DStream metadata.
*/
private[streaming]
class JobGenerator(jobScheduler: JobScheduler) extends Logging {

  private val ssc = jobScheduler.ssc
  private val conf = ssc.conf
  private val graph = ssc.graph

Spark Streaming中的Job，只是一个Java Bean，业务逻辑在func这个函数中。

/**
* Class representing a Spark computation. It may contain multiple Spark jobs.
*/
private[streaming]
class Job(val time: Time, func: () => _) {
  private var _id: String = _
  private var _outputOpId: Int = _
  private var isSet = false
  private var _result: Try[_] = null
  private var _callSite: CallSite = null
  private var _startTime: Option[Long] = None
  private var _endTime: Option[Long] = None

DStream有三种类型，第一种是不同的输入来源构建的Stream，例如来自Socket，Kafka，Flume，第二种是输出，outputStreams 是逻辑级别的Action，由于还是Spark Streaming框架级别的，最终还要变为物理级别的Action，第三种是Transforms操作从一种DStream转变为另一种DStream，即基于其他DStream产生的。其中DStreamGraph 类记录了数据来源的DStream，和输出类型的DStream。

//DStreamGraph是RDD的静态的模板，表示RDD依赖关系构成的具体处理逻辑步骤
final private[streaming] class DStreamGraph extends Serializable with Logging {

  // InputDStream类型的动态数组
  //输入流：数据来源
  private val inputStreams = new ArrayBuffer[InputDStream[_]]()
  //输出流：具体Action的输出操作
  private val outputStreams = new ArrayBuffer[DStream[_]]()

JobGenerator会根据BatchDuration时间间隔，随着时间的推移，会不断的产生作业，驱使checkpoint操作和清理之前DStream的数据。

对于流处理和批处理的思考。批处理间隔时间足够短的话就是流处理。Spark Streaming的流处理是以时间为触发器的，Strom的流处理是事件为触发器的。定时任务，流处理，J2EE触发作业。

思考一个问题：DStreamGraph逻辑级别翻译成物理级别的RDD Graph，最后一个操作是RDD的action操作，是否会立即触发Job？

JobGenerator产生的Job是Runnable的封装，对DStream的依赖关系生成RDD之间的依赖关系，最后的操作就是Action，由于这些操作都是在方法中，还没有被调用所以并没有在翻译时触发Job。如果在翻译时就触发Job，这样整个Spark Streaming的Jon提交就不受管理了。

当JobScheduler要调度Job的时候，就从线程池中拿出一个线程来执行封装Dstream到RDD的方法。

接下来从JobGenerator，JobScheduler，ReceiverTracker这三个角度来讲Job的生成。其中JobGenerator是负责Job的生成，JobScheduler是负责Job的调度，ReceiverTracker是记录数据的来源。JobGenerator和ReceiverTracker是JobScheduler的成员。

/**
* This class schedules jobs to be run on Spark. It uses the JobGenerator to generate
* the jobs and runs them using a thread pool.
* 本类对运行在Spark上的job进行调度。使用JobGenerator来生成Jobs，并且在线程池运行。
* 说的很清楚了。由JobGenerator生成Job，在线程池中运行。
*/
private[streaming]
class JobScheduler(val ssc: StreamingContext) extends Logging {

  // Use of ConcurrentHashMap.keySet later causes an odd runtime problem due to Java 7/8 diff
  // https://gist.github.com/AlainODea/1375759b8720a3f9f094
  private val jobSets: java.util.Map[Time, JobSet] = new ConcurrentHashMap[Time, JobSet]
  // 默认并发Jobs数为1
  private val numConcurrentJobs = ssc.conf.getInt("spark.streaming.concurrentJobs", 1)
  // 使用线程池方式执行
  private val jobExecutor =
    ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs, "streaming-job-executor")
  // 创建JobGenerator，后续会详细分析
  private val jobGenerator = new JobGenerator(this)
  val clock = jobGenerator.clock
  val listenerBus = new StreamingListenerBus()

  // These two are created only when scheduler starts.
  // eventLoop not being null means the scheduler has been started and not stopped
  var receiverTracker: ReceiverTracker = null
  // A tracker to track all the input stream information as well as processed record number
  var inputInfoTracker: InputInfoTracker = null

在JobScheduler的start方法中，分别调用了ReceiverTracker和JobGenerator的start方法。

def start(): Unit = synchronized {
  if (eventLoop != null) return // scheduler has already been started

  logDebug("Starting JobScheduler")
  //消息驱动系统
  eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
    override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)

    override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
  }
  //启动消息循环处理线程
  eventLoop.start()

  // attach rate controllers of input streams to receive batch completion updates
  for {
    inputDStream <- ssc.graph.getInputStreams
    rateController <- inputDStream.rateController
  } ssc.addStreamingListener(rateController)

  listenerBus.start(ssc.sparkContext)
  receiverTracker = new ReceiverTracker(ssc)
  inputInfoTracker = new InputInfoTracker(ssc)
  //启动receiverTracker
  receiverTracker.start()
  //启动Job生成器
  jobGenerator.start()
  logInfo("Started JobScheduler")
}

先看下JobGenerator的start方法，checkpoint的初始化操作，实例化并启动消息循环体EventLoop，开启定时生成Job的定时器。

/** Start generation of jobs */
def start(): Unit = synchronized {
  if (eventLoop != null) return // generator has already been started

  // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.
  // See SPARK-10125
  checkpointWriter

  eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
    override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)

    override protected def onError(e: Throwable): Unit = {
      jobScheduler.reportError("Error in job generator", e)
    }
  }
  //启动消息循环处理线程
  eventLoop.start()

  if (ssc.isCheckpointPresent) {
    restart()
  } else {
    //开启定时生成Job的定时器
    startFirstTime()
  }
}

EvenLoop类中有存储消息的LinkedBlockingDeque和后台线程，后台线程从队列中获取消息，然后调用onReceive方法对该消息进行处理，这里的onReceive方法即匿名内部类中重写onReceive方法的processEvent方法。

private[spark] abstract class EventLoop[E](name: String) extends Logging {

  private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]()

  private val stopped = new AtomicBoolean(false)

  private val eventThread = new Thread(name) {
    setDaemon(true)

    override def run(): Unit = {
      try {
        while (!stopped.get) {
          val event = eventQueue.take()
          try {
            onReceive(event)
          } catch {
            case NonFatal(e) => {
              try {
                onError(e)
              } catch {
                case NonFatal(e) => logError("Unexpected error in " + name, e)
              }
            }
          }
        }
      } catch {
        case ie: InterruptedException => // exit even if eventQueue is not empty
        case NonFatal(e) => logError("Unexpected error in " + name, e)
      }
    }

  }

  def start(): Unit = {
    if (stopped.get) {
      throw new IllegalStateException(name + " has already been stopped")
    }
    // Call onStart before starting the event thread to make sure it happens before onReceive
    onStart()
    eventThread.start()
  }

processEvent方法是对消息类型进行模式匹配，然后路由到对应处理该消息的方法中。消息的处理一般是发给另外一个线程来处理的，消息循环器不处理耗时的业务逻辑。

/** Processes all events */
private def processEvent(event: JobGeneratorEvent) {
  logDebug("Got event " + event)
  event match {
    case GenerateJobs(time) => generateJobs(time)
    case ClearMetadata(time) => clearMetadata(time)
    case DoCheckpoint(time, clearCheckpointDataLater) =>
      doCheckpoint(time, clearCheckpointDataLater)
    case ClearCheckpointData(time) => clearCheckpointData(time)
  }
}

以GenerateJobs消息的处理函数generateJobs为例，在获取到数据后调用DStreamGraph的generateJobs方法来生成Job。

/** Generate jobs and perform checkpoint for the given `time`. */
private def generateJobs(time: Time) {
  // Set the SparkEnv in this thread, so that job generation code can access the environment
  // Example: BlockRDDs are created in this thread, and it needs to access BlockManager
  // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
  SparkEnv.set(ssc.env)
  Try {
    //根据特定的时间获取具体的数据
    jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
    //调用DStreamGraph的generateJobs生成Job
    graph.generateJobs(time) // generate jobs using allocated block
  } match {
    case Success(jobs) =>
      val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
      jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
    case Failure(e) =>
      jobScheduler.reportError("Error generating jobs for time " + time, e)
  }
  eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}

generateJobs方法中outputStreams是整个DStream中的最后一个DStream。这里outputStream.generateJob(time)类似于RDD中从后往前推。

def generateJobs(time: Time): Seq[Job] = {
  logDebug("Generating jobs for time " + time)
  val jobs = this.synchronized {
    outputStreams.flatMap { outputStream =>
      val jobOption = outputStream.generateJob(time)
      jobOption.foreach(_.setCallSite(outputStream.creationSite))
      jobOption
    }
  }
  logDebug("Generated " + jobs.length + " jobs for time " + time)
  jobs
}

generateJob方法中jobFunc 封装了context.sparkContext.runJob(rdd, emptyFunc)

private[streaming] def generateJob(time: Time): Option[Job] = {
getOrCompute(time) match {
case Some(rdd) => {

      //用函数封装了Job本身，该方法现在没有执行
      val jobFunc = () => {
        val emptyFunc = { (iterator: Iterator[T]) => {} }
        context.sparkContext.runJob(rdd, emptyFunc)
      }
      Some(new Job(time, jobFunc))
    }
    case None => None
  }
}

Job对象，方法run会导致传入的func被调用。

/**
* Class representing a Spark computation. It may contain multiple Spark jobs.
*/
private[streaming]
class Job(val time: Time, func: () => _) {
  private var _id: String = _
  private var _outputOpId: Int = _
  private var isSet = false
  private var _result: Try[_] = null
  private var _callSite: CallSite = null
  private var _startTime: Option[Long] = None
  private var _endTime: Option[Long] = None

  def run() {
    _result = Try(func())
  }

getOrCompute方法，先根据传入的时间在HashMap中查找下RDD是否存在，如果不存在则调用compute方法计算获取RDD，再根据storageLevel 是否需要persist，是否到了checkpoint时间点进行checkpoint操作，最后把该RDD放入到HashMap中。

/**
* Get the RDD corresponding to the given time; either retrieve it from cache
* or compute-and-cache it.
*/
private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
  // If RDD was already generated, then retrieve it from HashMap,
  // or else compute the RDD
  generatedRDDs.get(time).orElse {
    // Compute the RDD if time is valid (e.g. correct time in a sliding window)
    // of RDD generation, else generate nothing.
    if (isTimeValid(time)) {

      val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {
        // Disable checks for existing output directories in jobs launched by the streaming
        // scheduler, since we may need to write output to an existing directory during checkpoint
        // recovery; see SPARK-4835 for more details. We need to have this call here because
        // compute() might cause Spark jobs to be launched.
        PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
          compute(time)
        }
      }

      rddOption.foreach { case newRDD =>
        // Register the generated RDD for caching and checkpointing
        if (storageLevel != StorageLevel.NONE) {
          newRDD.persist(storageLevel)
          logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
        }
        if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
          newRDD.checkpoint()
          logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
        }
        generatedRDDs.put(time, newRDD)
      }
      rddOption
    } else {
      None
    }
  }
}

再次回到JobGenerator类中，看下start方法中在消息循环体启动后，先判断之前是否进行checkpoint操作，如果是从checkpoint目录中读取然后再调用restart重启JobGenerator，如果是第一次则调用startFirstTime方法。

/** Start generation of jobs */
def start(): Unit = synchronized {
  if (eventLoop != null) return // generator has already been started

  // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.
  // See SPARK-10125
  checkpointWriter

  eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
    override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)

    override protected def onError(e: Throwable): Unit = {
      jobScheduler.reportError("Error in job generator", e)
    }
  }
  //启动消息循环处理线程
  eventLoop.start()

  if (ssc.isCheckpointPresent) {
    restart()
  } else {
    //开启定时生成Job的定时器
    startFirstTime()
  }
}

JobGenerator类中的startFirstTime方法，启动定时生成Job的Timer

/** Starts the generator for the first time */
private def startFirstTime() {
  val startTime = new Time(timer.getStartTime())
  graph.start(startTime - graph.batchDuration)
  timer.start(startTime.milliseconds)
  logInfo("Started JobGenerator at " + startTime)
}

timer对象为RecurringTimer，其start方法内部启动一个线程，在线程中不断调用triggerActionForNextInterval方法

// 循环定时器，定时回调 eventLoop.post(GenerateJobs(new Time(longTime)))。
// 定义了定时触发的函数，此函数就是将发送类型为"GenerateJobs"的事件
// 值得注意的事，这里只是定义了回调函数。
//根据创建StreamContext时传入的batchInterval，定时发送GenerateJobs消息
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

/**
* Start at the given start time.
*/
def start(startTime: Long): Long = synchronized {
  nextTime = startTime
  thread.start()
  logInfo("Started timer for " + name + " at time " + nextTime)
  nextTime
}

// 这里创建了一个守护线程
private val thread = new Thread("RecurringTimer - " + name) {
setDaemon(true)
override def run() { loop }
}

/**
* Repeatedly call the callback every interval.
*/
private def loop() {
  try {
    while (!stopped) {
      triggerActionForNextInterval()
    }
    triggerActionForNextInterval()
  } catch {
    case e: InterruptedException =>
  }
}

triggerActionForNextInterval方法，等待BatchDuration后回调callback这个方法，这里的callback方法是构造RecurringTimer对象时传入的方法，即longTime => eventLoop.post(GenerateJobs(new Time(longTime)))，不断向消息循环体发送GenerateJobs消息。

private def triggerActionForNextInterval(): Unit = {
  clock.waitTillTime(nextTime)
  callback(nextTime)
  prevTime = nextTime
  nextTime += period
  logDebug("Callback for " + name + " called at time " + prevTime)
}

private[streaming]
class RecurringTimer(clock: Clock, period: Long, callback: (Long) => Unit, name: String)
extends Logging {

我们再次聚焦generateJobs这个方法生成Job的步骤，

第一步：获取当前时间段内的数据。

第二步：生成Job，RDD之间的依赖关系。

第三步：获取生成Job对应的StreamId的信息。

第四步：封装成JobSet交给JobScheduler。

第五步：进行checkpoint操作。

graph.generateJobs(time) // generate jobs using allocated block
} match {
case Success(jobs) =>

//第三步：获取生成Job对应的StreamId的信息。
val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)

     //第四步：封装成JobSet交给JobScheduler。
      jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
    case Failure(e) =>
      jobScheduler.reportError("Error generating jobs for time " + time, e)
  }

//第五步：进行checkpoint操作。
eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}

其中submitJobSet方法，只是把JobSet放到ConcurrentHashMap中，把Job封装为JobHandler提交到jobExecutor线程池中

def submitJobSet(jobSet: JobSet) {
  if (jobSet.jobs.isEmpty) {
    logInfo("No jobs added for time " + jobSet.time)
  } else {
    listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
    jobSets.put(jobSet.time, jobSet)
    jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
    logInfo("Added jobs for time " + jobSet.time)
  }
}

private val jobSets: java.util.Map[Time, JobSet] = new ConcurrentHashMap[Time, JobSet]

JobHandler对象为实现Runnable 接口，job的run方法导致了func的调用，即基于DStream的业务逻辑

private class JobHandler(job: Job) extends Runnable with Logging {
    import JobScheduler._

    def run() {
      try {
        val formattedTime = UIUtils.formatBatchTime(
          job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)
        val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}"
        val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"

        ssc.sc.setJobDescription(
          s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""")
        ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
        ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)

        // We need to assign `eventLoop` to a temp variable. Otherwise, because
        // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
        // it's possible that when `post` is called, `eventLoop` happens to null.
        var _eventLoop = eventLoop
        if (_eventLoop != null) {
          _eventLoop.post(JobStarted(job, clock.getTimeMillis()))
          // Disable checks for existing output directories in jobs launched by the streaming
          // scheduler, since we may need to write output to an existing directory during checkpoint
          // recovery; see SPARK-4835 for more details.
          PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
            job.run()
          }
          _eventLoop = eventLoop
          if (_eventLoop != null) {
            _eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
          }
        } else {
          // JobScheduler has been stopped.
        }
      } finally {
        ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null)
        ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null)
      }
    }
  }
}