Spark Streaming批处理job生成流程解析

最新推荐文章于 2024-02-01 13:15:00 发布

KLordy

最新推荐文章于 2024-02-01 13:15:00 发布

阅读量812

点赞数

分类专栏： Spark Streaming Spark Streaming源码分析

本文链接：https://blog.csdn.net/klordy_123/article/details/84201055

版权

Spark Streaming源码分析同时被 2 个专栏收录

7 篇文章 1 订阅

订阅专栏

Spark Streaming

6 篇文章 0 订阅

订阅专栏

本篇文章继续上篇关于启动流程分析后进行，上篇中主要介绍了启动流程中主要涉及的JobScheduler和DStreamGraph在启动时的工作，已经知道启动起来之后，主要支撑运作的应该是JobScheduler->JobGenerator->Timer，通过定时器的形式每一个批次进行一次处理，那么每个批次开始定时器发布任务后，一直到当前批次处理完成，中间主要经历了哪些事前呢？对此，本文继续上篇内容进行介绍批处理流程。
整个处理流程中，由于涉及多处都是由EventLoop作为事件发布的渠道，所以我们有必要先对这个类进行一下了解，源码如下：

private[spark] abstract class EventLoop[E](name: String) extends Logging {

  //待处理消息，如果为空就会阻塞住
  private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]()

  private val stopped = new AtomicBoolean(false)

  //处理消息的线程，启动后只要不主动中断，会一直不停的轮询着处理消息队列中的数据
  private val eventThread = new Thread(name) {
    setDaemon(true)

    override def run(): Unit = {
      try {
        while (!stopped.get) {
          val event = eventQueue.take()
          try {
            onReceive(event)
          } catch {
            case NonFatal(e) =>
              try {
                onError(e)
              } catch {
                case NonFatal(e) => logError("Unexpected error in " + name, e)
              }
          }
        }
      } catch {
        case ie: InterruptedException => // exit even if eventQueue is not empty
        case NonFatal(e) => logError("Unexpected error in " + name, e)
      }
    }
  }

  def start(): Unit = {
    if (stopped.get) {
      throw new IllegalStateException(name + " has already been stopped")
    }
    // Call onStart before starting the event thread to make sure it happens before onReceive
    // 按照注释这里应该是抽象类对要继承子类提供的消息处理线程启动前做一些额外事情的方法
    onStart()
    //启动消息处理线程
    eventThread.start()
  }

  def stop(): Unit = {
    //熟悉的CAS操作   ConcurrentHashMap中这种用法见得太多了
    if (stopped.compareAndSet(false, true)) {
    //开始时先把stopped设置为false，不过有可能在while内长时间停留，所以不一定设置了stopped为true能有效，所以尝试调用interrupt进行线程中断。
      eventThread.interrupt()
      var onStopCalled = false
      try {
        eventThread.join()//这里阻塞等待eventThread被干掉
        // Call onStop after the event thread exits to make sure onReceive happens before onStop
        onStopCalled = true
        onStop()//线程被kill后，留给继承者们进行收尾工作的地方
      } catch {
        case ie: InterruptedException =>
          Thread.currentThread().interrupt()
          if (!onStopCalled) {
            // ie is thrown from `eventThread.join()`. Otherwise, we should not call `onStop` since
            // it's already called.
            onStop()
          }
      }
    } else {
      // Keep quiet to allow calling `stop` multiple times.
    }
  }

  def post(event: E): Unit = {
    eventQueue.put(event)
  }

  def isActive: Boolean = eventThread.isAlive
  protected def onStart(): Unit = {}
  protected def onStop(): Unit = {}

  /**
   * Note: Should avoid calling blocking actions in `onReceive`, or the event thread will be blocked
   * and cannot process events in time. If you want to call some blocking actions, * * run them in another thread.
   * 不要在这里面进行阻塞性的操作，不然会影响后续消息的处理
   */
  protected def onReceive(event: E): Unit
  protected def onError(e: Throwable): Unit

}

这个抽象类内部其实结构很简单，继承者们只需要实现消息的onReceive和异常处理的onError即可。
了解了EventLoop，接下来回到正轨，看看每个批次启动处理数据的入口timer处，代码为:

private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
    longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

关于RecurringTimer的实现也十分简单，它内部也是自带一个线程，调用start()方法的时候就启动该线程，并且会设置好开始时间以及每次启动的时间间隔，就是构造函数的第二个参数即一个批次的时间，然后就是有一个loop方法会不断的调用triggerActionForNextInterval方法，在这个方法内部会有一个时钟会等到下个批次处理时间到了再放行，放行后就会调用构造时传入的回调函数进行后续操作，完事再设置下次处理的时间，如此往复不停循环，主要方法triggerActionForNextInterval代码如下：

private def triggerActionForNextInterval(): Unit = {
    clock.waitTillTime(nextTime)
    callback(nextTime)
    prevTime = nextTime
    nextTime += period
    logDebug("Callback for " + name + " called at time " + prevTime)
  }

插播完timer的机制以后，我们再回到主线剧情，即我们在JobGenerator中启动定时器的时候，其实最重要的应该是定时器的回调函数中做了什么事情，很明显:longTime => eventLoop.post(GenerateJobs(new Time(longTime))，回调做的事情就是时间到了就往eventloop中发布一个事件，我们再看看我们JobGenerator中关于eventLoop事件处理的实现中消息接收和异常处理的实现部分，实现是匿名类的形式创建的实体类：

eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
      override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = {
        jobScheduler.reportError("Error in job generator", e)
      }
    }

再看processEvent代码：

private def processEvent(event: JobGeneratorEvent) {
    logDebug("Got event " + event)
    event match {
      case GenerateJobs(time) => generateJobs(time)
      case ClearMetadata(time) => clearMetadata(time)
      case DoCheckpoint(time, clearCheckpointDataLater) =>
        doCheckpoint(time, clearCheckpointDataLater)
      case ClearCheckpointData(time) => clearCheckpointData(time)
    }
  }

这里就很明显了，timer时间到了发布了一个GenerateJobs事件到eventLoop，然后eventLoop处理消息调用generatorJobs(time)方法：

private def generateJobs(time: Time) {
    //这里应该是DStream自带的RDD清理机制，因为DStream内部会保留每个批次生成的RDD数据，如果不自带清理机制，则会导致数组越来越大越长，从而出现stack overflow。
    ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
    Try {
      //获取当前批次需要处理的所有block数据
      jobScheduler.receiverTracker.allocateBlocksToBatch(time)
      //生成jobs
      graph.generateJobs(time) // generate jobs using allocated block
    } match {
      case Success(jobs) =>
        val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
        jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
      case Failure(e) =>
        jobScheduler.reportError("Error generating jobs for time " + time, e)
        PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
    }
    eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
  }

方法很简单，继续看其中是如何通过调用DStreamGraph对应generateJobs生成job的：

def generateJobs(time: Time): Seq[Job] = {
    logDebug("Generating jobs for time " + time)
    val jobs = this.synchronized {
      outputStreams.flatMap { outputStream =>
        val jobOption = outputStream.generateJob(time)
        jobOption.foreach(_.setCallSite(outputStream.creationSite))
        jobOption
      }
    }
    logDebug("Generated " + jobs.length + " jobs for time " + time)
    jobs
  }

喔，到这里就看出来了，一个outputStream到其对应的inputStream这一条处理流水线应该就是一个job，所以每个批次有多少个job其实就是看有多少个outputStream，为了验证这一点，其实大家自己启动一个sparkstreaming应用，然后在ui界面看Jobs选项栏，很容易发现其中每个job的描述开始的地方都是outputStream类型开始的。
既然知道了每个批次生成job是的工作是从DStreamGraph的outputStream开始的，那么来追踪这些outputStream的generateJob，代码如下：

override def generateJob(time: Time): Option[Job] = {
    parent.getOrCompute(time) match {
      case Some(rdd) =>
        val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
          foreachFunc(rdd, time)
        }
        Some(new Job(time, jobFunc))
      case None => None
    }
  }

可以看到，这里其实是一个链式调用，由于之前第一篇介绍过每个DStream均记录了和老父亲的关系，所以在这里就用到了所记录的关系，这里会调用所以来的DStream的getOrCompute，而由之前的分析知道每个通过DStreamGraph得到的DAG中的所有DStream中间应该都是transformation类型的算子得到的结果，很容易发现这类算子均实现了getOrCompute方法中的compute方法，而这个方法的实现都是一样的：

override def compute(validTime: Time): Option[RDD[U]] = {
    parent.getOrCompute(validTime).map(_.flatMap(flatMapFunc))
  }

通过这样实现链式的往上调用一直到inputStream类型的DStream，所有inputStream类型就是继承自InputStream实现的DStream，这些DStream的compute方法就是具体的获取数据的方法，例如我们看看FileInputStream的compute方法源码来看:

override def compute(validTime: Time): Option[RDD[(K, V)]] = {
    // Find new files
    val newFiles = findNewFiles(validTime.milliseconds)
    logInfo("New files at time " + validTime + ":\n" + newFiles.mkString("\n"))
    batchTimeToSelectedFiles.synchronized {
      batchTimeToSelectedFiles += ((validTime, newFiles))
    }
    recentlySelectedFiles ++= newFiles
    val rdds = Some(filesToRDD(newFiles))
    // Copy newFiles to immutable.List to prevent from being modified by the user
    val metadata = Map(
      "files" -> newFiles.toList,
      StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n"))
    val inputInfo = StreamInputInfo(id, 0, metadata)
    ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
    rdds
  }

另外这里几个方法的调用顺序是generator--> getOrCompute --> compute，可以注意到的一点是getOrCompute，看看它的源码：

private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
    generatedRDDs.get(time).orElse {
      if (isTimeValid(time)) {

        val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {
          PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
            compute(time)
          }
        }

        rddOption.foreach { case newRDD =>
          // Register the generated RDD for caching and checkpointing
          if (storageLevel != StorageLevel.NONE) {
            newRDD.persist(storageLevel)
            logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
          }
          if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
            newRDD.checkpoint()
            logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
          }
          generatedRDDs.put(time, newRDD)
        }
        rddOption
      } else {
        None
      }
    }
  }

可以看到，这里是先查询缓存中是否有需要的RDD，如果没有就进行计算，计算完成得到RDD后，调用generatedRDDs.put(time, newRDD)放入到缓存数据中，所以对于同一个批次中如果有多个jobs，假设某个DStream被包含在多个jobs的处理流程中，那么只要有一个job计算过了，那么后面的其它job在执行到这个DStream的时候，就可以直接获取到缓存中之前计算过的数据。
到这里，简单来说，通过DStreamGraph中generatorJob的调用，进行了链式搜索，将每一部的操作串联了起来，同时找到inputStream后，拿到本次批次需要处理的数据形成一个RDD，然后在此RDD基础上进行所有的操作，得到最终结果。这里得到RDD之后，后续的实际计算是要分布到各个节点上进行的，所以是先把这些真正的计算逻辑封装为一个job，然后交给JobScheduler中的jobSets，由它们负责对这些job具体的执行分配。
具体提交job的是从JobGenerator的generateJob开始的，继续回顾代码：

Try {
      jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
      graph.generateJobs(time) // generate jobs using allocated block
    } match {
      case Success(jobs) =>
        val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
        jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
      case Failure(e) =>
        jobScheduler.reportError("Error generating jobs for time " + time, e)
        PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
    }

可以看到job通过graph.generateJobs(time)生成后，进行模式匹配，调用jobScheduler.submitJobSet(xxx)方法：

def submitJobSet(jobSet: JobSet) {
    if (jobSet.jobs.isEmpty) {
      logInfo("No jobs added for time " + jobSet.time)
    } else {
      listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
      jobSets.put(jobSet.time, jobSet)
      jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
      logInfo("Added jobs for time " + jobSet.time)
    }
  }

至此，我们已经搞定了job的生成流程，接下来任务处理完毕后，还会有其它的事件发生，在后续文章继续跟进。自己建了一个源码交流群：936037639，如果你也在看Spark或是大数据相关框架的源码，可以进群大家互相交流哦，一个人看源码有些细节是真的不容易弄明白的，人多力量大！