Spark Structured Streaming源码分析--(二)StreamExecution持续查询引擎

最新推荐文章于 2024-07-07 00:59:13 发布

LS_ice

最新推荐文章于 2024-07-07 00:59:13 发布

阅读量3.2k

点赞数 3

分类专栏： spark structured streaming源码

本文链接：https://blog.csdn.net/LS_ice/article/details/81981762

版权

本文深入分析Spark Structured Streaming的StreamExecution，特别是MicroBatchExecution的初始化和流处理逻辑。从StreamingQueryManager启动流，到StreamExecution的runActivatedStream()方法在MicroBatchExecution中的实现，探讨了offsets、commits目录的内容，以及如何恢复流处理进度。

摘要由CSDN通过智能技术生成

一、StreamingQueryManager创建流并启动
二、StreamExecution的初始化
- 1、 StreamExecution源码分析
- 2、 StreamExecution的子类实现：override runActivatedStream()
三、MicroBatchExecution批量流处理分析

一、StreamingQueryManager创建流并启动

接一篇文章创建流的Source、Sink
在创建Sink后，会调用sessionState.streamingQueryManager.startQuery()创建并启动流，
对应的StreamQueryManager启动流程图为：

这里写图片描述

startQuery()、createQuery()主要代码：

class StreamingQueryManager private[sql] (sparkSession: SparkSession) extends Logging {
   
  private[sql] def startQuery(
      userSpecifiedName: Option[String],
      userSpecifiedCheckpointLocation: Option[String],
      df: DataFrame,
      extraOptions: Map[String, String],
      sink: BaseStreamingSink,
      outputMode: OutputMode,
      useTempCheckpointLocation: Boolean = false,
      recoverFromCheckpointLocation: Boolean = true,
      trigger: Trigger = ProcessingTime(0),
      triggerClock: Clock = new SystemClock()): StreamingQuery = {
    val query = createQuery(
      userSpecifiedName,
      userSpecifiedCheckpointLocation,
      df,
      extraOptions,
      sink,
      outputMode,
      useTempCheckpointLocation,
      recoverFromCheckpointLocation,
      trigger,
      triggerClock)

      activeQueries.put(query.id, query)
    }
    try {
      query.streamingQuery.start()
    } catch {
      case e: Throwable =>
        activeQueriesLock.synchronized {
          activeQueries -= query.id
        }
        throw e
    }
    query
  }

  private def createQuery(
      userSpecifiedName: Option[String],
      userSpecifiedCheckpointLocation: Option[String],
      df: DataFrame,
      extraOptions: Map[String, String],
      sink: BaseStreamingSink,
      outputMode: OutputMode,
      useTempCheckpointLocation: Boolean,
      recoverFromCheckpointLocation: Boolean,
      trigger: Trigger,
      triggerClock: Clock): StreamingQueryWrapper = {
    var streamExecutionCls = extraOptions.getOrElse("stream.execution.class", "")
    var deleteCheckpointOnStop = false
    val checkpointLocation = userSpecifiedCheckpointLocation.map { userSpecified =>
      new Path(userSpecified).toUri.toString
    }.orElse {
      xxxx
      }
    }

    val analyzedPlan = df.queryExecution.analyzed
    df.queryExecution.assertAnalyzed()

    if (sparkSession.sessionState.conf.isUnsupportedOperationCheckEnabled) {
      UnsupportedOperationChecker.checkForStreaming(analyzedPlan, outputMode)
    }

    if (sparkSession.sessionState.conf.adaptiveExecutionEnabled) {
      logWarning(s"${SQLConf.ADAPTIVE_EXECUTION_ENABLED.key} " +
          "is not supported in streaming DataFrames/Datasets and will be disabled.")
    }

    var streamingQueryWrapper: StreamingQueryWrapper = null

    if (streamExecutionCls.length > 0) {
      val cls = Utils.classForName(streamExecutionCls)
      val constructor = cls.getConstructor(
        classOf[SparkSession],
        classOf[String],
        classOf[String],
        classOf[LogicalPlan],
        classOf[BaseStreamingSink],
        classOf[Trigger],
        classOf[Clock],
        classOf[OutputMode],
        classOf[Map[String, String]],
        classOf[Boolean])
      val streamExecution = constructor.newInstance(
        sparkSession,
        userSpecifiedName.orNull,
        checkpointLocation,
        analyzedPlan,
        sink,
        trigger,
        triggerClock,
        outputMode,
        extraOptions,
        new java.lang.Boolean(deleteCheckpointOnStop)).asInstanceOf[StreamExecution]
      streamingQueryWrapper = new StreamingQueryWrapper(streamExecution)
    } else {
      streamingQueryWrapper = (sink, trigger) match {
        case (v2Sink: StreamWriteSupport, trigger: ContinuousTrigger) =>
          UnsupportedOperationChecker.checkForContinuous(analyzedPlan, outputMode)
          new StreamingQueryWrapper(new ContinuousExecution(
            sparkSession,
            userSpecifiedName.orNull,
            checkpointLocation,
            analyzedPlan,
            v2Sink,
            trigger,
            triggerClock,
            outputMode,
            extraOptions,
            deleteCheckpointOnStop))
        case _ =>
          new StreamingQueryWrapper(new MicroBatchExecution(
            sparkSession,
            userSpecifiedName.orNull,
            checkpointLocation,
            analyzedPlan,
            sink,
            trigger,
            triggerClock,
            outputMode,
            extraOptions,
            deleteCheckpointOnStop))
      }
    }

    streamingQueryWrapper
  }
}

二、StreamExecution的初始化

一、StreamingQueryManager创建流并启动
二、StreamExecution的初始化
- 1、 StreamExecution源码分析
- 2、 StreamExecution的子类实现：override runActivatedStream()
三、MicroBatchExecution批量流处理分析

1、 StreamExecution源码分析

StreamQueryManager.startQuery()最后一步描述的query.streamingQuery.start()即真正创建StreamExecution的流处理线程：

abstract class StreamExecution(xxxxx)
    extends StreamingQuery with ProgressReporter with Logging {
   
      def start(): Unit = {
        logInfo(s"Starting $prettyIdString. Use $resolvedCheckpointRoot to store the query checkpoint.")
        queryExecutionThread.setDaemon(true)
        queryExecutionThread.start()
        startLatch.await()  // Wait until thread started and QueryStart event has been posted
      }
}

实际是执行queryExecutionThread的run()方法：

val queryExecutionThread: QueryExecutionThread =
    new QueryExecutionThread(s"stream execution thread for $prettyIdString") {
      override def run(): Unit = {
        // To fix call site like "run at <unknown>:0", we bridge the call site from the caller
        // thread to this micro batch thread
        sparkSession.sparkContext.setCallSite(callSite)
        runStream()
      }
  }

runStream()分为环境的初始化、启动和执行过程中的异常处理try{…}catch{…}结构、其核心方法是runActivatedStream(sparkSessionForStream)，具体的实现在MicroBatchExecution（批量处理）、ContinuousExecution（连续处理）这个两个子类中均有各自具体的实现。

runStream()流程：
· 创建metics
· postEvent(id, runId, name)向listenBus发送启动事件Event
· 设置其它conf变量
· runActivatedStream(sparkSessionForStream)执行MicroBatchExecution或ContinousExecution的runActivatedStream()，持续查询流
· 使用try{…}catch{…}获取启动和运行过程异常