Spark Structured Streaming源码分析--(二)StreamExecution持续查询引擎

本文深入分析Spark Structured Streaming的StreamExecution,特别是MicroBatchExecution的初始化和流处理逻辑。从StreamingQueryManager启动流,到StreamExecution的runActivatedStream()方法在MicroBatchExecution中的实现,探讨了offsets、commits目录的内容,以及如何恢复流处理进度。
摘要由CSDN通过智能技术生成

一、StreamingQueryManager创建流并启动

接一篇文章创建流的Source、Sink
在创建Sink后,会调用sessionState.streamingQueryManager.startQuery()创建并启动流,
对应的StreamQueryManager启动流程图为:

这里写图片描述

startQuery()、createQuery()主要代码:

class StreamingQueryManager private[sql] (sparkSession: SparkSession) extends Logging {
   
  private[sql] def startQuery(
      userSpecifiedName: Option[String],
      userSpecifiedCheckpointLocation: Option[String],
      df: DataFrame,
      extraOptions: Map[String, String],
      sink: BaseStreamingSink,
      outputMode: OutputMode,
      useTempCheckpointLocation: Boolean = false,
      recoverFromCheckpointLocation: Boolean = true,
      trigger: Trigger = ProcessingTime(0),
      triggerClock: Clock = new SystemClock()): StreamingQuery = {
    val query = createQuery(
      userSpecifiedName,
      userSpecifiedCheckpointLocation,
      df,
      extraOptions,
      sink,
      outputMode,
      useTempCheckpointLocation,
      recoverFromCheckpointLocation,
      trigger,
      triggerClock)

      activeQueries.put(query.id, query)
    }
    try {
      query.streamingQuery.start()
    } catch {
      case e: Throwable =>
        activeQueriesLock.synchronized {
          activeQueries -= query.id
        }
        throw e
    }
    query
  }

  private def createQuery(
      userSpecifiedName: Option[String],
      userSpecifiedCheckpointLocation: Option[String],
      df: DataFrame,
      extraOptions: Map[String, String],
      sink: BaseStreamingSink,
      outputMode: OutputMode,
      useTempCheckpointLocation: Boolean,
      recoverFromCheckpointLocation: Boolean,
      trigger: Trigger,
      triggerClock: Clock): StreamingQueryWrapper = {
    var streamExecutionCls = extraOptions.getOrElse("stream.execution.class", "")
    var deleteCheckpointOnStop = false
    val checkpointLocation = userSpecifiedCheckpointLocation.map { userSpecified =>
      new Path(userSpecified).toUri.toString
    }.orElse {
      xxxx
      }
    }

    val analyzedPlan = df.queryExecution.analyzed
    df.queryExecution.assertAnalyzed()

    if (sparkSession.sessionState.conf.isUnsupportedOperationCheckEnabled) {
      UnsupportedOperationChecker.checkForStreaming(analyzedPlan, outputMode)
    }

    if (sparkSession.sessionState.conf.adaptiveExecutionEnabled) {
      logWarning(s"${SQLConf.ADAPTIVE_EXECUTION_ENABLED.key} " +
          "is not supported in streaming DataFrames/Datasets and will be disabled.")
    }

    var streamingQueryWrapper: StreamingQueryWrapper = null

    if (streamExecutionCls.length > 0) {
      val cls = Utils.classForName(streamExecutionCls)
      val constructor = cls.getConstructor(
        classOf[SparkSession],
        classOf[String],
        classOf[String],
        classOf[LogicalPlan],
        classOf[BaseStreamingSink],
        classOf[Trigger],
        classOf[Clock],
        classOf[OutputMode],
        classOf[Map[String, String]],
        classOf[Boolean])
      val streamExecution = constructor.newInstance(
        sparkSession,
        userSpecifiedName.orNull,
        checkpointLocation,
        analyzedPlan,
        sink,
        trigger,
        triggerClock,
        outputMode,
        extraOptions,
        new java.lang.Boolean(deleteCheckpointOnStop)).asInstanceOf[StreamExecution]
      streamingQueryWrapper = new StreamingQueryWrapper(streamExecution)
    } else {
      streamingQueryWrapper = (sink, trigger) match {
        case (v2Sink: StreamWriteSupport, trigger: ContinuousTrigger) =>
          UnsupportedOperationChecker.checkForContinuous(analyzedPlan, outputMode)
          new StreamingQueryWrapper(new ContinuousExecution(
            sparkSession,
            userSpecifiedName.orNull,
            checkpointLocation,
            analyzedPlan,
            v2Sink,
            trigger,
            triggerClock,
            outputMode,
            extraOptions,
            deleteCheckpointOnStop))
        case _ =>
          new StreamingQueryWrapper(new MicroBatchExecution(
            sparkSession,
            userSpecifiedName.orNull,
            checkpointLocation,
            analyzedPlan,
            sink,
            trigger,
            triggerClock,
            outputMode,
            extraOptions,
            deleteCheckpointOnStop))
      }
    }

    streamingQueryWrapper
  }
}

二、StreamExecution的初始化

1、 StreamExecution源码分析

StreamQueryManager.startQuery()最后一步描述的query.streamingQuery.start()即真正创建StreamExecution的流处理线程:

abstract class StreamExecution(xxxxx)
    extends StreamingQuery with ProgressReporter with Logging {
   
      def start(): Unit = {
        logInfo(s"Starting $prettyIdString. Use $resolvedCheckpointRoot to store the query checkpoint.")
        queryExecutionThread.setDaemon(true)
        queryExecutionThread.start()
        startLatch.await()  // Wait until thread started and QueryStart event has been posted
      }
}

实际是执行queryExecutionThread的run()方法:

val queryExecutionThread: QueryExecutionThread =
    new QueryExecutionThread(s"stream execution thread for $prettyIdString") {
      override def run(): Unit = {
        // To fix call site like "run at <unknown>:0", we bridge the call site from the caller
        // thread to this micro batch thread
        sparkSession.sparkContext.setCallSite(callSite)
        runStream()
      }
  }

runStream()分为环境的初始化、启动和执行过程中的异常处理try{…}catch{…}结构、其核心方法是runActivatedStream(sparkSessionForStream),具体的实现在MicroBatchExecution(批量处理)、ContinuousExecution(连续处理)这个两个子类中均有各自具体的实现。

runStream()流程:
· 创建metics
· postEvent(id, runId, name)向listenBus发送启动事件Event
· 设置其它conf变量
· runActivatedStream(sparkSessionForStream)执行MicroBatchExecution或ContinousExecution的runActivatedStream(),持续查询流
· 使用try{…}catch{…}获取启动和运行过程异常

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值