Spark Streaming执行流程分析

最新推荐文章于 2023-08-05 11:52:46 发布

华鹤

最新推荐文章于 2023-08-05 11:52:46 发布

阅读量598

点赞数 1

分类专栏： Spark源码学习文章标签： spark

本文链接：https://blog.csdn.net/u013478922/article/details/118060416

版权

Spark源码学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

为了探索Spark Streaming的完整执行流程，我们先看下Spark源码项目examples模块里面提供的Spark Streaming案例：

org.apache.spark.examples.streaming.DirectKafkaWordCount

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

// scalastyle:off println
package org.apache.spark.examples.streaming

import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer

import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010._

/**
 * Consumes messages from one or more topics in Kafka and does wordcount.
 * Usage: DirectKafkaWordCount <brokers> <topics>
 *   <brokers> is a list of one or more Kafka brokers
 *   <groupId> is a consumer group name to consume from topics
 *   <topics> is a list of one or more kafka topics to consume from
 *
 * Example:
 *    $ bin/run-example streaming.DirectKafkaWordCount broker1-host:port,broker2-host:port \
 *    consumer-group topic1,topic2
 */
object DirectKafkaWordCount {
  def main(args: Array[String]) {
    if (args.length < 3) {
      System.err.println(s"""
        |Usage: DirectKafkaWordCount <brokers> <groupId> <topics>
        |  <brokers> is a list of one or more Kafka brokers
        |  <groupId> is a consumer group name to consume from topics
        |  <topics> is a list of one or more kafka topics to consume from
        |
        """.stripMargin)
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()

    val Array(brokers, groupId, topics) = args

    // Create context with 2 second batch interval
    val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(2))

    // Create direct kafka stream with brokers and topics
    val topicsSet = topics.split(",").toSet
    val kafkaParams = Map[String, Object](
      ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
      ConsumerConfig.GROUP_ID_CONFIG -> groupId,
      ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
      ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
    val messages = KafkaUtils.createDirectStream[String, String](
      ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))

    // Get the lines, split them into words, count the words and print
    val lines = messages.map(_.value)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
    wordCounts.print()

    // Start the computation
    ssc.start()
    ssc.awaitTermination()
  }
}
// scalastyle:on println

分析这个案例，可以看到实现一个Spark Streaming流处理应用，一般有以下五个步骤：

初始化流处理上下文，即创建StreamingContext，作为流处理程序入口，这个过程中也在创建SparkContext这个Spark执行上下文；
创建输入流 Input DStreams；
对DStreams各种转换处理 Transformations，形成DStreams DAG；
执行Output Operations进行输出结果
启动StreamingContext，并等待终止

一、初始化StreamingContext

完整流程如下：

其中比较重要的是

SparkContext初始化，这个主要准备job执行的各种环境（网络通信、序列化反序列化、存储管理等等，需要专门去解读源码，这里暂时先不关心）；
构建JobScheduler对象，该对象使用JobGenerator对象生成jobs，然后将jobs调度执行；
构建JobGenerator对象，该对象从DStreams中或者Checkpointing中生成jobs，同时负责清理DStreams元数据。

二、创建输入流InputDStreams

StreamingContext中提供了各种创建输入流的方法，包括创建ReceiverInputDStream、FileInputDStream等

InputDStream的相关方法进行说明，其重要的方法，包括继承于DStream的

1）`compute(validTime: Time): Option[RDD[T]]`

用于生成RDD

2）dependencies: List[DStream[_]]

DStream依赖，InputDStream作为输入，不存在依赖，因此重写了该方法，直接返回空集合：

`override def dependencies: List[DStream[_]] = List()`

3）generateJob(time: Time): Option[Job]

生成job，该方法主要调用getOrCompute生成RDD，然后创建job对象

4）getOrCompute(time: Time): Option[RDD[T]]

该方法调用compute方法生成RDD，并对RDD进行持久化和checkpoint操作

5）register

 /**
   * Register this streaming as an output stream. This would ensure that RDDs of this
   * DStream will be generated.
   */
  private[streaming] def register(): DStream[T] = {
    ssc.graph.addOutputStream(this)
    this
  }

将自己注册到DStreamGraph上，用于输出操作算子得到OutputDStream注册，比如ForEachDStream，其实就是放到DStreamGraph的outputStreams集合中

相应的InputDStream，在初始化创建的时候，就调用DStreamGraph的addInputStream方法将自己注册到DStreamGraph的inputStreams集合中

6）start()

启动接收数据的方法，InputDStream新增的抽象方法，由各个子类实现

/** Method called to start receiving data. Subclasses must implement this method. */
  def start(): Unit

7）stop()方法

停止接收数据方法

 /** Method called to stop receiving data. Subclasses must implement this method. */
  def stop(): Unit

三、DStreams Transformation

四、Output Operations On DStreams

五、启动StreamingContext并等待终止

以上四个步骤，主要是在初始化执行环境，然后构建DStreamGraph，并没有真正开始数据流处理，即使调用了输出操作。

Spark Streaming作业真正开始执行的地方是调用StreamingContext的start方法，所以我们需要重点解析该方法的源码：

 /**
   * Start the execution of the streams.
   *
   * @throws IllegalStateException if the StreamingContext is already stopped.
   */
  def start(): Unit = synchronized {
    state match {
      case INITIALIZED =>
        startSite.set(DStream.getCreationSite())
        StreamingContext.ACTIVATION_LOCK.synchronized {
          StreamingContext.assertNoOtherContextIsActive()
          try {
            validate()

            // Start the streaming scheduler in a new thread, so that thread local properties
            // like call sites and job groups can be reset without affecting those of the
            // current thread.
            ThreadUtils.runInNewThread("streaming-start") {
              sparkContext.setCallSite(startSite.get)
              sparkContext.clearJobGroup()
              sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
              savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
              scheduler.start()
            }
            state = StreamingContextState.ACTIVE
            scheduler.listenerBus.post(
              StreamingListenerStreamingStarted(System.currentTimeMillis()))
          } catch {
            case NonFatal(e) =>
              logError("Error starting the context, marking it as stopped", e)
              scheduler.stop(false)
              state = StreamingContextState.STOPPED
              throw e
          }
          StreamingContext.setActiveContext(this)
        }
        logDebug("Adding shutdown hook") // force eager creation of logger
        shutdownHookRef = ShutdownHookManager.addShutdownHook(
          StreamingContext.SHUTDOWN_HOOK_PRIORITY)(() => stopOnShutdown())
        // Registering Streaming Metrics at the start of the StreamingContext
        assert(env.metricsSystem != null)
        env.metricsSystem.registerSource(streamingSource)
        uiTab.foreach(_.attach())
        logInfo("StreamingContext started")
      case ACTIVE =>
        logWarning("StreamingContext has already been started")
      case STOPPED =>
        throw new IllegalStateException("StreamingContext has already been stopped")
    }
  }

整个流程如下：

StreamingContext --> JobScheduler --> JobGenerator --> DStreamGraph --> SparkContext.runJob

华鹤

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
Spark Streaming执行流程分析

为了探索Spark Streaming的完整执行流程，我们先看下Spark源码项目examples模块里面提供的Spark Streaming案例：org.apache.spark.examples.streaming.DirectKafkaWordCount`
复制链接

扫一扫