为了探索Spark Streaming的完整执行流程,我们先看下Spark源码项目examples模块里面提供的Spark Streaming案例:
org.apache.spark.examples.streaming.DirectKafkaWordCount
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// scalastyle:off println
package org.apache.spark.examples.streaming
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010._
/**
* Consumes messages from one or more topics in Kafka and does wordcount.
* Usage: DirectKafkaWordCount <brokers> <topics>
* <brokers> is a list of one or more Kafka brokers
* <groupId> is a consumer group name to consume from topics
* <topics> is a list of one or more kafka topics to consume from
*
* Example:
* $ bin/run-example streaming.DirectKafkaWordCount broker1-host:port,broker2-host:port \
* consumer-group topic1,topic2
*/
object DirectKafkaWordCount {
def main(args: Array[String]) {
if (args.length < 3) {
System.err.println(s"""
|Usage: DirectKafkaWordCount <brokers> <groupId> <topics>
| <brokers> is a list of one or more Kafka brokers
| <groupId> is a consumer group name to consume from topics
| <topics> is a list of one or more kafka topics to consume from
|
""".stripMargin)
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
val Array(brokers, groupId, topics) = args
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
}
}
// scalastyle:on println
分析这个案例,可以看到实现一个Spark Streaming流处理应用,一般有以下五个步骤:
- 初始化流处理上下文,即创建StreamingContext,作为流处理程序入口,这个过程中也在创建SparkContext这个Spark执行上下文;
- 创建输入流 Input DStreams;
- 对DStreams各种转换处理 Transformations,形成DStreams DAG;
- 执行Output Operations进行输出结果
- 启动StreamingContext,并等待终止
一、初始化StreamingContext
完整流程如下:
其中比较重要的是
- SparkContext初始化,这个主要准备job执行的各种环境(网络通信、序列化反序列化、存储管理等等,需要专门去解读源码,这里暂时先不关心);
- 构建JobScheduler对象,该对象使用JobGenerator对象生成jobs,然后将jobs调度执行;
- 构建JobGenerator对象,该对象从DStreams中或者Checkpointing中生成jobs,同时负责清理DStreams元数据。
二、创建输入流InputDStreams
StreamingContext中提供了各种创建输入流的方法,包括创建ReceiverInputDStream、FileInputDStream等
InputDStream的相关方法进行说明,其重要的方法,包括继承于DStream的
1)`compute(validTime: Time): Option[RDD[T]]`
用于生成RDD
2)dependencies: List[DStream[_]]
DStream依赖,InputDStream作为输入,不存在依赖,因此重写了该方法,直接返回空集合:
`override def dependencies: List[DStream[_]] = List()`
3)generateJob(time: Time): Option[Job]
生成job,该方法主要调用getOrCompute生成RDD,然后创建job对象
4)getOrCompute(time: Time): Option[RDD[T]]
该方法调用compute方法生成RDD,并对RDD进行持久化和checkpoint操作
5)register
/**
* Register this streaming as an output stream. This would ensure that RDDs of this
* DStream will be generated.
*/
private[streaming] def register(): DStream[T] = {
ssc.graph.addOutputStream(this)
this
}
将自己注册到DStreamGraph上,用于输出操作算子得到OutputDStream注册,比如ForEachDStream,其实就是放到DStreamGraph的outputStreams集合中
相应的InputDStream,在初始化创建的时候,就调用DStreamGraph的addInputStream方法将自己注册到DStreamGraph的inputStreams集合中
6)start()
启动接收数据的方法,InputDStream新增的抽象方法,由各个子类实现
/** Method called to start receiving data. Subclasses must implement this method. */
def start(): Unit
7)stop()方法
停止接收数据方法
/** Method called to stop receiving data. Subclasses must implement this method. */
def stop(): Unit
三、DStreams Transformation
四、Output Operations On DStreams
五、启动StreamingContext并等待终止
以上四个步骤,主要是在初始化执行环境,然后构建DStreamGraph,并没有真正开始数据流处理,即使调用了输出操作。
Spark Streaming作业真正开始执行的地方是调用StreamingContext的start方法,所以我们需要重点解析该方法的源码:
/**
* Start the execution of the streams.
*
* @throws IllegalStateException if the StreamingContext is already stopped.
*/
def start(): Unit = synchronized {
state match {
case INITIALIZED =>
startSite.set(DStream.getCreationSite())
StreamingContext.ACTIVATION_LOCK.synchronized {
StreamingContext.assertNoOtherContextIsActive()
try {
validate()
// Start the streaming scheduler in a new thread, so that thread local properties
// like call sites and job groups can be reset without affecting those of the
// current thread.
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
scheduler.start()
}
state = StreamingContextState.ACTIVE
scheduler.listenerBus.post(
StreamingListenerStreamingStarted(System.currentTimeMillis()))
} catch {
case NonFatal(e) =>
logError("Error starting the context, marking it as stopped", e)
scheduler.stop(false)
state = StreamingContextState.STOPPED
throw e
}
StreamingContext.setActiveContext(this)
}
logDebug("Adding shutdown hook") // force eager creation of logger
shutdownHookRef = ShutdownHookManager.addShutdownHook(
StreamingContext.SHUTDOWN_HOOK_PRIORITY)(() => stopOnShutdown())
// Registering Streaming Metrics at the start of the StreamingContext
assert(env.metricsSystem != null)
env.metricsSystem.registerSource(streamingSource)
uiTab.foreach(_.attach())
logInfo("StreamingContext started")
case ACTIVE =>
logWarning("StreamingContext has already been started")
case STOPPED =>
throw new IllegalStateException("StreamingContext has already been stopped")
}
}
整个流程如下:
StreamingContext --> JobScheduler --> JobGenerator --> DStreamGraph --> SparkContext.runJob