object spark_kafka extends App{
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "192.168.1.201:6667,192.168.1.202:6667",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "qwerty",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val conf=new SparkConf().setAppName("demo").setMaster("local[*]");//这个不能省略,否则sparkcontext初始化异常
val ssc=new StreamingContext(conf,Seconds(5));
/*
注解:
LocationStrategies.PreferConsistent在大多数情况下,您应该使用LocationStrategies.PreferConsistent。如上所示,首选一致性。这将在可用的执行器之间均匀地分配分区
LocationStrategies.PreferBrokers如果执行器与Kafka代理位于相同的主机上,那么使用PreferBrokers,它更愿意为该分区在Kafka leader上调度分区
LocationStrategies.PreferFixed 如果分区之间的负载有明显的倾斜,请使用PreferFixed。这允许您指定分区到主机的显式映射(任何未指定的分区都将使用一致的位置)。
*/
val topics = Array("recharge_topic_li")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,//存储策略,如果broker与Kafka代理位于相同的主机上,那么使用PreferBrokers,它更愿意为该分区在Kafka leader上调度分区。最后,如果分区之间的负载有明显的倾斜,请使用PreferFixed
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
).map(f=>{f.toString}).print()
ssc.start()
ssc.awaitTermination()
这里从ssc.start说起,任务开始生成的地方,进入该方法里面有一个scheduler.start()方法我们进去看看
JobScheduler下面有俩个重要的方法,一个是JobGenerator,另一个是ReceiverTracker
def start(): Unit = synchronized {
state match {
case INITIALIZED =>
startSite.set(DStream.getCreationSite())
StreamingContext.ACTIVATION_LOCK.synchronized {
StreamingContext.assertNoOtherContextIsActive()
try {
validate()
// Start the streaming scheduler in a new thread, so that thread local properties
// like call sites and job groups can be reset without affecting those of the
// current thread.
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
scheduler.start()//这里就是job scheduler 它里面有jobGenerator和ReceiverTracker
}
state = StreamingContextState.ACTIVE
} catch {
case NonFatal(e) =>
logError("Error starting the context, marking it as stopped", e)
scheduler.stop(false)
state = StreamingContextState.STOPPED
throw e
}
StreamingContext.setActiveContext(this)
}
logDebug("Adding shutdown hook") // force eager creation of logger
shutdownHookRef = ShutdownHookManager.addShutdownHook(
StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
// Registering Streaming Metrics at the start of the StreamingContext
assert(env.metricsSystem != null)
env.metricsSystem.registerSource(streamingSource)
uiTab.foreach(_.attach())
logInfo("StreamingContext started")
case ACTIVE =>
logWarning("StreamingContext has already been started")
case STOPPED =>
throw new IllegalStateException("StreamingContext has already been stopped")
}
}
jobGenerator首先要求 ReceiverTracker 将上一批剩下的数据进行本批的一次分配,即将上次batch切分后的数据放入本次batch里并获取到本批的meta信息,要求DStreamGraph复制一套新的RDDDAG模板,即foreachRDD从尾部遍历至头部拿到一整个依赖链,将第一步的当前批的元数据信息和第二步的DAG模板一同提交给JobScheduler异步执行
我们打开jobGenerator看看里面是怎样负责DstreamGraph的生成与ReceiverTracker的每批数据的meta信息结合
/** Generate jobs and perform checkpointing for the given `time`. */
private def generateJobs(time: Time) {//generateJobs维护的定时器 使DstreamGraph与ReceiverTracker相结合的一个时间
// Checkpoint all RDDs marked for checkpointing to ensure their lineages are
// truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
Try {
jobScheduler.receiverTracker.allocateBlocksToBatch(time) //将接受到的块分配批处理,拿到每批的元数据信息
graph.generateJobs(time) //生成DstreamGraph的依赖图
} match {
case Success(jobs) =>
val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
case Failure(e) =>
jobScheduler.reportError("Error generating jobs for time " + time, e)
PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
}
eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}
先进入到generateJobs里看看是如何拿到DstreamGraph的依赖图的
===========================================DStreamGraph类
private val inputStreams = new ArrayBuffer[InputDStream[_]]()
private val outputStreams = new ArrayBuffer[DStream[_]]()//这里拿到的所有Dstream里的依赖
===========================================DStreamGraph类
def addOutputStream(outputStream: DStream[_]) {
this.synchronized {
outputStream.setGraph(this)
outputStreams += outputStream//将收集到的Dstream收集起来
}
}
===========================================DStreamGraph类
def generateJobs(time: Time): Seq[Job] = {
logDebug("Generating jobs for time " + time)
val jobs = this.synchronized {
outputStreams.flatMap { outputStream =>
val jobOption = outputStream.generateJob(time)//这里的outputStrams是一个Dstream的数组
jobOption.foreach(_.setCallSite(outputStream.creationSite))
jobOption
}
}
logDebug("Generated " + jobs.length + " jobs for time " + time)
jobs
}
===========================================DStream类 这里的依赖从尾部开始串起来
private def foreachRDD(
foreachFunc: (RDD[T], Time) => Unit,
displayInnerRDDOps: Boolean): Unit = {
new ForEachDStream(this,
context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()
}//每个输出算子都是调用的foreachRDD,由于foreachRDD的实现类ForEachDStream继承自Dstream所以有register()方法,我们进入到此方法内看到调用的是DStreamGraph类的addOutputStream方法
==========================================DStream类 这里将所有的类放进DstreamGraph里了
private[streaming] def register(): DStream[T] = {
ssc.graph.addOutputStream(this)//这里调用了DStreamGraph类的addOutputStream方法此时我们就能联系上前面的addOutputStream方法了 只不过这里使用了streamingContext来调用的DstreamGraph的方法
this
}
==========================================DStream类 ForEachDStream的详解
private[streaming]
class ForEachDStream[T: ClassTag] (
parent相当于在逻辑上表明各个DStream的关系,dependencies相当于在物理上表明整个DAG图的所有RDD集合,以便回溯计算
parent: DStream[T],//parent是一个DStream,而dependencies是一个DStream的集合。parent相当于指针,指向当前DStream的父级DStream,从而形成DAG图的一环
foreachFunc: (RDD[T], Time) => Unit,
displayInnerRDDOps: Boolean
) extends DStream[Unit](parent.ssc) {
override def dependencies: List[DStream[_]] = List(parent)//存放当前DStream之前的所有DStream的集合
override def slideDuration: Duration = parent.slideDuration
override def compute(validTime: Time): Option[RDD[Unit]] = None
override def generateJob(time: Time): Option[Job] = {//生成job的函数即用户自定义的函数对结果进行输出
parent.getOrCompute(time) match {
case Some(rdd) =>
val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
foreachFunc(rdd, time)
}
Some(new Job(time, jobFunc))
case None => None
}
}
}
下面来看看数据接收是如何从receiver->ReceiverSupervisor->ReceiverTracker的过程中吧
先从数据接收开始,由用户自定义的接收器这里以KafkaUtils.createDirectStream为例
==========================================KafkaUtils类
def createDirectStream[K, V](
ssc: StreamingContext,
locationStrategy: LocationStrategy,
consumerStrategy: ConsumerStrategy[K, V]
): InputDStream[ConsumerRecord[K, V]] = {//这里的返回值是InputDStream 进去看看吧
val ppc = new DefaultPerPartitionConfig(ssc.sparkContext.getConf)
createDirectStream[K, V](ssc, locationStrategy, consumerStrategy, ppc)
}
======================================InputDStream类
abstract class InputDStream[T: ClassTag](_ssc: StreamingContext)//看起来是一个抽象类我们看看他的实现找到ReceiverInputDStream进去看看
extends DStream[T](_ssc) {
private[streaming] var lastValidTime: Time = null
ssc.graph.addInputStream(this)
/** This is an unique identifier for the input stream. */
val id = ssc.getNewInputStreamId()
// Keep track of the freshest rate for this stream using the rateEstimator
protected[streaming] val rateController: Option[RateController] = None
======================================ReceiverInputDStream类
def getReceiver(): Receiver[T]//在该类里找到了一个方法 ,拿到接收数据的接收器对象,看看这个返回值类
======================================Receiver类
//Receiver里面会有一系列的store方法,该方法会将数据源源不断的发往ReceiverSupervisor,进入到store的pushSingle方法 看看里面吧
def store(dataItem: T) {
supervisor.pushSingle(dataItem)//pushSingle: 对应单条小数据
}
/** 将接收数据的ArrayBuffer作为数据块存储到Spark的内存中。 */
def store(dataBuffer: ArrayBuffer[T]) {
supervisor.pushArrayBuffer(dataBuffer, None, None)//pushArrayBuffer: 对应数组形式的数据
}
/**
* 将接收数据的ArrayBuffer作为数据块存储到Spark的内存中。元数据将与此数据块关联,用于对应的InputDStream中
*/
def store(dataBuffer: ArrayBuffer[T], metadata: Any) {
supervisor.pushArrayBuffer(dataBuffer, Some(metadata), None)
}
/** 将接收数据的迭代器作为数据块存储到Spark的内存中。 */
def store(dataIterator: Iterator[T]) {
supervisor.pushIterator(dataIterator, None, None)//pushIterator: 对应 iterator 形式数据
}
/*
将接收数据的迭代器作为数据块存储到Spark的内存中。元数据将与此数据块关联,用于对应的InputDStream。
*/
def store(bytes: ByteBuffer) {
supervisor.pushBytes(bytes, None, None)//pushBytes: 对应 ByteBuffer 形式的块数据
}
======================================ReceiverSupervisor类
def start() {
onStart()
startReceiver()//启动receiver,用户实现的receiver接收器
}
def createBlockGenerator(blockGeneratorListener: BlockGeneratorListener): BlockGenerator//创建块将接收器,进入到BlockGenerator看看
======================================BlockGenerator类
/** 开始生成和推线程块 */
def start(): Unit = synchronized {
if (state == Initialized) {
state = Active
blockIntervalTimer.start()
blockPushingThread.start()
logInfo("Started BlockGenerator")
} else {
throw new SparkException(
s"Cannot start BlockGenerator as its not in the Initialized state [state = $state]")
}
}
后续就是ReceiverSupervisor 持续不断地接收到 Receiver 转来的数据:
如果数据很细小,就需要 BlockGenerator 攒多条数据成一块(4a)、然后再成块存储(4b 或 4c)
反之就不用攒,直接成块存储(4b 或 4c)
这里 Spark Streaming 目前支持两种成块存储方式,一种是由 BlockManagerBasedBlockHandler 直接存到 executor 的内存或硬盘,另一种由 WriteAheadLogBasedBlockHandler 是同时写 WAL(4c) 和 executor 的内存或硬盘
每次成块在 executor 存储完毕后,ReceiverSupervisor 就会及时上报块数据的 meta 信息给 driver 端的 ReceiverTracker;这里的 meta 信息包括数据的标识 id,数据的位置,数据的条数,数据的大小等信息;
ReceiverTracker 再将收到的块数据 meta 信息直接转给自己的成员 ReceivedBlockTracker,由 ReceivedBlockTracker 专门管理收到的块数据 meta 信息。