Spark定制班第11课：Spark Streaming源码解读之Driver中的ReceiverTracker架构

最新推荐文章于 2020-05-05 11:54:49 发布

andyshar

最新推荐文章于 2020-05-05 11:54:49 发布

阅读量722

点赞数

分类专栏： Spark 大数据技术 Scala 文章标签： Spark Scala 源码解密架构

本文链接：https://blog.csdn.net/andyshar/article/details/51483184

版权

Scala 同时被 3 个专栏收录

58 篇文章 0 订阅

订阅专栏

Spark

55 篇文章 0 订阅

订阅专栏

大数据技术

55 篇文章 0 订阅

订阅专栏

本期内容：

1. ReceiverTracker的架构设计

2. 消息循环系统

3. ReceiverTracker具体实现

　　ReceiverTacker类如下，从源码注释可以看出该类的作用。

　　管理ReceiverInputDStreams的执行，记录Receiver发来的元数据信息。ReceiverTacker类构造时必须传入StreamingContext对象。

　　ReceiverTracker：

private val receivedBlockTracker = new ReceivedBlockTracker(

ssc.sparkContext.conf,

ssc.sparkContext.hadoopConfiguration,

receiverInputStreamIds,

ssc.scheduler.clock,

ssc.isCheckpointPresent,

Option(ssc.checkpointDir)

)

用于记录跟踪Receiver发送来Block的源数据信息。

ReceiverTacker类内部有ReceiverTackerEndpoint这个消息通信体，用于和运行在Executor端的ReceiverSupervisorImpl进行通信，包括Receiver的注册，重启Receiver，清除之前的Block数据，更新限流值，添加Block元数据信息等消息。

ReceiverTracker：

/** RpcEndpoint to receive messages from the receivers. */

private class ReceiverTrackerEndpoint(override val rpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint {

RPC消息通信体。

接下来以接收到来自Executor端的ReceiverSupervisorImpl发来添加元数据信息的AddBlock消息，进行讲解具体的处理流程。

ReceiverTracker：

...

override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {

// Remote messages

case RegisterReceiver(streamId, typ, host, executorId, receiverEndpoint) =>

val successful =

registerReceiver(streamId, typ, host, executorId, receiverEndpoint, context.senderAddress)

context.reply(successful)

// 若启用WAL方式，则在线程池中执行addBlock函数，否则直接执行addBlock函数，回复给ReceiverSupervisorImpl添加源数据是否成功的结果。

case AddBlock (receivedBlockInfo) =>

if (WriteAheadLogUtils.isBatchingEnabled(ssc.conf, isDriver = true)) {

walBatchingThreadPool.execute(new Runnable {

override def run(): Unit = Utils.tryLogNonFatalError {

if (active) {

context.reply( addBlock (receivedBlockInfo))

} else {

throw new IllegalStateException("ReceiverTracker RpcEndpoint shut down.")

}

})

} else {

context.reply( addBlock (receivedBlockInfo))

}

case DeregisterReceiver(streamId, message, error) =>

deregisterReceiver(streamId, message, error)

context.reply(true)

// Local messages

case AllReceiverIds =>

context.reply(receiverTrackingInfos.filter(_._2.state != ReceiverState.INACTIVE).keys.toSeq)

case StopAllReceivers =>

assert(isTrackerStopping || isTrackerStopped)

stopReceivers()

context.reply(true)

}

...

/** Add new blocks for the given stream */

private def addBlock (receivedBlockInfo: ReceivedBlockInfo): Boolean = {

receivedBlockTracker.addBlock(receivedBlockInfo)

}

...

ReceivedBlockInfo类包含了StreamID，Block中记录条数，元数据Metadata，接收Block的存储结果（BlockID和记录数量）

ReceivedBlockInfo:

...

/** Information about blocks received by the receiver */

private[streaming] case class ReceivedBlockInfo(

streamId: Int,

numRecords: Option[Long],

metadataOption: Option[Any],

blockStoreResult: ReceivedBlockStoreResult

) {

...

ReceiverBlockTracker类是addBlock方法的具体实现。

...

/** Add received block. This event will get written to the write ahead log (if enabled). */

def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {

try {

// 调用writeToLog来判断是否需要预写日志

val writeResult = writeToLog (BlockAdditionEvent(receivedBlockInfo))

if (writeResult) {

synchronized {

// 将receiverBlockInfo添加到队列中

getReceivedBlockQueue (receivedBlockInfo.streamId) += receivedBlockInfo

}

logDebug(s"Stream ${receivedBlockInfo.streamId} received " +

s"block ${receivedBlockInfo.blockStoreResult.blockId}")

} else {

logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " +

s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.")

}

writeResult

} catch {

case NonFatal(e) =>

logError(s"Error adding block $receivedBlockInfo", e)

false

}

...

调用ReceiverBlockTracker的writeToLog方法

/** Write an update to the tracker to the write ahead log */

private def writeToLog(record: ReceivedBlockTrackerLogEvent): Boolean = {

if (isWriteAheadLogEnabled) {

logTrace(s"Writing record: $record")

try {

writeAheadLogOption.get.write(ByteBuffer.wrap(Utils.serialize(record)),

clock.getTimeMillis())

true

} catch {

case NonFatal(e) =>

logWarning(s"Exception thrown while writing record: $record to the WriteAheadLog.", e)

false

}

} else {

true

}

调用ReceiverBlockTracker的getReceivedBlockQueue方法，其中streamIdToUnallocatedBlockQueues为HashMap，Key为StreamID，Value为ReceivedBlockQueue。而ReceivedBlockQueue 的定义为private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo]

/** Get the queue of received blocks belonging to a particular stream */

private def getReceivedBlockQueue(streamId: Int): ReceivedBlockQueue = {

// 保存到对应StreamID的ReceivedBlockQueue中

streamIdToUnallocatedBlockQueues.getOrElseUpdate(streamId, new ReceivedBlockQueue)

}

ReceivedBlockTracker类，可以从源码中看出，他会记录所有接收到的Block信息，根据需要把Block分配给Batch。如果设置了checkpoint，开启WAL，则会把所有的操作保存到预写日志中，因此当Driver失败后就可以从checkpoint和WAL中恢复ReceiverTracker的状态。

private[streaming] class ReceivedBlockTracker(

conf: SparkConf,

hadoopConf: Configuration,

streamIds: Seq[Int],

clock: Clock,

recoverFromWriteAheadLog: Boolean,

checkpointDirOption: Option[String])

extends Logging {

private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo]

// 存储批处理时刻，分配到的Blocks数据。

private val streamIdToUnallocatedBlockQueues = new mutable.HashMap[Int, ReceivedBlockQueue]

ReceiverBlockTracker类中重要的方法allocateBlocksToBatch。

/**

* Allocate all unallocated blocks to the given batch.

* This event will get written to the write ahead log (if enabled).

def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {

// 判断是否到下一次批处理时刻

if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) {

// 从队列中获取ReceivedBlock数据，组装成key为streamId、value为

val streamIdToBlocks = streamIds.map { streamId =>

(streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true))

}.toMap

val allocatedBlocks = AllocatedBlocks(streamIdToBlocks)

// 判断是否预写日志

if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) {

// 数据存储到timeToAllocatedBlocks中

timeToAllocatedBlocks.put(batchTime, allocatedBlocks)

lastAllocatedBatchTime = batchTime

} else {

logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery")

}

} else {

// This situation occurs when:

// 1. WAL is ended with BatchAllocationEvent, but without BatchCleanupEvent,

// possibly processed batch job or half-processed batch job need to be processed again,

// so the batchTime will be equal to lastAllocatedBatchTime.

// 2. Slow checkpointing makes recovered batch time older than WAL recovered

// lastAllocatedBatchTime.

// This situation will only occurs in recovery time.

logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery")

}

该方法是被ReceiverTracker调用的。

/** Allocate all unallocated blocks to the given batch. */

def allocateBlocksToBatch(batchTime: Time): Unit = {

if (receiverInputStreams.nonEmpty) {

receivedBlockTracker. allocateBlocksToBatch (batchTime)

}

而ReceiverTracker的allocateBlocksToBatch方法是被JobGenerator的generateJobs方法调用的。

/** Generate jobs and perform checkpoint for the given `time`. */

private def generateJobs(time: Time) {

// Set the SparkEnv in this thread, so that job generation code can access the environment

// Example: BlockRDDs are created in this thread, and it needs to access BlockManager

// Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.

SparkEnv.set(ssc.env)

Try {

jobScheduler.receiverTracker.allocateBlocksToBatch (time) // allocate received blocks to batch

graph.generateJobs(time) // generate jobs using allocated block

} match {

case Success(jobs) =>

val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)

jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))

case Failure(e) =>

jobScheduler.reportError("Error generating jobs for time " + time, e)

}

eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))

}

ReceiverBlockTracker类中重要的方法，getBlocksOfBatch。

/** Get the blocks allocated to the given batch. */

def getBlocksOfBatch(batchTime: Time): Map[Int, Seq[ReceivedBlockInfo]] = synchronized {

timeToAllocatedBlocks.get(batchTime).map { _.streamIdToAllocatedBlocks }.getOrElse(Map.empty)

}

该方法是被ReceiverTracker的getBlocksOfBatch调用。

/** Get the blocks for the given batch and all input streams. */

def getBlocksOfBatch(batchTime: Time): Map[Int, Seq[ReceivedBlockInfo]] = {

receivedBlockTracker. getBlocksOfBatch (batchTime)

}

ReceiverTracker的getBlocksOfBatch方法是被ReceiverInputDStream的compute方法调用的。

/**

* Generates RDDs with blocks received by the receiver of this stream. */

override def compute(validTime: Time): Option[RDD[T]] = {

val blockRDD = {

if (validTime < graph.startTime) {

// If this is called for any time before the start time of the context,

// then this returns an empty RDD. This may happen when recovering from a

// driver failure without any write ahead log to recover pre-failure data.

new BlockRDD[T](ssc.sc, Array.empty)

} else {

// Otherwise, ask the tracker for all the blocks that have been allocated to this stream

// for this batch

val receiverTracker = ssc.scheduler.receiverTracker

val blockInfos = receiverTracker. getBlocksOfBatch (validTime).getOrElse(id, Seq.empty)

// Register the input blocks information into InputInfoTracker

val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)

ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)

// Create the BlockRDD

createBlockRDD(validTime, blockInfos)

}

Some(blockRDD)

}

总结：

Receiver接收到数据，然后合并并存储数据之后，ReceiverSupervisorImpl会把Block的元数据汇报给ReceiverTracker内部的消息通信体ReceiverTrackerEndpoint。

ReceiverTracker接收到Block的元数据信息之后，由ReceivedBlockTracker管理Block的元数据的分配，JobGenerator会将每个Batch，从ReceivedBlockTracker中获取属于该Batch的Block元数据信息来生成RDD。

从设计模式来讲：ReceiverTrackerEndpoint和ReceivedBlockTracker是门面设计模式，内部实际干事情的是ReceivedBlockTracker，外部通信体或者代表者就是ReceiverTrackerEndpoint。

备注：

资料来源于：DT_大数据梦工厂（Spark版本定制班课程）

更多私密内容，请关注微信公众号：DT_Spark

如果您对大数据Spark感兴趣，可以免费听由王家林老师每天晚上20：00开设的Spark永久免费公开课，地址YY房间号：68917580

andyshar

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark定制班第11课：Spark Streaming源码解读之Driver中的ReceiverTracker架构

本期内容：1. ReceiverTracker的架构设计2. 消息循环系统3. ReceiverTracker具体实现　　ReceiverTacker类如下，从源码注释可以看出该类的作用。　　管理ReceiverInputDStreams的执行，记录Receiver发来的元数据信息。ReceiverTacker类构造时必须传入StreamingContext
复制链接

扫一扫