1, 还是从这个案例开始
object NetworkWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: NetworkWordCount <hostname> <port>") System.exit(1) } StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[5]") val ssc = new StreamingContext(sparkConf, Seconds(40)) // Create a socket stream on target ip:port and count the // words in input stream of \n delimited text (eg. generated by 'nc') // Note that no duplication in storage level only for running locally. // Replication necessary in distributed scenario for fault tolerance. val lines = ssc.socketTextStream("192.168.4.41", 9999, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }
2,在“SparkStream例子HdfsWordCount--从Dstream到RDD全过程解析”这一文中详细说明了DstreamGraph回溯生成RDD的。这边再简单回顾一下:
a,Dstream.print()==>对应的ForEachDStream的generateJob(time:Time)方法会被DstreamGraph.generateJobs(time)调用
b, ForEachDStream的generateJob(time:Time){ parent.getOrCompute(time)….},通过parent对应的Dstream一直找到FileInputDStream的compute方法,来生成RDD
===》此处NetworkWordCount对应的是ReceiverInputDstream,通过DstreamGraph回溯生成Rdd的过程是一样的。不过ReceiverInputDstream是取预先由SocketReceiver存放在spark的BlockManager中的数据来生成RDD的.
===》( 在这一文“ReceiverSupervisorImpl中的startReceiver(),Receiver如何将数据store到RDD的”中分析过,Receiver如何将数据存放到RDD中)
3,咱们直接查看一下SocketInputDstream的父类ReceiverInputDStream的compute方法.
abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext) extends InputDStream[T](ssc_) { 。。。。 /** * Generates RDDs with blocks received by the receiver of this stream. */ override def compute(validTime: Time): Option[RDD[T]] = { val blockRDD = { if (validTime < graph.startTime) { // If this is called for any time before the start time of the context, // then this returns an empty RDD. This may happen when recovering from a // driver failure without any write ahead log to recover pre-failure data.
// 发生返回空的Rdd,可能是因为driver失败后重启并且没有做 WAL new BlockRDD[T](ssc.sc, Array.empty) } else { // Otherwise, ask the tracker for all the blocks that have been allocated to this stream for this batch
//否则会通过ReceiverTracker取得当前批次所有ReceivedBlockInfo信息 val receiverTracker = ssc.scheduler.receiverTracker //receiverTracker.getBlocksOfBatch(validTime)取得当前批次对应的所有的ReceiverId和每个receiverId对应的Seq[ReceivedBlockInfo],
返回Map[receiverId,Seq[ReceivedBlockInfo]] // InputStream的id和receiverId 有对应关系 val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty) // Register the input blocks information into InputInfoTracker //将注册的输入的blocks信息放到StreamInputInfo类中,id是ReceiverInputDstream对应的,numRecords存放到BlockManager中所有条数 val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum) ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo) // Create the BlockRDD createBlockRDD(validTime, blockInfos) } } Some(blockRDD) }
4,分析一下这一段代码还是挺值得学习的:目标就是为了得到当前批次中所有receiverId对应的Seq[ReceivedBlockInfo]信息。
==》这个ReceivedBlockInfo类中,存放streamId,store()进来的总条数,及BlockId等原数据信息
val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)
a,在ReceiverTracker中getBlockOfBatch方法是要得到所有输入流的数据
private[streaming] class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false) extends Logging { 。。。。。 private val receivedBlockTracker = new ReceivedBlockTracker( ssc.sparkContext.conf, ssc.sparkContext.hadoopConfiguration, receiverInputStreamIds, ssc.scheduler.clock, ssc.isCheckpointPresent, Option(ssc.checkpointDir) )
。。。。。
/** Get the blocks for the given batch and all input streams. */
def getBlocksOfBatch(batchTime: Time): Map[Int, Seq[ReceivedBlockInfo]] = { receivedBlockTracker.getBlocksOfBatch(batchTime) }
b,从下面的代码可以得知,所有批次对应的数据信息都是通过timeToAllocatedBlocks这个map对应的AllocateBlocks中。
private[streaming] class ReceivedBlockTracker( conf: SparkConf, hadoopConf: Configuration, streamIds: Seq[Int], clock: Clock, recoverFromWriteAheadLog: Boolean, checkpointDirOption: Option[String]) extends Logging { private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo] private val streamIdToUnallocatedBlockQueues = new mutable.HashMap[Int, ReceivedBlockQueue] private val timeToAllocatedBlocks = new mutable.HashMap[Time, AllocatedBlocks] 。。。。 。。。
/** Get the blocks allocated to the given batch. *按当前批次取得指定 Map 里面是当前批次对应的所有receiverId 和receiverId对应的Seq[ReceivedBlockInfo] * */ def getBlocksOfBatch(batchTime: Time): Map[Int, Seq[ReceivedBlockInfo]] =
synchronized { timeToAllocatedBlocks.get(batchTime).map{ _.streamIdToAllocatedBlocks }
.getOrElse(Map.empty) }
b,即然是从timeToAllocatedBlocks中取的数据,哪是由谁将当前数据放进去的呢?
==》当Reciver将数据store到spark的BlockManager之后,JobGenerate才开始工作.看一下JobScheduler的start方法执行流程,就可以证明这一点。
def start(): Unit = synchronized { 。。。 listenerBus.start(ssc.sparkContext) //处理ReceiverInputDstream的数据源,如SocketInputDstream,FlumePollingInputDstream,FlumeInputDsteam等。看ReceiverInputDstream的子类 receiverTracker = new ReceiverTracker(ssc) inputInfoTracker = new InputInfoTracker(ssc) receiverTracker.start() jobGenerator.start() logInfo("Started JobScheduler") }
c,再跟踪到JobGenerator. generateJobs方法,关键代码就是
ReceiverTracker.allocateBlockToBatch(time),从注释上看可以得知这个方法的作用是:分配接收到的Blocks到当前批次中
===》allocateBlockToBatch调用在先,ReceivedInputDstream的compute调用在后。
/** Generate jobs and perform checkpoint for the given `time`. */ private def generateJobs(time: Time) { // Set the SparkEnv in this thread, so that job generation code can access the environment // Example: BlockRDDs are created in this thread, and it needs to access BlockManager // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed. SparkEnv.set(ssc.env) Try { jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch //调用graph的generateJobs方法,通过scala的Try的apply函数,返回Success(jobs) 或者 Failure(e), // 其中的jobs就是该方法返回的Job对象集合,如果Job创建成功,再调用JobScheduler的submitJobSet方法将job提交给集群执行。 graph.generateJobs(time) // generate jobs using allocated block } match { case Success(jobs) => val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time) //其中streamIdToInputInfos就是store接收到的数据对应的元数据 //JobSet代表了一个batch duration中的一批jobs。就是一个普通对象,包含了未提交的jobs,提交的时间,执行开始和结束时间等信息。 jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos)) case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e) } //发送执行CheckPoint时间,发送周期为streaming batch接收数据的时间 eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false)) }
d,看一下ReceiverTracker.allocateBlockToBatch(time)是如何实现的?
==>,分配所有未分配的blocks到给定的batch中
/** Allocate all unallocated blocks to the given batch.* */ def allocateBlocksToBatch(batchTime: Time): Unit = { if (receiverInputStreams.nonEmpty) { receivedBlockTracker.allocateBlocksToBatch(batchTime) } }
f,还是进入ReceivedBlockTracker中:
该allocateBlocksToBatch方法作用就是:填充timeToAllocatedBlocks是HashMap[Time, AllocatedBlocks],key表示每个time批次,value 是AllocatedBlocks, AllocatedBlocks(streamIdToAllocatedBlocks: Map[Int,Seq[ReceivedBlockInfo]])表示当前批次所有receiverId, 对应的Seq[ReceivedBlockInfo],放在这个map中
private[streaming] class ReceivedBlockTracker( conf: SparkConf, hadoopConf: Configuration, streamIds: Seq[Int], clock: Clock, recoverFromWriteAheadLog: Boolean, checkpointDirOption: Option[String]) extends Logging { private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo] private val streamIdToUnallocatedBlockQueues = new mutable.HashMap[Int, ReceivedBlockQueue] private val timeToAllocatedBlocks = new mutable.HashMap[Time, AllocatedBlocks] private var lastAllocatedBatchTime: Time = null 。。。。
/** * Allocate all unallocated blocks to the given batch. * This event will get written to the write ahead log (if enabled). * 如果启用WAL,会将该事件将被写入日志。 */ def allocateBlocksToBatch(batchTime: Time): Unit = synchronized { if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) { //将所有Receiver的id(streamId就是receiver的id)及它的ReceivedBlockInfo放在一个Map[streamId, Seq[ReceivedBlockInfo]]中 val streamIdToBlocks = streamIds.map { streamId => //1,dequeueAll会遍历队列中所有元素:ReceivedBlockInfo即Block信息,传给匿名函数,如果返回true则元素被取出,并将该元素从队列中移除。 //==》使用将元素从队列中移除这种特性来保证,即便下一次批次的ReceivedBlockInfo存放到这个队列中也没有关系,就当做当前批量进行处理,
然后从队列中移除
//2,能从队列中取数据是因为,先由receiver通过store将数据存放到BlockManager中-》executor会使用AddBlock(ReceivedBlockInfo)
通知Driver的ReceiverTrackerEndPoint==> //然后将当前批次中,所有ReceiverBlockInfo放在一个HashMap[Int, ReceivedBlockQueue]的value中,这个map的key就是receiverId (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true)) }.toMap //将上面的streamIdToBlocks:Map[streamId, Seq[ReceivedBlockInfo]]放到批次对应的Block类中:AllocatedBlocks val allocatedBlocks = AllocatedBlocks(streamIdToBlocks) //BatchAllocationEvent,代表当前ReceivedBlockTracker事件的状态批次分配完成,即数据已存放到BlockManager中,它是给WAL使用的 //writeToLog不管是否写入到日志中都会返回true的 if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) { //timeToAllocatedBlocks表示HashMap[Time, AllocatedBlocks] timeToAllocatedBlocks.put(batchTime, allocatedBlocks) //lastAllocatedBatchTime类型就是Time lastAllocatedBatchTime = batchTime } else { logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery") } } else { // This situation occurs when: // 1. WAL is ended with BatchAllocationEvent, but without BatchCleanupEvent, // possibly processed batch job or half-processed batch job need to be processed again, // so the batchTime will be equal to lastAllocatedBatchTime. // 2. Slow checkpointing makes recovered batch time older than WAL recovered // lastAllocatedBatchTime. // This situation will only occurs in recovery time. logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery") } }
5,所以再回到ReceiverInputDStream中,
receiverTracker.getBlocksOfBatch(time).getOrElse(id),就是这个receiverId对应的Seq[ReceivedBlockInfo]SocketReceiver存放到saprk数据的元数据信息
abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext) extends InputDStream[T](ssc_) { 。。。。 override def compute(validTime: Time): Option[RDD[T]] = { val blockRDD = { if (validTime < graph.startTime) { 。。。
} else {
//否则会通过ReceiverTracker取得当前批次所有ReceivedBlockInfo信息 val receiverTracker = ssc.scheduler.receiverTracker //receiverTracker.getBlocksOfBatch(validTime)取得当前批次对应的所有的ReceiverId和每个receiverId对应的Seq[ReceivedBlockInfo],
返回Map[receiverId,Seq[ReceivedBlockInfo]] // InputStream的id和receiverId 有对应关系 val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty) // Register the input blocks information into InputInfoTracker //将注册的输入的blocks信息放到StreamInputInfo类中,id是ReceiverInputDstream对应的,numRecords存放到BlockManager中所有条数 val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum) ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo) // Create the BlockRDD createBlockRDD(validTime, blockInfos) } } Some(blockRDD) }
6,接下来就是调用createBlockRDD方法,从spark的BlockManager中取得当前批次内的所有RDD,来创建BlockRDD
//传入当前批次,及对应的 Seq[ReceivedBlockInfo]表示存放到spark中的block信息 private[streaming] def createBlockRDD(time: Time, blockInfos: Seq[ReceivedBlockInfo]): RDD[T] = { if (blockInfos.nonEmpty) { //当前案例会取出Array[StreamBlockId(streamId: Int, uniqueId: Long)] val blockIds = blockInfos.map { _.blockId.asInstanceOf[BlockId] }.toArray // Are WAL record handles present with all the blocks val areWALRecordHandlesPresent = blockInfos.forall { _.walRecordHandleOption.nonEmpty } if (areWALRecordHandlesPresent) { // If all the blocks have WAL record handle, then create a WALBackedBlockRDD val isBlockIdValid = blockInfos.map { _.isBlockIdValid() }.toArray val walRecordHandles = blockInfos.map { _.walRecordHandleOption.get }.toArray new WriteAheadLogBackedBlockRDD[T]( ssc.sparkContext, blockIds, walRecordHandles, isBlockIdValid) } else { // Else, create a BlockRDD. However, if there are some blocks with WAL info but not // others then that is unexpected and log a warning accordingly. if (blockInfos.find(_.walRecordHandleOption.nonEmpty).nonEmpty) { if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) { logError("Some blocks do not have Write Ahead Log information; " + "this is unexpected and data may not be recoverable after driver failures") } else { logWarning("Some blocks have Write Ahead Log information; this is unexpected") } } //让BlockManagerMaster去判断是有StreamBlockId()在集群中 val validBlockIds = blockIds.filter { id => ssc.sparkContext.env.blockManager.master.contains(id) } //如果当前记录的Array[StreamBlockId(streamId: Int, uniqueId: Long)]和集群中的数据不一致则记录一下 if (validBlockIds.size != blockIds.size) { logWarning("Some blocks could not be recovered as they were not found in memory. "
+"To prevent such data loss, enabled Write Ahead Log (see programming guide " + "for more details.") } //按集群中拥有的Array[StreamBlockId(streamId: Int, uniqueId: Long)]来创建BlockRDD new BlockRDD[T](ssc.sc, validBlockIds) } } else { // If no block is ready now, creating WriteAheadLogBackedBlockRDD or BlockRDD // according to the configuration if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) { new WriteAheadLogBackedBlockRDD[T]( ssc.sparkContext, Array.empty, Array.empty, Array.empty) } else { new BlockRDD[T](ssc.sc, Array.empty) } } }
到此,ReceiverInputDStream周期性去取,预先从SocketReceiver中存放到spark的BlockManager中的数据流程结束。。。