上文提到“ReceiverInputDstream的Receiver是如何被放到Executor上执行的”关键代码ReceiverSupervisorImpl的start方法。
val startReceiverFunc: Iterator[Receiver[_]] => Unit = (iterator: Iterator[Receiver[_]]) => { if (!iterator.hasNext) { throw new SparkException( "Could not start receiver as object not found.") } //得到当前活动的TaskContext, attemptNumber:表示这个任务尝试了多少次。
第一个任务尝试将被分配 attemptNumber = 0,随后的尝试将有增加的尝试次数。 if (TaskContext.get().attemptNumber() == 0) { val receiver = iterator.next() //任务第一次进来时,iterator被上面next之后里面就没有元素。不然会报错 assert(iterator.hasNext == false) //ReceiverSupervisorImpl是Receiver的监控器,同时负责数据的写等操作,也就是每个周期receiver得到的数据写入到 //BlockGenerator中,然后让ReceiverInputStream中的compute方法将每个周期内的数据从BlockGenerator取出,然后写入RDD中 val supervisor = new ReceiverSupervisorImpl( receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption) //背后是调用ReceiverSupervisor父类实现方法,最终由ReceiverSupervisorImpl子类实现onStart方法 /*def start() { onStart() startReceiver() }*/ supervisor.start() supervisor.awaitTermination() } else { //意思是当.attemptNumber大于0时,表示重启TaskScheduler,Receiver不需要重启,直接退出就可以 // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it. } }
1,supervisor.start()该方法是启动Receiver开始在Executor上接收数据的入口
start()方法是在ReceiverSupervisorImpl的父类ReceiverSupervisor中实现的
/** Start the supervisor */ def start() { onStart() startReceiver() //真正调用的Receiver实现类中的onStart方法,如SocketInputDstream中实现的Receive中onStart方法 }
2,进入onStart(),注解说得很清楚,该方法必须在receiver.onStart()方法调用之前,先进行调用。从而保证BlockGenerator提前启动就绪,来接收receiver传过来的数据
/** * Called when supervisor is started. * Note that this must be called before the receiver.onStart() is called to ensure * things like [[BlockGenerator]]s are started before the receiver starts sending data. */
override protected def onStart() { registeredBlockGenerators.foreach { _.start() } }
a,其中registeredBlockGenerators就是ArrayBuffer[BlockGenerator],这个集合会在当前ReceiverSupervisorImpl初始化时将BlockGenerator生成出来放到ArrayBuffer集合中,其中的_.start()是BlockGenerator.start
/** * Concrete implementation of [[org.apache.spark.streaming.receiver.ReceiverSupervisor]] * which provides all the necessary functionality for handling the data received by * the receiver. Specifically, it creates a [[org.apache.spark.streaming.receiver.BlockGenerator]] * object that is used to divide the received data stream into blocks of data. * a,该类实现ReceiverSupervisor提供了所有处理receiver接收过来的数据, * b,它会在初始化时createBlockGenerator创建BlockGenerator对象, * c,同时它使用BlockGenerator对象来切分receiver得到的数据放到block块中 * *ReceiverSupervisorImpl是Receiver的监控器,同时负责数据的写等操作,也就是每个周期receiver得到的数据写入到BlockGenerator中,
然后让ReceiverInputStream中的compute方法将每个周期内的数据从BlockGenerator取出,然后写入RDD中 */ private[streaming] class ReceiverSupervisorImpl( receiver: Receiver[_], env: SparkEnv, hadoopConf: Configuration, checkpointDirOption: Option[String] ) extends ReceiverSupervisor(receiver, env.conf) with Logging { ….
/** Unique block ids if one wants to add blocks directly */ private val newBlockId = new AtomicLong(System.currentTimeMillis()) //它初始化后会被下面createBlockGenerator方法调用,将BlockGenerator填充到该集合中ArrayBuffer[BlockGenerator] private val registeredBlockGenerators = new mutable.ArrayBuffer[BlockGenerator] with mutable.SynchronizedBuffer[BlockGenerator] /** Divides received data records into data blocks for pushing in BlockManager. * 初始一个BlockGeneratorListener,帮助切分数据放到BlockManager中 * */ private val defaultBlockGeneratorListener = new BlockGeneratorListener { def onAddData(data: Any, metadata: Any): Unit = { } def onGenerateBlock(blockId: StreamBlockId): Unit = { print("onGenerateBlock no impl.....") } def onError(message: String, throwable: Throwable) { reportError(message, throwable) } def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) { //将receiver使用store存放在ArrayBuffer放到内存中 pushArrayBuffer(arrayBuffer, None, Some(blockId)) } } //初始化时createBlockGenerator创建BlockGenerator对象,用该BlockGenerator来切分Receiver接收到的数据,
每初始化一次创建一个BlockGenerator,这个BlockGenerator除放在registeredBlockGenerators集合中,还跟方法返回了 private val defaultBlockGenerator = createBlockGenerator(defaultBlockGeneratorListener)
b,查看一下createBlockGenerator(),该方法用于创建BlockGenerator对象,同时将BlockGeneratorListener注入到BlockGenerator中
override def createBlockGenerator( blockGeneratorListener: BlockGeneratorListener): BlockGenerator = { // Cleanup BlockGenerators that have already been stopped //registeredBlockGenerators集合是ArrayBuffer[BlockGenerator],先将BlockGenerator停止掉的内容去掉 registeredBlockGenerators --= registeredBlockGenerators.filter{ _.isStopped() } //生成一个BlockGenerator实例加到registeredBlockGenerators中,并返回生成的BlockGenerator val newBlockGenerator = new BlockGenerator(blockGeneratorListener, streamId, env.conf) registeredBlockGenerators += newBlockGenerator newBlockGenerator }
3,查看一下BlockGenerator.start()方法
/** * Generates batches of objects received by a * [[org.apache.spark.streaming.receiver.Receiver]] and puts them into appropriately * named blocks at regular intervals. This class starts two threads, * one to periodically start a new batch and prepare the previous batch of as a block, * the other to push the blocks into the block manager. * *一,通过Receiver接收到一批数据,同时将它们放到内部对应的规则名称中,这个类有两个线程: * a,会周期性的去开始一个新的批次,并准备块的前一批,也就是下面RecurringTimer对应的updateCurrentBuffer方法 * b,第二个线程将块放到blockManager中,对应下面的blockPushingThread线程 * * Note: Do not create BlockGenerator instances directly inside receivers. Use * `ReceiverSupervisor.createBlockGenerator` to create a BlockGenerator and use it. */ private[streaming] class BlockGenerator( listener: BlockGeneratorListener, receiverId: Int, conf: SparkConf, clock: Clock = new SystemClock() ) extends RateLimiter(conf) with Logging { // 第一个参数是StreamBlockId(receiverid,下一次处理的时间点-200ms):第二个参数receiver从数据源store进来的数据 private case class Block(id: StreamBlockId, buffer: ArrayBuffer[Any]) /** * BlockGenerator的成员state,有五个可能状态: * Initialized:表示还没有开始 * Active:start()已被调用,将数据放到生成block中 * StoppedAddingData:stop()已被调用,数据添加被停止,block还在生并数据可以推送到block中 * StoppedGeneratingBlocks:停止block的生成,但数据还可以被推送。 * StoppedAll:停止了所有,BlockGenerator对象被GC掉
* The BlockGenerator can be in 5 possible states, in the order as follows. * - Initialized: Nothing has been started * - Active: start() has been called, and it is generating blocks on added data. * - StoppedAddingData: stop() has been called, the adding of data has been stopped, * but blocks are still being generated and pushed. * - StoppedGeneratingBlocks: Generating of blocks has been stopped, but * they are still being pushed. * - StoppedAll: Everything has stopped, and the BlockGenerator object can be GCed. */ private object GeneratorState extends Enumeration { type GeneratorState = Value val Initialized, Active, StoppedAddingData, StoppedGeneratingBlocks, StoppedAll = Value } import GeneratorState._ //receive接收的数据在存储到spark里面时,将数据切分成块的时间间隔,建议最低不超过50ms private val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms") require(blockIntervalMs > 0, s"'spark.streaming.blockInterval' should be a positive value") //每200ms调用一次updateCurrentBuffer方法 private val blockIntervalTimer = new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator") private val blockQueueSize = conf.getInt("spark.streaming.blockQueueSize", 10) private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize) private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } } @volatile private var currentBuffer = new ArrayBuffer[Any] @volatile private var state = Initialized /** Start block generating and pushing threads. */ def start(): Unit = synchronized { if (state == Initialized) { state = Active blockIntervalTimer.start() blockPushingThread.start() logInfo("Started BlockGenerator") } else { throw new SparkException( s"Cannot start BlockGenerator as its not in the Initialized state [state = $state]") } }
a,先看一下blockIntervalTimer周期性线程,它是不断去生成Block用的
//receive接收的数据在存储到spark里面时,将数据切分成块的时间间隔,建议最低不超过50ms private val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms") require(blockIntervalMs > 0, s"'spark.streaming.blockInterval' should be a positive value") //每200ms调用一次updateCurrentBuffer方法 private val blockIntervalTimer = new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")
b,updateCurrentBuffer()如果receiver有数据存放进来,会每200ms生成一个Block,放到ArrayBlockQueue队列中,当这个队列有元素时,就会不断将Block存放到BlockManager进行存储
/** Change the buffer to which single records are added to. * 更改将单个记录添加到的缓冲区 * */ private def updateCurrentBuffer(time: Long): Unit = { try { var newBlock: Block = null //上面初始化进来的@volatile private var currentBuffer = new ArrayBuffer[Any] //这个currentBuffer集合需要Receiver的实现类,调用store()方法后,再调用当前类中的addData方法,集合才会有值 synchronized { if (currentBuffer.nonEmpty) { //如果currentBuffer存在元素就将它赋给另一个变量,然后给它赋一个新的ArrayBuffer[Any] val newBlockBuffer = currentBuffer currentBuffer = new ArrayBuffer[Any] //streaming的Batch周期最好不要小于500ms,分块的blockIntervalMs的默认值是200ms,最好不低于50ms val blockId = StreamBlockId(receiverId, time - blockIntervalMs) //listener如果是Streaming传过来的BlockGeneratorListener对象,该listener并没有实现onGenerateBlock方法 listener.onGenerateBlock(blockId) newBlock = new Block(blockId, newBlockBuffer) } } if (newBlock != null) { //将Block放到blocksForPushin这个ArrayBlockingQueue[Block]里面 blocksForPushing.put(newBlock) // put is blocking when queue is full } } catch { case ie: InterruptedException => logInfo("Block updating timer thread was interrupted") case e: Exception => reportError("Error in block updating thread", e) } }
c,再看blockPushingThread线程:它的线程会不断监听ArrayBlockingQueue[Block]队列是否有数据进来,然后放到BlockManager中进行存储。
private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }
/** Keep pushing blocks to the BlockManager. */ private def keepPushingBlocks() { logInfo("Started block pushing thread") def areBlocksBeingGenerated: Boolean = synchronized { state != StoppedGeneratingBlocks } try { // While blocks are being generated, keep polling for to-be-pushed blocks and push them. //如果state没有停止block的生成,就一直从blocksForPushing对应的ArrayBlockingQueue[Block]取出来,
这个blocksForPushing队列需要上面updateCurrentBuffer方法,对应的currentBuffer的集合ArrayBuffer[Any]里面有元素才会执行pushBlock()方法 while (areBlocksBeingGenerated) { //在10ms内取出队列头部元素,如果超时返回Null Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS)) match { //pushBlock方法就是将Receiver的store()接收到的数据放BlockManager的diskStore或MemoryStore中 case Some(block) => pushBlock(block) case None => } } // At this point, state is StoppedGeneratingBlock. So drain the queue of to-be-pushed blocks. logInfo("Pushing out the last " + blocksForPushing.size() + " blocks") while (!blocksForPushing.isEmpty) { val block = blocksForPushing.take() logDebug(s"Pushing block $block") //会将block放到初始化生成的defaultBlockGeneratorListener中 pushBlock(block) logInfo("Blocks left to push " + blocksForPushing.size()) } logInfo("Stopped block pushing thread") } catch { case ie: InterruptedException => logInfo("Block pushing thread was interrupted") case e: Exception => reportError("Error in block pushing thread", e) } }
4,接下来看一下pushBlock(block),它会将生成的block放到ReceiverSupervisorImpl中的defaultBlockGeneratorListener中
//pushBlock方法就是将Receiver的store()接收到的数据放BlockManager的diskStore或MemoryStore中 private def pushBlock(block: Block) { //这个block.id就是StreamBlockId(receiverId,time-200ms),block.buffer就是对应receiver.store进来的数据,
而这个listener就是ReceiverSupervisorImpl,初始化生成的defaultBlockGeneratorListener listener.onPushBlock(block.id, block.buffer) logInfo("Pushed block " + block.id) }
a,看一下BlockGeneratorListener这个监听器是onPushBlock()如何将Block数据放到BlockManager中的
/** Divides received data records into data blocks for pushing in BlockManager. * 初始一个BlockGeneratorListener,帮助切分数据放到BlockManager中 * */ private val defaultBlockGeneratorListener = new BlockGeneratorListener { def onAddData(data: Any, metadata: Any): Unit = { } def onGenerateBlock(blockId: StreamBlockId): Unit = { print("onGenerateBlock no impl.....") } def onError(message: String, throwable: Throwable) { reportError(message, throwable) } def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) { //将receiver使用store存放在ArrayBuffer放到内存中 pushArrayBuffer(arrayBuffer, None, Some(blockId)) } }
b,进pushArrayBuffer查看一下具体的实现:
/** Store an ArrayBuffer of received data as a data block into Spark's memory. * 将receiver使用store存放在arrayBuffer这个ArrayBuffer内存中 * pushArrayBuffer(arrayBuffer[Any], None, Some(StreamBlockId(receiverId,time-200ms))) * */ def pushArrayBuffer( arrayBuffer: ArrayBuffer[_], metadataOption: Option[Any], blockIdOption: Option[StreamBlockId] ) { //将信息告诉driver,ArrayBufferBlock是ReceivedBlock的cass class表示数据块存放在ArrayBuffer中 pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption) }
==》查看一下pushAndReportBlock是如何保存block并报告driver的
/** Store block and report it to driver * 保存block并报告给driver * ArrayBufferBlock(arrayBuffer[Any]), None, Some(StreamBlockId(receiverId,time-200ms) * */ def pushAndReportBlock( receivedBlock: ReceivedBlock, metadataOption: Option[Any], blockIdOption: Option[StreamBlockId] ) { //如果没有得到StreamBlockId,会生成一个新的StreamBlockId val blockId = blockIdOption.getOrElse(nextBlockId) val time = System.currentTimeMillis //将block保存到BlockManager中,按指定的storageLevel,返回BlockManagerBasedStoreResult(blockId, numRecords), // 其中numRecords就是store()存放进来的个数 val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock) logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms") val numRecords = blockStoreResult.numRecords val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult) //将ReceiverBlockInfo给ReceiverTrackerEndpoint对应的receiveAndReply, // 从而将ReceivedBlockInfo会将Block的元数据信息放在AddBlock身上 通过Driver trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo)) logDebug(s"Reported block $blockId") }
===》从源码可知将接收到的数据存储到BlockManager中是receiverdBlockHandler.storeBlock()
=== 》先看一下receivedBlockHandler是如何实现的,从源码可以分析当前案例得到的是BlockManagerBaseBlockHandler实例
//ReceivedBlockHandler处理接收receiver的Block信息的类 private val receivedBlockHandler: ReceivedBlockHandler = { //如果conf没有设置:spark.streaming.receiver.writeAheadLog.enable这个值,默认是false //即不会生成WriteAheadLogBasedBlockHandler实例 if (WriteAheadLogUtils.enableReceiverLog(env.conf)) { if (checkpointDirOption.isEmpty) { throw new SparkException( "Cannot enable receiver write-ahead log without checkpoint directory set. " + "Please use streamingContext.checkpoint() to set the checkpoint directory. " + "See documentation for more details.") } new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId, receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get) } else { //取得SprakEnv对应Executor的BlockManager,它用于存储block,如果是socketTextStream得到的receiver就是SocketReceiver new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel) } }
===》从源码得知当前storeBlock会将receiver得到的数据,以BlockManager.putIterator方式以storageLevel的方式存储在spark集群中。然后将StreamBlockId及当前存储的记录数通过BlockManagerBaseStoreResult()实例返回。
(对于BlockManager.putIterator()相关代码,后面针对性的分析一下)
/** * Implementation of a [[org.apache.spark.streaming.receiver.ReceivedBlockHandler]] which * stores the received blocks into a block manager with the specified storage level. * 将block保存到BlockManager中,按指定的storageLevel */ private[streaming] class BlockManagerBasedBlockHandler( blockManager: BlockManager, storageLevel: StorageLevel) extends ReceivedBlockHandler with Logging { //ReceivedBlock就是store()存放进来的数据对应ArrayBufferBlock(arrayBuffer[Any]) def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = { var numRecords = None: Option[Long] val putResult: Seq[(BlockId, BlockStatus)] = block match { case ArrayBufferBlock(arrayBuffer) => //得到存放到arrayBuffer[Any]集合的元素个数 numRecords = Some(arrayBuffer.size.toLong) //将block保存到BlockManager中,按指定的storageLevel blockManager.putIterator(blockId, arrayBuffer.iterator, storageLevel, tellMaster = true) case IteratorBlock(iterator) => 。。。。
BlockManagerBasedStoreResult(blockId, numRecords) }
===>最后会使用ReceiverTrackerEndpoint通知Driver
//将ReceiverBlockInfo给ReceiverTrackerEndpoint对应的receiveAndReply, // 从而将ReceivedBlockInfo会将Block的元数据信息放在AddBlock身上 通过Driver trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
===》被ReceiverTrackerEndpoint的receiveAndRely接收到,会回复addBlock方法返回的boolean值
override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = { // Remote messages case RegisterReceiver(streamId, typ, host, executorId, receiverEndpoint) => val successful = registerReceiver(streamId, typ, host, executorId, receiverEndpoint, context.senderAddress) context.reply(successful) case AddBlock(receivedBlockInfo) => if (WriteAheadLogUtils.isBatchingEnabled(ssc.conf, isDriver = true)) { walBatchingThreadPool.execute(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { if (active) { context.reply(addBlock(receivedBlockInfo)) } else { throw new IllegalStateException("ReceiverTracker RpcEndpoint shut down.") } } }) } else { context.reply(addBlock(receivedBlockInfo)) } case DeregisterReceiver(streamId, message, error) => 。。。。。 }
===>addBlock方法作用就是ReceivedBlockInfo这个元数据信息放到一个ReceivedBlockQueue队列中,元素就是这个ReceivedBlockInfo元数据信息。该方法发生异常时会返回false
==》该方法在ReceivedBlockTracker中实现
/** Add received block. This event will get written to the write ahead log (if enabled). */ def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { try { val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) if (writeResult) { synchronized { getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo } logDebug(s"Stream ${receivedBlockInfo.streamId} received " + s"block ${receivedBlockInfo.blockStoreResult.blockId}") } else { logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " + s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.") } writeResult } catch { case NonFatal(e) => logError(s"Error adding block $receivedBlockInfo", e) false } }
至此,ReceiverSupervisorImpl的onStart()方法是如何得到Reciver的数据写到spark的BlockManager中结束
下面来分析ReceiverSupervisorImpl中的startReceiver(),Receiver如何将数据store到RDD的?