Spark Streaming源码解读之流数据不断接收全生命周期彻底研究和思考
接第9讲的内容
ReceiverSupervisor.scala.scala (127-131行)
def start() {
onStart()
startReceiver()
}
onStart ReceiverSupervisor.scala.scala (107-112行)
/**
* Called when supervisor is started.
* Note that this must be called beforethe receiver.onStart() is called to ensure
* things like [[BlockGenerator]]s are started before the receiver startssending data.
*/
protected def onStart() { }
必须在receiver.onstart()之前启动
实现在ReceiverSupervisorImpl.scala (172-174行)
overrideprotected def onStart() {
registeredBlockGenerators.foreach { _.start() }
}
registeredBlockGenerator的定义
ReceiverSupervisorImpl.scala (95-96行)
private val registeredBlockGenerators= new mutable.ArrayBuffer[BlockGenerator]
with mutable.SynchronizedBuffer[BlockGenerator]
对registeredBlockGenerator的操作ReceiverSupervisorImpl.scala (194-202行)
override def createBlockGenerator(
blockGeneratorListener:BlockGeneratorListener): BlockGenerator = {
// Cleanup BlockGenerators that have already been stopped
registeredBlockGenerators --=registeredBlockGenerators.filter{_.isStopped() }
val newBlockGenerator= new BlockGenerator(blockGeneratorListener,streamId, env.conf)
registeredBlockGenerators+= newBlockGenerator
newBlockGenerator
}
该函数在ReceiverSupervisorImpl实例化时被调用。ReceiverSupervisorImpl(112)
private val defaultBlockGenerator= createBlockGenerator(defaultBlockGeneratorListener)
注意这里的defaultBlockGeneratorListener 后面会再次被提及。
ReceiverSupervisorImpl.scala(99-111)
private val defaultBlockGeneratorListener= new BlockGeneratorListener{
def onAddData(data:Any, metadata: Any): Unit = { }
def onGenerateBlock(blockId:StreamBlockId): Unit = { }
def onError(message:String, throwable:Throwable){
reportError(message, throwable)
}
def onPushBlock(blockId:StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
pushArrayBuffer(arrayBuffer, None, Some(blockId))
}
}
特别注意onPushBlock函数。
回到ReceiverSupervisorImpl.scala (172-174行),调用BlockGenerator的start
BlockGenerator.scala(115-125行)
def start(): Unit = synchronized {
if (state== Initialized) {
state =Active
blockIntervalTimer.start()
blockPushingThread.start()
logInfo("Started BlockGenerator")
} else {
throw new SparkException(
s"Cannot start BlockGenerator as its not in theInitialized state [state =$state]")
}
}
这里启动了blockIntervalTimer和blockPushingThread,blockIntervalTimer就是一个定时器,默认每200ms回调一下updateCurrentBuffer方法,回调时间通过参数spark.streaming.blockInterval设置,这也是一个性能调优的参数,时间过短太造成block碎片太多,时间过长可能导致block块过大,具体时间长短要根据实际业务而定,updateCurrentBuffer方法作用就是将接收到的数据包装到block存储,;blockPushingThread作用是定时从blocksForPushing队列中取block,然后存储,并向ReceiverTrackerEndpoint汇报。
blockIntervalTimer.scala是一定时器,代码如下
BlockGenerator.scala(105-106行)
private val blockIntervalTimer =
new RecurringTimer(clock, blockIntervalMs,updateCurrentBuffer,"BlockGenerator")
会定时调用updateCurrentBuffer
BlockGenerator.scala(232-254行)
private def updateCurrentBuffer(time: Long): Unit = {
try {
var newBlock:Block = null
synchronized {
if (currentBuffer.nonEmpty) {
val newBlockBuffer= currentBuffer
currentBuffer = new ArrayBuffer[Any]
val blockId= StreamBlockId(receiverId, time -blockIntervalMs)
listener.onGenerateBlock(blockId)
newBlock = new Block(blockId, newBlockBuffer)
}
}
if (newBlock!= null) {
blocksForPushing.put(newBlock) // put is blocking when queue is full
}
} catch {
case ie:InterruptedException=>
logInfo("Block updating timer thread wasinterrupted")
case e:Exception=>
reportError("Error in block updating thread", e)
}
}
实例化一个空的ArrayBuffer给currentBuffer,接着实例化一个Block把newBlockBuffer传递进去,最后把newBlock放入到blocksForPushing队列中
blockPushingThread (BlockGenerator.scala 109)
private val blockPushingThread= new Thread() { override def run() { keepPushingBlocks() } }
其中keepPushingBlocks()实现如下 (BlockGenerator.scala256-289行)
/** Keeppushing blocks to the BlockManager. */
private def keepPushingBlocks() {
logInfo("Started block pushing thread")
def areBlocksBeingGenerated:Boolean = synchronized {
state !=StoppedGeneratingBlocks
}
try {
// While blocks are being generated, keep polling forto-be-pushed blocks and push them.
while (areBlocksBeingGenerated){
Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS))match {
case Some(block)=> pushBlock(block)
case None=>
}
}
// At this point, state is StoppedGeneratingBlock. So drainthe queue of to-be-pushed blocks.
logInfo("Pushing out the last "+ blocksForPushing.size() +" blocks")
while (!blocksForPushing.isEmpty) {
val block= blocksForPushing.take()
logDebug(s"Pushing block $block")
pushBlock(block)
logInfo("Blocks left to push " +blocksForPushing.size())
}
logInfo("Stopped block pushing thread")
} catch {
case ie:InterruptedException=>
logInfo("Block pushing thread wasinterrupted")
case e:Exception=>
reportError("Error in block pushing thread", e)
}
}
其中功能模块
while (areBlocksBeingGenerated) {
Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS))match {
case Some(block)=> pushBlock(block)
case None=>
}
}
从blocksForPushing队列中定时取出block然后调用pushBlock
BlockGenerator.scala295-298行
private def push Block(block: Block) {
listener.onPushBlock(block.id,block.buffer)
logInfo("Pushed block " + block.id)
}
而这里的listener则在BlockGenerator初始化(伴随ReceiverSupervisorImpl初始化)是就已经完成实例化了(defaultBlockGeneratorListener)。可以参照上面提及的createBlockGenerator函数的内容。
所以onPushBlock的实现如下(ReceiverSupervisorImpl.scala 108-110)
def onPushBlock(blockId: StreamBlockId,arrayBuffer: ArrayBuffer[_]) {
pushArrayBuffer(arrayBuffer, None, Some(blockId))
}
(ReceiverSupervisorImpl.scala123-129)
def pushArrayBuffer(
arrayBuffer:ArrayBuffer[_],
metadataOption: Option[Any],
blockIdOption: Option[StreamBlockId]
) {
pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption)
}
而pushAndReportBlock是重磅!(ReceiverSupervisorImpl.scala149-163)
/** Storeblock and report it to driver */
def pushAndReportBlock(
receivedBlock: ReceivedBlock,
metadataOption: Option[Any],
blockIdOption: Option[StreamBlockId]
) {
val blockId =blockIdOption.getOrElse(nextBlockId)
val time =System.currentTimeMillis
val blockStoreResult= receivedBlockHandler.storeBlock(blockId, receivedBlock)
logDebug(s"Pushed block $blockIdin${(System.currentTimeMillis- time)} ms")
val numRecords =blockStoreResult.numRecords
val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
logDebug(s"Reported block $blockId")
}
第一调用receivedBlockHandler来存储block
第二向trackerEndpoint汇报block的存储结果blockInfo
receivedBlockHandler实现如下 (ReceiverSupervisorImpl.scala 53-66)
private val receivedBlockHandler: ReceivedBlockHandler = {
if (WriteAheadLogUtils.enableReceiverLog(env.conf)){
if (checkpointDirOption.isEmpty){
throw new SparkException(
"Cannot enable receiver write-ahead log withoutcheckpoint directory set. "+
"Please usestreamingContext.checkpoint() to set the checkpoint directory. "+
"See documentation for moredetails.")
}
new WriteAheadLogBasedBlockHandler(env.blockManager,receiver.streamId,
receiver.storageLevel, env.conf,hadoopConf, checkpointDirOption.get)
} else {
new BlockManagerBasedBlockHandler(env.blockManager,receiver.storageLevel)
}
}
其中有两种实现WriteAheadLogBasedBlockHandler和BlockManagerBasedBlockHandler
BlockManagerBasedBlockHandler借助BlockManager来存储block并返回block存储的元数据。
BlockGenerator启动之后接着看 supervisor.start()方法中的 startReceiver()方法,
(SupervisorImpl.scala143-158行)
def startReceiver(): Unit = synchronized {
try {
if (onReceiverStart()) {
logInfo("Starting receiver")
receiverState = Started
receiver.onStart()
logInfo("Called receiver onStart")
} else {
// The driver refused us
stop("Registeredunsuccessfully because Driver refused to start receiver "+ streamId, None)
}
} catch {
case NonFatal(t) =>
stop("Error starting receiver " +streamId, Some(t))
}
}
onReceiverStart的实现(ReceiverSupervisorImpl.scala181-185行)
overrideprotected def onReceiverStart():Boolean = {
val msg = RegisterReceiver(
streamId,receiver.getClass.getSimpleName,host,executorId,endpoint)
trackerEndpoint.askWithRetry[Boolean](msg)
}
主要是向trackerEndpoint发送了一条RegisterReceiver注册receiver的消息。
receiver.onStart()
这里以SocketReceiver为例。
SocketInputDStream.scala(55-61)
def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
setDaemon(true)
override def run() { receive() }
}.start()
}
receive定义 (SocketInputDStream.scala 69-96)
def receive() {
var socket:Socket = null
try {
logInfo("Connecting to " + host +":" + port)
socket = new Socket(host, port)
logInfo("Connected to " + host +":" + port)
val iterator= bytesToObjects(socket.getInputStream())
while(!isStopped&& iterator.hasNext) {
store(iterator.next)
}
if (!isStopped()){
restart("Socket data stream had no more data")
} else {
logInfo("Stopped receiving")
}
} catch {
case e:java.net.ConnectException =>
restart("Error connecting to " + host +":" + port, e)
case NonFatal(e) =>
logWarning("Error receiving data", e)
restart("Error receiving data", e)
} finally {
if (socket!= null) {
socket.close()
logInfo("Closed socket to " + host +":" + port)
}
}
}
}
启动socket 并将接收到的数据存储起来。
Store的实现 Receiver.scala (118-120)
def store(dataItem:T) {
supervisor.pushSingle(dataItem)
}
ReceiverSupervisorImpl.scala(118-120)
def pushSingle(data: Any) {
defaultBlockGenerator.addData(data)
}
BlockGenerator.scala(160-175)
def addData(data: Any): Unit = {
if (state== Active) {
waitToPush()
synchronized {
if (state== Active) {
currentBuffer += data
} else {
throw new SparkException(
"Cannot add data as BlockGenerator hasnot been started or has been stopped")
}
}
} else {
throw new SparkException(
"Cannot add data as BlockGenerator has not beenstarted or has been stopped")
}
}
currentBuffer += data,在currentBuffer上不断的累加数据.