通过前文可知:Spark Streaming为定时器定时生成RDD,对对应某一时间段内的数据进行计算。
沿用此文的案例。
而在某些场景下可能会出现RDD的数据为空的情况。我们还是以SocketInputDStream来看下具体的RDD的创建过程:
当DStream按照依赖回溯至起点DStream时。
可见,RDD是一定会创建的,从运行时流程来看,创建RDD时,只知道数据的元数据信息,并不清楚是否有数据。
// ReceiverInputDStream.scala line 41
private[streaming] def createBlockRDD(time: Time, blockInfos: Seq[ReceivedBlockInfo]): RDD[T] = {
if (blockInfos.nonEmpty) {
val blockIds = blockInfos.map { _.blockId.asInstanceOf[BlockId] }.toArray
// Are WAL record handles present with all the blocks
val areWALRecordHandlesPresent = blockInfos.forall { _.walRecordHandleOption.nonEmpty }
if (areWALRecordHandlesPresent) {
// If all the blocks have WAL record handle, then create a WALBackedBlockRDD
val isBlockIdValid = blockInfos.map { _.isBlockIdValid() }.toArray
val walRecordHandles = blockInfos.map { _.walRecordHandleOption.get }.toArray
new WriteAheadLogBackedBlockRDD[T]( // 从WAL恢复
ssc.sparkContext, blockIds, walRecordHandles, isBlockIdValid)
} else {
// Else, create a BlockRDD. However, if there are some blocks with WAL info but not
// others then that is unexpected and log a warning accordingly.
if (blockInfos.find(_.walRecordHandleOption.nonEmpty).nonEmpty) {
if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) {
logError("Some blocks do not have Write Ahead Log information; " +
"this is unexpected and data may not be recoverable after driver failures")
} else {
logWarning("Some blocks have Write Ahead Log information; this is unexpected")
}
}
val validBlockIds = blockIds.filter { id =>
ssc.sparkContext.env.blockManager.master.contains(id)
}
if (validBlockIds.size != blockIds.size) {
logWarning("Some blocks could not be recovered as they were not found in memory. " +
"To prevent such data loss, enabled Write Ahead Log (see programming guide " +
"for more details.")
}
new BlockRDD[T](ssc.sc, validBlockIds) // 创建BlockRDD
}
} else {
// If no block is ready now, creating WriteAheadLogBackedBlockRDD or BlockRDD
// according to the configuration
if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) {
new WriteAheadLogBackedBlockRDD[T](
ssc.sparkContext, Array.empty, Array.empty, Array.empty) // 从WAL中创建空RDD
} else {
new BlockRDD[T](ssc.sc, Array.empty) // 创建BlockRDD,没数据的BlockRDD
}
}
}
由此看来,不管有没有数据,RDD都是会创建的。那么,如果没有数据的话呢,会不会有Block呢?
来看下Receiver端的处理。
定时器默认200ms调用一下函数:
至此,Receiver端接收到的数据的元数据已经保存到Driver了。
不过上述步骤并没有体现时间维度,在哪体现的呢?