DStream是RDD的模板,每隔一个batchInterval会根据DStream模板生成一个对应的RDD,DStream实际上本身就是RDD的集合,只不过加上了时间的维度,我们从DStream中的generatedRDDs数据结构中也能看出这一点:
// RDDs generated, marked as private[streaming] so that testsuites can access it
@transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] ()
下面我们从一个简单的SparkStraming例子中来看RDD的
val lines = new SocketInputDStream("localhost", 9999)
// lines 类型是 SocketInputDStream
val words = new FlatMappedDStream(lines, _.split(" "))
// SocketInputDStream --> FlatMappedDStream
val pairs = new MappedDStream(words, word => (word, 1))
// FlatMappedDStream --> MappedDStream
val wordCounts = new ShuffledDStream(pairs, _ + _)
// MappedDStream --> ShuffledDStream
wordCounts.print();
有这个简单的例子我们看到了DStream相互转化的过程,此处我们从wordCounts.print()这句方法背后的源码,来解密RDD的生成过程
由于我们在前面的对此处的代码已经解析过,现在我们只关注这其中的部分重点
private def foreachRDD(
foreachFunc: (RDD[T], Time) => Unit,
displayInnerRDDOps: Boolean): Unit = {
new ForEachDStream(this,
context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()
}
我们简单的说明一下这其中的过程
wordCounts.print();
--> print(10)
-->foreachRDD
上面是print()方法到foreachRDD的调用过程,下面我们关注register()方法
/**
* Register this streaming as an output stream. This would ensure that RDDs of this
* DStream will be generated.
*/
private[streaming] def register(): DStream[T] = {
ssc.graph.addOutputStream(this)
this
}
}
当然这块的逻辑,其实在前面已经阐述,我们在这里依然简单的说明一下,作为一个输出的DStream(ForEachDStream),其generateJob在被调用时不光可能会产生Job也会产生RDD,在generateJob方法中首先调用的就是父类的getOrCompute
/**
* Get the RDD corresponding to the given time; either retrieve it from cache
* or compute-and-cache it.
*/
private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
// If RDD was already generated, then retrieve it from HashMap,
// or else compute the RDD
generatedRDDs.get(time).orElse {
// Compute the RDD if time is valid (e.g. correct time in a sliding window)
// of RDD generation, else generate nothing.
if (isTimeValid(time)) {
val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {
// Disable checks for existing output directories in jobs launched by the streaming
// scheduler, since we may need to write output to an existing directory during checkpoint
// recovery; see SPARK-4835 for more details. We need to have this call here because
// compute() might cause Spark jobs to be launched.
PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
compute(time)
}
}
rddOption.foreach { case newRDD =>
// Register the generated RDD for caching and checkpointing
if (storageLevel != StorageLevel.NONE) {
newRDD.persist(storageLevel)
logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
}
if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
newRDD.checkpoint()
logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
}
generatedRDDs.put(time, newRDD)
}
rddOption
} else {
None
}
}
}
在此处我们看到了RDD的产生,当然这是SparkStreaming程序中最后一个DStream,因为DStream.print(action操作),我们知道RDD执行action操作之后会不会转换出新的RDD呢?我们知道是不会的,同样DStream在执行action之后同样也不会再产生DStream,因为DStram本身就是RDD的模板,DStream记录的RDD其实也是每个batchDuration时间最后一个RDD,我们可以通过DStream之间的dependencies进行回溯,从而形成一个计算链条,当然也涉及到DStream之间的转换,DStream之间的转换也必定调用其compute方法,compute方法之中又做了什么呢?
以FlatMappedDStream的compute方法为例:
override def compute(validTime: Time): Option[RDD[U]] = {
parent.getOrCompute(validTime).map(_.flatMap(flatMapFunc))
}
我们依然可以看到起首先调用的是其父类DStream的getOrCompute方法,在getOrCompute会产生新的RDD,这种RDD的生成主要来自于DStream之间的转换而来的,这种DStream之间时存在依赖关系的
下面我们来看SocketInputDStream,SocketInputDStream作为数据的源头,其本身是没有依赖关系的
override def compute(validTime: Time): Option[RDD[T]] = {
val blockRDD = {
if (validTime < graph.startTime) {
// If this is called for any time before the start time of the context,
// then this returns an empty RDD. This may happen when recovering from a
// driver failure without any write ahead log to recover pre-failure data.
new BlockRDD[T](ssc.sc, Array.empty)
} else {
// Otherwise, ask the tracker for all the blocks that have been allocated to this stream
// for this batch
val receiverTracker = ssc.scheduler.receiverTracker
val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)
// Register the input blocks information into InputInfoTracker
val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
// Create the BlockRDD
createBlockRDD(validTime, blockInfos)
}
}
Some(blockRDD)
}
作为SparkStreaming程序中第一个DStream,其本身是没有依赖关系的,其compute方法并不是调用父类的方法而是执行自己的方法逻辑,根据receiverTracker收集的元数据信息及blocks信息生成具体的RDD