DStream是RDD的模板,每隔一个batchInterval会根据DStream模板生成一个对应的RDD。然后将RDD存储到DStream中的generatedRDDs数据结构中:
// RDDs generated, marked as private[streaming] so that testsuites can access it @transient private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] ()下面以一个基于SparkStreaming应用程序例子(以下基于丁立清的应用程序分析)来说明DStream中RDD的生成的全周期:
val
lines
=
ssc.socketTextStream(
"localhost"
,
9999
)
val
words
=
lines.flatMap(
_
.split(
" "
))
val
pairs
=
words.map(word
=
> (word,
1
))
val
wordCounts
=
pairs.reduceByKey(
_
+
_
)
wordCounts.print()
val
lines
=
new
SocketInputDStream(
"localhost"
,
9999
)
// 类型是 SocketInputDStream
val
words
=
new
FlatMappedDStream(lines,
_
.split(
" "
))
// 类型是 FlatMappedDStream
val
pairs
=
new
MappedDStream(words, word
=
> (word,
1
))
// 类型是 MappedDStream
val
wordCounts
=
new
ShuffledDStream(pairs,
_
+
_
)
// 类型是 ShuffledDStream
new
ForeachDStream(wordCounts, cnt
=
> cnt.print())
// 类型是 ForeachDStream
我们先看看DStream的print方法:
def
print(num
:
Int)
:
Unit
=
ssc.withScope {
def
foreachFunc
:
(RDD[T], Time)
=
> Unit
=
{
(rdd
:
RDD[T], time
:
Time)
=
> {
val
firstNum
=
rdd.take(num +
1
)
// scalastyle:off println
println(
"-------------------------------------------"
)
println(
"Time: "
+ time)
println(
"-------------------------------------------"
)
firstNum.take(num).foreach(println)
if
(firstNum.length > num) println(
"..."
)
println()
// scalastyle:on println
}
}
foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps
=
false
)
}
首先定义了一个函数,该函数用来从RDD中取出前几条数据,并打印出结果与时间等。后面会调用foreachRDD函数。
private
def
foreachRDD(
foreachFunc
:
(RDD[T], Time)
=
> Unit,
displayInnerRDDOps
:
Boolean)
:
Unit
=
{
new
ForEachDStream(
this
,
context.sparkContext.clean(foreachFunc,
false
), displayInnerRDDOps).register()
}
private
[streaming]
def
register()
:
DStream[T]
=
{
ssc.graph.addOutputStream(
this
)
this
}
def
addOutputStream(outputStream
:
DStream[
_
]) {
this
.synchronized {
outputStream.setGraph(
this
)
outputStreams +
=
outputStream
}
}
当到达batchInterval的时间后,会调用DStreamGraph中的generateJobs方法
def
generateJobs(time
:
Time)
:
Seq[Job]
=
{
logDebug(
"Generating jobs for time "
+ time)
val
jobs
=
this
.synchronized {
outputStreams.flatMap { outputStream
=
>
val
jobOption
=
outputStream.generateJob(time)
jobOption.foreach(
_
.setCallSite(outputStream.creationSite))
jobOption
}
}
logDebug(
"Generated "
+ jobs.length +
" jobs for time "
+ time)
jobs
}
override
def
generateJob(time
:
Time)
:
Option[Job]
=
{
parent.getOrCompute(time)
match
{
case
Some(rdd)
=
>
val
jobFunc
=
()
=
> createRDDWithLocalProperties(time, displayInnerRDDOps) {
foreachFunc(rdd, time)
}
Some(
new
Job(time, jobFunc))
case
None
=
> None
}
}
从这个方法开始,一直向DStream的依赖关系追溯上去。到最初的DStream,然后生成新的RDD,并将RDD写入generatedRDDs中。
实际上,从前面几篇文章中可以得知:DStream是RDD的模板,其内部generatedRDDs 保存了每个BatchDuration时间生成的RDD对象实例。DStream的依赖构成了RDD依赖关系,即从后往前计算时,只要对最后一个DStream计算即可。JobGenerator每隔BatchDuration调用DStreamGraph的generateJobs方法,调用了ForEachDStream的generateJob方法,其内部先调用父DStream的getOrCompute方法来获取RDD,然后在进行计算,从后往前推,第一个DStream是ReceiverInputDStream,其comput方法中从receiverTracker中获取对应时间段的metadata信息,然后生成BlockRDD对象,并放入到generatedRDDs中。