先分析一下Dstream的子类:
A,从上图可以发现子类InputDstream都是属于数据源Dstream;InputDStream分成两个类型,一种是ReceiverInputDstream,一种不需要实现ReceiverInputDstream.如FileInputDStream。
B ,上图中ForEachDStream就是OutputDstream:所有output算子最终都会调用到这个类上
一、 例子中FileInputDStream如何加到DstreamGraph中的 呢?
1,还是从案例开始顺藤摸瓜
/** * @author luyllyl@gmail.com */
object HdfsWordCount { def main(args: Array[String]) { if (args.length < 1) { System.err.println("Usage: HdfsWordCount <directory>") System.exit(1) } StreamingExamples.setStreamingLogLevels() val sparkConf = new SparkConf().setAppName("HdfsWordCount") // Create the context val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create the FileInputDStream on the directory and use the // stream to count words in new files created val lines = ssc.textFileStream(args(0)) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }
2、跟踪textFileStream()方法发现,会调用到如下代码:
def fileStream[ K: ClassTag, V: ClassTag, F <: NewInputFormat[K, V]: ClassTag ] (directory: String): InputDStream[(K, V)] = { //ssc.textFileStream会触发新建一个FileInputDStream。FileInputDStream继承于InputDStream new FileInputDStream[K, V, F](this, directory) }
3,哪FileInputDStream是如何被加到DstreamGraph中的呢?
因为FileInputDStream继承于InputDStream。发现到实例化FileInputDStream时,父类InputDstream中的成员,
ssc.graph.addInputStream(this)也会被初始化,所以会将自己的实例加DstreamGraph中的
abstract class InputDStream[T: ClassTag] (ssc_ : StreamingContext) extends DStream[T](ssc_) { private[streaming] var lastValidTime: Time = null ssc.graph.addInputStream(this)
进入DstreamGraph,简单分析一下addInputStream方法,从中发现加进来FileInputDStream的放在inputStreams是一个数组
==》所以得到结论是:开发代码时可以多次调用textFileStream()方法,来监听不同目录
final private[streaming] class DStreamGraph extends Serializable with Logging { private val inputStreams = new ArrayBuffer[InputDStream[_]]() private val outputStreams = new ArrayBuffer[DStream[_]]()
。。。。
def addInputStream(inputStream: InputDStream[_]) { this.synchronized { inputStream.setGraph(this) inputStreams += inputStream } }
4,而后面的Dstream算子只会形成Dstream的DAG视图,类似RDD的DAG视图一样。
val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
进入flatMap算子:
def flatMap[U: ClassTag](flatMapFunc: T => Traversable[U]): DStream[U] = ssc.withScope { new FlatMappedDStream(this, context.sparkContext.clean(flatMapFunc)) }
进入FlatMappedDStream简单分析一下
private[streaming] class FlatMappedDStream[T: ClassTag, U: ClassTag]( parent: DStream[T], flatMapFunc: T => Traversable[U] ) extends DStream[U](parent.ssc) { * 每个 DStream 的子类都会实现 def dependencies: List[DStream[_]]方法,该方法用来返回自己的依赖的父 DStream 列表。 * 比如,没有父DStream 的 InputDStream 的 dependencies方法返回List()。 * 而FlatMappedDStream它的依赖只有一个,该parent对象就是调用flatmap算子的Dstream实例 override def dependencies: List[DStream[_]] = List(parent) override def slideDuration: Duration = parent.slideDuration /** Method that generates a RDD for the given time:
Dstream的解释说该方法是生产RDD用的,后面解析,compute是如何被调用同时如何生成的RDD override def compute(validTime: Time): Option[RDD[U]] = { parent.getOrCompute(validTime).map(_.flatMap(flatMapFunc)) } }
二、 例子中OutputDStream如何加到DstreamGraph中的 呢?
1,上面的案例中调用print()算子最终就会调用ForEachDStream这个Dstream上。
(实际上所有output算子都会调用ForEachDStream,可以跟踪源码,跟几步就可以看到结果)
wordCounts.print()
2,哪它是如何加到DstreamGraph中的呢?
A,print()算子,默认会打印RDD的前10元素
def print(num: Int): Unit = ssc.withScope { def foreachFunc: (RDD[T], Time) => Unit = { (rdd: RDD[T], time: Time) => { val firstNum = rdd.take(num + 1) println("-------------------------------------------") println("Time: " + time) println("-------------------------------------------") firstNum.take(num).foreach(println) if (firstNum.length > num) println("...") println() } }
#简单说一下这个clean方法吧:所有scala函数,都是jvm的object对象,clean方法最重要作用就是在系列化对象时,将相同域中不必要的成员都清理掉,
并将函数相关联的成员都加载进来。保证高效系列化。(这个方法很重要,基本上源码中处处都有) foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps = false) }
B,foreachRDD方法跟进去:
* Dstream的output 操作,包括 print, saveAsTextFiles, saveAsObjectFiles, saveAsHadoopFiles, foreachRDD,
所有outPut操作都会创建ForEachDStream实例并调用register方法将自身添加到DStreamGraph.outputStreams成员中。
*
* 与DStream transform 操作返回一个新的 DStream 不同,output 操作不会返回任何东西,只会创建一个ForEachDStream作为依赖链的终结
private def foreachRDD( foreachFunc: (RDD[T], Time) => Unit, displayInnerRDDOps: Boolean): Unit = { new ForEachDStream(this, context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register() }
跟进register()方法,就找到DstreamGraph对象将当前ForeachDstream对象,加到它的成员中
/** * Register this streaming as an output stream. This would ensure that RDDs of this * DStream will be generated. */ private[streaming] def register(): DStream[T] = { ssc.graph.addOutputStream(this) this }
简单看一下DstreamGraph的addOutputStream方法:发现ouputStreams也是一个scala的可变数组。
final private[streaming] class DStreamGraph extends Serializable with Logging { private val inputStreams = new ArrayBuffer[InputDStream[_]]() private val outputStreams = new ArrayBuffer[DStream[_]]()
def addOutputStream(outputStream: DStream[_]) { this.synchronized {
#ForeachDstream将DstreamGraph对象加到自己身上及父Dstream上从而形成DstreamGraph的DAG视图 outputStream.setGraph(this) outputStreams += outputStream } }
3,进入setGraph方法,该方法在ForeachDstream父类Dstream中,并且会将ForeachDstream所有依赖的父Dstream遍历,
同时都将当前的DstreamGraph设置到所有Dstream实例中
private[streaming] def setGraph(g: DStreamGraph) { if (graph != null && graph != g) { throw new SparkException("Graph is already set in " + this + ", cannot set it again") } graph = g
#在addInputStream方法中也调用这个setGraph,因为FileInputDStream没有父Dstream,
所以FileInputDStream只给自己设置graph就可以 dependencies.foreach(_.setGraph(graph)) }
A,给每个Dstream实例成员graph变量进行赋值,从每个Dstream的成员dependencies得到一个结论:Dstream的DAG思路和RDD的DAG思路是一样的
private[streaming] var graph: DStreamGraph = null
B,当前案例的ForeachDstream的父Dstream如下:
ShuffledDStream
MappedDStream
FlatMappedDStream
FileInputDStream
下面来分析一下:HdfsWordCount从Dstream到RDD全过程解析