</pre>从上面的程序我们看出sparkStreaming其实是将一段时间间隔的数据作为一个整体,然后这段时间内的数据就可以作为rdd来进行计算,这也是SparkStreaming的核心,这一节介绍几类,在sparkStreaming中源码的解读<p></p><p>1.StreamingContext:先看下官方的解释</p><p></p><pre name="code" class="java">/**
* Main entry point for Spark Streaming functionality. It provides methods used to create
* [[org.apache.spark.streaming.dstream.DStream]]s from various input sources. It can be either
* created by providing a Spark master URL and an appName, or from a org.apache.spark.SparkConf
* configuration (see core Spark documentation), or from an existing org.apache.spark.SparkContext.
* The associated SparkContext can be accessed using `context.sparkContext`. After
* creating and transforming DStreams, the streaming computation can be started and stopped
* using `context.start()` and `context.stop()`, respectively.
* `context.awaitTermination()` allows the current thread to wait for the termination
* of the context by `stop()` or by an exception.
*/
大体意思是说StreamingContext是整个SparkStreaming的程序入口,它提供了通过多种数据来源来创建DStream的方法,它也可以通过spark的mastre的url来创建或者从spark配置中,或者已经存在的sparkconf或者使用context.sparkcontext来存取,然后创建和转换DStream,通过start来开启线程,stop结束当前线程,context.awaitTermination来运行当前的线程等待运行结束
2.Receiver:
/**
* :: DeveloperApi ::
* Abstract class of a receiver that can be run on worker nodes to receive external data. A
* custom receiver can be defined by defining the functions `onStart()` and `onStop()`. `onStart()`
* should define the setup steps necessary to start receiving data,
* and `onStop()` should define the cleanup steps necessary to stop receiving data.
* Exceptions while receiving can be handled either by restarting the receiver with `restart(...)`
* or stopped completely by `stop(...)` or
*
* A custom receiver in Scala would look like this.
*
* {{{
* class MyReceiver(storageLevel: StorageLevel) extends NetworkReceiver[String](storageLevel) {
* def onStart() {
* // Setup stuff (start threads, open sockets, etc.) to start receiving data.
* // Must start new thread to receive data, as onStart() must be non-blocking.
*
* // Call store(...) in those threads to store received data into Spark's memory.
*
* // Call stop(...), restart(...) or reportError(...) on any thread based on how
* // different errors needs to be handled.
*
* // See corresponding method documentation for more details
* }
*
* def onStop() {
* // Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
* }
* }
* }}}
*
* A custom receiver in Java would look like this.
*
* {{{
* class MyReceiver extends Receiver<String> {
* public MyReceiver(StorageLevel storageLevel) {
* super(storageLevel);
* }
*
* public void onStart() {
* // Setup stuff (start threads, open sockets, etc.) to start receiving data.
* // Must start new thread to receive data, as onStart() must be non-blocking.
*
* // Call store(...) in those threads to store received data into Spark's memory.
*
* // Call stop(...), restart(...) or reportError(...) on any thread based on how
* // different errors needs to be handled.
*
* // See corresponding method documentation for more details
* }
*
* public void onStop() {
* // Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
* }
* }
* }}}
*/
可以看出,Receiver是一个抽象类,运行在worker节点上,来接收外部的数据,
我们可以自定义接收数据源的地方,使用onstart开启接收数据,onstop用来清理数据关闭接收数据,官方定义了使用java和scala方式定义接收者的两种方式
Receiver通过接收者启动的时候被初始化,这个函数应该是非阻塞的,因此接收的数据必须在一个不同的线程中
3.DStream
/** * A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous * sequence of RDDs (of the same type) representing a continuous stream of data (see * org.apache.spark.rdd.RDD in the Spark core documentation for more details on RDDs). * DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume, * etc.) using a [[org.apache.spark.streaming.StreamingContext]] or it can be generated by * transforming existing DStreams using operations such as `map`, * `window` and `reduceByKeyAndWindow`. While a Spark Streaming program is running, each DStream * periodically generates a RDD, either from live data or by transforming the RDD generated by a * parent DStream. * * This class contains the basic operations available on all DStreams, such as `map`, `filter` and * `window`. In addition, [[org.apache.spark.streaming.dstream.PairDStreamFunctions]] contains * operations available only on DStreams of key-value pairs, such as `groupByKeyAndWindow` and * `join`. These operations are automatically available on any DStream of pairs * (e.g., DStream[(Int, Int)] through implicit conversions. * * DStreams internally is characterized by a few basic properties: * - A list of other DStreams that the DStream depends on * - A time interval at which the DStream generates an RDD * - A function that is used to generate an RDD after each time interval */Streaming最基本的一个抽象,是一组连续的RDD来代表连续的数据流,DStream也能通过实时数据进行创建,也可以通过目前DStream的转换来生成,每一个DStream能够定期生成一个RDD,也可以从实时数据等中生成
总之,DStream内部有以下特征:
1.DStream依赖于其他的DStream
2.DStream会根据时间间隔来生成RDD
3.在每个时间间隔之后它会生成RDD