SparkStreaming---wordCount源码解读

最新推荐文章于 2024-09-17 09:00:00 发布

疯狂的程序猿88888

最新推荐文章于 2024-09-17 09:00:00 发布

阅读量390

点赞数

分类专栏： SparkStreaming 文章标签： spark

本文链接：https://blog.csdn.net/u012940753/article/details/51530235

版权

SparkStreaming 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

</pre>从上面的程序我们看出sparkStreaming其实是将一段时间间隔的数据作为一个整体，然后这段时间内的数据就可以作为rdd来进行计算,这也是SparkStreaming的核心，这一节介绍几类，在sparkStreaming中源码的解读<p></p><p>1.StreamingContext：先看下官方的解释</p><p></p><pre name="code" class="java">/**
 * Main entry point for Spark Streaming functionality. It provides methods used to create
 * [[org.apache.spark.streaming.dstream.DStream]]s from various input sources. It can be either
 * created by providing a Spark master URL and an appName, or from a org.apache.spark.SparkConf
 * configuration (see core Spark documentation), or from an existing org.apache.spark.SparkContext.
 * The associated SparkContext can be accessed using `context.sparkContext`. After
 * creating and transforming DStreams, the streaming computation can be started and stopped
 * using `context.start()` and `context.stop()`, respectively.
 * `context.awaitTermination()` allows the current thread to wait for the termination
 * of the context by `stop()` or by an exception.
 */

大体意思是说StreamingContext是整个SparkStreaming的程序入口，它提供了通过多种数据来源来创建DStream的方法，它也可以通过spark的mastre的url来创建或者从spark配置中，或者已经存在的sparkconf或者使用context.sparkcontext来存取，然后创建和转换DStream，通过start来开启线程，stop结束当前线程，context.awaitTermination来运行当前的线程等待运行结束

2.Receiver：

/**
 * :: DeveloperApi ::
 * Abstract class of a receiver that can be run on worker nodes to receive external data. A
 * custom receiver can be defined by defining the functions `onStart()` and `onStop()`. `onStart()`
 * should define the setup steps necessary to start receiving data,
 * and `onStop()` should define the cleanup steps necessary to stop receiving data.
 * Exceptions while receiving can be handled either by restarting the receiver with `restart(...)`
 * or stopped completely by `stop(...)` or
 *
 * A custom receiver in Scala would look like this.
 *
 * {{{
 *  class MyReceiver(storageLevel: StorageLevel) extends NetworkReceiver[String](storageLevel) {
 *      def onStart() {
 *          // Setup stuff (start threads, open sockets, etc.) to start receiving data.
 *          // Must start new thread to receive data, as onStart() must be non-blocking.
 *
 *          // Call store(...) in those threads to store received data into Spark's memory.
 *
 *          // Call stop(...), restart(...) or reportError(...) on any thread based on how
 *          // different errors needs to be handled.
 *
 *          // See corresponding method documentation for more details
 *      }
 *
 *      def onStop() {
 *          // Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
 *      }
 *  }
 * }}}
 *
 * A custom receiver in Java would look like this.
 *
 * {{{
 * class MyReceiver extends Receiver<String> {
 *     public MyReceiver(StorageLevel storageLevel) {
 *         super(storageLevel);
 *     }
 *
 *     public void onStart() {
 *          // Setup stuff (start threads, open sockets, etc.) to start receiving data.
 *          // Must start new thread to receive data, as onStart() must be non-blocking.
 *
 *          // Call store(...) in those threads to store received data into Spark's memory.
 *
 *          // Call stop(...), restart(...) or reportError(...) on any thread based on how
 *          // different errors needs to be handled.
 *
 *          // See corresponding method documentation for more details
 *     }
 *
 *     public void onStop() {
 *          // Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
 *     }
 * }
 * }}}
 */

可以看出，Receiver是一个抽象类，运行在worker节点上，来接收外部的数据，

我们可以自定义接收数据源的地方，使用onstart开启接收数据，onstop用来清理数据关闭接收数据，官方定义了使用java和scala方式定义接收者的两种方式

Receiver通过接收者启动的时候被初始化，这个函数应该是非阻塞的，因此接收的数据必须在一个不同的线程中

3.DStream

/**
 * A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous
 * sequence of RDDs (of the same type) representing a continuous stream of data (see
 * org.apache.spark.rdd.RDD in the Spark core documentation for more details on RDDs).
 * DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume,
 * etc.) using a [[org.apache.spark.streaming.StreamingContext]] or it can be generated by
 * transforming existing DStreams using operations such as `map`,
 * `window` and `reduceByKeyAndWindow`. While a Spark Streaming program is running, each DStream
 * periodically generates a RDD, either from live data or by transforming the RDD generated by a
 * parent DStream.
 *
 * This class contains the basic operations available on all DStreams, such as `map`, `filter` and
 * `window`. In addition, [[org.apache.spark.streaming.dstream.PairDStreamFunctions]] contains
 * operations available only on DStreams of key-value pairs, such as `groupByKeyAndWindow` and
 * `join`. These operations are automatically available on any DStream of pairs
 * (e.g., DStream[(Int, Int)] through implicit conversions.
 *
 * DStreams internally is characterized by a few basic properties:
 *  - A list of other DStreams that the DStream depends on
 *  - A time interval at which the DStream generates an RDD
 *  - A function that is used to generate an RDD after each time interval
 */

Streaming最基本的一个抽象，是一组连续的RDD来代表连续的数据流，DStream也能通过实时数据进行创建，也可以通过目前DStream的转换来生成，每一个DStream能够定期生成一个RDD，也可以从实时数据等中生成
总之，DStream内部有以下特征：
1.DStream依赖于其他的DStream
2.DStream会根据时间间隔来生成RDD
3.在每个时间间隔之后它会生成RDD