SparkStreaming 核心概念与编程

最新推荐文章于 2024-06-02 01:46:30 发布

leedsjung

最新推荐文章于 2024-06-02 01:46:30 发布

阅读量227

点赞数

分类专栏： Spark Streaming

本文链接：https://blog.csdn.net/leedsjung/article/details/79583127

版权

Spark Streaming 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

核心概念：

1）StreamingContext

2）DStream（batched RDDs）

3）Input DStream 、Receiver

4）Transformation、Output Operation（RDD的转化、行动操作）

StreamingContext

常用构造方法：

def this(sparkContext: SparkContext, batchDuration: Duration) = {

this(sparkContext, null, batchDuration)

}

def this(conf: SparkConf, batchDuration: Duration) = {

this(StreamingContext.createNewSparkContext(conf), null, batchDuration)

}

batch interval

可以根据你的应用程序需求的延迟要求以及集群可用的资源情况来设置

一旦StreamingContext定义好之后，就可以做一些事情

Define the input sources by creating input DStreams.
Define the streaming computations by applying transformation and output operations to DStreams.（类似RDD）
Start receiving data and processing it using streamingContext.start().
Wait for the processing to be stopped (manually or due to any error) using streamingContext.awaitTermination().
The processing can be manually stopped using streamingContext.stop().

注意:

Once a context has been started, no new streaming computations（计算） can be set up or added to it.
Once a context has been stopped, it cannot be restarted.
Only one StreamingContext can be active in a JVM at the same time.
stop() on StreamingContext also stops the SparkContext.
- To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false.
A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.

Discretized（离散） Streams (DStreams)

Internally, a DStream is represented by a continuous series（一系列） of RDDs

Each RDD in a DStream contains data from a certain interval

对DStream操作算子，比如map/flatMap，其实底层会被翻译为对DStream中的每个RDD都做相同的操作；

因为一个DStream是由不同批次的RDD所构成的。

Any operation applied on a DStream translates to operations on the underlying RDDs.

For example, in the earlier example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. This is shown in the following figure.

These underlying RDD transformations are computed by the Spark engine

Input DStreams and Receivers

Every input DStream (except file stream )

is associated with a Receiver object which receives the data from a source and stores it in Spark’s memory for processing.

Points to remember（注意）

When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL.
- Either of these means that only one thread will be used for running tasks locally. If you are using an input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data.
- Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).
Extending the logic to running on a cluster,
- the number of cores allocated to the Spark Streaming application must be more than the number of receivers.
- Otherwise the system will receive data, but not be able to process it.

Spark Streaming provides two categories of built-in streaming sources.

Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
Advanced sources: Sources like Kafka, Flume, etc. are available through extra utility classes.

For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as via StreamingContext.fileStream[KeyClass, ValueClass, InputFormatClass].

Spark Streaming will monitor the directory dataDirectory and process any files created in that directory.

A simple directory can be monitored, such as "hdfs://namenode:8040/logs/". All files directly under such a path will be processed as they are discovered.
A POSIX glob pattern can be supplied, such as "hdfs://namenode:8040/logs/2017/*". Here, the DStream will consist of all files in the directories matching the pattern.
- That is: it is a pattern of directories, not of files in directories.
All files must be in the same data format.
A file is considered part of a time period based on its modification time, not its creation time.
Once processed, changes to a file within the current window will not cause the file to be reread.
- That is: updates are ignored.
The more files under a directory, the longer it will take to scan for changes — even if no files have been modified.
If a wildcard is used to identify directories, such as "hdfs://namenode:8040/logs/2016-*", renaming an entire directory to match the path will add the directory to the list of monitored directories. Only the files in the directory whose modification time is within the current window will be included in the stream.
Calling FileSystem.setTimes() to fix the timestamp is a way to have the file picked up in a later window, even if its contents have not changed.