SparkStreaming 核心概念与编程

核心概念:
1)StreamingContext
2)DStream(batched RDDs)
3)Input DStream 、Receiver
4)Transformation、Output Operation(RDD的转化、行动操作)




StreamingContext
常用构造方法:
def this(sparkContext: SparkContext, batchDuration: Duration) = {
    this(sparkContext, null, batchDuration)
}
def this(conf: SparkConf, batchDuration: Duration) = {
    this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
}

batch interval
可以根据你的应用程序需求的延迟要求以及集群可用的资源情况来设置


一旦StreamingContext定义好之后,就可以做一些事情
  • Define the input sources by creating input DStreams.
  • Define the streaming computations by applying transformation and output operations to DStreams.(类似RDD)
  • Start receiving data and processing it using streamingContext.start().
  • Wait for the processing to be stopped (manually or due to any error) using streamingContext.awaitTermination().
  • The processing can be manually stopped using streamingContext.stop().
注意:
  • Once a context has been started, no new streaming computations(计算) can be set up or added to it.
  • Once a context has been stopped, it cannot be restarted.
  • Only one StreamingContext can be active in a JVM at the same time.
  • stop() on StreamingContext also stops the SparkContext. 
    • To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false.
  • A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.



Discretized(离散) Streams (DStreams)
Internally, a DStream is represented by a continuous series(一系列) of RDDs
Each RDD in a DStream contains data from a certain interval


对DStream操作算子,比如map/flatMap,其实底层会被翻译为对DStream中的每个RDD都做相同的操作;
因为一个DStream是由不同批次的RDD所构成的。
Any operation applied on a DStream translates to operations on the underlying RDDs. 
For example, in the earlier example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. This is shown in the following figure.
These underlying RDD transformations are computed by the Spark engine





Input DStreams and Receivers

Every input DStream (except file stream )
is associated with a Receiver object which receives the data from a source and stores it in Spark’s memory for processing.

Points to remember(注意)
  • When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL. 
    • Either of these means that only one thread will be used for running tasks locally. If you are using an input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. 
    • Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).
  • Extending the logic to running on a cluster,
    •  the number of cores allocated to the Spark Streaming application must be more than the number of receivers. 
    • Otherwise the system will receive data, but not be able to process it.


Spark Streaming provides two categories of built-in streaming sources.
  • Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
  • Advanced sources: Sources like Kafka, Flume, etc. are available through extra utility classes. 





For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as via StreamingContext.fileStream[KeyClass, ValueClass, InputFormatClass].

Spark Streaming will monitor the directory dataDirectory and process any files created in that directory.
  • A simple directory can be monitored, such as "hdfs://namenode:8040/logs/". All files directly under such a path will be processed as they are discovered.
  • A POSIX glob pattern can be supplied, such as "hdfs://namenode:8040/logs/2017/*". Here, the DStream will consist of all files in the directories matching the pattern. 
    • That is: it is a pattern of directories, not of files in directories.
  • All files must be in the same data format.
  • A file is considered part of a time period based on its modification time, not its creation time.
  • Once processed, changes to a file within the current window will not cause the file to be reread. 
    • That is: updates are ignored.
  • The more files under a directory, the longer it will take to scan for changes — even if no files have been modified.
  • If a wildcard is used to identify directories, such as "hdfs://namenode:8040/logs/2016-*", renaming an entire directory to match the path will add the directory to the list of monitored directories. Only the files in the directory whose modification time is within the current window will be included in the stream.
  • Calling FileSystem.setTimes() to fix the timestamp is a way to have the file picked up in a later window, even if its contents have not changed.















评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值