Spark Streaming核心概念
核心概念
核心概念之StreamingContext
在IDEA中 搜索StreamingContext.scala
def this(sparkContext: SparkContext, batchDuration: Duration) = {
this(sparkContext, null, batchDuration)
}
def this(conf: SparkConf, batchDuration: Duration) = {
this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
}
batch interval可以根据你的应用程序的需求的延迟要求以及集群可用的资源情况来设置
After a context is defined, you have to do the following.
- Define the input sources by creating input DStreams.
- Define the streaming computations by applying transformation and
output operations to DStreams. - Start receiving data and processing it using
streamingContext.start(). - Wait for the processing to be stopped (manually or due to any error)
using streamingContext.awaitTermination(). - The processing can be manually stopped using
streamingContext.stop().
一旦StreamingContext定义好之后,就可以做一些事情
Discretized Streams (DStreams)
Internally, a DStream is represented by a continuous series of RDDs
Each RDD in a DStream contains data from a certain interval
对DStream操作算子,比如map/flatMap,其实底层会被翻译为对DStream中的每个RDD都做相同因为一个DStream是由不同批次的RDD所构成的。
Input DStreams and Receivers
Input DStreams are DStreams representing the stream of input data received from streaming sources.
Every input DStream (except file stream, discussed later in this section) is associated with a Receiver (Scala doc, Java doc) object which
receives the data from a source and stores it
in Spark’s memory for processing.
Transformations
Output Operations
案例实战
案例一:Spark Streaming处理socket数据
报错 :java.lang.NoClassDefFoundError: net/jpountz/util/SafeUtils
百度查找maven repository
案例二:Spark Streaming处理HDFS文件数据
文件系统对接的注意事项(官网查询):
All files must be in the same data format.