1 spark streaming基本概念
Spark streaming是Spark核心API的一个扩展,它对实时流式数据的处理具有可扩展性、高吞吐量、可容错性等特点。我们可以从kafka、flume、Twitter、 ZeroMQ、Kinesis等源获取数据,也可以通过由高阶函数map、reduce、join、window等组成的复杂算法计算出数据。最后,处理后的数据可以推送到文件系统、数据库、实时仪表盘中。
2 Spark Streaming的工作原理
C: 每个块都会生成一个Spark Job处理,最终结果也返回多块
3 spark stream处理原理
Spark Streaming支持一个高层的抽象,叫做离散流(discretized stream
)或者DStream
,它代表连续的数据流。DStream既可以利用从Kafka, Flume和Kinesis等源获取的输入数据流创建,也可以在其他DStream的基础上通过高阶函数获得。在内部,DStream是由一系列RDDs组成。
用户能够利用scala、java或者Python来编写Spark Streaming程序。
4 一个spark stream简单例子
}
一个SparkContext对象可以重复利用去创建多个StreamingContext对象,前提条件是前面的StreamingContext在后面StreamingContext创建之前关闭(不关闭SparkContext)。
5 输入DStream
输入DStreams表示从数据源获取输入数据流的DStreams。
输入DStreams表示从数据源获取的原始数据流。Spark Streaming拥有两类数据源
- 基本源(Basic sources):这些源在StreamingContext API中直接可用。例如文件系统、套接字连接等。
- 高级源(Advanced sources):这些源包括Kafka,Flume,Kinesis,Twitter等等。
- 需要注意的是,如果你想在一个流应用中并行地创建多个输入DStream来接收多个数据流,你能够创建多个输入流(这将在性能调优那一节介绍)。它将创建多个Receiver同时接收多个数据流。但是,receiver作为一个长期运行的任务运行在Spark worker或executor中。因此,它占有一个核,这个核是分配给Spark Streaming应用程序的所有核中的一个(it occupies one of the cores allocated to the Spark Streaming application)。所以,为Spark Streaming应用程序分配足够的核(如果是本地运行,那么是线程)用以处理接收的数据并且运行receiver是非常重要的
和RDD类似,transformation允许从输入DStream来的数据被修改。DStreams支持很多在RDD中可用的transformation算子。
Transformation | Meaning |
---|---|
map(func) | 利用函数func 处理原DStream的每个元素,返回一个新的DStream |
flatMap(func) | 与map相似,但是每个输入项可用被映射为0个或者多个输出项 |
filter(func) | 返回一个新的DStream,它仅仅包含源DStream中满足函数func的项 |
repartition(numPartitions) | 通过创建更多或者更少的partition改变这个DStream的并行级别(level of parallelism) |
union(otherStream) | 返回一个新的DStream,它包含源DStream和otherStream的联合元素 |
count() | 通过计算源DStream中每个RDD的元素数量,返回一个包含单元素(single-element)RDDs的新DStream |
reduce(func) | 利用函数func聚集源DStream中每个RDD的元素,返回一个包含单元素(single-element)RDDs的新DStream。函数应该是相关联的,以使计算可以并行化 |
countByValue() | 这个算子应用于元素类型为K的DStream上,返回一个(K,long)对的新DStream,每个键的值是在原DStream的每个RDD中的频率。 |
reduceByKey(func, [numTasks]) | 当在一个由(K,V)对组成的DStream上调用这个算子,返回一个新的由(K,V)对组成的DStream,每一个key的值均由给定的reduce函数聚集起来。注意:在默认情况下,这个算子利用了Spark默认的并发任务数去分组。你可以用numTasks 参数设置不同的任务数 |
join(otherStream, [numTasks]) | 当应用于两个DStream(一个包含(K,V)对,一个包含(K,W)对),返回一个包含(K, (V, W))对的新DStream |
cogroup(otherStream, [numTasks]) | 当应用于两个DStream(一个包含(K,V)对,一个包含(K,W)对),返回一个包含(K, Seq[V], Seq[W])的元组 |
transform(func) | 通过对源DStream的每个RDD应用RDD-to-RDD函数,创建一个新的DStream。这个可以在DStream中的任何RDD操作中使用 |
7 WindowOperations(窗口操作)
Spark还提供了窗口的计算,它允许你使用一个滑动窗口应用在数据变换中。下图说明了该滑动窗口。
如图所示,每个时间窗口在一个个DStream中划过,每个DSteam中的RDD进入Window中进行合并,操作时生成为
窗口化DSteam的RDD。在上图中,该操作被应用在过去的3个时间单位的数据,和划过了2个时间单位。这说明任
何窗口操作都需要指定2个参数:
- window length(窗口长度):窗口的持续时间(上图为3个时间单位)
- sliding interval (滑动间隔)- 窗口操作的时间间隔(上图为2个时间单位)。
上面的2个参数的大小,必须是接受产生一个DStream时间的倍数
让我们用一个例子来说明窗口操作。比如说,你想用以前的WordCount的例子,来计算最近30s的数据的中的单词
数,10S接受为一个DStream。为此,我们要用reduceByKey操作来计算最近30s数据中每一个DSteam中关于
(word,1)的pair操作。它可以用reduceByKeyAndWindow操作来实现。一些常见的窗口操作如下。所有这些操作
都需要两个参数--- window length(窗口长度)和sliding interval(滑动间隔)。
-------------------------实验数据----------------------------------------------------------------------
(每秒在其中随机抽取一个,作为Socket端的输入),socket端的数据模拟和实验函数等程序见附录百度云链接
-
//输入:窗口长度(隐:输入的滑动窗口长度为形成Dstream的时间)
-
//输出:返回一个DStream,這个DStream包含這个滑动窗口下的全部元素
-
def window(windowDuration: Duration): DStream[T] = window(windowDuration, this.slideDuration)
-
-
//输入:窗口长度和滑动窗口长度
-
//输出:返回一个DStream,這个DStream包含這个滑动窗口下的全部元素
-
def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {
-
new WindowedDStream( this, windowDuration, slideDuration)
-
}
import org.apache.log4j.{Level, Logger}
-
import org.apache.spark.streaming.{Seconds, StreamingContext}
-
import org.apache.spark.{SparkConf, SparkContext}
-
-
object windowOnStreaming {
-
def main(args: Array[String]) {
-
/**
-
* this is test of Streaming operations-----window
-
*/
-
Logger.getLogger( "org.apache.spark").setLevel(Level.ERROR)
-
Logger.getLogger( "org.eclipse.jetty.Server").setLevel(Level.OFF)
-
-
val conf = new SparkConf().setAppName( "the Window operation of SparK Streaming").setMaster( "local[2]")
-
val sc = new SparkContext(conf)
-
val ssc = new StreamingContext(sc,Seconds( 2))
-
-
-
//set the Checkpoint directory
-
ssc.checkpoint( "/Res")
-
-
//get the socket Streaming data
-
val socketStreaming = ssc.socketTextStream( "master", 9999)
-
-
val data = socketStreaming.map(x =>(x, 1))
-
//def window(windowDuration: Duration): DStream[T]
-
val getedData1 = data.window(Seconds( 6))
-
println( "windowDuration only : ")
-
getedData1.print()
-
//same as
-
// def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
-
//val getedData2 = data.window(Seconds(9),Seconds(3))
-
//println("Duration and SlideDuration : ")
-
//getedData2.print()
-
-
ssc.start()
-
ssc.awaitTermination()
-
}
-
-
}
--------------------reduceByKeyAndWindow操作--------------------------------
-
/**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像
-
* `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群
-
* 默认的分区树
-
* @param reduceFunc 从左到右的reduce 函数
-
* @param windowDuration 窗口时间
-
* 滑动窗口默认是1个batch interval
-
* 分区数是是RDD默认(depend on spark集群core)
-
*/
-
def reduceByKeyAndWindow(
-
reduceFunc: (V, V) => V,
-
windowDuration: Duration
-
): DStream[(K, V)] = ssc.withScope {
-
reduceByKeyAndWindow(reduceFunc, windowDuration, self.slideDuration, defaultPartitioner())
-
}
-
-
/**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像
-
* `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群
-
* 默认的分区树
-
* @param reduceFunc 从左到右的reduce 函数
-
* @param windowDuration 窗口时间
-
* @param slideDuration 滑动时间
-
*/
-
def reduceByKeyAndWindow(
-
reduceFunc: (V, V) => V,
-
windowDuration: Duration,
-
slideDuration: Duration
-
): DStream[(K, V)] = ssc.withScope {
-
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())
-
}
-
-
-
/**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像
-
* `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群
-
* 默认的分区树
-
* @param reduceFunc 从左到右的reduce 函数
-
* @param windowDuration 窗口时间
-
* @param slideDuration 滑动时间
-
-
* @param numPartitions 每个RDD的分区数.
-
*/
-
def reduceByKeyAndWindow(
-
reduceFunc: (V, V) => V,
-
windowDuration: Duration,
-
slideDuration: Duration,
-
numPartitions: Int
-
): DStream[(K, V)] = ssc.withScope {
-
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration,
-
defaultPartitioner(numPartitions))
-
}
-
-
/**
-
/**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像
-
* `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群
-
* 默认的分区树
-
* @param reduceFunc 从左到右的reduce 函数
-
* @param windowDuration 窗口时间
-
* @param slideDuration 滑动时间
-
-
* @param numPartitions 每个RDD的分区数.
-
* @param partitioner 设置每个partition的分区数
-
*/
-
def reduceByKeyAndWindow(
-
reduceFunc: (V, V) => V,
-
windowDuration: Duration,
-
slideDuration: Duration,
-
partitioner: Partitioner
-
): DStream[(K, V)] = ssc.withScope {
-
self.reduceByKey(reduceFunc, partitioner)
-
.window(windowDuration, slideDuration)
-
.reduceByKey(reduceFunc, partitioner)
-
}
-
-
/**
-
*通过对每个滑动过来的窗口应用一个reduceByKey的操作.同时对old RDDs进行了invReduceFunc操作
-
* hash分区是采用spark集群,默认的分区树
-
* @param reduceFunc从左到右的reduce 函数
-
* @param invReduceFunc inverse reduce function; such that for all y, invertible x:
-
* `invReduceFunc(reduceFunc(x, y), x) = y`
-
* @param windowDuration窗口时间
-
* @param slideDuration 滑动时间
-
* @param filterFunc 来赛选一定条件的 key-value 对的
-
*/
-
def reduceByKeyAndWindow(
-
reduceFunc: (V, V) => V,
-
invReduceFunc: (V, V) => V,
-
windowDuration: Duration,
-
slideDuration: Duration = self.slideDuration,
-
numPartitions: Int = ssc.sc.defaultParallelism,
-
filterFunc: ((K, V)) => Boolean = null
-
): DStream[(K, V)] = ssc.withScope {
-
reduceByKeyAndWindow(
-
reduceFunc, invReduceFunc, windowDuration,
-
slideDuration, defaultPartitioner(numPartitions), filterFunc
-
)
-
}
-
-
/**
-
*通过对每个滑动过来的窗口应用一个reduceByKey的操作.同时对old RDDs进行了invReduceFunc操作
-
* hash分区是采用spark集群,默认的分区树
-
* @param reduceFunc从左到右的reduce 函数
-
* @param invReduceFunc inverse reduce function; such that for all y, invertible x:
-
* `invReduceFunc(reduceFunc(x, y), x) = y`
-
* @param windowDuration窗口时间
-
* @param slideDuration 滑动时间
-
* @param partitioner 每个RDD的分区数.
-
* @param filterFunc 来赛选一定条件的 key-value 对的
-
*/
-
def reduceByKeyAndWindow(
-
reduceFunc: (V, V) => V,
-
invReduceFunc: (V, V) => V,
-
windowDuration: Duration,
-
slideDuration: Duration,
-
partitioner: Partitioner,
-
filterFunc: ((K, V)) => Boolean
-
): DStream[(K, V)] = ssc.withScope {
-
-
val cleanedReduceFunc = ssc.sc.clean(reduceFunc)
-
val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc)
-
val cleanedFilterFunc = if (filterFunc != null) Some(ssc.sc.clean(filterFunc)) else None
-
new ReducedWindowedDStream[K, V](
-
self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc,
-
windowDuration, slideDuration, partitioner
-
)
-
}
import org.apache.log4j.{Level, Logger}
-
import org.apache.spark.streaming.{Seconds, StreamingContext}
-
import org.apache.spark.{SparkConf, SparkContext}
-
-
-
object reduceByWindowOnStreaming {
-
-
def main(args: Array[String]) {
-
/**
-
* this is test of Streaming operations-----reduceByKeyAndWindow
-
*/
-
Logger.getLogger( "org.apache.spark").setLevel(Level.ERROR)
-
Logger.getLogger( "org.eclipse.jetty.Server").setLevel(Level.OFF)
-
-
val conf = new SparkConf().setAppName( "the reduceByWindow operation of SparK Streaming").setMaster( "local[2]")
-
val sc = new SparkContext(conf)
-
val ssc = new StreamingContext(sc,Seconds( 2))
-
-
//set the Checkpoint directory
-
ssc.checkpoint( "/Res")
-
-
//get the socket Streaming data
-
val socketStreaming = ssc.socketTextStream( "master", 9999)
-
-
val data = socketStreaming.map(x =>(x, 1))
-
//def reduceByKeyAndWindow(reduceFunc: (V, V) => V, windowDuration: Duration ): DStream[(K, V)]
-
//val getedData1 = data.reduceByKeyAndWindow(_+_,Seconds(6))
-
-
val getedData2 = data.reduceByKeyAndWindow(_+_,
-
(a,b) => a+b* 0
-
,Seconds( 6),Seconds( 2))
-
-
val getedData1 = data.reduceByKeyAndWindow(_+_,_-_,Seconds( 9),Seconds( 6))
-
-
println( "reduceByKeyAndWindow : ")
-
getedData1.print()
-
-
ssc.start()
-
ssc.awaitTermination()
-
-
-
}
-
}
ReducedWindowedDStream這个类内部来进行说明:
------------------reduceByWindow操作---------------------------
-
/输入:reduceFunc、窗口长度、滑动长度
-
//输出:(a,b)为从几个从左到右一次取得两个元素
-
//(,a,b)进入reduceFunc,
-
def reduceByWindow(
-
reduceFunc: (T, T) => T,
-
windowDuration: Duration,
-
slideDuration: Duration
-
): DStream[T] = ssc.withScope {
-
this.reduce(reduceFunc).window(windowDuration, slideDuration).reduce(reduceFunc)
-
}
-
/**
-
*输入reduceFunc,invReduceFunc,窗口长度、滑动长度
-
*/
-
def reduceByWindow(
-
reduceFunc: (T, T) => T,
-
invReduceFunc: (T, T) => T,
-
windowDuration: Duration,
-
slideDuration: Duration
-
): DStream[T] = ssc.withScope {
-
this.map(( 1, _))
-
.reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, 1)
-
.map(_._2)
-
}
-
import org.apache.log4j.{Level, Logger}
-
import org.apache.spark.streaming.{Seconds, StreamingContext}
-
import org.apache.spark.{SparkConf, SparkContext}
-
-
-
object reduceByWindow {
-
def main(args: Array[String]) {
-
/**
-
* this is test of Streaming operations-----reduceByWindow
-
*/
-
Logger.getLogger( "org.apache.spark").setLevel(Level.ERROR)
-
Logger.getLogger( "org.eclipse.jetty.Server").setLevel(Level.OFF)
-
-
val conf = new SparkConf().setAppName( "the reduceByWindow operation of SparK Streaming").setMaster( "local[2]")
-
val sc = new SparkContext(conf)
-
val ssc = new StreamingContext(sc,Seconds( 2))
-
//set the Checkpoint directory
-
ssc.checkpoint( "/Res")
-
-
//get the socket Streaming data
-
val socketStreaming = ssc.socketTextStream( "master", 9999)
-
-
//val data = socketStreaming.reduceByWindow(_+_,Seconds(6),Seconds(2))
-
val data = socketStreaming.reduceByWindow(_+_,_+_,Seconds( 6),Seconds( 2))
-
-
-
println( "reduceByWindow: count the number of elements")
-
data.print()
-
-
-
ssc.start()
-
ssc.awaitTermination()
-
-
}
-
}
-----------------------------------------------countByWindow操作---------------------------------
-
/**
-
* 输入 窗口长度和滑动长度,返回窗口内的元素数量
-
* @param windowDuration 窗口长度
-
* @param slideDuration 滑动长度
-
*/
-
def countByWindow(
-
windowDuration: Duration,
-
slideDuration: Duration): DStream[Long] = ssc.withScope {
-
this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)
-
//窗口下的DStream进行map操作,把每个元素变为1之后进行reduceByWindow操作
-
}
import org.apache.log4j.{Level, Logger}
-
import org.apache.spark.streaming.{Seconds, StreamingContext}
-
import org.apache.spark.{SparkConf, SparkContext}
-
-
-
object countByWindow {
-
def main(args: Array[String]) {
-
-
/**
-
* this is test of Streaming operations-----countByWindow
-
*/
-
Logger.getLogger( "org.apache.spark").setLevel(Level.ERROR)
-
Logger.getLogger( "org.eclipse.jetty.Server").setLevel(Level.OFF)
-
-
val conf = new SparkConf().setAppName( "the reduceByWindow operation of SparK Streaming").setMaster( "local[2]")
-
val sc = new SparkContext(conf)
-
val ssc = new StreamingContext(sc,Seconds( 2))
-
//set the Checkpoint directory
-
ssc.checkpoint( "/Res")
-
-
//get the socket Streaming data
-
val socketStreaming = ssc.socketTextStream( "master", 9999)
-
-
val data = socketStreaming.countByWindow(Seconds( 6),Seconds( 2))
-
-
-
println( "countByWindow: count the number of elements")
-
data.print()
-
-
-
ssc.start()
-
ssc.awaitTermination()
-
-
-
}
-
}
-------------------------------- countByValueAndWindow-------------
/**
-
*输入 窗口长度、滑动时间、RDD分区数(默认分区是等于并行度)
-
* @param windowDuration width of the window; must be a multiple of this DStream's
-
* batching interval
-
* @param slideDuration sliding interval of the window (i.e., the interval after which
-
* the new DStream will generate RDDs); must be a multiple of this
-
* DStream's batching interval
-
* @param numPartitions number of partitions of each RDD in the new DStream.
-
*/
-
def countByValueAndWindow(
-
windowDuration: Duration,
-
slideDuration: Duration,
-
numPartitions: Int = ssc.sc.defaultParallelism)
-
(implicit ord: Ordering[T] = null)
-
: DStream[ (T, Long)] = ssc.withScope {
-
this.map((_, 1L)).reduceByKeyAndWindow(
-
(x: Long, y: Long) => x + y,
-
(x: Long, y: Long) => x - y,
-
windowDuration,
-
slideDuration,
-
numPartitions,
-
(x: (T, Long)) => x._2 != 0L)
-
}
-
import org.apache.log4j.{Level, Logger}
-
import org.apache.spark.streaming.{Seconds, StreamingContext}
-
import org.apache.spark.{SparkConf, SparkContext}
-
-
-
object countByValueAndWindow {
-
def main(args: Array[String]) {
-
/**
-
* this is test of Streaming operations-----countByValueAndWindow
-
*/
-
Logger.getLogger( "org.apache.spark").setLevel(Level.ERROR)
-
Logger.getLogger( "org.eclipse.jetty.Server").setLevel(Level.OFF)
-
-
val conf = new SparkConf().setAppName( "the reduceByWindow operation of SparK Streaming").setMaster( "local[2]")
-
val sc = new SparkContext(conf)
-
val ssc = new StreamingContext(sc,Seconds( 2))
-
//set the Checkpoint directory
-
ssc.checkpoint( "/Res")
-
-
//get the socket Streaming data
-
val socketStreaming = ssc.socketTextStream( "master", 9999)
-
-
val data = socketStreaming.countByValueAndWindow(Seconds( 6),Seconds( 2))
-
-
-
println( "countByWindow: count the number of elements")
-
data.print()
-
-
-
ssc.start()
-
ssc.awaitTermination()
-
}
-
-
}