DStreams 转换算子
与RDD类似,转换允许修改来自输入DStream的数据。 DStreams支持普通Spark RDD上可用的许多转换。一些常见的如下。
Transformation | Meaning |
---|---|
map(func) | Return a new DStream by passing each element of the source DStream through a function func. |
flatMap(func) | Similar to map, but each input item can be mapped to 0 or more output items. |
filter(func) | Return a new DStream by selecting only the records of the source DStream on which func returns true. |
repartition(numPartitions) | Changes the level of parallelism in this DStream by creating more or fewer partitions. |
union(otherStream) | Return a new DStream that contains the union of the elements in the source DStream and otherDStream.(要求RDD类型一致 ) |
count() | Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream. |
reduce(func) | Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.参数要求(K,K)=>K |
countByValue() | When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream. |
reduceByKey(func, [numTasks]) | When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism ) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. |
join(otherStream, [numTasks]) | When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.同RDD join类似,但是必须保证在同一个Batch中的RDD才能join,用处不大 |
cogroup(otherStream, [numTasks]) | When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples. |
transform(func) | Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream. |
updateStateByKey(func) | Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key. |
map(func)
通过将源DStream的每个元素传递给函数func来返回一个新的DStream。
val conf = new SparkConf().setMaster("local[5]").setAppName("mapOperation")
val ssc = new StreamingContext(conf,Seconds(5))
ssc.sparkContext.setLogLevel("FATAL")
val queue = new mutable.Queue[RDD[Int]]()
ssc.queueStream(queue)
.map(item =>item * 2)
.print()
ssc.start()
for(i <- 0 to 10){
val value: RDD[Int] = ssc.sparkContext.makeRDD(List(1,2,3,4,5))
queue += value
Thread.sleep(5000)
}
ssc.stop()
flatMap(func)
与map类似,但每个输入项可以映射到0~n项输出。
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(5))
ssc.sparkContext.setLogLevel("FATAL")
val queue = new mutable.Queue[RDD[String]]()
ssc.queueStream(queue)
.flatMap(item => item.split("\\W+"))
.print()
ssc.start()
for(i <- 0 to 10){
val value: RDD[String] = ssc.sparkContext.makeRDD(List("this is a demo"))
queue += value
Thread.sleep(5000)
}
ssc.stop()
filter(func)
通过仅选择func返回true的源DStream的记录来返回新的DStream。
val conf = new SparkConf().setMaster("local[5]").setAppName("filterOperation")
val ssc = new StreamingContext(conf,Seconds(5))
ssc.sparkContext.setLogLevel("FATAL")
val queue = new mutable.Queue[RDD[Int]]()
ssc.queueStream(queue)
.filter(item =>item % 2 ==0)
.print()
ssc.start()
for(i <- 0 to 10){
val value: RDD[Int] = ssc.sparkContext.makeRDD(List(1,2,3,4,5))
queue += value
Thread.sleep(5000)
}
ssc.stop()
union
返回一个新的DStream,它包含源DStream和otherDStream中元素的并集。
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(5))
ssc.sparkContext.setLogLevel("FATAL")
val queue1 = new mutable.Queue[RDD[String]]()
val queue2 = new mutable.Queue[RDD[String]]()
val stream1 = ssc.queueStream(queue1)
val stream2 = ssc.queueStream(queue2)
//要求stream1和stream2的元素类型必须一致
stream1.union(stream2)
.print()
ssc.start()
for(i <- 0 to 10){
val rdd1 = ssc.sparkContext.makeRDD(List("a","b","c"))
val rdd2 = ssc.sparkContext.makeRDD(List("a","d","e"))
queue1 += rdd1
queue2 += rdd2
Thread.sleep(1000)
}
ssc.stop()
count()
通过计算源DStream的每个RDD中的元素数量,返回单元素RDD的新DStream。
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("wordCount")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(1))
ssc.socketTextStream("192.168.239.131", 9999)
.flatMap(line => line.split(" "))
.count()
.print()
ssc.start()
ssc.awaitTermination()
ssc.stop(stopSparkContext = false)
reduce(func)
通过使用函数func(它接受两个参数并返回一个)聚合源DStream的每个RDD中的元素,返回单元素RDD的新DStream。
ssc.socketTextStream("192.168.239.131", 9999)
.flatMap(line => line.split(" "))
.reduce(_+","+_)
.print()
countByValue()
当在类型为K的元素的DStream上调用时,返回(K,Long)对的新DStream,其中每个键的值是其在源DStream的每个RDD中的频率。
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("wordCount")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(1))
ssc.socketTextStream("192.168.239.131", 9999)
.flatMap(line => line.split(" "))
.countByValue()
.print()
ssc.start()
ssc.awaitTermination()
ssc.stop(stopSparkContext = false)
reduceByKey(func, [numTasks])
当在(K,V)对的DStream上调用时,返回(K,V)对的新DStream,其中使用给定的reduce函数聚合每个键的值。
ssc.socketTextStream("192.168.239.131", 9999)
.flatMap(line => line.split(" "))
.map((_,1))
.reduceByKey(_+_)
.print()
ssc.start()
ssc.awaitTermination()
ssc.stop(stopSparkContext = false)
join(otherStream, [numTasks])
当在(K,V)和(K,W)对的两个DStream上调用时,返回(K,(V,W))对的新DStream与每个键的所有元素对。
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(5))
ssc.sparkContext.setLogLevel("FATAL")
val queue1 = new mutable.Queue[RDD[(String,String)]]()
val queue2 = new mutable.Queue[RDD[(String,(String,Double))]]()
val stream1 = ssc.queueStream(queue1)
val stream2 = ssc.queueStream(queue2)
stream1.join(stream2)
.print()
ssc.start()
for(i <- 0 to 10){
val rdd1 = ssc.sparkContext.makeRDD(List("1 zhangsan","2 lisi","3 wangwu"))
.map(item=> {
val tokens= item.split("\\W+")
(tokens(0),tokens(1))
})
val rdd2 = ssc.sparkContext.makeRDD(List("1,苹果,9.0","1,橘子,18.0","2,机械键盘,15000"))
.map(item => {
val splits = item.split(",")
(splits(0),(splits(1),splits(2).toDouble))
})
queue1 += rdd1
queue2 += rdd2
Thread.sleep(1000)
}
ssc.stop()
cogroup(otherStream, [numTasks])
当在(K,V)和(K,W)对的DStream上调用时,返回(K,Seq [V],Seq [W])元组的新DStream。
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(5))
ssc.sparkContext.setLogLevel("FATAL")
val queue1 = new mutable.Queue[RDD[(String,String)]]()
val queue2 = new mutable.Queue[RDD[(String,(String,Double))]]()
val stream1 = ssc.queueStream(queue1)
val stream2 = ssc.queueStream(queue2)
stream1.cogroup(stream2)
.print()
ssc.start()
for(i <- 0 to 10){
val rdd1 = ssc.sparkContext.makeRDD(List("1 zhangsan","2 lisi","3 wangwu"))
.map(item=> {
val tokens= item.split("\\W+")
(tokens(0),tokens(1))
})
val rdd2 = ssc.sparkContext.makeRDD(List("1,苹果,9.0","1,橘子,18.0","2,机械键盘,15000"))
.map(item => {
val splits = item.split(",")
(splits(0),(splits(1),splits(2).toDouble))
})
queue1 += rdd1
queue2 += rdd2
Thread.sleep(1000)
}
ssc.stop()
transform(func)
通过将RDD-to-RDD函数应用于源DStream的每个RDD来返回新的DStream。这可以用于在DStream上执行任意RDD操作。
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(5))
ssc.sparkContext.setLogLevel("FATAL")
val queue1 = new mutable.Queue[RDD[(String,(String,Double))]]()
val orderStream = ssc.queueStream(queue1)
val userRDD = ssc.sparkContext.makeRDD(List("1 zhangsan", "2 lisi", "3 wangwu"))
.map(item => {
val tokens = item.split("\\W+")
(tokens(0), tokens(1))
})
orderStream.transform(orderRDD => orderRDD.join(userRDD))
.print()
ssc.start()
for(i <- 0 to 10){
val rdd2 = ssc.sparkContext.makeRDD(List("1,苹果,9.0","1,橘子,18.0","2,机械键盘,15000"))
.map(item => {
val splits = item.split(",")
(splits(0),(splits(1),splits(2).toDouble))
})
queue1 += rdd2
Thread.sleep(1000)
}
ssc.stop()
updateStateByKey(func)
返回一个新的“状态”DStream,其中通过在键的先前状态和键的新值上应用给定函数来更新每个键的状态。这可用于维护每个key的任意状态数据。在每个批处理中,Spark都会对所有现有key状态更新,无论它们是否在批处理中都有新数据。如果updateFunction返回None,则将删除键值对。
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(5))
ssc.checkpoint("file:///D://checkpoint")
ssc.sparkContext.setLogLevel("FATAL")
ssc.socketTextStream("CentOS",9999)
.flatMap(line => line.split("\\W+"))
.map((_,1))
.reduceByKey(_+_)
.updateStateByKey((values,options:Option[Int])=> Some(options.getOrElse(0)+values.sum) )
.print()
ssc.start()
ssc.awaitTermination()
注意必须开启ssc.checkpoint(“file:///D://checkpoint”)否则系统无法保存状态。
Window Operations
Spark Streaming还提供窗口计算,允许您在滑动数据窗口上应用转换。下图说明了此滑动窗口。
如图所示,每当窗口在源DStream上滑动时,落入窗口内的源RDD被组合并操作以产生窗口化DStream的RDD。在这种特定情况下,操作应用于最后3个时间单位的数据,并按2个时间单位滑动。这表明任何窗口操作都需要指定两个参数。
- 窗口长度 - 窗口的持续时间。
- 滑动间隔 - 执行窗口操作的间隔。
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("FATAL")
ssc.socketTextStream("CentOS",9999)
.flatMap(line => line.split("\\W+"))
.map((_,1))
.reduceByKeyAndWindow((v1:Int,v2:Int) => v1+v2,Seconds(3),Seconds(2),3)
.print()
ssc.start()
ssc.awaitTermination()
要求窗口的长度必须是
batchDuration
的整数倍。
一些常见的窗口操作如下。
Transformation | Meaning |
---|---|
window(windowLength, slideInterval) | Return a new DStream which is computed based on windowed batches of the source DStream. |
countByWindow(windowLength, slideInterval) | Return a sliding window count of elements in the stream. |
reduceByWindow(func, windowLength, slideInterval) | Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel. |
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) | When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism ) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. |
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks]) | A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow , the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation. |
countByValueAndWindow(windowLength,slideInterval, [numTasks]) | When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow , the number of reduce tasks is configurable through an optional argument. |
window
返回一个新的DStream,它是根据源DStream的窗口批次计算的。
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("FATAL")
ssc.socketTextStream("CentOS",9999)
.flatMap(line => line.split("\\W+"))
.window(Seconds(10))
.map((_,1))
.reduceByKey(_+_)
.print()
ssc.start()
ssc.awaitTermination()
reduceByWindow
返回一个新的单元素流,通过使用func在滑动间隔内聚合流中的元素。
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("FATAL")
ssc.socketTextStream("CentOS",9999)
.flatMap(line => line.split("\\W+"))
.map(_.toString)
.reduceByWindow((t1:String,t2:String)=>t1+","+t2,
Seconds(5),
Seconds(1)
)
.print()
ssc.start()
ssc.awaitTermination()
reduceByKeyAndWindow
当在(K,V)对的DStream上调用时,返回(K,V)对的新DStream,其中使用给定的reduce函数func在滑动窗口中的批次聚合每个键的值。
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("FATAL")
ssc.socketTextStream("CentOS",9999)
.flatMap(line => line.split("\\W+"))
.map((_,1))
.reduceByKeyAndWindow((t1:Int,t2:Int)=>t1+t2,
Seconds(5),
Seconds(1),10
)
.print()
ssc.start()
ssc.awaitTermination()
注意:默认情况下,这使用Spark的默认并行任务数(本地模式为2,在群集模式下,数量由配置属性spark.default.parallelism确定)进行分组。您可以传递可选的numTasks参数来设置不同数量的任务。
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("FATAL")
ssc.checkpoint("file:///D:/checkpoint1")
ssc.socketTextStream("CentOS",9999)
.flatMap(line => line.split("\\W+"))
.map((_,1))
.reduceByKeyAndWindow(
(t1:Int,t2:Int)=>t1+t2,
(v1:Int,v2:Int)=> v1-v2,
Seconds(5),
Seconds(1),
10)
.print()
ssc.start()
ssc.awaitTermination()
指定移除RDD动作,移除一个批次做减法,这个时候用户可能发现有些key的值为0了系统还输出,当时间窗划出去后,剔除为0的元素
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("FATAL")
ssc.checkpoint("file:///D:/checkpoint1")
ssc.socketTextStream("CentOS",9999)
.flatMap(line => line.split("\\W+"))
.map((_,1))
.reduceByKeyAndWindow(
(t1:Int,t2:Int)=>t1+t2,
(v1:Int,v2:Int)=> v1-v2,
Seconds(5),
Seconds(1),
10,
(t:(String,Int))=> t._2>0
)
.print()
ssc.start()
ssc.awaitTermination()
countByValueAndWindow
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("FATAL")
ssc.checkpoint("file:///D:/checkpoint1")
ssc.socketTextStream("CentOS",9999)
.flatMap(line => line.split("\\W+"))
.countByValueAndWindow(Seconds(10),Seconds(3),10)
.print()
ssc.start()
ssc.awaitTermination()
Output Operations
输出操作允许将DStream的数据推送到外部系统,如数据库或文件系统。
Output Operation | Meaning |
---|---|
print() | Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging. Python API This is called pprint() in the Python API. |
saveAsTextFiles(prefix, [suffix]) | Save this DStream’s contents as text files. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”. |
saveAsObjectFiles(prefix, [suffix]) | Save this DStream’s contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”. Python API This is not available in the Python API. |
saveAsHadoopFiles(prefix, [suffix]) | Save this DStream’s contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”. Python API This is not available in the Python API. |
foreachRDD(func) | The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs. |
print()
在运行流应用程序的驱动程序节点上打印DStream中每批数据的前十个元素。
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("FATAL")
ssc.checkpoint("file:///D:/checkpoint1")
ssc.socketTextStream("CentOS",9999)
.flatMap(line => line.split("\\W+"))
.countByValueAndWindow(Seconds(10),Seconds(3),10)
.print()
ssc.start()
ssc.awaitTermination()
saveAsTextFiles
将此DStream的内容保存为文本文件。每个批处理间隔的文件名基于前缀和后缀生成:“prefix-TIME_IN_MS [.suffix]”。
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("FATAL")
ssc.checkpoint("file:///D:/checkpoint1")
ssc.socketTextStream("CentOS",9999)
.flatMap(line => line.split("\\W+"))
.countByValueAndWindow(Seconds(10),Seconds(3),10)
.saveAsTextFiles("file:///D:/checkpoints")
ssc.start()
ssc.awaitTermination()
foreachRDD(func)
最常用的的输出运算符,它将函数func应用于从流生成的每个RDD。此函数应将每个RDD中的数据推送到外部系统,例如将RDD保存到文件,或通过网络将其写入数据库。
Jedis
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.9.0</version>
</dependency>
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val ssc = new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("FATAL")
ssc.checkpoint("file:///D:/checkpoint1")
ssc.socketTextStream("CentOS",9999)
.flatMap(line => line.split("\\W+"))
.map((_,1))
.reduceByKeyAndWindow(
(t1:Int,t2:Int)=>t1+t2,
(v1:Int,v2:Int)=> v1-v2,
Seconds(5),
Seconds(1),
10
).foreachRDD(rdd => {
rdd.foreachPartition(partitionRecords => {
val pool=new JedisPool("CentOS",6379)
val jedis = pool.getResource()
partitionRecords.foreach(record =>{
println(record)
record match {
case r:(String,Int) if(r._2==0) => jedis.hdel("wordcount",record._1)
case r:(String,Int) if(r._2>0) => jedis.hset("wordcount",record._1,record._2+"")
}
})
jedis.close()
pool.close();
})
})
ssc.start()
ssc.awaitTermination()
Checkpointing
streaming application 必须全天候运行,因此必须能够适应与应用程序逻辑无关的故障,为了实现这一点,Spark Streaming需要将足够的信息检查到容错存储系统,以便它可以从故障中恢复。检查点有两种类型的数据。
- Metadata checkpointing:将定义流式计算的信息保存到HDFS等容错存储中。这用于从运行流应用程序的Driver程序的节点的故障中恢复(稍后详细讨论)。
- 集群配置信息
- DStream操作函数
- 未完成的批处理
- Data checkpointing:将生成的RDD保存到可靠的存储中。在一些跨多个批次组合数据的有状态转换中,这是必需的。
有状态转换的用法 - 如果在应用程序中使用了updateStateByKey或reduceByKeyAndWindow(带反函数),则必须提供检查点目录以允许定期RDD检查点。
val conf = new SparkConf().setMaster("local[5]").setAppName("wordCount")
val checkpoint="file:///D://checkpoint"
val ssc=StreamingContext.getOrCreate(checkpoint,()=>{
val ssc = new StreamingContext(conf,Seconds(5))
ssc.checkpoint(checkpoint)
ssc.sparkContext.setLogLevel("FATAL")
ssc.socketTextStream("CentOS",9999)
.flatMap(line => line.split("\\W+"))
.map((_,1))
.reduceByKey(_+_)
.updateStateByKey((values,options:Option[Int])=> Some(options.getOrElse(0)+values.sum))
.checkpoint(Seconds(5))
.print()
ssc
})
ssc.start()
ssc.awaitTermination()