SparkStreaming有状态转化操作案例
SparkStreaming有状态转化操作主要有两种类型滑动窗口和updateStateByKey()
- 滑动窗口
Window Operations可以设置窗口的大小和滑动窗口的间隔来动态的获取当前Steaming的允许状态。基于窗口的操作会在一个比 StreamingContext 的批次间隔更长的时间范围内,通过整合多个批次的结果,计算出整个窗口的结果。
- 案例一
每间隔十秒都计算前三十秒的数据实现wordcount
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
//每间隔十秒都计算前三十秒的数据
object SparkStreamingStudy_Window {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
//批次为10秒
val ssc = new StreamingContext(conf, Seconds(10))
// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("10.21.13.181", 9999)
// Split each line into words
val words = lines.flatMap(_.split(" "))
// Count each word in each batch
val pairs = words.map(word => (word, 1))
//窗口大小为30s,滑动步长为10s ======》每间隔十秒都计算前三十秒的数据
val wordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b),Seconds(30), Seconds(10))
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
//ssc.stop()
}
}
- 案例二
每间隔5秒都对计算前一天的数据进行累加(输入数据为INT类型实现累加)
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkStreamingStudy_window1 {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
//批次为5秒
val ssc = new StreamingContext(conf, Seconds(5))
// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("10.21.13.181", 9999)
// Split each line into words
val words = lines.flatMap(_.split(" "))
// Count each word in each batch
val pairs = words.map(word => ("Sum",word.trim.toInt))
//窗口大小为24小时,滑动步长为5s ======》每间隔5秒都计算前一天的数据 **滑动步长必须是批次的倍数**
val wordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b),Seconds(86400), Seconds(5))
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
wordCounts.foreachRDD(rdd=>{
//rdd[(string,int)]转换为rdd[Int]
//rddSum[Int]
val rddSum = rdd.map{(x)=>(x._2)}
println(rddSum.first())
})
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
//ssc.stop()
}
}
- updateStateByKey
UpdateStateByKey原语用于记录历史记录,有时,我们需要在 DStream 中跨批次维护状态(例如流计算中累加wordcount)。针对这种情况,updateStateByKey() 为我们提供了对一个状态变量的访问,用于键值对形式的 DStream。给定一个由(键,事件)对构成的 DStream,并传递一个指定如何根据新的事件 更新每个键对应状态的函数,它可以构建出一个新的 DStream,其内部数据为(键,状态) 对。
updateStateByKey() 的结果会是一个新的 DStream,其内部的 RDD 序列是由每个时间区间对应的(键,状态)对组成的。
updateStateByKey操作使得我们可以在用新信息进行更新时保持任意的状态。为使用这个功能,你需要做下面两步:
(1)定义状态,状态可以是一个任意的数据类型。
(2)定义状态更新函数,用此函数阐明如何使用之前的状态和来自输入流的新值对状态进行更新。
使用updateStateByKey需要对检查点目录进行配置,会使用检查点来保存状态。
更新版的wordcount:
- 案例一
每间隔三秒都会计算之前的单词出现的频度
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkStreamingStudy_updateStateByKey {
def main(args: Array[String]): Unit = {
// 定义更新状态方法,参数values为当前批次单词频度,state为以往批次单词频度
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(3))
//ssc.checkpoint("hdfs://hadoop102:9000/streamCheck")
ssc.checkpoint("E://checkpoint")
// Create a DStream that will connect to hostname:port, like hadoop102:9999
val lines = ssc.socketTextStream("10.21.13.181", 9999)
// Split each line into words
val words = lines.flatMap(_.split(" "))
//import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
// 使用updateStateByKey来更新状态,统计从运行开始以来单词总的次数
val stateDstream = pairs.updateStateByKey[Int](updateFunc)
stateDstream.print()
//val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
//wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
//ssc.stop()
}
}
- 案例二
每间隔三秒都会计算当前数字和之前的数字之和
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkStreamingStudy_updateStateByKey1 {
def main(args: Array[String]): Unit = {
// 定义更新状态方法,参数values为当前数字,state为以往数字之和
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_+_)
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(3))
//ssc.checkpoint("hdfs://hadoop102:9000/streamCheck")
ssc.checkpoint("E://checkpoint")
// Create a DStream that will connect to hostname:port, like hadoop102:9999
val lines = ssc.socketTextStream("10.21.13.181", 9999)
// Split each line
val words = lines.flatMap(_.split(" "))
// updateStateByKey是根据key来计算,给所有的数字同一个key就可以实现不断的累加数字的功能
val pairs = words.map(word => ("SUM", word.trim.toInt))
// 使用updateStateByKey来更新状态,统计从运行开始以来单词总的次数
val stateDstream = pairs.updateStateByKey[Int](updateFunc)
stateDstream.print()
stateDstream.foreachRDD(rdd=>{
//rdd[(string,int)]转换为rdd[Int]
//rddSum[Int]
val rddSum = rdd.map{(x)=>(x._2)}
println(rddSum.first())
})
//val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
//wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
//ssc.stop()
}
}