文章目录
一 updateStateByKey
操作允许在使用新信息不断更新状态的同时能够保留他的状态.
- 定义状态. 状态可以是任意数据类型
- 定义状态更新函数. 指定一个函数, 这个函数负责使用以前的状态和新值来更新状态.
在每个阶段, Spark 都会在所有已经存在的 key 上使用状态更新函数, 而不管是否有新的数据在.
Seq[V] 当前key 新接受的value的值的序列
Option[S] 上个阶段累加后的结果
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S]): DStream[(K, S)]
package com.gc.sparkStreaming.day01.HaveStatusTransform
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* 有状态转换 将中间计算的结果进行缓存 方便累加
*/
object upDateStateByKey {
//需求 对接kafka 根据输入对输入的数据进行 累加求和
// 案例 wordCount
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("upDateStateByKey")
val streamingContext = new StreamingContext(conf, Seconds(4))// 四秒一个批次
val group: String ="guochao" // 消费组
val brokers="hadoop102:9092,hadoop103:9092,hadoop104:9092" // kafka 集群地址
val topic: String ="first" //主题
val kafkaParams = Map(
ConsumerConfig.GROUP_ID_CONFIG->group ,
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG->brokers
)
val kafkaInputStream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
streamingContext,
kafkaParams,
Set(topic)
)
val wordOne: DStream[(String, Int)] = kafkaInputStream.map(_._2).flatMap(_.split("\\W+")).map((_,1))
val dsStream: DStream[(String, Int)] = wordOne.updateStateByKey(updateFunc)
streamingContext.checkpoint("./checkpoint") // 设置检查点 将中间计算完的结果保存在此目录
dsStream.print(100)
streamingContext.start()
streamingContext.awaitTermination()
}
//定义状态更改函数 第一个参数 当前key 新传入的value 序列 第二个参数 上个阶段这个key计算后的结果
def updateFunc(newValue:Seq[Int],runningCount:Option[Int]):Option[Int]={
// 新的总数和状态进行求和操作
val sum: Int = newValue.sum
val value:Int = runningCount.getOrElse(0)
Some[Int](sum+value)
}
}
二 window操作
Spark Streaming 也提供了窗口计算, 允许执行转换操作作用在一个窗口内的数据.
默认情况下, 计算只对一个时间段内的RDD进行, 有了窗口之后, 可以把计算应用到一个指定的窗口内的所有 RDD 上.
一个窗口可以包含多个时间段. 基于窗口的操作会在一个比StreamingContext的批次间隔更长的时间范围内,通过整合多个批次的结果,计算出整个窗口的结果。
观察上图, 窗口在 DStream 上每滑动一次, 落在窗口内的那些 RDD会结合在一起, 然后在上面操作产生新的 RDD, 组成了 window DStream.
在上面图的情况下, 操作会至少应用在 3 个数据单元上, 每次滑动 2 个时间单位. 所以, 窗口操作需要 2 个参数:
• 窗口长度 – 窗口的持久时间(执行一次持续多少个时间单位)(图中是 3)
• 滑动步长 – 窗口操作被执行的间隔(每多少个时间单位执行一次).(图中是 2 )
注意: 这两个参数必须是源 DStream 的 interval 的倍数.
2.1 reduceByKeyAndWindow
/**
* Return a new DStream by applying `reduceByKey` over a sliding window. This is similar to
* `DStream.reduceByKey()` but applies it over a sliding window. Hash partitioning is used to
* generate the RDDs with Spark's default number of partitions.
* @param reduceFunc associative and commutative reduce function
* @param windowDuration width of the window; must be a multiple of this DStream's
* batching interval
* @param slideDuration sliding interval of the window (i.e., the interval after which
* the new DStream will generate RDDs); must be a multiple of this
* DStream's batching interval
*/
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())
}
编码实现从kafka上读取流数据,进行计算,指定窗口长度为12,滑动长度为8
package com.gc.sparkStreaming.day02.window
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
*/
object ReduceByKeyAndWindow {
//需求 对接kafka 根据输入对输入的数据进行 累加求和
// 案例 wordCount
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("upDateStateByKey")
val streamingContext = new StreamingContext(conf, Seconds(4))// 四秒一个批次
val group: String ="guochao" // 消费组
val brokers="hadoop102:9092,hadoop103:9092,hadoop104:9092" // kafka 集群地址
val topic: String ="first" //主题
val kafkaParams = Map(
ConsumerConfig.GROUP_ID_CONFIG->group ,
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG->brokers
)
val kafkaInputStream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
streamingContext,
kafkaParams,
Set(topic)
)
val wordOne: DStream[(String, Int)] = kafkaInputStream.map(_._2).flatMap(_.split("\\W+")).map((_,1))
val resDStream: DStream[(String, Int)] = wordOne.reduceByKeyAndWindow((x: Int, y: Int) => {
x + y
}, Seconds(12), Seconds(8))
// 指定窗口的长度为Seconds(12) 窗口滑动步长为Seconds(8)
// (x:Int,y:Int)=>{x+y}聚合函数 必须带上类型
// 窗口长度和敞口滑动的步长必须为批次的整数倍
resDStream.print(100)
streamingContext.start()
streamingContext.awaitTermination()
}
}
2.2 reduceByKeyAndWindow
invReduceFunc 比没有invReduceFunc高效. 会利用旧值来进行计算.
invReduceFunc: (V, V) => V 窗口移动了, 上一个窗口和新的窗口会有重叠部分, 重叠部分的值可以不用重复计算了. 第一个参数就是新的值, 第二个参数是旧的值
方法声明
/**
* Return a new DStream by applying incremental `reduceByKey` over a sliding window.
* The reduced value of over a new window is calculated using the old window's reduced value :
* 1. reduce the new values that entered the window (e.g., adding new counts)
*
* 2. "inverse reduce" the old values that left the window (e.g., subtracting old counts)
*
* This is more efficient than reduceByKeyAndWindow without "inverse reduce" function.
* However, it is applicable to only "invertible reduce functions".
* Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
* @param reduceFunc associative and commutative reduce function
* @param invReduceFunc inverse reduce function; such that for all y, invertible x:
* `invReduceFunc(reduceFunc(x, y), x) = y`
* @param windowDuration width of the window; must be a multiple of this DStream's
* batching interval
* @param slideDuration sliding interval of the window (i.e., the interval after which
* the new DStream will generate RDDs); must be a multiple of this
* DStream's batching interval
* @param filterFunc Optional function to filter expired key-value pairs;
* only pairs that satisfy the function are retained
*/
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
invReduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration = self.slideDuration,
numPartitions: Int = ssc.sc.defaultParallelism,
filterFunc: ((K, V)) => Boolean = null
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(
reduceFunc, invReduceFunc, windowDuration,
slideDuration, defaultPartitioner(numPartitions), filterFunc
)
}
编码实现wordcount 指定invReduceFunc函数
package com.gc.sparkStreaming.day02.window
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
*/
object ReduceByKeyAndWindow2 {
//需求 对接kafka 根据输入对输入的数据进行 累加求和
// 案例 wordCount
// The checkpoint directory has not been set. Please set it by StreamingContext.checkpoint(). 需要指定checkpoint
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("upDateStateByKey")
val streamingContext = new StreamingContext(conf, Seconds(4))// 四秒一个批次
val group: String ="guochao" // 消费组
val brokers="hadoop102:9092,hadoop103:9092,hadoop104:9092" // kafka 集群地址
val topic: String ="first" //主题
val kafkaParams = Map(
ConsumerConfig.GROUP_ID_CONFIG->group ,
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG->brokers
)
val kafkaInputStream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
streamingContext,
kafkaParams,
Set(topic)
)
val wordOne: DStream[(String, Int)] = kafkaInputStream.map(_._2).flatMap(_.split("\\W+")).map((_,1))
val resDStream: DStream[(String, Int)] = wordOne.reduceByKeyAndWindow((x: Int, y: Int) => {x + y},
(newValue:Int,oldValue:Int)=>{
println(newValue)
println(oldValue)
newValue-oldValue},
Seconds(12),
Seconds(8),
filterFunc = _._2>0
)
//
// (x:Int,y:Int)=>{x+y}聚合函数 必须带上类型
// (newValue:Int,oldValue:Int)=>{newValue-oldValue}, invReduceFunc函数 第一参数为新计算的结果 oldValue 上个窗口计算的结果
// 指定窗口的长度为Seconds(12) 窗口滑动步长为Seconds(8)
//filterFunc = _._2>0 命名参数 传递过滤函数 过滤掉值为0 的数据
// 窗口长度和敞口滑动的步长必须为批次的整数倍
//
streamingContext.checkpoint("checkpoint1")
resDStream.print(100)
streamingContext.start()
streamingContext.awaitTermination()
//在窗口滑动 超过 窗口长度的时候 会出现下面的结果 可以在增加一个过滤参数 过滤掉 上次计算结果为0 的数据
// (d,0)
// (b,0)
// (f,0)
// (s,0)
// (gh,0)
// (a,0)
// (g,0)
}
}
2.3 window(windowLength, slideInterval)
基于对源 DStream 窗化的批次进行计算返回一个新的 Dstream
package com.gc.sparkStreaming.day02.window
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
// nc -lt 9999 从scoket网络端口中获取数据
object window2 {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("window2").setMaster("local[2]")
val sc: StreamingContext = new StreamingContext(conf,Seconds(3))
val dsStream: DStream[String] = sc.socketTextStream("hadoop102",9999).window(Seconds(9),Seconds(6))
val rsDstream: DStream[(String, Int)] = dsStream.flatMap(_.split("\\W+")).map((_,1)).reduceByKey(_+_)
rsDstream.print(100)
sc.start()
sc.awaitTermination()
/* -------------------------------------------
Time: 1569327660000 ms
-------------------------------------------
-------------------------------------------
Time: 1569327666000 ms
-------------------------------------------
(wangwu,1)
(lisi,1)
-------------------------------------------
Time: 1569327672000 ms
-------------------------------------------
(zhangsan,1)
(wangwu,1)
(wangermazi,1)
-------------------------------------------
Time: 1569327678000 ms
-------------------------------------------
(wangermazi,1)
-------------------------------------------
Time: 1569327684000 ms
-------------------------------------------
*/
}
}
2.4 countByWindow(windowLength, slideInterval)
返回一个滑动窗口计数流中的元素的个数
/**
* Return a new DStream in which each RDD has a single element generated by counting the number
* of elements in a sliding window over this DStream. Hash partitioning is used to generate
* the RDDs with Spark's default number of partitions.
* @param windowDuration width of the window; must be a multiple of this DStream's
* batching interval
* @param slideDuration sliding interval of the window (i.e., the interval after which
* the new DStream will generate RDDs); must be a multiple of this
* DStream's batching interval
*/
def countByWindow(
windowDuration: Duration,
slideDuration: Duration): DStream[Long] = ssc.withScope {
this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)
}
package com.gc.sparkStreaming.day02.window
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.DStream
object countByWindow {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("window2").setMaster("local[2]")
val sc: StreamingContext = new StreamingContext(conf,Seconds(3))
// 需要指定对应的checkpoint 目录
sc.checkpoint("./window")
val dsStream: DStream[String] = sc.socketTextStream("hadoop102",9999)
val rsDstream: DStream[(String, Int)] = dsStream.flatMap(_.split("\\W+")).map((_,1)).reduceByKey(_+_)
rsDstream.print(100)
val countDstream: DStream[Long] = rsDstream.countByWindow(Seconds(9),Seconds(6))
countDstream.print(100)
sc.start()
sc.awaitTermination()
}
}