spark streaming 中有三个关于时间的参数,分别如下:
窗口时间windowDuration:当前窗口要统计多长时间的数据,是批量时间的整数倍
滑动时间slideDuration:要多长时间更新一次结果,是批量时间的整数倍
批量时间batchDuration:多长时间创建一个批次,与实际业务无关,只与数据量有关,数据量大则可以设置短一些,数据量小则设置长一些,但必须小于其他两个时间。
示例:spark streaming 从 Kafka 中读取数据,进行处理
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
object SparkWindowDemo {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setAppName("kafkaDemo").setMaster("local[2]")
val streamingContext = new StreamingContext(sparkConf,Seconds(2))
streamingContext.checkpoint("checkpoint") //定义checkpoint目录位置,这里放在项目的根目录下
//kafka的配置参数
val kafkaParams = Map(
(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.237.100:9092"),
(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.GROUP_ID_CONFIG -> "kafkaGroup")
)
val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe(Set("sparkKafkaDemo"), kafkaParams)
)
// 注意:窗口长度,窗口移动速率需要是batch time的整数倍
val numStream: DStream[(String, Int)] = kafkaStream
.flatMap(line => line.value().toString.split("\\s+"))
.map((_, 1)).window(Seconds(8),Seconds(6))
numStream.print()
streamingContext.start()
streamingContext.awaitTermination()
}
}
1.window(windowLength, slideInterval)
该操作由一个DStream对象调用,传入一个窗口长度参数,一个窗口移动速率参数,然后将当前时刻当前长度窗口中的元素取出形成一个新的DStream。
//批处理时间2s 窗口长度8s 移动速率6s
val numStream: DStream[(String, Int)] = kafkaStream
.flatMap(line => line.value().toString.split("\\s+"))
.map((_, Integer.parseInt(System.currentTimeMillis().toString.substring(5))))
.window(Seconds(8), Seconds(6))
numStream.print()
输出结果:
-------------------------------------------
Time: 1608724478000 ms
-------------------------------------------
(1,24478115)
(2,24478115)
(3,24478115)
-------------------------------------------
Time: 1608724484000 ms
-------------------------------------------
(4,24484073)
(5,24484073)
(6,24484073)
(7,24484073)
1.windowDuration(窗口时间)要求大于slideDuration(滑动时间)否则会报错
2.slideDuration(滑动时间)也要求是batchDuration(批处理时间)的整数倍
3.使用Kafka接收数据的时候,要求windowDuration(窗口时间)是slideDuration(滑动时间)的2倍以上,否则会只打印移动时间slideDuration内接收到的数据
使用socket可以看到window的效果:
以下测试均改用socket:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
object SparkWindowDemo {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setAppName("kafkaDemo").setMaster("local[2]")
val streamingContext = new StreamingContext(sparkConf, Seconds(1))
streamingContext.checkpoint("checkpoint")
val stream: ReceiverInputDStream[String] = streamingContext.socketTextStream("192.168.237.100", 7777)
val numStream: DStream[(String, Int)] = stream
.flatMap(line => line.toString.split("\\s+"))
.map((_, Integer.parseInt(System.currentTimeMillis().toString.substring(5))))
.window(Seconds(8), Seconds(6))
numStream.print()
streamingContext.start()
streamingContext.awaitTermination()
}
}
输出结果:(控制每2秒输入一个数字)
-------------------------------------------
Time: 1608778314000 ms
-------------------------------------------
(1,78308049)
(2,78314041)
(3,78314043)
(4,78314045)
-------------------------------------------
Time: 1608778320000 ms
-------------------------------------------
(4,78314045)
(5,78320043)
(1,78320047)
(2,78320050)
2.countByWindow(windowLength,slideInterval)
返回指定长度窗口中的元素个数。
注:需要设置checkpoint
val numStream: DStream[Long] = stream
.flatMap(line => line.toString.split("\\s+"))
.map((_, Integer.parseInt(System.currentTimeMillis().toString.substring(5))))
.countByWindow(Seconds(10),Seconds(6))
输出结果:(控制每2秒输入一个数字)
-------------------------------------------
Time: 1608779304000 ms
-------------------------------------------
5
-------------------------------------------
Time: 1608779310000 ms
-------------------------------------------
5
3.countByValueAndWindow(windowLength,slideInterval, [numTasks])
统计当前时间窗口中元素值相同的元素的个数
注:需要设置checkpoint
val numStream: DStream[(String, Long)] = stream
.flatMap(line => line.toString.split("\\s+"))
.countByValueAndWindow(Seconds(10),Seconds(6))
numStream.print()
输出结果:(控制每2秒输入一个数字)
-------------------------------------------
Time: 1608781714000 ms
-------------------------------------------
(4,1)
(6,1)
(2,1)
(5,1)
(3,1)
-------------------------------------------
Time: 1608781720000 ms
-------------------------------------------
(8,1)
(6,1)
(7,1)
(5,1)
(9,1)
4.reduceByWindow(func, windowLength,slideInterval)
在调用DStream上首先取窗口函数的元素形成新的DStream,然后在窗口元素形成的DStream上进行reduce。
val numStream: DStream[String] = kafkaStream
.flatMap(line => line.value().toString.split("\\s+"))
.reduceByWindow(_+":"+_,Seconds(8),Seconds(2))
numStream.print()
输出结果:(控制每2秒输入一个数字)
-------------------------------------------
Time: 1608782036000 ms
-------------------------------------------
1:2:3
-------------------------------------------
Time: 1608782038000 ms
-------------------------------------------
2:3:4
5.reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks])
reduceByKeyAndWindow的数据源是基于该DStream的窗口长度中的所有数据进行计算。该操作有一个可选的并发数参数。
val numStream: DStream[(String, Int)] = kafkaStream
.flatMap(line => line.value().toString.split("\\s+"))
.map((_, 1))
.reduceByKeyAndWindow((x:Int,y:Int)=>{x+y},Seconds(8),Seconds(2))
numStream.print()
输出结果:(控制每2秒输入一个数字)
-------------------------------------------
Time: 1608782502000 ms
-------------------------------------------
(0,2)
(1,2)
-------------------------------------------
Time: 1608782504000 ms
-------------------------------------------
(0,2)
(1,2)
结束时显示:
-------------------------------------------
Time: 1608782854000 ms
-------------------------------------------
(0,1)
-------------------------------------------
Time: 1608782856000 ms
-------------------------------------------
第二种参数列表:
val numStream: DStream[(String, Int)] = kafkaStream
.flatMap(line => line.value().toString.split("\\s+"))
.map((_, 1))
.reduceByKeyAndWindow((x:Int,y:Int)=>{x+y},(x:Int,y:Int)=>{x-y},Seconds(8),Seconds(2))
输出结果:(控制每2秒输入一个数字)
-------------------------------------------
Time: 1608782502000 ms
-------------------------------------------
(0,2)
(1,2)
-------------------------------------------
Time: 1608782504000 ms
-------------------------------------------
(0,2)
(1,2)
结束时显示:
-------------------------------------------
Time: 1608782502000 ms
-------------------------------------------
(0,0)
(1,0)