2020.12.23课堂笔记(Spark Streaming中的Window操作)

spark streaming 中有三个关于时间的参数,分别如下:

窗口时间windowDuration​:当前窗口要统计多长时间的数据,是批量时间的整数倍
滑动时间slideDuration​:要多长时间更新一次结果,是批量时间的整数倍
批量时间batchDuration​:多长时间创建一个批次,与实际业务无关,只与数据量有关,数据量大则可以设置短一些,数据量小则设置长一些,但必须小于其他两个时间。

示例:spark streaming 从 Kafka 中读取数据,进行处理

import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}

object SparkWindowDemo {
  def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setAppName("kafkaDemo").setMaster("local[2]")
    val streamingContext = new StreamingContext(sparkConf,Seconds(2))  
    streamingContext.checkpoint("checkpoint")  //定义checkpoint目录位置,这里放在项目的根目录下

    //kafka的配置参数
    val kafkaParams = Map(
      (ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.237.100:9092"),
      (ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.GROUP_ID_CONFIG -> "kafkaGroup")
    )

    val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
      streamingContext,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe(Set("sparkKafkaDemo"), kafkaParams)
    )
    
	// 注意:窗口长度,窗口移动速率需要是batch time的整数倍
    val numStream: DStream[(String, Int)] = kafkaStream
      .flatMap(line => line.value().toString.split("\\s+"))
      .map((_, 1)).window(Seconds(8),Seconds(6))
    numStream.print()

    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

1.window(windowLength, slideInterval)

该操作由一个DStream对象调用,传入一个窗口长度参数,一个窗口移动速率参数,然后将当前时刻当前长度窗口中的元素取出形成一个新的DStream。

//批处理时间2s 窗口长度8s 移动速率6s
val numStream: DStream[(String, Int)] = kafkaStream
  .flatMap(line => line.value().toString.split("\\s+"))
  .map((_, Integer.parseInt(System.currentTimeMillis().toString.substring(5))))
  .window(Seconds(8), Seconds(6))
numStream.print()

输出结果:

-------------------------------------------
Time: 1608724478000 ms
-------------------------------------------
(1,24478115)
(2,24478115)
(3,24478115)
-------------------------------------------
Time: 1608724484000 ms
-------------------------------------------
(4,24484073)
(5,24484073)
(6,24484073)
(7,24484073)

1.windowDuration(窗口时间)要求大于slideDuration(滑动时间)否则会报错
2.slideDuration(滑动时间)也要求是batchDuration(批处理时间)的整数倍
3.使用Kafka接收数据的时候,要求windowDuration(窗口时间)是slideDuration(滑动时间)的2倍以上,否则会只打印移动时间slideDuration内接收到的数据

使用socket可以看到window的效果:
以下测试均改用socket:

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}

object SparkWindowDemo {
  def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setAppName("kafkaDemo").setMaster("local[2]")
    val streamingContext = new StreamingContext(sparkConf, Seconds(1))
    streamingContext.checkpoint("checkpoint")

    val stream: ReceiverInputDStream[String] = streamingContext.socketTextStream("192.168.237.100", 7777)

    val numStream: DStream[(String, Int)] = stream
      .flatMap(line => line.toString.split("\\s+"))
      .map((_, Integer.parseInt(System.currentTimeMillis().toString.substring(5))))
      .window(Seconds(8), Seconds(6))

    numStream.print()

    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

输出结果:(控制每2秒输入一个数字)

-------------------------------------------
Time: 1608778314000 ms
-------------------------------------------
(1,78308049)
(2,78314041)
(3,78314043)
(4,78314045)
-------------------------------------------
Time: 1608778320000 ms
-------------------------------------------
(4,78314045)
(5,78320043)
(1,78320047)
(2,78320050)

2.countByWindow(windowLength,slideInterval)

返回指定长度窗口中的元素个数。
注:需要设置checkpoint

val numStream: DStream[Long] = stream
  .flatMap(line => line.toString.split("\\s+"))
  .map((_, Integer.parseInt(System.currentTimeMillis().toString.substring(5))))
  .countByWindow(Seconds(10),Seconds(6))

输出结果:(控制每2秒输入一个数字)

-------------------------------------------
Time: 1608779304000 ms
-------------------------------------------
5
-------------------------------------------
Time: 1608779310000 ms
-------------------------------------------
5

3.countByValueAndWindow(windowLength,slideInterval, [numTasks])

统计当前时间窗口中元素值相同的元素的个数
注:需要设置checkpoint

val numStream: DStream[(String, Long)] = stream
  .flatMap(line => line.toString.split("\\s+"))
  .countByValueAndWindow(Seconds(10),Seconds(6))
numStream.print()

输出结果:(控制每2秒输入一个数字)

-------------------------------------------
Time: 1608781714000 ms
-------------------------------------------
(4,1)
(6,1)
(2,1)
(5,1)
(3,1)
-------------------------------------------
Time: 1608781720000 ms
-------------------------------------------
(8,1)
(6,1)
(7,1)
(5,1)
(9,1)

4.reduceByWindow(func, windowLength,slideInterval)

在调用DStream上首先取窗口函数的元素形成新的DStream,然后在窗口元素形成的DStream上进行reduce。

val numStream: DStream[String] = kafkaStream
  .flatMap(line => line.value().toString.split("\\s+"))
  .reduceByWindow(_+":"+_,Seconds(8),Seconds(2))
numStream.print()

输出结果:(控制每2秒输入一个数字)

-------------------------------------------
Time: 1608782036000 ms
-------------------------------------------
1:2:3
-------------------------------------------
Time: 1608782038000 ms
-------------------------------------------
2:3:4

5.reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks])

reduceByKeyAndWindow的数据源是基于该DStream的窗口长度中的所有数据进行计算。该操作有一个可选的并发数参数。

val numStream: DStream[(String, Int)] = kafkaStream
  .flatMap(line => line.value().toString.split("\\s+"))
  .map((_, 1))
  .reduceByKeyAndWindow((x:Int,y:Int)=>{x+y},Seconds(8),Seconds(2))
numStream.print()

输出结果:(控制每2秒输入一个数字)

-------------------------------------------
Time: 1608782502000 ms
-------------------------------------------
(0,2)
(1,2)
-------------------------------------------
Time: 1608782504000 ms
-------------------------------------------
(0,2)
(1,2)

结束时显示:

-------------------------------------------
Time: 1608782854000 ms
-------------------------------------------
(0,1)
-------------------------------------------
Time: 1608782856000 ms
-------------------------------------------

第二种参数列表:

val numStream: DStream[(String, Int)] = kafkaStream
  .flatMap(line => line.value().toString.split("\\s+"))
  .map((_, 1))
  .reduceByKeyAndWindow((x:Int,y:Int)=>{x+y},(x:Int,y:Int)=>{x-y},Seconds(8),Seconds(2))

输出结果:(控制每2秒输入一个数字)

-------------------------------------------
Time: 1608782502000 ms
-------------------------------------------
(0,2)
(1,2)
-------------------------------------------
Time: 1608782504000 ms
-------------------------------------------
(0,2)
(1,2)

结束时显示:

-------------------------------------------
Time: 1608782502000 ms
-------------------------------------------
(0,0)
(1,0)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值