大数据——Flink Window(窗口)机制

Window(窗口)

  • Window是Flink处理无限流的核心,Windows将流拆分为有限大小的“桶”,我们可以在其上应用计算。

  • Flink认为Batch是Streaming的一个特例,所以Flink底层引擎是一个流式引擎,在上面实现了流处理和批处理。

  • 而窗口(window)就是从Streaming到Batch的一个桥梁。

  • Flink提供了非常完善的窗口机制。

  • 在流处理应用中,数据时连续不断的,因此我们不能等到所有数据都到了才开始处理。

  • 当然我们可以每来一个消息就处理一次,但是有时我们需要做一些聚合类的处理,例如:在过去的1分钟内有多少用户点击了我们的网页。

  • 在这种情况下,我们必须定义一个窗口,用来收集最近一分钟内的数据,并对这个窗口内的数据进行计算。

  • 窗口可以是基于时间驱动的(Time Window,例如:每30秒钟)

  • 也可以是基于数据驱动的(Count Window,例如:每一百个元素)

  • 同时基于不同事件驱动的窗口又可以分成以下几类:

     翻滚窗口(Tumbling Window,无重叠)
     滑动窗口(Sliding Window,有重叠)
     会话窗口(Session Window,活动间隙)
     全局窗口(略)
    

Flink要操作窗口,先得将StreamSource转成WindowedStream

Window操作作用
Window Keyed Streaming->WindowedStream可以在已经分区的KeyedStream上定义Windows,即K,V格式的数据。
WindowAll DataStream->AllWindowedStream对常规的DataStream上定义Window,即非K,V格式的数据
Window Apply WindowedStream->AllWindowedStream AllWindowedStream->DataStream将函数应用于窗口中的数据
Window Reduce WindowedStream->DataStream对窗口里的数据进行“reduce”减少聚合统计
Aggregations on Windows WindowedStream->DataStream对窗口里的数据进行聚合操作:sum(),max(),min()

Tumbling Window(翻滚窗口)

  • 翻滚窗口能将数据流切分成不重叠的窗口,每一个事件只能属于一个窗口
  • 翻滚具有固定的尺寸,不重叠
  • 例图
    在这里插入图片描述
    代码展示:
package window

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
/*
flink窗口 tumbling(滚动窗口) 窗口周期10秒
找出最小空高
从socket端口接收字符串格式为“ws_001,1577844001,45.0”数据
接收到字符串后将字符串流转换成WaterSensor流 dataStream
 */


case class WaterSensor(id:String,ts:Long,vc:Double)

object WindowDemo {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置并行度
    env.setParallelism(1)
    val stream: DataStream[String] = env.socketTextStream("192.168.136.100",7777)

    val dataStream: DataStream[WaterSensor] = stream.map(x => {
      val strings: Array[String] = x.split(",")
      WaterSensor(strings(0).trim.toString, strings(1).trim.toLong, strings(2).trim.toDouble)
    })
    val dataStream2: DataStream[(String, Double)] = dataStream.map(data => (data.id, data.vc))
    val dataStream3: KeyedStream[(String, Double), String] = dataStream2.keyBy(_._1)
    //Tumbling 窗口
    val dataStream4: WindowedStream[(String, Double), String, TimeWindow] = dataStream3.timeWindow(Time.seconds(10))

    val minDataStream: DataStream[(String, Double)] = dataStream4.reduce((x,y)=>(x._1,x._2.min(y._2)))

    dataStream.print("orig")

    minDataStream.print("min")

    env.execute("windowdemo")
  }
}

  • 开启端口
[root@hadoop100 ~]# nc -lk 7777
  • 启动Scala程序
  • 在端口输入信息
ws_001,1609314670,45.0
ws_002,1609314671,33.0
ws_003,1609314672,32.0
ws_002,1609314673,23.0
ws_003,1609314674,31.0
ws_002,1609314675,45.0
ws_003,1609314676,18.0
ws_002,1609314677,34.0
ws_003,1609314678,47.0
ws_001,1609314679,55.0
ws_001,1609314680,25.0
ws_001,1609314681,25.0
ws_001,1609314682,25.0
ws_001,1609314683,26.0
ws_001,1609314684,21.0
ws_001,1609314685,24.0
ws_001,1609314686,15.0
  • 相应的在控制台打印出了结果,求出了每个key的最小值

在这里插入图片描述

Sliding Window(滑动窗口)

  • 滑动窗口和翻滚窗口类似,区别在于:滑动窗口可以有重叠的部分。
  • 在滑动窗口中,一个元素可以对应多个窗口。
  • 例图:
    在这里插入图片描述
    代码展示:
package window

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
/*
flink窗口 tumbling(滚动窗口) 窗口周期10秒
找出最小空高
从socket端口接收字符串格式为“ws_001,1577844001,45.0”数据
接收到字符串后将字符串流转换成WaterSensor流 dataStream
 */


case class WaterSensor(id:String,ts:Long,vc:Double)

object WindowDemo {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置并行度
    env.setParallelism(1)
    val stream: DataStream[String] = env.socketTextStream("192.168.136.100",7777)

    val dataStream: DataStream[WaterSensor] = stream.map(x => {
      val strings: Array[String] = x.split(",")
      WaterSensor(strings(0).trim.toString, strings(1).trim.toLong, strings(2).trim.toDouble)
    })
    val dataStream2: DataStream[(String, Double)] = dataStream.map(data => (data.id, data.vc))
    val dataStream3: KeyedStream[(String, Double), String] = dataStream2.keyBy(_._1)
 
    //sliding 滑动窗口
   val dataStream4: WindowedStream[(String, Double), String, TimeWindow] = dataStream3.timeWindow(Time.seconds(10),Time.seconds(5))


    val minDataStream: DataStream[(String, Double)] = dataStream4.reduce((x,y)=>(x._1,x._2.min(y._2)))

    dataStream.print("orig")

    minDataStream.print("min")

    env.execute("windowdemo")
  }
}

  • 开启端口
[root@hadoop100 ~]# nc -lk 7777
  • 启动Scala程序
  • 在端口输入信息
ws_001,1609314670,45.0
ws_002,1609314671,33.0
ws_003,1609314672,32.0
ws_002,1609314673,23.0
ws_003,1609314674,31.0
ws_002,1609314675,45.0
ws_003,1609314676,18.0
ws_002,1609314677,34.0
ws_003,1609314678,47.0
ws_001,1609314679,55.0
ws_001,1609314680,25.0
ws_001,1609314681,25.0
ws_001,1609314682,25.0
ws_001,1609314683,26.0
ws_001,1609314684,21.0
ws_001,1609314685,24.0
ws_001,1609314686,15.0
  • 相应的在控制台打印出了结果,求出了每个key的最小值

在这里插入图片描述

Sliding Window(滑动窗口)设置Watermark时间

package window

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
/*
flink窗口 tumbling(滚动窗口) 窗口周期10秒
找出最小空高
从socket端口接收字符串格式为“ws_001,1577844001,45.0”数据
接收到字符串后将字符串流转换成WaterSensor流 dataStream
 */




object WindowDemo2 {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置并行度
    env.setParallelism(1)

    /*
    flink 时间原意三种情况
    EventTime         事件发生时间
    IngestionTime     事件进入到flink时间
    ProcessingTime    事件处理时间
     */


    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)


    val stream: DataStream[String] = env.socketTextStream("192.168.136.100",7777)

    val dataStream: DataStream[WaterSensor] = stream.map(x => {
      val strings: Array[String] = x.split(",")
        WaterSensor(strings(0).trim, strings(1).trim.toLong, strings(2).trim.toDouble)
    })
      //.assignAscendingTimestamps(x=>x.ts*1000)  //简单,时间永远是升序,不能延时

      .assignTimestampsAndWatermarks(new MyAssigner)  //自定义类
    //
          //flink自带类
      /*.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
      override def extractTimestamp(t: WaterSensor): Long = {
        t.ts*1000
      }
    })*/

    val dataStream2: DataStream[(String, Double)] = dataStream.map(data => (data.id, data.vc))
    val dataStream3: KeyedStream[(String, Double), String] = dataStream2.keyBy(_._1)
    //Tumbling 窗口
    //val dataStream4: WindowedStream[(String, Double), String, TimeWindow] = dataStream3.timeWindow(Time.seconds(10))

    //sliding 滑动窗口
    val dataStream4: WindowedStream[(String, Double), String, TimeWindow] = dataStream3.timeWindow(Time.seconds(8),Time.seconds(2))


    val minDataStream: DataStream[(String, Double)] = dataStream4.reduce((x,y)=>(x._1,x._2.min(y._2)))

    dataStream.print("orig")

    minDataStream.print("min")

    env.execute("windowdemo")
  }
}
class MyAssigner extends AssignerWithPeriodicWatermarks[WaterSensor]{
  var bound=3000  //waterMark 延迟关窗时间
  var maxTs= 0L //记录当前最大时间戳

  override def getCurrentWatermark: Watermark = {
     new Watermark(maxTs-bound)
  }

  override def extractTimestamp(t: WaterSensor, l: Long): Long = {
     maxTs= Math.max(t.ts*1000,maxTs)
     t.ts*1000
  }
}
  • 开启端口
[root@hadoop100 ~]# nc -lk 7777
  • 启动Scala程序
  • 在端口输入信息
ws_002, 1609314671, 33.0
ws_002, 1609314673, 23.0
ws_002, 1609314673, 43.0
ws_002, 1609314673, 13.0
ws_002, 1609314673, 43.0
ws_001, 1609314683, 26.0
  • 相应的在控制台打印出了结果,求出了每个key的最小值
    在这里插入图片描述
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值