Flink窗口

Flink窗口计算 / 流计算

参考:https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators/windows.html

概述

窗口计算是流计算的核心,通过窗口将一个无线的数据流在时间轴上切分成有限大小的数据集-bucket,然后在对切分后的数据做计算。Flink根据流的特点将窗口计算分为两大类。

  • Keyed Windows
stream
       .keyBy(...)               <-  对数据进行分组
       .window(...)              <-  必须指定: "assigner",如果将数据划分到窗口中
      [.trigger(...)]            <-  可选: "trigger" 每个窗口都有默认触发器,规定窗口什么时候触发
      [.evictor(...)]            <-  可选: "evictor",剔除器负责将窗口中数据在聚合之前或者之后剔除
      [.allowedLateness(...)]    <-  可选: "lateness" 默认不允许迟到,设置窗口数据迟到时间-EventTime
      [.sideOutputLateData(...)] <-  可选: "output tag" 可以通过边输出,将太迟的数据通过SideOut输出到特定流中
       .reduce/aggregate/fold/apply() <-  必须: "function" 负责窗口聚合计算
      [.getSideOutput(...)]      <-  可选: "output tag",获取太迟的数据
  • Non-Keyed Windows
stream
       .windowAll(...)           <-  必须指定: "assigner",如果将数据划分到窗口中
      [.trigger(...)]            <-  可选: "trigger" 每个窗口都有默认触发器,规定窗口什么时候触发
      [.evictor(...)]            <-  可选: "evictor",剔除器负责将窗口中数据在聚合之前或者之后剔除
      [.allowedLateness(...)]    <-  可选: "lateness" 默认不允许迟到,设置窗口数据迟到时间-EventTime
      [.sideOutputLateData(...)] <-  可选: "output tag" 可以通过边输出,将太迟的数据通过SideOut输出到特定流中
       .reduce/aggregate/fold/apply() <-  必须: "function" 负责窗口聚合计算
      [.getSideOutput(...)]      <-  可选: "output tag",获取太迟的数据

Window Lifecycle

简而言之,一旦应属于该窗口的第一个元素到达,就会“创建”窗口,并且当时间(事件或处理时间)超过其结束时间戳时,会“完全删除”该窗口。用户指定的“允许的延迟”(请参阅允许的延迟)。 Flink保证只删除基于时间的窗口,而不能删除其他类型的窗口,例如*全局窗口

此外,每个窗口都会有一个“触发器”(请参阅Triggers和一个函数(“ ProcessWindowFunction”,“ ReduceFunction”,“ AggregateFunction”或“ FoldFunction”)(请参见Window Functions)附加到它。该函数将包含要应用于窗口内容的计算,而“ Trigger”则指定条件,在该条件下,该窗口被视为可以应用该函数的条件。触发策略可能类似于“当窗口中的元素数大于4时”或“当水印通过窗口末尾时”。触发器还可以决定在创建和删除窗口之间的任何时间清除窗口的内容。在这种情况下,清除仅是指窗口中的元素,而不是指窗口元数据。这意味着仍可以将新数据添加到该窗口。

除上述内容外,您还可以指定一个“ Evictor”(请参阅Evictors),将能够在触发触发器之后以及应用此功能之前和/或之后从窗口中删除元素。

In a nutshell, a window is created as soon as the first element that should belong to this window arrives, and the window is completely removed when the time (event or processing time) passes its end timestamp plus the user-specified allowed lateness (see Allowed Lateness). Flink guarantees removal only for time-based windows and not for other types, e.g. global windows

In addition, each window will have a Trigger (see Triggers) and a function (ProcessWindowFunction, ReduceFunction, AggregateFunction or FoldFunction) (see Window Functions) attached to it. The function will contain the computation to be applied to the contents of the window, while the Trigger specifies the conditions under which the window is considered ready for the function to be applied. A triggering policy might be something like “when the number of elements in the window is more than 4”, or “when the watermark passes the end of the window”. A trigger can also decide to purge a window’s contents any time between its creation and removal. Purging in this case only refers to the elements in the window, and not the window metadata. This means that new data can still be added to that window.

Apart from the above, you can specify an Evictor (see Evictors) which will be able to remove elements from the window after the trigger fires and before and/or after the function is applied.

Window Assigners

Tumbling Windows(时间)

窗口长度和步长一样,不存在窗口的交叠

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

fsEnv.socketTextStream("HadoopNode00",9999)
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce((v1,v2)=>(v1._1,v1._2+v2._2))
.print()

fsEnv.execute("FlinkWordCountsTumblingWindow_ReduceFunction")

Sliding Windows(时间)

窗口长度一般大于或等于步长,否则会产生丢数据,存在窗口的交叠

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

fsEnv.socketTextStream("HadoopNode00",9999)
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(SlidingProcessingTimeWindows.of(Time.seconds(4),Time.seconds(2)))
.aggregate(new AggregateFunction[(String,Int),(String,Int),(String,Int)]{
   
    override def createAccumulator(): (String, Int) = ("",0)

    override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) = {
   
        (value._1,value._2+accumulator._2)
    }

    override def getResult(accumulator: (String, Int)): (String, Int) = {
   
        accumulator
    }

    override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
   
        (a._1,a._2+b._2)
    }
})
.print()

fsEnv.execute("FlinkWordCountsTumblingWindow_ReduceFunction")

Session Windows(时间)

每一个元素都会产一个窗口,如果窗口与窗口间的间隔小于指定Window Gap,则系统会合并当前窗口。相比较于前两种窗口,会话窗口又称为可合并窗口,长度不固定。

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

fsEnv.socketTextStream("HadoopNode00",9999)
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
.apply(new WindowFunction[(String,Int),String,String,TimeWindow]{
   
    override def apply(key: String, window: TimeWindow, input: Iterable[(String, Int)],
                       out: Collector[String]): Unit = {
   
        val start = window.getStart
        val end = window.getEnd
        val sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
        out.collect(sdf.format(start)+" ~ "+sdf.format(end)+"\t"+input.mkString(","))
    }
})
.print()

fsEnv.execute("FlinkWordCountsTumblingWindow_ReduceFunction")

Global Windows(非时间)

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

fsEnv.socketTextStream("HadoopNode00",9999)
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(GlobalWindows.create())
.trigger(CountTrigger.of(2))
.fold(("",0))((z,v)=>(v._1,z._2+v._2))
.print()

fsEnv.execute("FlinkWordCountsGlobalWindow_FoldFunction")

Window Functions

After defining the window assigner, we need to specify the computation that we want to perform on each of these windows. This is the responsibility of the window function, which is used to process the elements of each (possibly keyed) window once the system determines that a window is ready for processing (see triggers for how Flink determines when a window is ready).

ReduceFunction

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .reduce { (v1, v2) => (v1._1, v1._2 + v2._2) }

AggregateFunction

class AverageAggregate extends AggregateFunction[(String, Long), (Long, Long), Double] {
   
  override def createAccumulator() = (0L, 0L)

  override def add(value: (String, Long), accumulator: (Long, Long)) =
    (accumulator._1 + value._2, accumulator._2 + 1L)

  override def getResult(accumulator: (Long, Long)) = accumulator._1 / accumulator._2

  override def merge(a: (Long, Long), b: (Long, Long)) =
    (a._1 + b._1, a._2 + b._2)
}

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .aggregate(new AverageAggregate)

FoldFunction

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .fold("") {
    (acc, v) => acc + v._2 }

不可以用在会话窗口中。

WindowFunction (Legacy)

In some places where a ProcessWindowFunction can be used you can also use a WindowFunction. This is an older version of ProcessWindowFunction that provides less contextual information and does not have some advances features, such as per-window keyed state. This interface will be deprecated at some point.

trait WindowFunction[IN, OUT, KEY, W <: Window] extends Function with Serializable {
  def apply(key: KEY, window: W, input: Iterable[IN], out: Collector[OUT])
}
val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .apply(new MyWindowFu
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值