Spark Windows--Assigners(分配器)、Functions(窗口函数)、Trigger(触发器)

Windows

窗口计算是流计算的核心,窗口将流数据切分成有限大小的“buckets”,我们可以对这个“buckets”中的有限数据做运算。

Windows are at the heart of processing infinite streams. Windows split the stream into “buckets” of finite size, over which we can apply computations. This document focuses on how windowing is performed in Flink and how the programmer can benefit to the maximum from its offered functionality.

在Flink中整体将窗口计算分为两大类:keyedstream窗口、datastream窗口。

Keyed Windows
 .keyBy(...)               <-  keyed versus non-keyed windows
       .window(...)              <-  必须指定: "assigner"
      [.trigger(...)]            <-  可选: "trigger" (else default trigger) 决定了窗口何时触发计算
      [.evictor(...)]            <-  可选: "evictor" (else no evictor) 剔除器,剔除窗口内的元素
      [.allowedLateness(...)]    <-  可选: "lateness" (else zero) 是否允许有迟到的元素
      [.sideOutputLateData(...)] <-  可选: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  必须: "function" 对窗口的数据做运算
      [.getSideOutput(...)]      <-  可选: "output tag" 获取迟到的数据
Non-Keyed Windows
stream
       .windowAll(...)           <-  required: "assigner"
       [.trigger(...)]            <-  可选: "trigger" (else default trigger) 决定了窗口何时触发计算
      [.evictor(...)]            <-  可选: "evictor" (else no evictor) 剔除器,剔除窗口内的元素
      [.allowedLateness(...)]    <-  可选: "lateness" (else zero) 是否允许有迟到的元素
      [.sideOutputLateData(...)] <-  可选: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  必须: "function" 对窗口的数据做运算
      [.getSideOutput(...)]      <-  可选: "output tag" 获取迟到的数据
Window Lifecycle

当有第一个元素落入到窗口中的时候窗口就被创建。当时间(水位线)越过窗口的EndTime的时候,该窗口认定为是就绪状态,可以应用WindowFunction对窗口中的元素进行运算。当前的时间越过了窗口的EndTime+allowed lateness时间,该窗口会被删除。只有time-based windows才有生命周期的概念,因为Flink还有一种类型的窗口global window不是基于时间的,因此没有生命周期的概念。

例如,采用基于Event-Time的窗口化策略,该策略每5分钟创建一次不重叠(或翻滚)的窗口,并允许延迟为1分钟,Flink将为12:00至12:05之间的间隔创建一个新窗口;当带有时间戳的第一个元素落入此时间间隔中,且水位线经过12:06时间戳时,12:00至12:05窗口将被删除。

每⼀种窗⼝都有⼀个Trigger和function与之绑定,function的作⽤是⽤于对窗⼝中的内容实现运算。⽽Trigger决定了窗⼝什么时候是就绪的,因为只有就绪的窗⼝才会运⽤function做运算。

除了指定以上的策略以外,我们还可以指定 Evictor ,该 Evictor 可以在窗⼝就绪以后且在function运⾏之前或者之后删除窗⼝中的元素。

Keyed vs Non-Keyed Windows

Keyed Windows:在某一个时刻,会触发多个window任务,取决于Key的种类。
Non-Keyed Windows:因为没有key的概念,所以任意时刻只有一个window任务执行。

Window Assigners

Window Assigner定义了如何将元素分配给窗⼝,这是通过在 window(...) / windowAll() 指定⼀个Window Assigner实现。

Window Assigner负责将接收的数据分配给1~N窗⼝,Flink中预定义了⼀些Window Assigner分如下:tumbling windows , sliding windows , session windows 和 global windows .⽤户还可以通过实现WindowAssigner类⾃定义窗⼝。除了global windows 以外其它窗⼝都是基于时间的TimeWindow.Timebased窗⼝都有 start timestamp (包含)和end timestamp (排除)属性描述⼀个窗⼝的⼤⼩。

  • Tumbling Windows(滚动窗口)

滚动窗⼝分配器将每个元素分配给指定窗⼝⼤⼩的窗⼝。滚动窗⼝具有固定的⼤⼩,并且不重叠。例如,如果您指定⼤⼩为5分钟的翻滚窗⼝,则将评估当前窗⼝,并且每五分钟将启动⼀个新窗⼝,如下图所示。

在这里插入图片描述

val env = StreamExecutionEnvironment.getExecutionEnvironment

    val text = env.socketTextStream("train",9999)

    text.flatMap(_.split(" "))
      .map(word=>(word,1))
      .keyBy(0)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
      .reduce((v1,v2)=>(v1._1,v1._2+v2._2))
      .print()

    env.execute("Tumbling Window")
  • Sliding Windows(滑动窗口)
    滑动窗⼝分配器将元素分配给固定⻓度的窗⼝。类似于滚动窗⼝分配器,窗⼝的⼤⼩由窗⼝⼤⼩参数配置。附加的窗⼝滑动参数控制滑动窗⼝启动的频率。因此,如果幻灯⽚⼩于窗⼝⼤⼩(窗口长度大于窗口间隔),则滑动窗⼝可能会重叠。在这种情况下,元素被分配给多个窗⼝。
    在这里插入图片描述
object FlinkSlidingWindows {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val text = env.socketTextStream("train",9999)

    text.flatMap(_.split(" "))
      .map(word=>(word,1))
      .keyBy(0)
      .window(SlidingProcessingTimeWindows.of(Time.seconds(4),Time.seconds(4)))
      .aggregate(new UserDefineAggregateFunction)
      .print()

    env.execute("Tumbling Window")
  }
}
class UserDefineAggregateFunction extends AggregateFunction[(String,Int),(String,Int),(String,Int)] {
  override def createAccumulator(): (String, Int) = ("",0)

  override def add(in: (String, Int), acc: (String, Int)): (String, Int) = {
    (in._1,in._2+acc._2)
  }

  override def getResult(acc: (String, Int)): (String, Int) = acc

  override def merge(acc: (String, Int), acc1: (String, Int)): (String, Int) = {
    (acc._1,acc._2+acc1._2)
  }
}
  • Session Windows(会话窗口)
    会话窗⼝分配器按活动会话对元素进⾏分组。与滚动窗⼝和滑动窗⼝相⽐,会话窗⼝不重叠且没有固定的开始和结束时间。相反,当会话窗⼝在⼀定时间段内未接收到元素时(即,发⽣不活动间隙时),它将关闭。

会话窗口有一个时间间隔session gap,当两个数据的间隔小于gap的时候,会归为一个窗口。如果大于gap,就会进入下一个窗口中。
在这里插入图片描述

object FlinkSessionWindows {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val text = env.socketTextStream("train",9999)

    text.flatMap(_.split(" "))
      .map(word=>(word,1))
      .keyBy(t=>t._1)
      .window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
      .apply(new UserDefineWindowFunction)
      .print()

    env.execute("Tumbling Window")
  }
}
class UserDefineWindowFunction extends WindowFunction[(String,Int),(String,Int),String,TimeWindow] {

  override def apply(key: String,
                     window: TimeWindow,
                     input: Iterable[(String, Int)],
                     out: Collector[(String, Int)]): Unit = {
    val sdf = new SimpleDateFormat("HH:mm:ss")
    val start = sdf.format(window.getStart)
    val end = sdf.format(window.getEnd)
    val sum = input.map(_._2).sum
    out.collect(s"${key}\t${start}~~${end}",sum)
  }
}
  • Global Windows
    全局窗⼝分配器将具有相同键的所有元素分配给同⼀单个全局窗⼝。仅当您还指定⾃定义触发器时,此窗⼝⽅案才有⽤。否则,将不会执⾏任何计算,因为全局窗⼝没有可以处理聚合元素的⾃然终点。

全局窗口没有时间的概念。
在这里插入图片描述

object FlinkGlobalWindows {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val text = env.socketTextStream("train",9999)

    text.flatMap(_.split(" "))
      .map(word=>(word,1))
      .keyBy(t=>t._1)
      .window(GlobalWindows.create())
      .trigger(CountTrigger.of(4))
      .apply(new UserDefineGlobalWindowFunction)
      .print()

    env.execute("Tumbling Window")
  }
}
class UserDefineGlobalWindowFunction extends WindowFunction[(String,Int),(String,Int),String,GlobalWindow] {
  override def apply(key: String,
                     window: GlobalWindow,
                     input: Iterable[(String, Int)],
                     out: Collector[(String, Int)]): Unit = {
    val sum = input.map(_._2).sum
    out.collect(s"${key}",sum)
  }
}
Window Functions

定义窗⼝分配器后,我们需要指定要在每个窗⼝上执⾏的计算。这是Window Function的职责,⼀旦系统确定窗⼝已准备好进⾏处理,就可以处理每个窗⼝的元素。窗⼝函数可以是ReduceFunctionAggregateFunctionFoldFunctionProcessWindowFunctionWindowFunction(古董)之⼀。其中ReduceFunction和AggregateFunction在运⾏效率上⽐ProcessWindowFunction要⾼,因为前俩个⽅法执⾏的是增量计算,只要有数据抵达窗⼝,系统就会调⽤ReduceFunction,AggregateFunction实现增量计算ProcessWindowFunction在窗⼝触发之前会⼀直缓存接收数据,只有当窗⼝就绪的时候才会对窗⼝中的元素做批量计算,但是该⽅法可以获取窗⼝的元数据信息。但是可以通过将ProcessWindowFunction与ReduceFunction,AggregateFunction或FoldFunction结合使⽤来获得窗⼝元素的增量聚合以及ProcessWindowFunction接收的其他窗⼝元数据,从⽽减轻这种情况。

  • ReduceFunction
object FlinkReduceFunction {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val text = env.socketTextStream("train",9999)

    text.flatMap(_.split(" "))
      .map(word=>(word,1))
      .keyBy(0)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
      .reduce(new UserDefineReduceFunction)
      .print()

    env.execute("Tumbling Window")
  }
}

class UserDefineReduceFunction extends ReduceFunction[(String,Int)] {
  
  override def reduce(t: (String, Int), t1: (String, Int)): (String, Int) = {
  	println("reduce:"+v1+"\t"+v2)
    (t._1,t._2+t1._2)
  }
  
}
  • AggregateFunction
object FlinkAggregateFunction {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val text = env.socketTextStream("train",9999)

    text.flatMap(_.split(" "))
      .map(word=>(word,1))
      .keyBy(0)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
      .aggregate(new UserDefineAggregateFunction)
      .print()

    env.execute("Tumbling Window")
  }
}
class UserDefineAggregateFunction extends AggregateFunction[(String,Int),(String,Int),(String,Int)] {
  override def createAccumulator(): (String, Int) = ("",0)

  override def add(in: (String, Int), acc: (String, Int)): (String, Int) = {
    (in._1,in._2+acc._2)
  }

  override def getResult(acc: (String, Int)): (String, Int) = acc

  override def merge(acc: (String, Int), acc1: (String, Int)): (String, Int) = {
    println("merge:"+a+"\t"+b)
    (acc._1,acc._2+acc1._2)
  }
}
  • FoldFunction
object FlinkFoldFunction {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val text = env.socketTextStream("train",9999)

    text.flatMap(_.split(" "))
      .map(word=>(word,1))
      .keyBy(t=>t._1)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
      .fold(("",0),new UserDefineFoldFunction)
      .print()

    env.execute("Fold Window")
  }
}
class UserDefineFoldFunction extends FoldFunction[(String,Int),(String,Int)] {
  override def fold(acc: (String, Int), value: (String, Int)): (String, Int) = {
    println("fold:"+acc+"\t"+value)
    (value._1,acc._2+value._2)
  }
}

注意:FoldFunction不可以⽤在Session Window中

  • ProcessWindowFunction
object ProcessWindowFunction {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val text = env.socketTextStream("train",9999)

    text.flatMap(_.split(" "))
      .map(word=>(word,1))
      .keyBy(t=>t._1)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
      .process(new UserDefineProcessWindowFunction)
      .print()

    env.execute("Fold Window")
  }
}
class UserDefineProcessWindowFunction extends ProcessWindowFunction[(String,Int),(String,Int),String,TimeWindow] {
  override def process(key: String,
                       context: Context,
                       elements: Iterable[(String, Int)],
                       out: Collector[(String, Int)]): Unit = {
    val sdf = new SimpleDateFormat("HH:mm:ss")
    val w = context.window //获取窗口元数据
    val start = sdf.format(w.getStart)
    val end = sdf.format(w.getEnd)
    val total = elements.map(_._2).sum

    out.collect((key+"\t["+start+"~"+end+"]",total))
  }
}
  • ProcessWindowFunction & Reduce/Aggregte/Fold
object FlinkProcessingTimeTumblingWindowFunction {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val text = env.socketTextStream("train",9999)

    text.flatMap(_.split(" "))
      .map(word=>(word,1))
      .keyBy(t=>t._1)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
      .reduce(new UserDefineReduceFunction2,new UserDefineProcessWindowFunction2)
      .print()

    env.execute("Fold Window")
  }
}
class UserDefineProcessWindowFunction2 extends ProcessWindowFunction[(String,Int),(String,Int),String,TimeWindow] {
  override def process(key: String,
                       context: Context,
                       elements: Iterable[(String, Int)],
                       out: Collector[(String, Int)]): Unit = {
    val sdf = new SimpleDateFormat("HH:mm:ss")
    val w = context.window //获取窗口元数据
    val start = sdf.format(w.getStart)
    val end = sdf.format(w.getEnd)

    val list = elements.toList

    println("list:"+list)


    val total = elements.map(_._2).sum

    out.collect((key+"\t["+start+"~"+end+"]",total))
  }
}

class UserDefineReduceFunction2 extends ReduceFunction[(String,Int)] {
  override def reduce(v1: (String, Int), v2: (String, Int)): (String, Int) = {
    println("reduce:"+v1+"\t"+v2)
    (v1._1,v2._2+v1._2)
  }
}

会先进行ReduceFunction计算,最后ProcessWindowFunction汇总

  • Per-window state In ProcessWindowFunction(全窗口状态)
object FlinkPerStateProcessingTimeFunction {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val text = env.socketTextStream("train",9999)

    text.flatMap(_.split(" "))
      .map(word=>(word,1))
      .keyBy(t=>t._1)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
      .process(new UserDefineProcessWindowFunction3)
      .print()

    env.execute("Fold Window")
  }
}
class UserDefineProcessWindowFunction3 extends ProcessWindowFunction[(String,Int),(String,Int),String,TimeWindow] {
  val sdf = new SimpleDateFormat("HH:mm:ss")

  var wvsd:ValueStateDescriptor[Int]=_
  var gvsd:ValueStateDescriptor[Int]=_

  override def open(parameters: Configuration): Unit = {
    wvsd = new ValueStateDescriptor[Int]("ws",createTypeInformation[Int])
    gvsd = new ValueStateDescriptor[Int]("gs",createTypeInformation[Int])
  }

  override def process(key: String,
                       context: Context,
                       elements: Iterable[(String, Int)],
                       out: Collector[(String, Int)]): Unit = {
    val w = context.window //获取窗口元数据
    val start = sdf.format(w.getStart)
    val end = sdf.format(w.getEnd)

    val list = elements.toList

    val total = list.map(_._2).sum

    val wvs = context.windowState.getState(wvsd)
    val gvs = context.globalState.getState(gvsd)

    wvs.update(wvs.value()+total)
    gvs.update(gvs.value()+total)

    println("Window Count:"+wvs.value()+"\t"+"Global Count:"+gvs.value())

    out.collect((key+"\t["+start+"~"+end+"]",total))

  }
}
  • WindowFunction (Legacy)

在某些可以使⽤ProcessWindowFunction的地⽅,您也可以使⽤WindowFunction。这是
ProcessWindowFunction的较旧版本,提供的上下⽂信息较少,并且没有某些⾼级功能,例如,每个窗⼝的keyed State。

object FlinkProcessingTimeSessionWithWindowFunction {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val text = env.socketTextStream("train",9999)

    text.flatMap(_.split(" "))
      .map(word=>(word,1))
      .keyBy(t=>t._1)
      .window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
      .apply(new UserDefineSessionWindowFunction)
      .print()

    env.execute("Tumbling Window")
  }
}
class UserDefineSessionWindowFunction extends WindowFunction[(String,Int),
  (String,Int),String,TimeWindow] {
  override def apply(key: String,
                     window: TimeWindow,
                     input: Iterable[(String, Int)],
                     out: Collector[(String, Int)]): Unit = {
    val sdf = new SimpleDateFormat("HH:mm:ss")
    var start=sdf.format(window.getStart)
    var end=sdf.format(window.getEnd)
    var sum = input.map(_._2).sum
    out.collect((s"${key}\t${start}~${end}",sum))
  }
}
Trigger(触发器)

Trigger决定了什么时候窗⼝准备就绪了,⼀旦窗⼝准备就绪就可以使⽤WindowFunction进⾏计算。每⼀个 WindowAssigner 都会有⼀个默认的Trigger。如果默认的Trigger不满⾜⽤户的需求⽤户可以⾃定义Trigger。
在这里插入图片描述
触发器接⼝具有五种⽅法,这些⽅法允许触发器对不同事件做出反应:

public abstract class Trigger<T, W extends Window> implements Serializable {
/**
 只要有元素落⼊到当前窗⼝, 就会调⽤该⽅法
 * @param element 收到的元素
 * @param timestamp 元素抵达时间.
 * @param window 元素所属的window窗⼝.
 * @param ctx ⼀个上下⽂对象,通常⽤该对象注册 timer(ProcessingTime/EventTime) 回调.
 */
 public abstract TriggerResult onElement(T element, long timestamp, W window,
TriggerContext ctx) throws Exception;
 /**
 * processing-time 定时器回调函数
 *
 * @param time 定时器触发的时间.
 * @param window 定时器触发的窗⼝对象.
 * @param ctx ⼀个上下⽂对象,通常⽤该对象注册 timer(ProcessingTime/EventTime) 回调.
 */
 public abstract TriggerResult onProcessingTime(long time, W window, TriggerContext
ctx) throws Exception;
 
 /**
 * event-time 定时器回调函数
 *
 * @param time 定时器触发的时间.
 * @param window 定时器触发的窗⼝对象.
 * @param ctx ⼀个上下⽂对象,通常⽤该对象注册 timer(ProcessingTime/EventTime) 回调.
 */
 public abstract TriggerResult onEventTime(long time, W window, TriggerContext ctx)
throws Exception;
 
 /**
 * 当 多个窗⼝合并到⼀个窗⼝的时候,调⽤该⽅法,例如系统SessionWindow
 * {@link org.apache.flink.streaming.api.windowing.assigners.WindowAssigner}.
 *
 * @param window 合并后的新窗⼝对象
 * @param ctx ⼀个上下⽂对象,通常⽤该对象注册 timer(ProcessingTime/EventTime)回调以及访问
状态
 */
 public void onMerge(W window, OnMergeContext ctx) throws Exception {
 throw new UnsupportedOperationException("This trigger does not support merging.");
 }
 /**
 * 当窗⼝被删除后执⾏所需的任何操作。例如:可以清除定时器或者删除状态数据
 */
 public abstract void clear(W window, TriggerContext ctx) throws Exception; }

关于上述⽅法,需要注意两件事:
1)前三个⽅法决定如何通过返回TriggerResult来决定窗⼝是否就绪。

public enum TriggerResult {
 /**
 * 不触发,也不删除元素
 */
 CONTINUE(false, false),
 /**
 * 触发窗⼝,窗⼝出发后删除窗⼝中的元素
 */
 FIRE_AND_PURGE(true, true),
 /**
 * 触发窗⼝,但是保留窗⼝元素
 */
 FIRE(true, false),
 /**
 * 不触发窗⼝,丢弃窗⼝,并且删除窗⼝的元素
 */
 PURGE(false, true);
 private final boolean fire;//是否触发窗⼝
 private final boolean purge;//是否清除窗⼝元素
 ...
 }

2)这些⽅法中的任何⼀种都可以⽤于注册处理或事件时间计时器以⽤于将来的操作.

案例
class UserDefineCountTrigger(maxCount:Long) extends Trigger[String,TimeWindow]{

  private val rsd = new ReducingStateDescriptor[Long]("rsd", new ReduceFunction[Long] {
    override def reduce(t: Long, t1: Long): Long = {
      t + t1
    }
  }, createTypeInformation[Long])


  override def onElement(element: String, timestamp: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
    val count = ctx.getPartitionedState(rsd)
    count.add(1L)
    if(count.get()>=maxCount){
      count.clear()
      return TriggerResult.FIRE_AND_PURGE
    }else{
      return TriggerResult.CONTINUE
    }
  }

  override def onProcessingTime(l: Long, w: TimeWindow, triggerContext: Trigger.TriggerContext): TriggerResult = TriggerResult.CONTINUE

  override def onEventTime(l: Long, w: TimeWindow, triggerContext: Trigger.TriggerContext): TriggerResult = TriggerResult.CONTINUE

  override def clear(w: TimeWindow, ctx: Trigger.TriggerContext): Unit = {
    println("===clear===")
    ctx.getPartitionedState(rsd).clear()
  }
}
object FlinkCountTrigger {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val text = env.socketTextStream("train",9999)

    text.flatMap(_.split(" "))
      .windowAll(TumblingProcessingTimeWindows.of(Time.seconds(5)))
      .trigger(new UserDefineCountTrigger(4L))
      .apply(new UserDefineGlobalWindowFunction2)
      .print()

    env.execute("Global Window Stream WordCount")
  }
}

class UserDefineGlobalWindowFunction2 extends AllWindowFunction[String,String,TimeWindow] {
  override def apply(window: TimeWindow,
                     input: Iterable[String],
                     out: Collector[String]): Unit = {
    val list = input.toList
    println(list)
  }
}
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值