Flink Watermark

Flink中和EventTime相关的一个重要概念是Watermark,Watermark是在数据流生成的一个特殊对象。Window的计算是根据Watermark来决定的,如果Window关闭时间小于等于Watermark时间,Window就开始计算。生成Watermark有两种方式Periodic或者onEvent从代码可以看到:

public interface WatermarkGenerator<T> {

	/**
	 * Called for every event, allows the watermark generator to examine and remember the
	 * event timestamps, or to emit a watermark based on the event itself.
	 */
	void onEvent(T event, long eventTimestamp, WatermarkOutput output);

	/**
	 * Called periodically, and might emit a new watermark, or not.
	 *
	 * <p>The interval in which this method is called and Watermarks are generated
	 * depends on {@link ExecutionConfig#getAutoWatermarkInterval()}.
	 */
	void onPeriodicEmit(WatermarkOutput output);
}

默认Periodic是按 ExecutionConfig#getAutoWatermarkInterval() 设置的值定时调用,默认值为200

ExecutionConfig.java
private long autoWatermarkInterval = 200;

在真实的数据流场景中,由于网络的原因,数据往往是不能及时达到数据处理引擎的。这就需要我们考虑数据延迟的处理方法。Watermark就是为支持数据延迟设计的,可以设置每个窗口的数据可以晚到多久。我们来看两个例子:

没有数据延迟的情况
五条记录进入时间 10--14,进入窗口时间 [10--15)数据正确。
(TextElement(1,bob,1615538950839,21),Fri Mar 12 16:49:10 CST 2021)
(TextElement(1,bob,1615538951844,22),Fri Mar 12 16:49:11 CST 2021)
(TextElement(1,bob,1615538952848,23),Fri Mar 12 16:49:12 CST 2021)
(TextElement(1,bob,1615538953849,24),Fri Mar 12 16:49:13 CST 2021)
(TextElement(1,bob,1615538954854,25),Fri Mar 12 16:49:14 CST 2021)
窗口开始时间:20210312,16:49:10 窗口结束时间:20210312,16:49:15 数据:TextElement(1,bob,1615538950839,21) EventTime:20210312,16:49:10 Watermark:20210312,16:49:15
窗口开始时间:20210312,16:49:10 窗口结束时间:20210312,16:49:15 数据:TextElement(1,bob,1615538951844,22) EventTime:20210312,16:49:11 Watermark:20210312,16:49:15
窗口开始时间:20210312,16:49:10 窗口结束时间:20210312,16:49:15 数据:TextElement(1,bob,1615538952848,23) EventTime:20210312,16:49:12 Watermark:20210312,16:49:15
窗口开始时间:20210312,16:49:10 窗口结束时间:20210312,16:49:15 数据:TextElement(1,bob,1615538953849,24) EventTime:20210312,16:49:13 Watermark:20210312,16:49:15
窗口开始时间:20210312,16:49:10 窗口结束时间:20210312,16:49:15 数据:TextElement(1,bob,1615538954854,25) EventTime:20210312,16:49:14 Watermark:20210312,16:49:15
有数据延迟的情况

有数据延迟应该怎么做,Flink Watermark就是让窗口晚些计算,通过WatermarkStrategy进行设置。

可以看到窗口时间延迟了两秒(当前时间和窗口结束时间),当前时间:20210312,16:58:37 窗口开始时间:20210312,16:58:30 窗口结束时间:20210312,16:58:35,
(TextElement(1,bob,1615539510756,1),Fri Mar 12 16:58:30 CST 2021)
(TextElement(1,bob,1615539511760,2),Fri Mar 12 16:58:31 CST 2021)
(TextElement(1,bob,1615539512764,3),Fri Mar 12 16:58:32 CST 2021)
(TextElement(1,bob,1615539513765,4),Fri Mar 12 16:58:33 CST 2021)
(TextElement(1,bob,1615539514769,5),Fri Mar 12 16:58:34 CST 2021)
(TextElement(1,bob,1615539515774,6),Fri Mar 12 16:58:35 CST 2021)
(TextElement(1,bob,1615539516778,7),Fri Mar 12 16:58:36 CST 2021)
(TextElement(1,bob,1615539517780,8),Fri Mar 12 16:58:37 CST 2021)
当前时间:20210312,16:58:37 窗口开始时间:20210312,16:58:30 窗口结束时间:20210312,16:58:35 数据:TextElement(1,bob,1615539510756,1) EventTime:20210312,16:58:30 Watermark:20210312,16:58:35
当前时间:20210312,16:58:37 窗口开始时间:20210312,16:58:30 窗口结束时间:20210312,16:58:35 数据:TextElement(1,bob,1615539511760,2) EventTime:20210312,16:58:31 Watermark:20210312,16:58:35
当前时间:20210312,16:58:37 窗口开始时间:20210312,16:58:30 窗口结束时间:20210312,16:58:35 数据:TextElement(1,bob,1615539512764,3) EventTime:20210312,16:58:32 Watermark:20210312,16:58:35
当前时间:20210312,16:58:37 窗口开始时间:20210312,16:58:30 窗口结束时间:20210312,16:58:35 数据:TextElement(1,bob,1615539513765,4) EventTime:20210312,16:58:33 Watermark:20210312,16:58:35
当前时间:20210312,16:58:37 窗口开始时间:20210312,16:58:30 窗口结束时间:20210312,16:58:35 数据:TextElement(1,bob,1615539514769,5) EventTime:20210312,16:58:34 Watermark:20210312,16:58:35
超出数据延时范围

超出范围的记录会被丢弃,18:13:49那条被丢弃了。

(TextElement(1,bob,1615544030000,21),Fri Mar 12 18:13:50 CST 2021)
**迟到元素属于上一个窗口**(TextElement(1,bob,1615544029000,21),Fri Mar 12 18:13:49 CST 2021)
(TextElement(1,bob,1615544033000,21),Fri Mar 12 18:13:53 CST 2021)
(TextElement(1,bob,1615544034000,22),Fri Mar 12 18:13:54 CST 2021)
(TextElement(1,bob,1615544035000,23),Fri Mar 12 18:13:55 CST 2021)
(TextElement(1,bob,1615544036000,24),Fri Mar 12 18:13:56 CST 2021)
(TextElement(1,bob,1615544037000,25),Fri Mar 12 18:13:57 CST 2021)
当前时间:20210312,18:13:58 窗口开始时间:20210312,18:13:50 窗口结束时间:20210312,18:13:55 数据:TextElement(1,bob,1615544030000,18) EventTime:20210312,18:13:50 Watermark:20210312,18:13:54
当前时间:20210312,18:13:58 窗口开始时间:20210312,18:13:50 窗口结束时间:20210312,18:13:55 数据:TextElement(1,bob,1615544031000,19) EventTime:20210312,18:13:51 Watermark:20210312,18:13:54
当前时间:20210312,18:13:58 窗口开始时间:20210312,18:13:50 窗口结束时间:20210312,18:13:55 数据:TextElement(1,bob,1615544032000,20) EventTime:20210312,18:13:52 Watermark:20210312,18:13:54
当前时间:20210312,18:13:58 窗口开始时间:20210312,18:13:50 窗口结束时间:20210312,18:13:55 数据:TextElement(1,bob,1615544030000,21) EventTime:20210312,18:13:50 Watermark:20210312,18:13:54
当前时间:20210312,18:13:58 窗口开始时间:20210312,18:13:50 窗口结束时间:20210312,18:13:55 数据:TextElement(1,bob,1615544033000,21) EventTime:20210312,18:13:53 Watermark:20210312,18:13:54
当前时间:20210312,18:13:58 窗口开始时间:20210312,18:13:50 窗口结束时间:20210312,18:13:55 数据:TextElement(1,bob,1615544034000,22) EventTime:20210312,18:13:54 Watermark:20210312,18:13:54
多并行度下的Watermark

Watermark在多并行度的情况下,当多个EventTime进入到Operator的时候,会取最小的eventtime作为watermark。通过下面的结果,可以看到WaterMark取的是Thu Mar 18 09:46:59这个时间点。

***(75,TextElement(1,bob,1616032019000,0),Thu Mar 18 09:46:59 CST 2021)
(76,TextElement(1,bob,1616032020000,1),Thu Mar 18 09:47:00 CST 2021)
(75,TextElement(1,bob,1616032021000,2),Thu Mar 18 09:47:01 CST 2021)
(76,TextElement(1,bob,1616032022000,3),Thu Mar 18 09:47:02 CST 2021)
(75,TextElement(1,bob,1616032020000,4),Thu Mar 18 09:47:00 CST 2021)
(76,TextElement(1,bob,1616032019000,4),Thu Mar 18 09:46:59 CST 2021)
(75,TextElement(1,bob,1616032023000,4),Thu Mar 18 09:47:03 CST 2021)
09:47:03,791 WARN  Window(TumblingEventTimeWindows(5000), EventTimeTrigger, ScalaProcessWindowFunctionWrapper) (1/1)#0 [] - 
当前时间:20210318,09:47:03
 窗口开始时间:20210318,09:46:55 窗口结束时间:20210318,09:47:00 
 ***Watermark:20210318,09:46:59 
 Window(TumblingEventTimeWindows(5000), EventTimeTrigger, ScalaProcessWindowFunctionWrapper)
 数据:TextElement(1,bob,1616032019000,0)
 EventTime:20210318,09:46:59
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值