Flink 使用 watermark + window 处理时间乱序数据

Flink 使用 watermark + window 处理时间乱序数据

时间乱序数据情况

由于业务数据采集是获取的数据有时并不能保证数据的顺序传输,错误的数据顺序可能会带来业务的异常。例如:数据如下;

01,1635867066000
01,1635867067000
01,1635867068000
01,1635867069000
01,1635867070000
01,1635867071000

实例验证

POM文件

<dependencies>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-java</artifactId>
			<version>1.13.2</version>
		</dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-core</artifactId>
			<version>1.13.2</version>
		</dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-streaming-java_2.12</artifactId>
			<version>1.13.2</version>
		</dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-clients_2.12</artifactId>
			<version>1.13.2</version>
		</dependency>
		
		<dependency>
		    <groupId>com.alibaba</groupId>
		    <artifactId>fastjson</artifactId>
		    <version>1.2.29</version>
		</dependency>
	</dependencies>

代码实现

设置解释

  • env.getConfig().setAutoWatermarkInterval(1000L):每隔一秒去自动emitWatermark
  • TumblingEventTimeWindows.of(Time.seconds(4)):滚动窗口为4s
  • private long maxOutOfOrderness = 3000L:允许的最大延迟时间3s

注意

//最初写成Long.MIN_VALUE 导致new Watermark(maxTimeStamp - maxOutOfOrderness)
//超出范围出错,出错不报错很难排查
//private long maxTimeStamp =Long.MIN_VALUE;

//排出后改为
 private long maxTimeStamp = 0L;
import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.util.Iterator;

public class IoTMain4 {
	public static void main(String[] args) throws Exception {
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		env.enableCheckpointing(60 * 1000, CheckpointingMode.EXACTLY_ONCE);
		// 设置自动水印发射的间隔
		env.getConfig().setAutoWatermarkInterval(1000L);
		env.setParallelism(1);
		DataStreamSource<String> sourceDs = env.socketTextStream("localhost", 9000);
		SingleOutputStreamOperator<Tuple2<String, Long>> mapDs = sourceDs
				.map(new MapFunction<String, Tuple2<String, Long>>() {
					
					private static final long serialVersionUID = -5181351998053732122L;

					@Override
					public Tuple2<String, Long> map(String value) throws Exception {
						String[] split = value.split(",");
						return Tuple2.of(split[0], Long.valueOf(split[1]));
					}
				});
		// 周期性 发射watermark
		SingleOutputStreamOperator<Tuple2<String, Long>> watermarks = mapDs
				.assignTimestampsAndWatermarks(new WatermarkStrategy<Tuple2<String, Long>>() {
					
					private static final long serialVersionUID = -8873639694196414860L;

					@Override
					public WatermarkGenerator<Tuple2<String, Long>> createWatermarkGenerator(
							WatermarkGeneratorSupplier.Context context) {
						return new WatermarkGenerator<Tuple2<String, Long>>() {
							private long maxTimeStamp = 0L;
							private long maxOutOfOrderness = 3000L; // 允许的最大延迟时间

							@Override
							public void onEvent(Tuple2<String, Long> event, long eventTimestamp,
									WatermarkOutput output) {
								// 每次来一条数据就会触发一次
								maxTimeStamp = Math.max(maxTimeStamp, event.f1);
							}

							@Override
							public void onPeriodicEmit(WatermarkOutput output) {
								// 周期性 发射watermark
								output.emitWatermark(new Watermark(maxTimeStamp - maxOutOfOrderness));
							}
						};
					}
				}.withTimestampAssigner(((element, recordTimestamp) -> element.f1)));

		watermarks.keyBy(x -> x.f0).window(TumblingEventTimeWindows.of(Time.seconds(4)))
				.apply(new WindowFunction<Tuple2<String, Long>, String, String, TimeWindow>() {
					
					private static final long serialVersionUID = 65693184846116387L;

					@Override
					public void apply(String s, TimeWindow window, Iterable<Tuple2<String, Long>> input,
							Collector<String> out) throws Exception {
						Iterator<Tuple2<String, Long>> iterator = input.iterator();
						int count = 0;
						while (iterator.hasNext()) {
							count++;
							iterator.next();
						}
						out.collect(window.getStart() + "->" + window.getEnd() + " " + s + ":" + count);
					}
				}).print();
		env.execute();
	}

}

测试情况

使用 netcat 向9000发送上述测试数据

C:\Users\xxx> nc -l -p 9999
01,1635867066000
01,1635867067000
01,1635867068000
01,1635867069000
01,1635867070000
01,1635867071000

当最后一条数据 01,1635867071000 处理时,会触发窗口**[1635867064000, 1635867068000) **且不再接收此阶段数据(可以自定义处理);
滚动窗口将每一分钟每隔四秒分隔,前闭后开
例如2021-11-02 23:31分划分成:

[2021-11-02 23:31:00 , 2021-11-02 23:31:04)
[2021-11-02 23:31:04 , 2021-11-02 23:31:08)
....
[2021-11-02 23:31:56 , 2021-11-02 23:32:00)

[1635867064000, 1635867068000) 对应 [2021-11-02 23:31:04, 2021-11-02 23:31:08);
最大延迟时间为3s,所以触发 [1635867064000, 1635867068000) 此窗口的时间戳要 >= 16358670710001635867068000 + 3000 = 1635867067100

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值