对延迟数据的再接收 和 非迟到数据的合并
一、测试数据
1000,hadoop
3000,spark
4999,flink
5999,flink
6999,a
9999,b
10999,c
11999,a
二、代码
- 抽取时间字段 作为waterMark, 并设定延迟时间
- 将数据处理成 (word,1),并sum聚合
3. 划分窗口,滚动窗口边界为5s,对迟到的数据进行打标签,进行测流输出
keyed.window(TumblingEventTimeWindows.of(Time.seconds(5)))
//将迟到的数据,测流输出
.sideOutputLateData(lateDataTag)
- 拿出迟到的测流输出数据
5. 将迟到的数据 和 非迟到的数据 进行union,然后集合在一块
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;
public class WindowLateDataDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStreamSource<String> lines = env.socketTextStream("linux01", 8888);
// 一、抽取时间字段 作为waterMark, 并设定延迟时间
SingleOutputStreamOperator<String> timeStamp =
lines.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<String>(Time.seconds(2)) {
@Override
public long extractTimestamp(String element) {
return Long.parseLong(element.split(",")[0]);
}
});
// 二、将数据处理成 (word,1)
SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndOne =
timeStamp.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String input, Collector<Tuple2<String, Integer>> out) throws Exception {
String word = input.split(",")[1];
out.collect(Tuple2.of(word, 1));
}
});
KeyedStream<Tuple2<String, Integer>, Tuple> keyed = wordAndOne.keyBy(0);
// 三、划分窗口,滚动窗口边界为5s,对迟到的数据进行打标签,进行测流输出
OutputTag<Tuple2<String, Integer>> lateDataTag = new OutputTag<Tuple2<String, Integer>>("late-data"){};
SingleOutputStreamOperator<Tuple2<String, Integer>> summed =
keyed.window(TumblingEventTimeWindows.of(Time.seconds(5)))
//将迟到的数据,测流输出
.sideOutputLateData(lateDataTag)
.sum(1);
// 四、拿出迟到的测流输出数据
DataStream<Tuple2<String, Integer>> lateDataStream = summed.getSideOutput(lateDataTag);
// 五、将迟到的数据 和 非迟到的数据 进行union,然后集合在一块
SingleOutputStreamOperator<Tuple2<String, Integer>> result = summed.union(lateDataStream).keyBy(0).sum(1);
result.print();
env.execute("WindowLateDataDemo");
}
}
三、输出结果解析
4> (hadoop,1)
1> (spark,1)
4> (flink,1)
1> (b,1)
4> (flink,2)
3> (a,1)
- 划分的窗口大小为5000毫秒
- 延迟的时间是2000毫秒
- 那 waterMark = 数据的当前时间 - 2000 >= [4999, 9999, 14999…]时,就对应的会触发窗口计算
重点:
输出的还是窗口内的数据,时间延迟只是会引起窗口的触发,不会影响窗口的大小。
- 所以当输入 6999,a 时,只是触发上一个窗口的事件,但该条数据还是属于下一个窗口的。