前言
本文为学习flink入门与实战/网易云课堂-flink大数据项目实战课程的笔记整理
一、Time
1.Stream中,Time的种类有三种:Event Time/Ingestion Time/Processing Time
2.三种Time之间的关系
3.设置Time的方法:
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
二、Flink如何处理乱序数据
flink处理时,容易出现数据乱序的情况。在计算window时,不能无限期等待,因此需要有一个机制来保证,在特定时间之后,必须触发window计算,该机制为watermark。
只有Event Time时需要指定watermark和timestamp,watermark和timestamp采用毫秒作为计量单位。
2.1 watermark
1.应用场景:
有序Stream中的watermark:
无序Stream中的watermark:
多并行度Stream的watermark:
2.多并行度watermark对齐机制:
一个opt有多个入度时,watermark会取所有入度中最小的watermark
2.2 watermark生成方式
1.生成时机:
a.接收到Source的数据后,立即生成watermark
b.在map/filter等操作后生成(timestamp assigner/watermark generator)
示例代码:
package com.zzh.testWindow;
import com.zzh.testJoin.Transcript;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import java.sql.Timestamp;
public class testWindow {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env=StreamExecutionEnvironment.createLocalEnvironment();
//设置时间类型为event time
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream<Transcript> dataStream=env.fromElements(getTranscriptDataSource());
//在opt中设置watermark
DataStream<Transcript> dataStreamWithTimeStamp=dataStream.filter(new FilterFunction<Transcript>() {
@Override
public boolean filter(Transcript transcript) throws Exception {
if (transcript.getScore()>60){
return true;
}
return false;
}
}).assignTimestampsAndWatermarks(new MyWaterMark(3500));
dataStreamWithTimeStamp.timeWindowAll(Time.seconds(10)).reduce(new ReduceFunction<Transcript>(){
@Override
public Transcript reduce(Transcript lastData, Transcript newData) throws Exception {
System.out.println(lastData);
System.out.println(newData);
System.out.println("=====================");
lastData.setScore((lastData.getScore()+newData.getScore())/2);
return lastData;
}
}).print();
env.execute("finish");
}
private static Transcript[] getTranscriptDataSource(){
return new Transcript[]{
new Transcript("1","张三","语文",100, Timestamp.valueOf("2020-07-01 11:1:1").getTime()),
new Transcript("2","李四","语文",78,Timestamp.valueOf("2020-07-01 11:3:1").getTime()),
new Transcript("3","王五","语文",99,Timestamp.valueOf("2020-07-01 11:3:4").getTime()),
new Transcript("4","赵六","语文",81,Timestamp.valueOf("2020-07-01 11:3:9").getTime()),
new Transcript("5","钱七","语文",59,Timestamp.valueOf("2020-07-01 11:1:10").getTime()),
new Transcript("6","马二","语文",97,Timestamp.valueOf("2020-07-01 11:1:12").getTime()),
};
}
}
2.生成方式:
a.wtih periodic watermarks
概述:
周期性调用getCurrentWatermark()方法,若获取的watermark不为null且大于上一个watermark,则向下游发送
特点:
- 周期性触发
- 每隔N秒自动向流注入watermark
- 可以定义一个最大允许乱序的时间
- 实现AssignerWithPeriodWatermarks接口
- 可设置watermark发送周期:
ExecutionConfig.setAutoWatermarkInterval();
示例代码:
package com.zzh.testWindow;
import com.zzh.testJoin.Transcript;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.watermark.Watermark;
import javax.annotation.Nullable;
public class MyWaterMark implements AssignerWithPeriodicWatermarks<Transcript> {
private long currentMaxTimeStamp;
private long timeBounded;
public MyWaterMark(long timeBounded){
this.timeBounded=timeBounded;
}
@Nullable
@Override
public Watermark getCurrentWatermark() {
//当当前watermark比上一次大,则向发射数据,因此此处使用最大timestamp减去bounded
return new Watermark(this.currentMaxTimeStamp-this.timeBounded);
}
@Override
public long extractTimestamp(Transcript transcript, long l) {
//获取当前最大的时间戳
long currentTimeStamp=transcript.getTime();
this.currentMaxTimeStamp=Math.max(currentTimeStamp,this.currentMaxTimeStamp);
return currentTimeStamp;
}
}
b.with punctuated watermarks
特点:
- 基于某些事件触发watermark生成
- 每一个元素都会判断是否生成watermark
- 实现AssignerWithPunctuatedWatermarks
示例代码:
package com.zzh.testWindow;
import com.zzh.testJoin.Transcript;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.watermark.Watermark;
import javax.annotation.Nullable;
public class PunctuatedWaterMark implements AssignerWithPunctuatedWatermarks<Transcript> {
@Nullable
@Override
public Watermark checkAndGetNextWatermark(Transcript transcript, long l) {
//l等价于transcript的timestamp
return transcript.getTime()>0?new Watermark(l):null;
}
@Override
public long extractTimestamp(Transcript transcript, long l) {
return transcript.getTime();
}
}
三、预定义Timestamp Extractors和watermark Emitters
3.1 适用于时间戳单调递增场景
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Transcript>() {
@Override
public long extractAscendingTimestamp(Transcript element) {
return element.getTime();
}
});
3.2 适用于固定延迟的场景
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Transcript>(Time.seconds(10)) {
@Override
public long extractTimestamp(Transcript element) {
return element.getTime();
}
});
3.3 延迟数据处理
1.allowedLateness(),设置最大延迟处理时间
2.sideOutputTag,提供延迟获取数据的方式,这样就不会丢弃数据了
示例代码:
OutputTag<Transcript> lateOutputTag=new OutputTag<Transcript>("late-date");
dataStreamWithTimeStamp.timeWindowAll(Time.seconds(10)).
allowedLateness(Time.seconds(10)).
sideOutputLateData(lateOutputTag).