1、Trigger
触发器(Trigger)决定何时由“Window Function”处理窗口, 每个WindowAssigner
都带有一个默认Trigger
。 如果默认触发器不符合您的需求,则可以使用trigger(...)
指定自定义触发器。
WindowAssigners | 触发器 |
---|---|
global window | NeverTrigger |
event-time window | EventTimeTrigger |
processing-time window | ProcessingTimeTrigger |
触发接口具有五种方法,这些方法允许Trigger
对不同事件做出反应:
onElement()
:对于添加到窗口中的每个元素,都会调用onElement()
方法。onEventTime()
:注册的事件时间计时器触发时,将调用onEventTime()方法。onProcessingTime()
:当注册的处理时间计时器触发时,调用onProcessingTime()方法。onMerge()
:方法与有状态触发器相关,并且在两个触发器的相应窗口合并时(例如,在使用会话窗口时)合并两个触发器的状态。clear()
:执行删除相应窗口后所需的任何操作。
案例:(1)计数大于2时触发
new Trigger<Tuple3<String, Long, Integer>, TimeWindow>() {
private static final long serialVersionUID = 2742133264310093792L;
ValueStateDescriptor<Integer> sumStateDescriptor = new ValueStateDescriptor<Integer>("sum", Integer.class);
@Override
public TriggerResult onElement(Tuple3<String, Long, Integer> element, long timestamp, TimeWindow window, TriggerContext ctx) throws Exception {
ValueState<Integer> sumState = ctx.getPartitionedState(sumStateDescriptor);
if (null == sumState.value()) {
sumState.update(0);
}
sumState.update(element.f2 + sumState.value());
if (sumState.value() >= 2) {
//这里可以选择手动处理状态
// 默认的trigger发送是TriggerResult.FIRE 不会清除窗口数据
return TriggerResult.FIRE_AND_PURGE;
}
return TriggerResult.CONTINUE;
}
@Override
public TriggerResult onProcessingTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
@Override
public TriggerResult onEventTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
@Override
public void clear(TimeWindow window, TriggerContext ctx) throws Exception {
System.out.println("清理窗口状态 窗口内保存值为" + ctx.getPartitionedState(sumStateDescriptor).value());
ctx.getPartitionedState(sumStateDescriptor).clear();
}
}
(2)新值与新值差值大于2时触发
var env=StreamExecutionEnvironment.getExecutionEnvironment
val deltaTrigger = DeltaTrigger.of[(String,Double),GlobalWindow](2.0,new DeltaFunction[(String,Double)] {
override def getDelta(oldDataPoint: (String, Double), newDataPoint: (String, Double)): Double = {
newDataPoint._2-oldDataPoint._2
}
},createTypeInformation[(String,Double)].createSerializer(env.getConfig))
env.socketTextStream("centos",7788)
.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1).toDouble))
.keyBy(0)
.window(GlobalWindows.create())
.trigger(deltaTrigger)
.reduce((v1:(String,Double),v2:(String,Double))=>(v1._1,v1._2+v2._2))
.print()
env.execute("window")
2、Evictor
Evictor可以在触发触发器之后以及在应用窗口功能之前或之后从窗口中删除元素。 为此,Evictor
接口有两种方法:
public interface Evictor<T, W extends Window> extends Serializable {
void evictBefore(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);
void evictAfter(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);
}
evictBefore:是在使用window function 之前剔除元素。
evictAfter:是在使用window function之后剔除元素。
案例:
class ErrorEvictor(isBefore:Boolean) extends Evictor[String,TimeWindow] {
override def evictBefore(elements: lang.Iterable[TimestampedValue[String]], size: Int, window: TimeWindow, evictorContext: Evictor.EvictorContext): Unit = {
if(isBefore){
evictor(elements,size,window,evictorContext)
}
}
override def evictAfter(elements: lang.Iterable[TimestampedValue[String]], size: Int, window: TimeWindow, evictorContext: Evictor.EvictorContext): Unit = {
if(!isBefore){
evictor(elements,size,window,evictorContext)
}
}
private def evictor(elements: lang.Iterable[TimestampedValue[String]], size: Int, window: TimeWindow, evictorContext: Evictor.EvictorContext): Unit={
val iterator = elements.iterator()
while(iterator.hasNext){
val it = iterator.next()
if(it.getValue.contains("error")){//将 含有error数据剔出
iterator.remove()
}
}
}
}
var fsEnv=StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.socketTextStream("CentOS",7788)
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.evictor(new ErrorEvictor(true))
.apply(new AllWindowFunction[String,String,TimeWindow] {
override def apply(window: TimeWindow, input: Iterable[String], out: Collector[String]): Unit = {
for(e <- input){
out.collect(e)
}
print()
}
})
.print()
fsEnv.execute("window")
(3) Event Time
Flink在做窗口计算的时候支持以下语义的window:Processing time
、Event time
、Ingestion time
Processing time:使用处理节点时间,计算窗口
Event time:使用事件产生时间,计算窗口- 精确
Ingestion time:数据进入到Flink的时间,一般是通过SourceFunction指定时间
默认Flink使用的是ProcessingTime ,因此一般情况下如果用户需要使用 Event time/Ingestion time需要设置时间属性
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
.....
//window 操作
....
fsEnv.execute("event time")
一旦设置基于EventTime处理,用户必须声明水位线的计算策略,系统需要给每一个流计算出水位线时间T,只有窗口的end time T’ < = watermarker(T)的时候,窗口才会被触发。在Flink当中需要用户实现水位线计算的方式,系统并不提供实现。触发水位线的计算方式有两种:①一种是基于定时Interval(推荐)、②通过记录触发,每来一条记录系统会立即更新水位线。
① 基于定时
如果基于定时的,需要配置fsEnv.getConfig.setAutoWatermarkInterval(水位线更新时间)
class AccessLogAssignerWithPeriodicWatermarks extends AssignerWithPeriodicWatermarks[AccessLog]{
private var maxSeeTime:Long=0L
private var maxOrderness:Long=2000L
override def getCurrentWatermark: Watermark = {
return new Watermark(maxSeeTime-maxOrderness)
}
override def extractTimestamp(element: AccessLog, previousElementTimestamp: Long): Long = {
maxSeeTime=Math.max(maxSeeTime,element.timestamp)
element.timestamp
}
}
② 基于记录
class AccessLogAssignerWithPunctuatedWatermarks extends AssignerWithPunctuatedWatermarks[AccessLog]{
private var maxSeeTime:Long=0L
private var maxOrderness:Long=2000L
override def checkAndGetNextWatermark(lastElement: AccessLog, extractedTimestamp: Long): Watermark = {
new Watermark(maxSeeTime-maxOrderness)
}
override def extractTimestamp(element: AccessLog, previousElementTimestamp: Long): Long = {
maxSeeTime=Math.max(maxSeeTime,element.timestamp)
element.timestamp
}
}
案例:
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
fsEnv.getConfig.setAutoWatermarkInterval(1000)//设置水位线定期计算频率 1s/每次
fsEnv.setParallelism(1)
//模块信息 时间
fsEnv.socketTextStream("CentOS",8888)
.map(line=> line.split("\\s+"))
.map(ts=>AccessLog(ts(0),ts(1).toLong))
.assignTimestampsAndWatermarks(new AccessLogAssignerWithPeriodicWatermarks)
.keyBy(accessLog=>accessLog.channel)
.window(TumblingEventTimeWindows.of(Time.seconds(4)))
.process(new ProcessWindowFunction[AccessLog,String,String,TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[AccessLog], out: Collector[String]): Unit = {
val sdf = new SimpleDateFormat("HH:mm:ss")
val window = context.window
val currentWatermark = context.currentWatermark
println("window:"+sdf.format(window.getStart)+"\t"+sdf.format(window.getEnd)+" \t watermarker:"+sdf.format(currentWatermark))
for(e<-elements){
val AccessLog(channel:String,timestamp:Long)=e
out.collect(channel+"\t"+sdf.format(timestamp))
}
}
})
.print()
(4) 迟到数据处理
Flink支持对迟到数据处理,如果watermaker - window end < allow late time 记录可以参与窗口计算,否则Flink将too late数据丢弃。
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
fsEnv.getConfig.setAutoWatermarkInterval(1000)//设置水位线定期计算频率 1s/每次
fsEnv.setParallelism(1)
//模块信息 时间
fsEnv.socketTextStream("CentOS",8888)
.map(line=> line.split("\\s+"))
.map(ts=>AccessLog(ts(0),ts(1).toLong))
.assignTimestampsAndWatermarks(new AccessLogAssignerWithPeriodicWatermarks)
.keyBy(accessLog=>accessLog.channel)
.window(TumblingEventTimeWindows.of(Time.seconds(4)))
.allowedLateness(Time.seconds(2))
.process(new ProcessWindowFunction[AccessLog,String,String,TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[AccessLog], out: Collector[String]): Unit = {
val sdf = new SimpleDateFormat("HH:mm:ss")
val window = context.window
val currentWatermark = context.currentWatermark
println("window:"+sdf.format(window.getStart)+"\t"+sdf.format(window.getEnd)+" \t watermarker:"+sdf.format(currentWatermark))
for(e<-elements){
val AccessLog(channel:String,timestamp:Long)=e
out.collect(channel+"\t"+sdf.format(timestamp))
}
}
})
.print()
fsEnv.execute("event time")
Flink默认对too late数据采取的是丢弃,如果用户想拿到过期的数据,可以使用sideout方式
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
fsEnv.getConfig.setAutoWatermarkInterval(1000)//设置水位线定期计算频率 1s/每次
fsEnv.setParallelism(1)
val lateTag = new OutputTag[AccessLog]("latetag")
//模块信息 时间
val keyedWindowStream=fsEnv.socketTextStream("CentOS",8888)
.map(line=> line.split("\\s+"))
.map(ts=>AccessLog(ts(0),ts(1).toLong))
.assignTimestampsAndWatermarks(new AccessLogAssignerWithPeriodicWatermarks)
.keyBy(accessLog=>accessLog.channel)
.window(TumblingEventTimeWindows.of(Time.seconds(4)))
.allowedLateness(Time.seconds(2))
.sideOutputLateData(lateTag)
.process(new ProcessWindowFunction[AccessLog,String,String,TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[AccessLog], out: Collector[String]): Unit = {
val sdf = new SimpleDateFormat("HH:mm:ss")
val window = context.window
val currentWatermark = context.currentWatermark
println("window:"+sdf.format(window.getStart)+"\t"+sdf.format(window.getEnd)+" \t watermarker:"+sdf.format(currentWatermark))
for(e<-elements){
val AccessLog(channel:String,timestamp:Long)=e
out.collect(channel+"\t"+sdf.format(timestamp))
}
}
})
keyedWindowStream.print("正常:")
keyedWindowStream.getSideOutput(lateTag).print("too late:")
fsEnv.execute("event time")