摘录仅供学习使用,原文来自: Flink详解系列之五--水位线(watermark) - 简书
1、概念
在Flink中,水位线是一种衡量Event Time进展的机制,用来处理实时数据中的乱序问题的,通常是水位线和窗口结合使用来实现。
从设备生成实时流事件,到Flink的source,再到多个oparator处理数据,过程中会受到网络延迟、背压等多种因素影响造成数据乱序。在进行窗口处理时,不可能无限期的等待延迟数据到达,当到达特定watermark时,认为在watermark之前的数据已经全部达到(即使后面还有延迟的数据), 可以触发窗口计算,这个机制就是 Watermark(水位线),具体如下图所示。
window的生成时间是基于wall clock(processing time)的,与event time无关。
2、水位线的计算
watermark本质上是一个时间戳,且是动态变化的,会根据当前最大事件时间产生。watermarks具体计算为:
watermark = 进入 Flink 窗口的最大的事件时间(maxEventTime)— 指定的延迟时间(t)
当watermark时间戳大于等于窗口结束时间时,意味着窗口结束,需要触发窗口计算。
watermark的本质是使用event time 做一个函数映射生成触发window计算的wall clock时间(processing time).
3、水位线生成
3.1 生成的时机
水位线生产的最佳位置是在尽可能靠近数据源的地方,因为水位线生成时会做出一些有关元素顺序相对时间戳的假设。由于数据源读取过程是并行的,一切引起Flink跨行数据流分区进行重新分发的操作(比如:改变并行度,keyby等)都会导致元素时间戳乱序。但是如果是某些初始化的filter、map等不会引起元素重新分发的操作,可以考虑在生成水位线之前使用。
3.2 水位线分配器
- Periodic Watermarks
周期性分配水位线比较常用,是我们会指示系统以固定的时间间隔发出的水位线。在设置时间为事件时间时,会默认设置这个时间间隔为200ms, 如果需要调整可以自行设置。比如下面的例子是手动设置每隔1s发出水位线。
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// 手动设置时间间隔为1s
env.getConfig().setAutoWatermarkInterval(1000);
周期水位线需要实现接口:AssignerWithPeriodicWatermarks,下面是示例:
public class TestPeriodWatermark implements AssignerWithPeriodicWatermarks<Tuple2<String, Long>> {
Long currentMaxTimestamp = 0L;
final Long maxOutOfOrderness = 1000L;// 延迟时长是1s
@Nullable
@Override
public Watermark getCurrentWatermark() {
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
@Override
public long extractTimestamp(Tuple2<String, Long> element, long previousElementTimestamp) {
long timestamp = element.f1;
currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
return timestamp;
}
}
- Punctuated Watermarks
定点水位线不是太常用,主要为输入流中包含一些用于指示系统进度的特殊元组和标记,方便根据输入元素生成水位线的场景使用的。
由于数据流中每一个递增的EventTime都会产生一个Watermark。
在实际的生产中Punctuated方式在TPS很高的场景下会产生大量的Watermark在一定程度上对下游算子造成压力,所以只有在实时性要求非常高的场景才会选择Punctuated的方式进行Watermark的生成。
public class TestPunctuateWatermark implements AssignerWithPunctuatedWatermarks<Tuple2<String, Long>> {
@Nullable
@Override
public Watermark checkAndGetNextWatermark(Tuple2<String, Long> lastElement, long extractedTimestamp) {
return new Watermark(extractedTimestamp);
}
@Override
public long extractTimestamp(Tuple2<String, Long> element, long previousElementTimestamp) {
return element.f1;
}
}
4、水位线与数据完整性
水位线可以用于平衡延迟和结果的完整性,它控制着执行某些计算需要等待的时间。这个时间是预估的,现实中不存在完美的水位线,因为总会存在延迟的记录。现实处理中,需要我们足够了解从数据生成到数据源的整个过程,来估算延迟的上线,才能更好的设置水位线。
如果水位线设置的过于宽松,好处是计算时能保证近可能多的数据被收集到,但由于此时的水位线远落后于处理记录的时间戳,导致产生的数据结果延迟较大。
如果设置的水位线过于紧迫,数据结果的时效性当然会更好,但由于水位线大于部分记录的时间戳,数据的完整性就会打折扣。
所以,水位线的设置需要更多的去了解数据,并在数据时效性和完整性上有一个权衡。
===================================
从Flink 1.12 开始WaterMark接口进行了更新,原理没有变化。
WatermarkStrategy
-- TimestampAssigner
-- WatermarkGenerator
TimestampAssigner
/**
* A {@code TimestampAssigner} assigns event time timestamps to elements. These timestamps are used
* by all functions that operate on event time, for example event time windows.
*
* <p>Timestamps can be an arbitrary {@code long} value, but all built-in implementations represent
* it as the milliseconds since the Epoch (midnight, January 1, 1970 UTC), the same way as {@link
* System#currentTimeMillis()} does it.
*
* @param <T> The type of the elements to which this assigner assigns timestamps.
*/
@Public
@FunctionalInterface
public interface TimestampAssigner<T> {
/**
* The value that is passed to {@link #extractTimestamp} when there is no previous timestamp
* attached to the record.
*/
long NO_TIMESTAMP = Long.MIN_VALUE;
/**
* Assigns a timestamp to an element, in milliseconds since the Epoch. This is independent of
* any particular time zone or calendar.
*
* <p>The method is passed the previously assigned timestamp of the element. That previous
* timestamp may have been assigned from a previous assigner. If the element did not carry a
* timestamp before, this value is {@link #NO_TIMESTAMP} (= {@code Long.MIN_VALUE}: {@value
* Long#MIN_VALUE}).
*
* @param element The element that the timestamp will be assigned to.
* @param recordTimestamp The current internal timestamp of the element, or a negative value, if
* no timestamp has been assigned yet.
* @return The new timestamp.
*/
long extractTimestamp(T element, long recordTimestamp);
}
默认实现的类: RecordTimestampAssigner
/**
* A {@link TimestampAssigner} that forwards the already-assigned timestamp. This is for use when
* records come out of a source with valid timestamps, for example from the Kafka Metadata.
*/
@Public
public final class RecordTimestampAssigner<E> implements TimestampAssigner<E> {
@Override
public long extractTimestamp(E element, long recordTimestamp) {
return recordTimestamp;
}
}
WatermarkGenerator
/**
* The {@code WatermarkGenerator} generates watermarks either based on events or periodically (in a
* fixed interval).
*
* <p><b>Note:</b> This WatermarkGenerator subsumes the previous distinction between the {@code
* AssignerWithPunctuatedWatermarks} and the {@code AssignerWithPeriodicWatermarks}.
*/
@Public
public interface WatermarkGenerator<T> {
/**
* Called for every event, allows the watermark generator to examine and remember the event
* timestamps, or to emit a watermark based on the event itself.
*/
void onEvent(T event, long eventTimestamp, WatermarkOutput output);
/**
* Called periodically, and might emit a new watermark, or not.
*
* <p>The interval in which this method is called and Watermarks are generated depends on {@link
* ExecutionConfig#getAutoWatermarkInterval()}.
*/
void onPeriodicEmit(WatermarkOutput output);
}
实现的类主要有:
BoundedOutOfOrdernessWatermarks
WatermarksWithIdleness
参考: