前言:
在流处理的过程中,会根据接入的事件类型如事件时间、处理时间,选择不同的方式去处理数据。
与此同时在Operator中如何对时间进行协调和管理?在每个Operator内部都维系了一个TimeService服务,专门用于处理与时间相关的操作,例如获取当前算子中最新的处理时间以及watermark、注册不同时间类型的定时器等等。
一、时间和水位线
watermark数据结构:
watermark的功能是告诉flink系统:不会再有小于或等于watermark.timestamp的数据到达了。watermark本质上还是一个时间戳。从flink的watermark数据结构来看,唯一有意义的成员变量就是timestamp。
public final class Watermark extends StreamElement {
public static final Watermark MAX_WATERMARK = new Watermark(Long.MAX_VALUE);
public static final Watermark UNINITIALIZED = new Watermark(Long.MIN_VALUE);
private final long timestamp;
public Watermark(long timestamp) {
this.timestamp = timestamp;
}
public long getTimestamp() {
return timestamp;
}
@Override
public boolean equals(Object o) {
return this == o
|| o != null
&& o.getClass() == Watermark.class
&& ((Watermark) o).timestamp == timestamp;
}
@Override
public int hashCode() {
return (int) (timestamp ^ (timestamp >>> 32));
}
@Override
public String toString() {
return "Watermark @ " + timestamp;
}
}
1.1 在SourceFunction中抽取Timestamp和生成WaterMark
在SourceFuction中读取数据元素时,SourceContext接口中定义了抽取Timestamp和生成Watermark的方法
@PublicEvolving
void collectWithTimestamp(T element, long timestamp);
@PublicEvolving
void emitWatermark(Watermark mark);
至于watermark是基于什么事件类型的,在StreamSourceContexts.ManualWatermarkContext处理Watermark信息。
final SourceFunction.SourceContext<OUT> ctx;
switch (timeCharacteristic) {
case EventTime:
ctx =
new ManualWatermarkContext<>(
output,
processingTimeService,
checkpointLock,
idleTimeout,
emitProgressiveWatermarks);
break;
case IngestionTime:
Preconditions.checkState(
emitProgressiveWatermarks,
"Ingestion time is not available when emitting progressive watermarks "
+ "is disabled.");
ctx =
new AutomaticWatermarkContext<>(
output,
watermarkInterval,
processingTimeService,
checkpointLock,
idleTimeout);
break;
case ProcessingTime:
ctx = new NonTimestampContext<>(checkpointLock, output);
break;
WatermarkContext.collectWithTimestamp方法
@Override
public final void collectWithTimestamp(T element, long timestamp) {
synchronized (checkpointLock) {
processAndEmitWatermarkStatus(WatermarkStatus.ACTIVE);
if (nextCheck != null) {
this.failOnNextCheck = false;
} else {
scheduleNextIdleDetectionTask();
}
// 抽取timestamp信息
processAndCollectWithTimestamp(element, timestamp);
}
}
生成watermark主要是通过调用WatermarkContext.emitWatermark()方法进行的,生成watermark首先会更新当前Source算子中的CurrentWatermark,然后将watermark传递给下游算子处理,当下游算子接收到watermark事件后,也会更新当前算子内部的CurrentWatermark。
@Override
public final void emitWatermark(Watermark mark) {
if (allowWatermark(mark)) {
synchronized (checkpointLock) {
processAndEmitWatermarkStatus(WatermarkStatus.ACTIVE);
if (nextCheck != null) {
this.failOnNextCheck = false;
} else {
scheduleNextIdleDetectionTask();
}
// 处理并发送Watermark至下游算子
processAndEmitWatermark(mark);
}
}
}
1.2 通过DataStream中的独立算子抽取Timestamp生成watermark
除了能够在SourceFunction中直接分配Timestamp和生成Watermark,也可以再DataStream数据转换的过程中进行相应的操作,此时转换操作对应的算子就能使用生成的Timestamp和watermark信息了。
在DataStream.assignTimestampsAndWatermarks()
// assignTimestampsAndWatermarks方法的参数是 WatermarkStrategy接口参数
// watermarkStrategy.<元素类型>forMonotonousTimestamps()
// watermarkStrategy用于生成watermark
public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(
WatermarkStrategy<T> watermarkStrategy) {
final WatermarkStrategy<T> cleanedStrategy = clean(watermarkStrategy);
final int inputParallelism = getTransformation().getParallelism();
final TimestampsAndWatermarksTransformation<T> transformation =
new TimestampsAndWatermarksTransformation<>(
"Timestamps/Watermarks",
inputParallelism,
getTransformation(),
cleanedStrategy,
false);
getExecutionEnvironment().addOperator(transformation);
return new SingleOutputStreamOperator<>(getExecutionEnvironment(), transformation);
}
1.3 watermark的类型
flink采用WatermarkStretegy设置自定义Watermark类型,WatermarkGenerator是Watermark的基类。flink实现了Puctuated Watermarks从事件中获取事件时间戳、PeriodicWatermark周期获取事件的时间戳。
public interface WatermarkGenerator<T> {
// 从事件中获取事件的时间戳
void onEvent(T event, long eventTimestamp, WatermarkOutput output);
// 周期性获取事件的时间戳
void onPeriodicEmit(WatermarkOutput output);
}
1.4 watermark的产生
Watermark是算子TimestampsAndWatermarksOperator产生的,WatermarkStrategy相当于UNFFunction(封装与TimestampsAndWatermarksOperator内部)。
processElement方法实现事件产生watermark,processWatermark方法阻断上游传过来的watermark, onProcessingTime方法实现周期性产生watermark
public class TimestampsAndWatermarksOperator<T>
extends AbstractStreamOperator<T>
implements OneInputStreamOperator<T, T>, ProcessingTimeCallback {
// 省略...
@Override
public void open() throws Exception {
super.open();
// 初始化timestampAssiger
timestampAssigner = watermarkStrategy.createTimestampAssigner(this::getMetricGroup);
watermarkGenerator =
emitProgressiveWatermarks
? watermarkStrategy.createWatermarkGenerator(this::getMetricGroup)
: new NoWatermarksGenerator<>();
wmOutput = new WatermarkEmitter(output);
// 生成watermark周期配置时间
watermarkInterval = getExecutionConfig().getAutoWatermarkInterval();
// 注册定时器配置时间
if (watermarkInterval > 0 && emitProgressiveWatermarks) {
final long now = getProcessingTimeService().getCurrentProcessingTime();
// 注册一个watermarkInterval后触发的定时器,传入回调参数this,也就是会调用当前对象的onProcessingTime方法
getProcessingTimeService().registerTimer(now + watermarkInterval, this);
}
}
@Override
public void processElement(final StreamRecord<T> element) throws Exception {
final T event = element.getValue();
final long previousTimestamp = element.hasTimestamp() ? element.getTimestamp() : Long.MIN_VALUE;
final long newTimestamp = timestampAssigner.extractTimestamp(event, previousTimestamp);
element.setTimestamp(newTimestamp);
output.collect(element);
// 事件产生 Watermark
watermarkGenerator.onEvent(event, newTimestamp, wmOutput);
}
// 阻断上游传过来的 watermark
@Override
public void processWatermark(org.apache.flink.streaming.api.watermark.Watermark mark) throws Exception {
// if we receive a Long.MAX_VALUE watermark we forward it since it is used
// to signal the end of input and to not block watermark progress downstream
if (mark.getTimestamp() == Long.MAX_VALUE) {
wmOutput.emitWatermark(Watermark.MAX_WATERMARK);
}
}
@Override
public void onProcessingTime(long timestamp) throws Exception {
// 采用定时器, 周期产生 Watermark
watermarkGenerator.onPeriodicEmit(wmOutput);
final long now = getProcessingTimeService().getCurrentProcessingTime();
// 更新定时器
getProcessingTimeService().registerTimer(now + watermarkInterval, this);
}
// 省略...
}
二、TimerService时间服务
对于需要依赖时间定时器进行数据处理的算子来说,需要TimerService组件实现对定时器的管理,其中定时器执行的具体处理都是通过回调函数定义。每个StreamOperator在创建和初始化的过程中,都会通过InternalTimeServiceManager创建TimerService对象,通过InternalTimeServiceManager管理了task内所有和时间相关的服务,并向所有Operator提供创建和获取TimerService的方法。
public interface TimerService {
String UNSUPPORTED_REGISTER_TIMER_MSG = "Setting timers is only supported on a keyed streams.";
String UNSUPPORTED_DELETE_TIMER_MSG = "Deleting timers is only supported on a keyed streams.";
long currentWatermark();
void registerProcessingTimeTimer(long time);
void registerEventTimeTimer(long time);
void deleteProcessingTimeTimer(long time);
void deleteEventTimeTimer(long time);
}
TimerService的主要实现是SimpleTimerService
public class SimpleTimerService implements TimerService {
// SimpleTimerService继承自TimerService接口,并将InternaltimerService接口作为成员变量,因此
// SimpleTimerService中提供的方法基本都是借助InternalTimerService实现的
private final InternalTimerService<VoidNamespace> internalTimerService;
public SimpleTimerService(InternalTimerService<VoidNamespace> internalTimerService) {
this.internalTimerService = internalTimerService;
}
// 重写方法略。。。。
}
InternalTimerService接口实际上是TimerService接口的内部版本,而TimerService接口是专门提供给用户使用的暴露接口。InternalTimerService需要按照Key和命名空间进行划分,并提供操作时间和定时器的内部方法。
因此不仅是SimpleTimerService通过InternalTimerService获取 别的内置算子也会用过InternalTimerService提供的方法执行时间相关的操作
public interface InternalTimerService<N> {
/** Returns the current processing time. */
long currentProcessingTime();
/** Returns the current event-time watermark. */
long currentWatermark();
/**
* Registers a timer to be fired when processing time passes the given time. The namespace you
* pass here will be provided when the timer fires.
*/
void registerProcessingTimeTimer(N namespace, long time);
/** Deletes the timer for the given key and namespace. */
void deleteProcessingTimeTimer(N namespace, long time);
/**
* Registers a timer to be fired when event time watermark passes the given time. The namespace
* you pass here will be provided when the timer fires.
*/
void registerEventTimeTimer(N namespace, long time);
/** Deletes the timer for the given key and namespace. */
void deleteEventTimeTimer(N namespace, long time);
/**
* Performs an action for each registered timer. The timer service will set the key context for
* the timers key before invoking the action.
*/
void forEachEventTimeTimer(BiConsumerWithException<N, Long, Exception> consumer)
throws Exception;
/**
* Performs an action for each registered timer. The timer service will set the key context for
* the timers key before invoking the action.
*/
void forEachProcessingTimeTimer(BiConsumerWithException<N, Long, Exception> consumer)
throws Exception;
}
三、TimerService应用例子
SingleOutputStreamOperator<String> process = wordAndOne.keyBy(data -> data.f0).process(
new KeyedProcessFunction<String, Tuple2<String, Long>, String>() {
// open()方法中可以设置定时器
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
}
@Override
public void processElement(Tuple2<String, Long> value, Context ctx, Collector<String> out) throws Exception {
String f0 = value.f0;
}
}
);