一、Watermark简介与用途
1、Watermark用于处理乱序事件,需要通过与window结合使用来实现
2、导致事件乱序的原因有两个,out-of-order 和 late element
二、Watermark策略
1、Watermark策略用途
为了使用Watermark机制,需要指定事件中的某一个字段为事件时间,并需要根据该事件时间生成Watermark。
接口WatermarkStrategy为此提供了两个抽象方法:
1、TimestampAssigner createTimestampAssigner(TimestampAssignerSupplier.Context context)
根据策略实例化一个可分配的时间戳,即抽取事件时间
2、WatermarkGenerator createWatermarkGenerator(WatermarkGeneratorSupplier.Context context)
根据策略实例化一个Watermark生成器
2、Watermark策略源码
package org.apache.flink.api.common.eventtime;
import org.apache.flink.annotation.Public;
import java.io.Serializable;
import java.time.Duration;
import static org.apache.flink.util.Preconditions.checkArgument;
import static org.apache.flink.util.Preconditions.checkNotNull;
/**
* The WatermarkStrategy defines how to generate {@link Watermark}s in the stream sources. The
* WatermarkStrategy is a builder/factory for the {@link WatermarkGenerator} that generates the
* watermarks and the {@link TimestampAssigner} which assigns the internal timestamp of a record.
* WatermarkStrategy 定义了如何在流源中生成 {@link Watermark}。 WatermarkStrategy 是生成水印的 {@link WatermarkGenerator} 和分配记录内部时间戳的 {@link TimestampAssigner} 的构建器/工厂。
*
* <p>This interface is split into three parts: 1) methods that an implementor of this interface
* needs to implement, 2) builder methods for building a {@code WatermarkStrategy} on a base
* strategy, 3) convenience methods for constructing a {code WatermarkStrategy} for common built-in
* strategies or based on a {@link WatermarkGeneratorSupplier}
* 该接口分为三个部分:1) 该接口的实现者需要实现的方法,2) 用于在基本策略上构建 {@code WatermarkStrategy} 的构建器方法,3) 用于构造 {code WatermarkStrategy} 的便捷方法 常见的内置策略或基于 {@link WatermarkGeneratorSupplier}
*
* <p>Implementors of this interface need only implement {@link #createWatermarkGenerator(WatermarkGeneratorSupplier.Context)}.
* Optionally, you can implement {@link #createTimestampAssigner(TimestampAssignerSupplier.Context)}.
* 此接口的实现者只需实现 {@link #createWatermarkGenerator(WatermarkGeneratorSupplier.Context)}。 或者,您可以实现 {@link #createTimestampAssigner(TimestampAssignerSupplier.Context)}。
*
* <p>The builder methods, like {@link #withIdleness(Duration)} or {@link
* #createTimestampAssigner(TimestampAssignerSupplier.Context)} create a new {@code
* WatermarkStrategy} that wraps and enriches a base strategy. The strategy on which the method is
* called is the base strategy.
* 构建器方法,如 {@link #withIdleness(Duration)} 或 {@link #createTimestampAssigner(TimestampAssignerSupplier.Context)} 创建了一个新的 {@code WatermarkStrategy} 来包装和丰富基本策略。 调用该方法的策略是基本策略。
*
* <p>The convenience methods, for example {@link #forBoundedOutOfOrderness(Duration)}, create a
* {@code WatermarkStrategy} for common built in strategies.
* 方便的方法,例如 {@link #forBoundedOutOfOrderness(Duration)},为常见的内置策略创建一个 {@code WatermarkStrategy}。
*
* <p>This interface is {@link Serializable} because watermark strategies may be shipped
* to workers during distributed execution.
* 此接口是 {@link Serializable},因为水印策略可能会在分布式执行期间发送给工作人员。
*/
@Public
public interface WatermarkStrategy<T> extends
TimestampAssignerSupplier<T>, WatermarkGeneratorSupplier<T> {
// ------------------------------------------------------------------------
// Methods that implementors need to implement.
// ------------------------------------------------------------------------
/**
* Instantiates a WatermarkGenerator that generates watermarks according to this strategy.
* 实例化根据此策略生成水印的 WatermarkGenerator。
*/
@Override
WatermarkGenerator<T> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context);
/**
* Instantiates a {@link TimestampAssigner} for assigning timestamps according to this
* strategy.
* 实例化一个 {@link TimestampAssigner} 用于根据此策略分配时间戳。
*/
@Override
default TimestampAssigner<T> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
// By default, this is {@link RecordTimestampAssigner},
// for cases where records come out of a source with valid timestamps, for example from Kafka.
// 对于记录来自具有有效时间戳的源的情况,例如来自 Kafka。
return new RecordTimestampAssigner<>();
}
// ------------------------------------------------------------------------
// Builder methods for enriching a base WatermarkStrategy
// ------------------------------------------------------------------------
/**
* Creates a new {@code WatermarkStrategy} that wraps this strategy but instead uses the given
* {@link TimestampAssigner} (via a {@link TimestampAssignerSupplier}).
* 创建一个新的 {@code WatermarkStrategy} 来包装此策略,但使用给定的 {@link TimestampAssigner}(通过 {@link TimestampAssignerSupplier})。
*
* <p>You can use this when a {@link TimestampAssigner} needs additional context, for example
* access to the metrics system.
* 当 {@link TimestampAssigner} 需要额外的上下文时,您可以使用它,例如访问指标系统。
*
* <pre>
* {@code WatermarkStrategy<Object> wmStrategy = WatermarkStrategy
* .forMonotonousTimestamps()
* .withTimestampAssigner((ctx) -> new MetricsReportingAssigner(ctx));
* }</pre>
*/
default WatermarkStrategy<T> withTimestampAssigner(TimestampAssignerSupplier<T> timestampAssigner) {
checkNotNull(timestampAssigner, "timestampAssigner");
return new WatermarkStrategyWithTimestampAssigner<>(this, timestampAssigner);
}
/**
* Creates a new {@code WatermarkStrategy} that wraps this strategy but instead uses the given
* {@link SerializableTimestampAssigner}.
* 创建一个新的 {@code WatermarkStrategy} 来包装这个策略,而是使用给定的 {@link SerializableTimestampAssigner}。
*
* <p>You can use this in case you want to specify a {@link TimestampAssigner} via a lambda
* function.
* 如果您想通过 lambda 函数指定 {@link TimestampAssigner},您可以使用它。
*
* <pre>
* {@code WatermarkStrategy<CustomObject> wmStrategy = WatermarkStrategy
* .forMonotonousTimestamps()
* .withTimestampAssigner((event, timestamp) -> event.getTimestamp());
* }</pre>
*/
default WatermarkStrategy<T> withTimestampAssigner(SerializableTimestampAssigner<T> timestampAssigner) {
checkNotNull(timestampAssigner, "timestampAssigner");
return new WatermarkStrategyWithTimestampAssigner<>(this,
TimestampAssignerSupplier.of(timestampAssigner));
}
/**
* Creates a new enriched {@link WatermarkStrategy} that also does idleness detection in the
* created {@link WatermarkGenerator}.
* 创建一个新的丰富的 {@link WatermarkStrategy},它也在创建的 {@link WatermarkGenerator} 中进行空闲检测。
*
* <p>Add an idle timeout to the watermark strategy. If no records flow in a partition of a
* stream for that amount of time, then that partition is considered "idle" and will not hold
* back the progress of watermarks in downstream operators.
* 为水印策略添加空闲超时。 如果在该时间内没有记录在流的分区中流动,则该分区被视为“空闲”并且不会阻止下游算子中水印的进度。
*
* <p>Idleness can be important if some partitions have little data and might not have events
* during some periods. Without idleness, these streams can stall the overall event time
* progress of the application.
* 如果某些分区的数据很少并且在某些时间段内可能没有事件,则空闲可能很重要。 如果没有空闲,这些流可能会拖延应用程序的整体事件时间进度。
*/
default WatermarkStrategy<T> withIdleness(Duration idleTimeout) {
checkNotNull(idleTimeout, "idleTimeout");
checkArgument(!(idleTimeout.isZero() || idleTimeout.isNegative()),
"idleTimeout must be greater than zero");
return new WatermarkStrategyWithIdleness<>(this, idleTimeout);
}
// ------------------------------------------------------------------------
// Convenience methods for common watermark strategies
// ------------------------------------------------------------------------
/**
* Creates a watermark strategy for situations with monotonously ascending timestamps.
* 为时间戳单调递增的情况创建水印策略。
*
* <p>The watermarks are generated periodically and tightly follow the latest
* timestamp in the data. The delay introduced by this strategy is mainly the periodic interval
* in which the watermarks are generated.
* 水印是定期生成的,并严格遵循数据中的最新时间戳。 这种策略引入的延迟主要是产生水印的周期间隔。
*
* @see AscendingTimestampsWatermarks
*/
static <T> WatermarkStrategy<T> forMonotonousTimestamps() {
return (ctx) -> new AscendingTimestampsWatermarks<>();
}
/**
* Creates a watermark strategy for situations where records are out of order, but you can place
* an upper bound on how far the events are out of order. An out-of-order bound B means that
* once the an event with timestamp T was encountered, no events older than {@code T - B} will
* follow any more.
* 为记录无序的情况创建水印策略,但您可以为事件无序的程度设置上限。 无序边界 B 意味着一旦遇到时间戳为 T 的事件,就不会再出现比 {@code T - B} 更旧的事件。
*
* <p>The watermarks are generated periodically. The delay introduced by this watermark
* strategy is the periodic interval length, plus the out of orderness bound.
* 水印是周期性生成的。 这种水印策略引入的延迟是周期间隔长度,加上乱序界限。
*
* @see BoundedOutOfOrdernessWatermarks
*/
static <T> WatermarkStrategy<T> forBoundedOutOfOrderness(Duration maxOutOfOrderness) {
return (ctx) -> new BoundedOutOfOrdernessWatermarks<>(maxOutOfOrderness);
}
/**
* Creates a watermark strategy based on an existing {@link WatermarkGeneratorSupplier}.
* 基于现有的 {@link WatermarkGeneratorSupplier} 创建水印策略。
*/
static <T> WatermarkStrategy<T> forGenerator(WatermarkGeneratorSupplier<T> generatorSupplier) {
return generatorSupplier::createWatermarkGenerator;
}
/**
* Creates a watermark strategy that generates no watermarks at all. This may be useful in
* scenarios that do pure processing-time based stream processing.
* 创建一个完全不生成水印的水印策略。 这在执行纯基于处理时间的流处理的场景中可能很有用。
*/
static <T> WatermarkStrategy<T> noWatermarks() {
return (ctx) -> new NoWatermarksGenerator<>();
}
}
3、Watermark策略测试案例
注意:策略一定要设置withIdleness,防止多分区时某分区无数据导致watermark值无法更新
SingleOutputStreamOperator<Tuple2<String, Long>> watermarkDS = inputMapDS
.assignTimestampsAndWatermarks(new MyWatermarkStrategy().withIdleness(Duration.ofSeconds(10))); //允许30秒的乱序,1秒的空闲检测
package com.scallion.transform;
import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.api.java.tuple.Tuple2;
/**
* created by gaowj.
* created on 2021-06-01.
* function: 水印策略
* origin ->
*/
public class MyWatermarkStrategy implements WatermarkStrategy<Tuple2<String, Long>> {
//抽取事件时间
@Override
public TimestampAssigner<Tuple2<String, Long>> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return new MyTimestampAssigner(); //引用 TimestampAssigner接口
}
//发射水印
@Override
public WatermarkGenerator<Tuple2<String, Long>> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new BoundedOutOfOrdernessGenerator(); //引用WatermarkGenerator接口
}
}
三、自定义事件时间分配器
当使用事件时间处理数据的时候,需要手动的为每个元素分配事件时间时间戳,需要实现TimestampAssigner接口
1、TimestampAssigner接口源码
package org.apache.flink.api.common.eventtime;
import org.apache.flink.annotation.Public;
/**
* A {@code TimestampAssigner} assigns event time timestamps to elements.
* These timestamps are used by all functions that operate on event time,
* for example event time windows.
* {@code TimestampAssigner} 为元素分配事件时间时间戳。所有对事件时间进行操作的函数都使用这些时间戳,例如事件时间窗口。
*
* <p>Timestamps can be an arbitrary {@code long} value, but all built-in implementations
* represent it as the milliseconds since the Epoch (midnight, January 1, 1970 UTC),
* the same way as {@link System#currentTimeMillis()} does it.
* 时间戳可以是任意的 {@code long} 值,但所有内置实现都将其表示为自纪元(UTC 1970 年 1 月 1 日午夜)以来的毫秒数,与 {@link System#currentTimeMillis()} 相同。
*
* @param <T> The type of the elements to which this assigner assigns timestamps.
* <T>:此分配器为其分配时间戳的元素的类型。
*/
@Public
@FunctionalInterface
public interface TimestampAssigner<T> {
/**
* The value that is passed to {@link #extractTimestamp} when there is no previous timestamp
* attached to the record.
* 当记录没有附加时间戳时传递给 {@link #extractTimestamp} 的值。
*/
long NO_TIMESTAMP = Long.MIN_VALUE;
/**
* Assigns a timestamp to an element, in milliseconds since the Epoch. This is independent of
* any particular time zone or calendar.
* 为元素分配时间戳,以 Epoch 以来的毫秒数为单位。 这独立于任何特定的时区或日历。
*
* <p>The method is passed the previously assigned timestamp of the element.
* That previous timestamp may have been assigned from a previous assigner. If the element did
* not carry a timestamp before, this value is {@link #NO_TIMESTAMP} (= {@code Long.MIN_VALUE}:
* {@value Long#MIN_VALUE}).
* 该方法传递元素的先前分配的时间戳。 先前的时间戳可能是从先前的分配者分配的。 如果元素之前没有携带时间戳,则此值为 {@link #NO_TIMESTAMP} (= {@code Long.MIN_VALUE}:
{@value Long#MIN_VALUE})
*
* @param element The element that the timestamp will be assigned to.
* 时间戳将被分配到的元素。
* @param recordTimestamp The current internal timestamp of the element,
* or a negative value, if no timestamp has been assigned yet.
* 元素的当前内部时间戳,如果尚未分配时间戳,则为负值。
* @return The new timestamp. 新的时间戳。
*/
long extractTimestamp(T element, long recordTimestamp);
}
2、TimestampAssigner接口测试案例
public class MyWatermarkStrategy implements WatermarkStrategy<Tuple2<String, Long>> {
//抽取事件时间
@Override
public TimestampAssigner<Tuple2<String, Long>> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return new MyTimestampAssigner(); //引用 TimestampAssigner接口
}
}
package com.scallion.transform;
import org.apache.flink.api.common.eventtime.TimestampAssigner;
import org.apache.flink.api.java.tuple.Tuple2;
/**
* created by gaowj.
* created on 2021-06-01.
* function: 指定事件时间
* origin ->
*/
public class MyTimestampAssigner implements TimestampAssigner<Tuple2<String, Long>> {
@Override
public long extractTimestamp(Tuple2<String, Long> event, long l) {
return event.f1;
}
}
四、自定义Watermark生成器
1、Watermark生成器接口源码
自定义Watermark生成器可以引用接口WatermarkGenerator,并按照业务逻辑构造其方法
package org.apache.flink.api.common.eventtime;
import org.apache.flink.annotation.Public;
import org.apache.flink.api.common.ExecutionConfig;
/**
* The {@code WatermarkGenerator} generates watermarks either based on events or
* periodically (in a fixed interval).
* {@code WatermarkGenerator} 基于事件或定期(以固定间隔)生成水印。
*
* <p><b>Note:</b> This WatermarkGenerator subsumes the previous distinction between the
* {@code AssignerWithPunctuatedWatermarks} and the {@code AssignerWithPeriodicWatermarks}.
* 此 WatermarkGenerator 包含了 {@code AssignerWithPunctuatedWatermarks} 和 {@code AssignerWithPeriodicWatermarks} 之间的先前区别。
*/
@Public
public interface WatermarkGenerator<T> {
/**
* Called for every event, allows the watermark generator to examine and remember the
* event timestamps, or to emit a watermark based on the event itself.
* 为每个事件调用,允许水印生成器检查并记住事件时间戳,或根据事件本身发出水印。
*/
void onEvent(T event, long eventTimestamp, WatermarkOutput output);
/**
* Called periodically, and might emit a new watermark, or not.
* 定期调用,并且可能会发出或不发出新的水印。
*
* <p>The interval in which this method is called and Watermarks are generated
* depends on {@link ExecutionConfig#getAutoWatermarkInterval()}.
* 调用此方法和生成 Watermarks 的时间间隔取决于 {@link ExecutionConfig#getAutoWatermarkInterval()}。
*/
void onPeriodicEmit(WatermarkOutput output);
}
watermark 的生成方式本质上是有两种:周期性生成和标记生成。
1、周期性生成器通常通过 onEvent() 观察传入的事件数据,然后在框架调用 onPeriodicEmit() 时发出 watermark。
2、标记生成器将查看 onEvent() 中的事件数据,并等待检查在流中携带 watermark 的特殊标记事件或打点数据。当获取到这些事件数据时,它将立即发出 watermark。通常情况下,标记生成器不会通过 onPeriodicEmit() 发出 watermark。
2、自定义周期性Watermark生成器
package com.scallion.transform;
import org.apache.flink.api.common.eventtime.Watermark;
import org.apache.flink.api.common.eventtime.WatermarkGenerator;
import org.apache.flink.api.common.eventtime.WatermarkOutput;
import org.apache.flink.api.java.tuple.Tuple2;
/**
* created by gaowj.
* created on 2021-06-01.
* function: 该Watermark生成器可以覆盖的场景:数据在一定程度上乱序
*/
public class BoundedOutOfOrdernessGenerator implements WatermarkGenerator<Tuple2<String, Long>> {
private final long maxOutOfOrderness = 30000; // 允许30秒的乱序
private long currentMaxTimestamp;
@Override
public void onEvent(Tuple2<String, Long> event, long eventTimestamp, WatermarkOutput output) {
currentMaxTimestamp = Math.max(currentMaxTimestamp, eventTimestamp);
}
@Override
public void onPeriodicEmit(WatermarkOutput output) {
output.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness));
}
}
/**
* 该生成器生成的 watermark 滞后于处理时间固定量。它假定元素会在有限延迟后到达 Flink。
*/
public class TimeLagWatermarkGenerator implements WatermarkGenerator<MyEvent> {
private final long maxTimeLag = 5000; // 5 秒
@Override
public void onEvent(MyEvent event, long eventTimestamp, WatermarkOutput output) {
// 处理时间场景下不需要实现
}
@Override
public void onPeriodicEmit(WatermarkOutput output) {
output.emitWatermark(new Watermark(System.currentTimeMillis() - maxTimeLag));
}
}
3、自定义标点Watermark生成器
标记 watermark 生成器观察流事件数据并在获取到带有 watermark 信息的特殊事件元素时发出 watermark。
如下是实现标记生成器的方法,当事件带有某个指定标记时,该生成器就会发出 watermark:
public class PunctuatedAssigner implements WatermarkGenerator<MyEvent> {
@Override
public void onEvent(MyEvent event, long eventTimestamp, WatermarkOutput output) {
if (event.hasWatermarkMarker()) {
output.emitWatermark(new Watermark(event.getWatermarkTimestamp()));
}
}
@Override
public void onPeriodicEmit(WatermarkOutput output) {
// onEvent 中已经实现
}
}
五、Watermark策略的使用位置
1、在Kafka连接器上使用
生产中挺难遇到,日后再整理
2、在算子上使用
六、在使用Watermark时需注意的地方
1、window的触发条件
此处需要注意的是,并不是有数据流入和有水印生成就会触发窗口计算,窗口的计算需要同时满足两个条件,记住是同时满足这两个条件:
1、watermark时间 >= window_end_time
2、在窗口范围内有数据,即[window_start_time,window_end_time]
其实稍微动动脑子就能明白,若想让该窗口触发,首先得窗口结束前的数据都能到来,水印的定义就是水印时间之前所有的数据都已经到来,当水印时间>=window_end_time时也就能确保窗口结束之前时间内数据都已经到来。
2、处理空闲数据源
这个问题很严重,在文章最后的完整测试案例中我就遇到了这个问题,当时我把下游并行度设置为10,但是source我只起了一个线程,也就是说下游10个并行中只有1个是有数据的,
即:env.setParallelism(10)。
由于下游算子 watermark 的计算方式是取所有不同的上游并行数据源 watermark 的最小值,则其 watermark 将不会发生变化。
因此在我测试数据的时候,无论将数据中的时间字段设置的多大,都不会触发计算,当把下游计算修改为1个并发的时候才能正常触发计算,即:env.setParallelism(1)。
当然,官方给出的正确的处理空闲数据源的方法为进行空闲检测:
即 .withIdleness(Duration.ofSeconds(10))
SingleOutputStreamOperator<Tuple2<String, Long>> watermarkDS = inputMapDS
.assignTimestampsAndWatermarks(new MyWatermarkStrategy().withIdleness(Duration.ofSeconds(10))); //允许30秒的乱序,10秒的空闲检测
代码的意思就是如果某分区内10秒内未产生数据,就忽略掉对该分区的水印的计算。
七、完整测试代码
1、入口类
package com.scallion.entry;
import com.scallion.job.TumblingWindowAccumulatingJob;
import com.scallion.utils.FlinkUtil;
/**
* created by gaowj.
* created on 2021-05-28.
* function: watermark allowedLateness RichWindowFunction
* origin -> https://blog.csdn.net/lmalds/article/details/55259718
*/
public class TumblingWindowAccumulatingTest {
public static void main(String[] args) {
FlinkUtil.run(new TumblingWindowAccumulatingJob());
FlinkUtil.execution("TumblingWindowAccumulatingTest");
}
}
1.1、Flink工具类
package com.scallion.utils;
import com.scallion.common.Common;
import com.scallion.job.Job;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.common.time.Time;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011;
import java.util.Properties;
import java.util.concurrent.TimeUnit;
/**
* created by gaowj.
* created on 2021-03-01.
* function: Flink上下文配置,源数据获取类
* origin ->
*/
public class FlinkUtil {
private static StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
static {
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.enableCheckpointing(1000);
env.setParallelism(10);
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3,
Time.of(10, TimeUnit.SECONDS)
));
}
public static void run(Job job) {
job.run();
}
public static DataStream<String> getSocketTextStream(String ip, int port) {
DataStreamSource<String> source = env.socketTextStream(ip, port);
// DataStreamSource<String> source = env.fromElements("a a a b c c c d d f h");
return source;
}
public static DataStream<String> getSocketTextStream() {
DataStreamSource<String> source = env.socketTextStream(Common.SOCKET_IP, Common.SOCKET_PORT);
return source;
}
public static DataStream<String> getText() {
DataStreamSource<String> source = env.fromElements("key2,1487225040000");
return source;
}
public static DataStream<String> getKafkaStream(String broker, String topic, String groupId) {
Properties prop = new Properties();
prop.setProperty("bootstrap.servers", broker);
prop.setProperty("group.id", groupId);
FlinkKafkaConsumer011<String> kafkaConsumer = new FlinkKafkaConsumer011<>(topic, new SimpleStringSchema(), prop);
kafkaConsumer.setStartFromLatest();
DataStreamSource<String> source = env.addSource(kafkaConsumer).setParallelism(1);
return source;
}
public static void execution(String jobName) {
try {
env.execute(jobName);
} catch (Exception e) {
e.printStackTrace();
}
}
}
2、业务处理类
package com.scallion.job;
import com.scallion.transform.MyWatermarkStrategy;
import com.scallion.transform.WordCountProcessWindowFunction;
import com.scallion.utils.FlinkUtil;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import java.time.Duration;
/**
* created by gaowj.
* created on 2021-05-28.
* function: watermark allowedLateness RichWindowFunction
* origin -> https://blog.csdn.net/lmalds/article/details/55259718
*/
public class TumblingWindowAccumulatingJob implements Job {
@Override
public void run() {
/**
* Source
*/
DataStream<String> sourceDS = FlinkUtil.getSocketTextStream();
/**
* Transform
*/
SingleOutputStreamOperator<Tuple2<String, Long>> inputMapDS = sourceDS
.map(new MapFunction<String, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> map(String input) throws Exception {
String[] split = input.split(",");
return new Tuple2<>(split[0], Long.parseLong(split[1]));
}
});
SingleOutputStreamOperator<Tuple2<String, Long>> watermarkDS = inputMapDS
.assignTimestampsAndWatermarks(new MyWatermarkStrategy().withIdleness(Duration.ofSeconds(10))); //允许30秒的乱序,1秒的空闲检测
SingleOutputStreamOperator<String> outDS = watermarkDS
.keyBy(tuple -> tuple.f0)
.window(TumblingEventTimeWindows.of(Time.seconds(30))) //30秒翻滚窗口大小
.allowedLateness(Time.seconds(60)) //允许60秒的延迟数据
.process(new WordCountProcessWindowFunction());
/**
* Sink
*/
outDS.print();
}
}
3、水印策略类
package com.scallion.transform;
import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.api.java.tuple.Tuple2;
/**
* created by gaowj.
* created on 2021-06-01.
* function: 水印策略
* origin ->
*/
public class MyWatermarkStrategy implements WatermarkStrategy<Tuple2<String, Long>> {
//抽取事件时间
@Override
public TimestampAssigner<Tuple2<String, Long>> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return new MyTimestampAssigner(); //引用 TimestampAssigner接口
}
//发射水印
@Override
public WatermarkGenerator<Tuple2<String, Long>> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new BoundedOutOfOrdernessGenerator(); //引用WatermarkGenerator接口
}
}
4、事件时间生成器
package com.scallion.transform;
import org.apache.flink.api.common.eventtime.TimestampAssigner;
import org.apache.flink.api.java.tuple.Tuple2;
/**
* created by gaowj.
* created on 2021-06-01.
* function: 指定事件时间
* origin ->
*/
public class MyTimestampAssigner implements TimestampAssigner<Tuple2<String, Long>> {
@Override
public long extractTimestamp(Tuple2<String, Long> event, long l) {
return event.f1;
}
}
5、水印生成器
package com.scallion.transform;
import org.apache.flink.api.common.eventtime.Watermark;
import org.apache.flink.api.common.eventtime.WatermarkGenerator;
import org.apache.flink.api.common.eventtime.WatermarkOutput;
import org.apache.flink.api.java.tuple.Tuple2;
/**
* created by gaowj.
* created on 2021-06-01.
* function: 该Watermark生成器可以覆盖的场景:数据在一定程度上乱序
*/
public class BoundedOutOfOrdernessGenerator implements WatermarkGenerator<Tuple2<String, Long>> {
private final long maxOutOfOrderness = 30000; // 允许30秒的乱序
private long currentMaxTimestamp;
@Override
public void onEvent(Tuple2<String, Long> event, long eventTimestamp, WatermarkOutput output) {
currentMaxTimestamp = Math.max(currentMaxTimestamp, eventTimestamp);
}
@Override
public void onPeriodicEmit(WatermarkOutput output) {
output.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness));
}
}
6、窗口计算函数
package com.scallion.transform;
import com.scallion.utils.TimeUtil;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.guava18.com.google.common.collect.Iterables;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
/**
* created by gaowj.
* created on 2021-06-01.
* function: 计算窗口内数据的count值
* origin ->
*/
public class WordCountProcessWindowFunction extends ProcessWindowFunction<Tuple2<String, Long>, String, String, TimeWindow> {
ValueState<Integer> state;
int count;
@Override
public void open(Configuration parameters) throws Exception {
//该状态值为全局状态,不随某个窗口的结束清理
state = getRuntimeContext().getState(new ValueStateDescriptor<Integer>("WordCountProcess test", Integer.class));
}
@Override
public void process(String key, Context context, Iterable<Tuple2<String, Long>> elements, Collector<String> collector) throws Exception {
int inputSize = Iterables.size(elements);
if (state.value() == null)
count = inputSize;
else
count = state.value() + inputSize;
state.update(count);
String res = "key:" + key + "\t"
+ "windowStart:" + TimeUtil.getTimestampToDate(context.window().getStart()) + "\t"
+ "windowEnd:" + TimeUtil.getTimestampToDate(context.window().getEnd()) + "\t"
+ "inputSize:" + inputSize + "\t"
+ "allCount:" + count;
collector.collect(res);
}
}