Flink 本身提供了多层 API ,具体结构层次如下:
过程函数(ProcessFunction)是 Flink 的最底层 API,它不定义任何的操作算子,仅仅通过统一的 process 操作。在处理函数中,使用者直面数据流中最基本的元素:数据事件(event)、状态(state)以及时间(time)。
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/operators/process_function/
Flink 提供了 8 个不同的处理函数:
- ProcessFunction 最基本的处理函数,基于 DataStream 直接调用 process() 时作为参数传
- KeyedProcessFunction 对流按键分区后的处理函数,基于 KeyedStream 调用 process() 时作为参数传入。要想使用定时器、状态就必须是基于 KeyedStream。
- ProcessWindowFunction 开窗之后的处理函数,也是全窗口函数的代表。基于 WindowedStream 调用 process() 时作为参数传入。
- ProcessAllWindowFunction 同样是开窗之后的处理函数,基于 AllWindowedStream 调用 process() 时作为参数传入。
- CoProcessFunction 合并(connect)两条流之后的处理函数。
- ProcessJoinFunction 间隔连接(interval join)两条流之后的处理函数。
- BroadcastProcessFunction 广播连接流处理函数,这里的“广播连接流”BroadcastConnectedStream,是一个未 keyBy 的普通 DataStream 与一个广播流(BroadcastStream)做连接(conncet)之后的产物。
- KeyedBroadcastProcessFunction 按键分区的广播连接流处理函数,这时的广播连接流,是一个KeyedStream 与广播流(BroadcastStream)做连接之后的结果。
ProcessFunction
针对没有 keyBy 的数据流,可以使用 ProcessFunction 接口,针对流中的每个元素输出 0 个、1 个或者多个元素。(与类似 RichFlatMapFunction)
● ProcessFunction<IN, OUT>:IN是输入的泛型,OUT是输出的泛型
● processElement:每来一条数据,调用一次
● 使用.process(new ProcessFunction<I, O>)来调用。
ProcessFunction 函数虽然在没有分组的情况下也可以获取 timer(定时器)和 state(状态)但是在编译是会报 Keyed state can only be used on a ‘keyed stream’, i.e., after a ‘keyBy()’ operation 异常
DataStreamSource<Integer> sourceStream = env
.addSource(new SourceFunction<Integer>() {
private boolean running = true;
private Random random = new Random();
@Override
public void run(SourceContext<Integer> ctx) throws Exception {
while (running) {
ctx.collect(random.nextInt(100));
Thread.sleep(1000);
}
}
@Override
public void cancel() {
running = false;
}
});
sourceStream
.process(new ProcessFunction<Integer, Integer>() {
@Override
public void processElement(Integer value, Context ctx, Collector<Integer> out) throws Exception {
if (value % 10 == 1){
out.collect(value);
out.collect(value);
}
}
})
.print();
# 异常演示
public class ProcessFunctions {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// ProcessFunction 不能使用 状态变量 不能使用 onTimer 编译会出错
// 求平均值 实现 map - reduce
DataStreamSource<Integer> sourceStream = env
.addSource(new SourceFunction<Integer>() {
private boolean running = true;
private Random random = new Random();
@Override
public void run(SourceContext<Integer> ctx) throws Exception {
while (running) {
ctx.collect(random.nextInt(100));
Thread.sleep(1000);
}
}
@Override
public void cancel() {
running = false;
}
});
// ProcessFunction 不能使用 状态变量 不能使用 onTimer 编译会出错
sourceStream
.process(new ProcessFunction<Integer, String>() {
private ValueState<Tuple2<Integer, Integer>> avgState;
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
avgState = getRuntimeContext().getState(new ValueStateDescriptor<Tuple2<Integer, Integer>>("avg-state", Types.TUPLE(Types.INT, Types.INT)));
}
@Override
public void processElement(Integer value, Context ctx, Collector<String> out) throws Exception {
if (avgState.value() == null) {
avgState.update(Tuple2.of(value, 1));
} else {
avgState.update(Tuple2.of(avgState.value().f0 + value, avgState.value().f1 + 1));
}
ctx.timerService().registerProcessingTimeTimer(ctx.timerService().currentProcessingTime() + 10L);
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
super.onTimer(timestamp, ctx, out);
ctx.timerService().registerProcessingTimeTimer(timestamp + 10L);
out.collect("avg = " + ((double) avgState.value().f1 / avgState.value().f0));
}
})
.print("avg:");
env.execute();
}
}
KeyedProcessFunction
针对 keyBy 之后的键控流(KeyedStream),可以使用 KeyedProcessFunction
● KeyedProcessFunction<KEY, IN, OUT>:KEY是key的泛型,IN是输入的泛型,OUT是输出的泛型。
● processElement:来一条数据,触发调用一次。
● onTimer:定时器。时间到达某一个时间戳触发调用。
public class NumValueDataContinuousRiseKeyedProcessFunction {
public static void main(String[] args) throws Exception {
// 整数连续 1s 上升报警
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<Integer> randomNumSourceStream = env
.addSource(new SourceFunction<Integer>() {
private boolean running = true;
private Random random = new Random();
@Override
public void run(SourceContext<Integer> ctx) throws Exception {
while (running) {
ctx.collect(random.nextInt(100000));
try {
TimeUnit.MILLISECONDS.sleep(200);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
@Override
public void cancel() {
running = false;
random = null;
}
});
randomNumSourceStream
// 将所有数据分到一个 slot
.keyBy(r -> 1)
// KEY、INPUT、OUTPUT
.process(new KeyedProcessFunction<Integer, Integer, String>() {
// 保存上一条数据
private ValueState<Integer> lastValueState;
// 保存定时器
private ValueState<Long> timerTs;
private final Long oneS = 1000L;
@Override
public void open(Configuration parameters) throws Exception {
lastValueState = getRuntimeContext().getState(new ValueStateDescriptor<Integer>("lastValueState", Types.INT));
timerTs = getRuntimeContext().getState(new ValueStateDescriptor<Long>("timerTs", Types.LONG));
}
@Override
public void processElement(Integer value, Context ctx, Collector<String> out) throws Exception {
// =====================
Integer lastVal = lastValueState.value() == null ? Integer.MIN_VALUE : lastValueState.value();
// 跟新上一个元素
lastValueState.update(value);
Long ts = null;
if (timerTs.value() != null){
ts = timerTs.value();
}
if (lastVal >= value){
if (ts != null){
System.out.println("当前元素小与上一条元素,curr = " + value + ", last = " + lastVal + ", 删除定时器 = " + ts);
ctx.timerService().deleteProcessingTimeTimer(ts);
timerTs.clear();
}
}else {
System.out.println("当前元素大于上一条元素 -->>>,curr = " + value + ", last = " + lastVal);
if (ts == null){
long timer = ctx.timerService().currentProcessingTime() + oneS;
System.out.println("当前元素大于上一条,并且是第一条元素,curr = " + value + ", last = " + lastVal + ", 注册定时器 = " + timer);
ctx.timerService().registerProcessingTimeTimer(timer);
timerTs.update(timer);
}
}
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
out.collect("整数连续一秒上升");
timerTs.clear();
}
})
.print();
env.execute();
}
}
ProcessWindowFunction
针对 WindowedStream 之后的数据流,可以使用 ProcessWindowFunction
● ProcessWindowFunction<IN, OUT, KEY, W extends Window>。
● processElement:来一条数据,触发调用一次。
● onTimer:定时器。时间到达某一个时间戳触发调用。
public class WinElemsCountProcessWindowFunction {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<Tuple2<String, Integer>> withWatermarkStream = env
.fromCollection(
Arrays.asList(
Tuple2.of("Alick", 1),
Tuple2.of("Alick", 2),
Tuple2.of("Alick", 3),
Tuple2.of("BOb", 3),
Tuple2.of("BOb", 5),
Tuple2.of("BOb", 7),
Tuple2.of("BOb", 10),
Tuple2.of("Alick", 7)
)
)
.assignTimestampsAndWatermarks(
new WatermarkStrategy<Tuple2<String, Integer>>() {
@Override
public WatermarkGenerator<Tuple2<String, Integer>> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new WatermarkGenerator<Tuple2<String, Integer>>() {
private Long delay = 0L;
private Long watermark = -Long.MAX_VALUE + delay + 1L;
@Override
public void onEvent(Tuple2<String, Integer> event, long eventTimestamp, WatermarkOutput output) {
watermark = Math.max(event.f1,watermark);
}
@Override
public void onPeriodicEmit(WatermarkOutput output) {
output.emitWatermark(new Watermark(watermark - delay - 1L));
}
};
}
@Override
public TimestampAssigner<Tuple2<String, Integer>> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return new TimestampAssigner<Tuple2<String, Integer>>() {
@Override
public long extractTimestamp(Tuple2<String, Integer> element, long recordTimestamp) {
return element.f1 * 1000L;
}
};
}
}
);
// .assignTimestampsAndWatermarks(
// WatermarkStrategy
// .<Tuple2<String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(0L))
// .withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Integer>>() {
// @Override
// public long extractTimestamp(Tuple2<String, Integer> element, long recordTimestamp) {
// return element.f1 * 1000L;
// }
// })
// );
KeyedStream<Tuple2<String, Integer>, String> keyedStream = withWatermarkStream
.keyBy(r -> r.f0);
WindowedStream<Tuple2<String, Integer>, String, TimeWindow> winStream = keyedStream
.window(TumblingEventTimeWindows.of(Time.seconds(5)));
SingleOutputStreamOperator<String> countWinElemsStream = winStream
.process(
new ProcessWindowFunction<Tuple2<String, Integer>, String, String, TimeWindow>() {
@Override
public void process(String key, Context context, Iterable<Tuple2<String, Integer>> elements, Collector<String> out) throws Exception {
Timestamp winStart = new Timestamp(context.window().getStart());
Timestamp winEnd = new Timestamp(context.window().getEnd());
long count = elements.spliterator().getExactSizeIfKnown();
out.collect("key = " + key + " , win [ " + winStart + " - " + winEnd + " ) 有 " + count + " 个元素");
}
}
);
countWinElemsStream.print();
env.execute();
}
}
ProcessAllWindowFunction
针对 AllWindowedStream 之后的数据流,可以使用 ProcessAllWindowFunction
● ProcessAllWindowFunction<IN, OUT, W extends Window>
● process(Context context, Iterable elements, Collector out):窗口触发时执行
public class WinElemsCountProcessAllWindowFunction {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<Event> withWatermarkStream = env
.addSource(new ClickSource())
.assignTimestampsAndWatermarks(
new WatermarkStrategy<Event>() {
@Override
public WatermarkGenerator<Event> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new WatermarkGenerator<Event>() {
private long delay = 0L;
private long maxWatermark = -Long.MAX_VALUE + delay + 1L;
@Override
public void onEvent(Event event, long eventTimestamp, WatermarkOutput output) {
maxWatermark = Math.max(event.timestamp, maxWatermark);
}
@Override
public void onPeriodicEmit(WatermarkOutput output) {
output.emitWatermark(new Watermark(maxWatermark - delay - 1L));
}
};
}
@Override
public TimestampAssigner<Event> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return new TimestampAssigner<Event>() {
@Override
public long extractTimestamp(Event element, long recordTimestamp) {
return element.timestamp;
}
};
}
}
);
AllWindowedStream<Event, GlobalWindow> allWindowedStream = withWatermarkStream
.windowAll(GlobalWindows.create());
SingleOutputStreamOperator<String> countAlWinElemsStream = allWindowedStream
.trigger(
// 每 5 s 触发一次窗口计算
new Trigger<Event, GlobalWindow>() {
public final Long tenS = 5_000L;
@Override
public TriggerResult onElement(Event element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
ValueState<Long> tenSTriggerState = ctx.getPartitionedState(new ValueStateDescriptor<Long>("tenSTriggerState", Types.LONG));
if (tenSTriggerState.value() == null) {
long timer = ctx.getCurrentWatermark() + tenS;
ctx.registerEventTimeTimer(timer);
tenSTriggerState.update(timer);
}
return TriggerResult.CONTINUE;
}
@Override
public TriggerResult onProcessingTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
@Override
public TriggerResult onEventTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
ValueState<Long> tenSTriggerState = ctx.getPartitionedState(new ValueStateDescriptor<Long>("tenSTriggerState", Types.LONG));
tenSTriggerState.clear();
return TriggerResult.FIRE;
}
@Override
public void clear(GlobalWindow window, TriggerContext ctx) throws Exception {
System.out.println("=============================");
}
}
)
.process(
new ProcessAllWindowFunction<Event, String, GlobalWindow>() {
@Override
public void process(Context context, Iterable<Event> elements, Collector<String> out) throws Exception {
out.collect("窗口中有 " + elements.spliterator().getExactSizeIfKnown() + " 条元素");
}
}
);
countAlWinElemsStream.print();
env.execute();
}
}
CoProcessFunction、ProcessJoinFunction、BroadcastProcessFunction、KeyedBroadcastProcessFunction API 在多流操作是详细介绍
ProcessFunction 原理将在 基于时间和窗口 做详细介绍
参考资料
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/operators/process_function/