源码版本
flink-release-1.11.0
代码位置 org.apache.flink.streaming.api.functions
Flink提供了8个Process Function:
ProcessFunction:dataStream
KeyedProcessFunction:用于KeyedStream,keyBy之后的流处理
CoProcessFunction:用于connect连接的流
ProcessJoinFunction:用于join流操作
BroadcastProcessFunction:用于广播
KeyedBroadcastProcessFunction:keyBy之后的广播
ProcessWindowFunction:窗口增量聚合
ProcessAllWindowFunction:全窗口聚合
KeyedProcessFunction和ProcessFunction源码分析
KeyedProcessFunction和ProcessFunction源码类似,此处只做KeyedProcessFunction分析
KeyedProcessFunction结构
ProcessFunction结构
KeyedProcessFunction类结构
1 Context
调用{
@link #processElement(Object,Context,Collector)}或{
@link #onTimer(long,OnTimerContext,Collector)}时可用的信息。
2 OnTimerContext
调用{
@link #onTimer(long,OnTimerContext,Collector)}可获得的信息。
3 processElement
处理输入流中的一个元素。此函数可以使用{
@link Collector}参数输出零个或多个元素,并使用{
@link Context}参数更新内部状态或设置计时器。
4 onTimer
在使用{
@link TimerService}设置的计时器触发时调用。
KeyedProcessFunction用来操作KeyedStream
KeyedProcessFunction会处理流的每一个元素(每条数据来了之后都可以处理、过程处理函数),输出为0个、1个或者多个元素。
所有的 Process Function 都继承自RichFunction接口(富函数,它可以有各种生命周期、状态的一些操作,获取watermark、定义闹钟定义定时器等),
所以都有open()、close()和getRuntimeContext() 等方法。
而KeyedProcessFunction[KEY, IN, OUT] 还额外提供了两个方法:
①.processElement(I value, Context ctx, Collector<O> OUt), 流中的每一个元素都会调用这个方法,调用结果将会放在Collector数据类型中输出。
Context可以访问元素的时间戳,元素的key,以及TimerService时间服务。Context还可以将结果输出到别的流(side outputs)
②.onTimer( long timestamp, OnTimerContext ctx, Collector<O> OUT )是一个回调函数。当之前注册的定时器触发时调用(定时器触发时候的操作)。
参数timestamp为定时器所设定的触发的时间戳。Collector为输出结果的集合。OnTimerContext和processElement的Context 参数一样,提供了上下文的一些信息,
例如定时器触发的时间信息: 事件时间或者处理时间 。
TimerService 和 定时器 Timers
Context和OnTimerContext所持有的TimerService对象拥有以下方法:
long currentProcessingTime() 返回当前处理时间
long currentWatermark() 返回当前watermark的时间戳
void registerProcessingTimeTimer(long timestamp) 会注册当前key的processing time的定时器。当processing time到达定时时间时,触发timer。
void registerEventTimeTimer(long timestamp) 会注册当前key的event time 定时器。当水位线大于等于定时器注册的时间时,触发定时器执行回调函数。
void deleteProcessingTimeTimer(long timestamp) 删除之前注册处理时间定时器。如果没有这个时间戳的定时器,则不执行。
void deleteEventTimeTimer(long timestamp) 删除之前注册的事件时间定时器,如果没有此时间戳的定时器,则不执行。
当定时器timer触发时,会执行回调函数onTimer()。注意定时器timer只能在keyed streams上面使用。
KeyedProcessFunction源码如下
package org.apache.flink.streaming.api.functions;
import org.apache.flink.annotation.PublicEvolving;
import org.apache.flink.api.common.functions.AbstractRichFunction;
import org.apache.flink.streaming.api.TimeDomain;
import org.apache.flink.streaming.api.TimerService;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;
/**
* A keyed function that processes elements of a stream.
*
* <p>For every element in the input stream {@link #processElement(Object, Context, Collector)}
* is invoked. This can produce zero or more elements as output. Implementations can also
* query the time and set timers through the provided {@link Context}. For firing timers
* {@link #onTimer(long, OnTimerContext, Collector)} will be invoked. This can again produce
* zero or more elements as output and register further timers.
*
* <p><b>NOTE:</b> Access to keyed state and timers (which are also scoped to a key) is only
* available if the {@code KeyedProcessFunction} is applied on a {@code KeyedStream}.
*
* <p><b>NOTE:</b> A {@code KeyedProcessFunction} is always a
* {@link org.apache.flink.api.common.functions.RichFunction}. Therefore, access to the
* {@link org.apache.flink.api.common.functions.RuntimeContext} is always available and setup and
* teardown methods can be implemented. See
* {@link org.apache.flink.api.common.functions.RichFunction#open(org.apache.flink.configuration.Configuration)}
* and {@link org.apache.flink.api.common.functions.RichFunction#close()}.
*
* @param <K> Type of the key. 键数据类型
* @param <I> Type of the input elements. 输入元素的数据类型
* @param <O> Type of the output elements. 输出结果的数据类型
*/
@PublicEvolving
public abstract class KeyedProcessFunction<K, I, O> extends AbstractRichFunction {
private static final long serialVersionUID = 1L;
/**
* Process one element from the input stream.
*
* <p>This function can output zero or more elements using the {@link Collector} parameter
* and also update internal state or set timers using the {@link Context} parameter.
*
* @param value The input value.
* @param ctx A {@link Context} that allows querying the timestamp of the element and getting
* a {@link TimerService} for registering timers and querying the time. The
* context is only valid during the invocation of this method, do not store it.
* @param out The collector for returning result values.
*
* @throws Exception This method may throw exceptions. Throwing an exception will cause the operation
* to fail and may trigger recovery.
*/
public abstract void processElement(I value, Context ctx, Collector<O> out) throws Exception;
/**
* Called when a timer set using {@link TimerService} fires.
*
* @param timestamp The timestamp of the firing timer.
* @param ctx An {@link OnTimerContext} that allows querying the timestamp, the {@link TimeDomain}, and the key
* of the firing timer and getting a {@link TimerService} for registering timers and querying the time.
* The context is only valid during the invocation of this method, do not store it.
* @param out The collector for returning result values.
*
* @throws Exception This method may throw exceptions. Throwing an exception will cause the operation
* to fail and may trigger recovery.
*/
public void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception {
}
/**
* Information available in an invocation of {@link #processElement(Object, Context, Collector)}
* or {@link #onTimer(long, OnTimerContext, Collector)}.
*/
public abstract class Context {
/**
当前正在处理的元素的时间戳或触发计时器的时间戳
* Timestamp of the element currently being processed or timestamp of a firing timer.
*
* <p>This might be {@code null}, for example if the time characteristic of your program
* is set to {@link org.apache.flink.streaming.api.TimeCharacteristic#ProcessingTime}.
*/
public abstract Long timestamp();
/**
* A {@link TimerService} for querying time and registering timers.
*/
public abstract TimerService timerService();
/**
* Emits a record to the side output identified by the {@link OutputTag}.
*
* @param outputTag the {@code OutputTag} that identifies the side output to emit to.
* @param value The record to emit.
*/
public abstract <X> void output(OutputTag<X> outputTag, X value);
/**
* Get key of the element being processed.
*/
public abstract K getCurrentKey();
}
/**
* Information available in an invocation of {@link #onTimer(long, OnTimerContext, Collector)}.
*/
public abstract class OnTimerContext extends Context {
/**
* The {@link TimeDomain} of the firing timer.
*/
public abstract TimeDomain timeDomain();
/**
* Get key of the firing timer.
*/
@Override
public abstract K getCurrentKey();
}
}
KeyedProcessFunction用法示例
示例一
负责维护状态的类
public class CountWithTimestampState {
private String key;
private long count;
private long lastModified;
public CountWithTimestampState() {
}
public CountWithTimestampState(String key, long count, long lastModified) {
this.key = key;
this.count = count;
this.lastModified = lastModified;
}
public String getKey() {
return key;
}
public void setKey(String key) {
this.key = key;
}
public long getCount() {
return count;
}
public void setCount(long count) {
this.count = count;
}
public long getLastModified() {
return lastModified;
}
public void setLastModified(long lastModified) {
this.lastModified = lastModified;
}
@Override
public String toString() {
return "CountWithTimestampState{" +
"key='" + key + '\'' +
", count=" + count +
", lastModified=" + lastModified +
'}';
}
}
输入元素数据类
public class WordWithCount {
private String key;
private long count;
public WordWithCount() {
}
public WordWithCount(String key, long count) {
this.key = key;
this.count = count;
}
public String getKey() {
return key;
}
public void setKey(String key) {
this.key = key;
}
public long getCount() {
return count;
}
public void setCount(long count) {
this.count = count;
}
@Override
public String toString() {
return "WordWithCount{" +
"key='" + key + '\'' +
", count=" + count +
'}';
}
}
处理函数
import com.scallion.bean.CountWithTimestampState;
import com.scallion.bean.WordWithCount;
import com.scallion.utils.TimeUtil;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api