一个一般的基本通用的流处理步骤:
1.source 处理数据源 ,数据源可以通过实现SourceFunction的接口来实现,当然,为了保证检查点的一致性,也会实现CheckpointedFunction。
例如可以如下处理从外部读取的数据。检查点怎么加锁后续还需要研究。
/**
* @author smilezmh
*/
public class MySourceFunction implements SourceFunction<String>, CheckpointedFunction {
private volatile boolean isRunning = true;
private transient ListState<String> checkPointedLines;
private String oneline;
@Override
public void run(SourceContext ctx) throws Exception {
while (isRunning && !CommonParams.END_READ.equals(oneline)) {
if (oneline != null) {
ctx.collect(oneline);
} else {// 从外部读取源
try (BufferedReader reader = new BufferedReader(new InputStreamReader(getClass().getResourceAsStream("/UserBehavior.csv"), "UTF-8"))) {
// this synchronized block ensures that state checkpointing,
// internal state updates and emission of elements are an atomic operation
synchronized (ctx.getCheckpointLock()) {
while ((oneline = reader.readLine()) != null && !oneline.isEmpty()) {// 产生
ctx.collect(oneline);
}
oneline = CommonParams.END_READ;// 读取完毕
}
}
}
}
}
@Override
public void cancel() {
isRunning = false;
}
// ?
@Override
public void snapshotState(FunctionSnapshotContext context) throws Exception {
this.checkPointedLines.clear();
this.checkPointedLines.add(oneline);
}
// ?
@Override
public void initializeState(FunctionInitializationContext context) throws Exception {
this.checkPointedLines = context
.getOperatorStateStore()
.getListState(new ListStateDescriptor<>(CommonParams.SOURCE_HOT_ITEM, String.class));
if (context.isRestored()) {
for (String line : this.checkPointedLines.get()) {
this.oneline = line;
}
}
}
}
调用SourceFunction
environment.addSource(new MySourceFunction()).returns(Types.STRING).map(data -> {
String[] splits = data.split(",");
return new UserBehavior(
Long.valueOf(splits[0].trim()),
Long.valueOf(splits[1].trim()),
Integer.valueOf(splits[2].trim()),
splits[3].trim(),
Long.valueOf(splits[4].trim()));
}).returns(UserBehavior.class)// 转换数据
也可以直接读取源
URL url = getClass().getResource("/UserBehavior.csv");
DataStream sumStream = environment.readTextFile(url.getPath(), "UTF-8")
2.生成时间戳和水印
水印和时间戳:从一个数据中提取时间戳,默认是毫秒,换算成long类型应该有13位,如果是10位的要乘以1000。
时间戳:
一般设置时间为事件时间
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
environment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
生成水印的处理方法:
下边这两个方法已经过时:
public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(
AssignerWithPeriodicWatermarks<T> timestampAndWatermarkAssigner)
public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(
AssignerWithPunctuatedWatermarks<T> timestampAndWatermarkAssigner)
最新的1.12版本用的是下边的方法
public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(
WatermarkStrategy<T> watermarkStrategy)
WatermarkStrategy有四个实现类,看其中一个常用的AssignerWithPeriodicWatermarksAdapter中的内部类Strategy
public static final class Strategy<T> implements WatermarkStrategy<T> {
private static final long serialVersionUID = 1L;
private final AssignerWithPeriodicWatermarks<T> wms;
public Strategy(AssignerWithPeriodicWatermarks<T> wms) {
this.wms = checkNotNull(wms);
}
@Override
public TimestampAssigner<T> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return wms;
}
@Override
public WatermarkGenerator<T> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new AssignerWithPeriodicWatermarksAdapter<>(wms);
}
}
这个类有一个构造方法,传入的是AssignerWithPeriodicWatermarks接口,而AssignerWithPeriodicWatermarks只剩下一个没有过时的BoundedOutOfOrdernessTimestampExtractor实现类。这个实现类的有参构造函数传入的是Time maxOutOfOrdernesspublic BoundedOutOfOrdernessTimestampExtractor(Time maxOutOfOrderness)
所以才有如下代码
val sumStream = environment.readTextFile(url.getPath(), "UTF-8")
.map(data -> {
final String[] splits = data.split(",");
return new UserBehavior(Long.valueOf(splits[0]), Long.valueOf(splits[1]), Long.valueOf(splits[2]), splits[3], Long.valueOf(splits[4]));
}).returns(UserBehavior.class)
.filter(data -> "pv".equals(data.getBehavior()))
.assignTimestampsAndWatermarks(
new AssignerWithPeriodicWatermarksAdapter.Strategy<UserBehavior>(
new BoundedOutOfOrdernessTimestampExtractor<UserBehavior>(Time.seconds(0)) {
@Override
public long extractTimestamp(UserBehavior element) {
return element.getTimestamp() * 1000;
}
}))
.returns(TypeInformation.of(UserBehavior.class))
.map(data -> new Tuple2(String.valueOf(data.getBehavior()), Integer.valueOf(1)))
.returns(Types.TUPLE(Types.STRING, Types.INT))
.keyBy(data -> data.f0)
.timeWindow(Time.hours(1))
.sum(1);
返回类型:
因为jvm会擦除范型信息,但会保留父类的范型信息,所以基本上通过采用匿名内部类带上范型或通过继承父类的方式保留范型信息。也可以通过returns方法。例如
返回如果是Tuple类型,Tuple中的范型信息会被jvm擦除,而应该用这种方法告诉jvm
.returns(Types.TUPLE(Types.STRING, Types.INT))
而按这种形式进行返回会报错,报推断类型不匹配的错误。这个问题记录一下。
.returns(new TypeHint<Tuple2<String, Long>>(){})
或
.TypeInformation.of(new TypeHint<Tuple2<String, Long>>() {})
3.进行过滤,一般用filter函数
// 过滤
// =>singleOutPutStreamOperator<UserBehavior>
.filter(data -> data.getBehavior().equals("pv"))
4.按某个字段进行分组,keyBy
// 按商品分区
// 上步:singleOutPutStreamOperator<UserBehavior>
// data=> keyedStream<data,key>=[userBehavior,itemId:Long]
.keyBy(data -> data.getItemId())
5.一般开一个滑动窗口,用于隔一定时间统计一定步长的数据
// 滑动时间窗口
// 上步数据 keyedStream<data,key>=[userBehavior,itemId:Long]
// =>WindowedStream<data,key,timeWindow>=[userBehavior,itemId:Long,TimeWindow]
.timeWindow(Time.hours(1), Time.minutes(5))
6.因为滑动窗口比较特殊,需要用aggregate函数进行聚合
滑动窗口一般先定义一个Window实体类,对应这个特殊的聚合窗口作为接下来聚合函数aggregate的输出
/**
* 窗口聚合类
*/
@Data
public class ItemViewCount {
private Long itemId;// 商品id
private Long windowEnd; //
private Long count;
public ItemViewCount(Long itemId, Long windowEnd, Long count) {
this.itemId = itemId;
this.windowEnd = windowEnd;
this.count = count;
}
public ItemViewCount() {
}
}
调用aggregate函数的细节:
// 自定义聚合
// 上步数据 WindowedStream<data,key,timeWindow>=[userBehavior,itemId:Long,TimeWindow] 经历AggregateFunction输出最大值
// =》AggregateFunction<IN, ACC, OUT>=[userBehavior:In,acc:Long,out:Long] (out:Long累计浏览量 作为 MyWindowFunction的输入) 经历WindowFunction
// =>WindowFunction<IN, OUT, KEY, W extends Window>=(in:Long(AggregateFunction的out 累计浏览量),out:ItemViewCount,itemId:Long,timeWindow) ItemViewCount
.aggregate(new CountAggregateFunction(), new MyWindowFunction())
CountAggregateFunction函数实现一个AggregateFunction
/**
* 累计浏览量
*/
public class CountAggregateFunction implements AggregateFunction<UserBehavior, Long, Long> {
@Override
public Long createAccumulator() {
return 0L;// acc初始状态
}
@Override
public Long add(UserBehavior value, Long accumulator) {
return accumulator + 1;// 来一条数据+1
}
@Override
public Long getResult(Long accumulator) {
return accumulator;
}
@Override
public Long merge(Long a, Long b) {// 重分区
return a + b;
}
}
再据一个聚合函数的例子
/**
* 求平均数:时间戳的平均数 输出是WindowFunction的输入
* UserBehavior 输入
* Tuple2<Long, Long> 元组代表总和与个数
* Double 输出是平均数
*/
public class AvgAggregateFunction implements AggregateFunction<UserBehavior, Tuple2<Long, Long>, Double> {
@Override
public Tuple2<Long, Long> createAccumulator() {
return new Tuple2<>(0L, 0L);
}
@Override
public Tuple2<Long, Long> add(UserBehavior value, Tuple2<Long, Long> accumulator) {
return new Tuple2<>(accumulator.getItem1() + value.getTimeStamp(), accumulator.getItem2() + 1);
}
@Override
public Double getResult(Tuple2<Long, Long> accumulator) {
return accumulator.getItem1() / Double.valueOf(accumulator.getItem2());
}
@Override
public Tuple2<Long, Long> merge(Tuple2<Long, Long> a, Tuple2<Long, Long> b) {
return new Tuple2<>(a.getItem1() + b.getItem1(), a.getItem2() + b.getItem2());
}
}
windowFunction
/**
* windowFunction的输入是AggregateFunction的输出
*/
public class MyWindowFunction implements WindowFunction<Long, ItemViewCount, Long, TimeWindow> {
@Override
public void apply(Long keyLong, TimeWindow window, Iterable<Long> input, Collector<ItemViewCount> out) throws Exception {
Iterator<Long> inputIterator = input.iterator();
Long itemId = keyLong;
Long count = 0L;
if (inputIterator.hasNext()) {
count = inputIterator.next();
}
out.collect(new ItemViewCount(itemId, window.getEnd(), count));
}
}
- 按上一步WindowFunction的输出对,聚合出的窗口实体类ItemViewCount,按WindowEnd进行聚合 。聚合出来的是每个窗口结束时间为key的数据。
// 按window结束时间windowEnd进行 窗口分组
// 上步数据 out:ItemViewCount
// data=> keyedStream<data,key>=[ItemViewCount,windowEnd:Long]
// windowEnd example: 1511690100000 13位
.keyBy(data -> data.getWindowEnd())
- 对每个聚合的时间窗口集合进行处理processFunction,得到结果,处理如进行加和、或者排序求最大的前几个等操作。
// 每个窗口求出浏览量量最大的三个商品
// 上步数据: keyedStream<data,key>=[ItemViewCount,windowEnd:Long]
// => KeyedProcessFunction<K, I, O>=[windowEnd:Long,ItemViewCount,O:String]
// out:String
.process(new TopNHotItemKeyProcessFunction(3));
这里边涉及到状态编程:用ValueState< T > 或者是ListState< T > ,
/**
* 每5分钟统计浏览量前3的商品
*/
public class TopNHotItemKeyProcessFunction extends KeyedProcessFunction<Long, ItemViewCount, String> {
private ListState<ItemViewCount> itemStates;// 存储状态
RuntimeContext runtimeContext;
private final int topSize;
public TopNHotItemKeyProcessFunction(int topSize) {
this.topSize = topSize;
}
public void setValue(RuntimeContext runtimeContext) {
itemStates = runtimeContext.getListState(new ListStateDescriptor("itemStates", ItemViewCount.class));
}
@Override
public void processElement(ItemViewCount value, Context ctx, Collector<String> out) throws Exception {
// 把每条数据存入状态列表
itemStates.add(value);
// current key :windowEnd
// System.out.println(" current key:" + ctx.getCurrentKey());
// 注册定时器
//触发的时间:用event time+1秒,下一个时间来如果比这个时间大1秒就会触发定时器,也可以是1L
ctx.timerService().registerEventTimeTimer(value.getWindowEnd() + 1L);// 用event,延迟1秒触发定时器
}
// 声明周期开始 open定义状态
@Override
public void open(Configuration parameters) throws Exception {
if (runtimeContext == null) {
runtimeContext = getRuntimeContext();
setValue(runtimeContext);
}
}
// 定时器触发排序
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
// 将所有的state数据取出
List<ItemViewCount> allItems = new ArrayList<>();
for (ItemViewCount itemViewCount : itemStates.get()) {
allItems.add(itemViewCount);
}
// 倒序stream
Stream<ItemViewCount> sortedStream = allItems.stream().sorted(Comparator.comparingLong(x -> -x.getCount())).limit(topSize);
// 清空状态
itemStates.clear();
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.append("时间").append(new Timestamp(timestamp - 1)).append("\n");
sortedStream.forEach(data -> stringBuilder.append("商品id:").append(data.getItemId())
.append(" 浏览量:" + data.getCount()).append("\n"));
out.collect(stringBuilder.toString());
Thread.sleep(1000);
}
}
再例如:
/**
* 每一个时间窗口做一次汇总,输出
*/
public class NetWorkProcessFunction extends ProcessFunction<UrlViewCount, String> {
private RuntimeContext runtimeContext;
private final Integer topSize;
public NetWorkProcessFunction(Integer topSize) {
this.topSize = topSize;
}
private ListState<UrlViewCount> urlViewCountStates;
private Set<Long> timerStamps = new HashSet<>();// 注册过的时间戳
private void getSates() {
urlViewCountStates = runtimeContext.getListState(new ListStateDescriptor<UrlViewCount>(CommonParams.URL_VIEW_COUNT_STATE, UrlViewCount.class));
}
@Override
public void processElement(UrlViewCount value, Context ctx, Collector<String> out) throws Exception {
urlViewCountStates.add(value);
// 注册定时器,已经注册过的定时器无需再次注册
if (V.isEmpty(timerStamps) || (V.notEmpty(timerStamps) && !timerStamps.contains(value.getWindowEnd()))) {
//触发的时间:用event time+1秒,下一个时间来如果比这个时间大1秒就会触发定时器,也可以是1L
ctx.timerService().registerEventTimeTimer(value.getWindowEnd() + 1000L);
if (timerStamps.size() > 100000) {
timerStamps.clear();
}
timerStamps.add(value.getWindowEnd());
}
}
@Override
public void open(Configuration parameters) throws Exception {
if (runtimeContext == null) {
runtimeContext = getRuntimeContext();
}
getSates();
}
/**
* @param timestamp 触发的时间
* @param ctx
* @param out
* @throws Exception
*/
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
StringBuilder stringBuilder = new StringBuilder();
List<UrlViewCount> urlViewCounts = new ArrayList<>();
for (UrlViewCount urlViewCount : urlViewCountStates.get()) {
urlViewCounts.add(urlViewCount);
}
// 倒序
final Supplier<Stream<UrlViewCount>> limitStreamSupplier = () -> urlViewCounts.stream()
.sorted(Comparator.comparingLong((data) -> -data.getCount())).limit(topSize);
// stringBuilder.append("\n时间:").append(CommonParams.dfSupplier.get().format(
// new Date(limitStreamSupplier.get().map(x -> x.getWindowEnd()).distinct().findFirst().orElse(0L))));
stringBuilder.append("\n时间:").append(CommonParams.dfSupplier.get().format(
new Date(timestamp - 1000)));
limitStreamSupplier.get().forEach(data ->
stringBuilder.append("\n点击Url: ").append(data.getUrl())
.append(" 点击量: ").append(data.getCount())
);
out.collect(stringBuilder.toString());
urlViewCountStates.clear();
Thread.sleep(1000);
}
}
这样就是一个基本的流数据处理的步骤。当然如果不涉及滑动窗口就更简单
只需要 source=>generate timestamp watermark=>filter=>map=>key by=>timeWindow=>aggregate(apply(windowFunction))
如下变例子:
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
environment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
URL url = getClass().getResource("/UserBehavior.csv");
DataStream sumStream = environment.readTextFile(url.getPath(), "UTF-8")
.map(data -> {
final String[] splits = data.split(",");
return new UserBehavior(Long.valueOf(splits[0]), Long.valueOf(splits[1]), Long.valueOf(splits[2]), splits[3], Long.valueOf(splits[4]));
}).returns(UserBehavior.class)
.filter(data -> "pv".equals(data.getBehavior()))
.assignTimestampsAndWatermarks(
new AssignerWithPeriodicWatermarksAdapter.Strategy<UserBehavior>(
new BoundedOutOfOrdernessTimestampExtractor<UserBehavior>(Time.seconds(0)) {
@Override
public long extractTimestamp(UserBehavior element) {
return element.getTimestamp() * 1000;
}
}))
.returns(TypeInformation.of(UserBehavior.class))
.map(data -> new Tuple2(String.valueOf(data.getBehavior()), 1L))
.returns(Types.TUPLE(Types.STRING, Types.LONG))
.keyBy(data -> data.f0)
.timeWindow(Time.hours(1))
.sum(1);
sumStream.print();
environment.execute("behaviorAnalysis");
一个综合例子:
@SpringBootTest
@SuppressWarnings("deprecation")
class FlinkbizApplicationTests {
@Test
void contextLoads() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(6);
// env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9000);
DataStream<SensorData> dataStream = streamSource
.map(data -> {
String[] arr = data.split(" ");
return new SensorData(arr[0], Long.valueOf(arr[1]), Double.valueOf(arr[2]));
})
.returns(SensorData.class);
DataStream minStream = dataStream
// 插入一条时间戳空数据wartermark,表示之后不会有比这个时间更小的数据了
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<SensorData>(Time.seconds(1)) {
@Override
public long extractTimestamp(SensorData element) {
return element.getTime();
}
})
.map(data -> new Tuple2(data.getId(), data.getTemp()))
.returns(Types.TUPLE(Types.STRING, Types.DOUBLE))
.keyBy(t -> t.f0)
// 开窗口(窗口取值左闭右开),10秒事件时间一次统计,开窗口的方法要看源码,初始窗口值,滑动窗口会开好几个窗口
.timeWindow(Time.seconds(10))
.reduce((data1, data2) -> new Tuple2(data1.f0, ((Double) data1.f1).compareTo((Double) data2.f1) < 0 ? data1.f1 : data2.f1));
// dataStream.print("input");
// minStream.print("min");
// env.execute("Window");
// processFunction
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
KeyedStream<SensorData, String> processKeyStream = dataStream
.keyBy(data -> data.getId());
DataStream<String> processStrStream = processKeyStream.process(new MyProcessFunction());
// processStrStream.print("processFunction");
// env.execute("processFunction");
// outSidePut
SingleOutputStreamOperator containsSlidOutStream = dataStream.process(new SlideOutFunction()).returns(SensorData.class);
DataStream slideOutStream = containsSlidOutStream.getSideOutput(new OutputTag("alert", Types.STRING));
slideOutStream.print("slidOutStream");
env.execute("slidOutStream");
}
/**
* watermark产生逻辑
* 延时1分钟产生一个watermark,插入一条时间戳空数据,表示之后不会有比这个时间更小的数据了
*/
class MyAssigner implements AssignerWithPeriodicWatermarks {
int timePeriod = 60 * 1000;
long timeMax = Long.MIN_VALUE;
/**
* 产生一个递增的watermark
*
* @return
*/
@Nullable
@Override
public Watermark getCurrentWatermark() {
return new Watermark(timeMax - timePeriod);
}
/**
* 从数据,抽取时间戳的算法
*
* @param element
* @param recordTimestamp
* @return
*/
@Override
public long extractTimestamp(Object element, long recordTimestamp) {
Long time1 = ((SensorData) element).getTime();
timeMax = time1.compareTo(timeMax) > 0 ? time1 : timeMax;
return timeMax;
}
}
/**
* ProcessFunction处理逻辑
* 10秒钟温度连续上升
*/
class MyProcessFunction extends KeyedProcessFunction<String, SensorData, String> {
// 上次时间
ValueState<Double> lastTemp;
// 保存定时器的时间戳
ValueState<Long> serviceTimeStamp;// 状态值
RuntimeContext runtimeContext;
public MyProcessFunction() {
}
public void setValue(RuntimeContext runtimeContext) {
lastTemp = runtimeContext.getState(new ValueStateDescriptor("lastTemp", Types.DOUBLE));
serviceTimeStamp = runtimeContext.getState(new ValueStateDescriptor("serviceTimeStamp", Types.LONG));
}
// 温度连续上升1s报警
@Override
public void processElement(SensorData value, Context ctx, Collector<String> out) throws Exception {
if (runtimeContext == null) {
runtimeContext = getRuntimeContext();
setValue(runtimeContext);
}
if (value != null) {
// 从状态取出上一次值
Double lastT = lastTemp.value();
// 将状态更新成新值
lastTemp.update(value.getTemp());
if (lastT != null && value.getTemp() > lastT && serviceTimeStamp.value() == null) {// 温度上升
// 当前时间
long currentProcessingTime = ctx.timerService().currentProcessingTime();
// 定时器
ctx.timerService().registerProcessingTimeTimer(currentProcessingTime + 10000L);
serviceTimeStamp.update(currentProcessingTime);
} else if (lastT != null && value.getTemp() < lastT && Optional.ofNullable(serviceTimeStamp.value()).orElse(Long.valueOf(0)) > 0) {// 如果温度下降或者定义了定时器
ctx.timerService().deleteProcessingTimeTimer(serviceTimeStamp.value());// 删除定时器
serviceTimeStamp.clear();//清空状态 变成null
}
}
}
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
// 输出报警信息
out.collect("sensor " + ctx.getCurrentKey() + " 温度连续上升");
}
}
class SlideOutFunction extends ProcessFunction<SensorData, SensorData> {
OutputTag<String> alertOutFlag;
@Override
public void processElement(SensorData value, Context ctx, Collector<SensorData> out) throws Exception {
if (alertOutFlag == null) {
alertOutFlag = new OutputTag("alert", Types.STRING);
}
if (value != null && value.getTemp() > 0) {
if (value.getTemp() < 20) {
ctx.output(alertOutFlag, String.valueOf(value.getTemp()));
} else {
out.collect(value);
}
}
}
}
}