目录
一、每小时输出一次窗口时间内的pv数
先定义两个pojo类,UserBehavior为输入类,PageViewCount为输出类。
@Data
@AllArgsConstructor
@NoArgsConstructor
@ToString
public class UserBehavior {
private Long userId;//用户ID
private Long itemId;//类目ID
private Integer categoryId;//分类ID
private String behavior;//用户行为
private Long timestamp;//时间戳
}
@Data
@AllArgsConstructor
@NoArgsConstructor
@ToString
public class PageViewCount {
private String rand; //随机key
private Long windowEnd; //窗口结束时间
private Long count; //pv
}
1 读取数据,创建DataStream(本案例数据为文件数据,如果输入数据为kafka,需要修改map里面的代码)
DataStreamSource<String> inputDS = env.readTextFile("***.csv");
2 读取数据,转换为bean,分配时间戳和watermark
SingleOutputStreamOperator<UserBehavior> mapDS = inputDS
.map(line -> {
String[] fields = line.split(",");
return new UserBehavior(new Long(fields[0]), new Long(fields[1]), new Integer(fields[2]), fields[3], new Long(fields[4]));
})
.assignTimestampsAndWatermarks(WatermarkStrategy.<UserBehavior>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner<UserBehavior>() {
@Override
public long extractTimestamp(UserBehavior userBehavior, long l) {
return userBehavior.getTimestamp() * 1000L;
}
}));
3 分组开窗聚合
生成随机key,避免数据倾斜。设置一小时的滚动窗口,并使用aggregate预聚合函数进行统计。
SingleOutputStreamOperator<PageViewCount> windowAggDS = mapDS
.filter(line -> "pv".equals(line.getBehavior())) //过滤pv行为
.map(new MapFunction<UserBehavior, Tuple2<Integer, Long>>() {
@Override
public Tuple2<Integer, Long> map(UserBehavior userBehavior) throws Exception {
Random random = new Random();
return new Tuple2<>(random.nextInt(10), 1L);
}
})
.keyBy(line -> line.f0)
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.aggregate(new PvCountAgg(), new PvCountResult());
3.1 预聚合函数PvCountAgg实现代码
public static class PvCountAgg implements AggregateFunction<Tuple2<Integer, Long>, Long, Long>{
@Override
public Long createAccumulator() {
return 0L;
}
@Override
public Long add(Tuple2<Integer, Long> integerLongTuple2, Long aLong) {
return aLong + 1;
}
@Override
public Long getResult(Long aLong) {
return aLong;
}
@Override
public Long merge(Long aLong, Long acc1) {
return aLong + acc1;
}
}
3.2 PvCountResult函数实现代码
public static class PvCountResult implements WindowFunction<Long, PageViewCount, Integer, TimeWindow>{
@Override
public void apply(Integer integer, TimeWindow timeWindow, Iterable<Long> iterable, Collector<PageViewCount> collector) throws Exception {
collector.collect(new PageViewCount(integer.toString(), timeWindow.getEnd(), iterable.iterator().next()));
}
}
4 将各个分区数据汇总起来
SingleOutputStreamOperator<PageViewCount> resultDS2 = windowAggDS
.keyBy(line -> line.getWindowEnd())
.process(new TotalPvCount());
4.1 自定义Ke