Flink 应用-电商用户行为分析
相关博客:
Flink电商项目第一天-电商用户行为分析及完整图步骤解析-热门商品统计TopN的实现
一、电商用户行为分析
电商业务分析主要有以下三类:
- 统计分析
- 点击、浏览
- 热门商品、最近热门商品、分类热门商品、流量统计
- 偏好统计
- 收藏、喜欢、评分、打标签
- 用户画像、推荐列表
- 风险控制
- 下订单、支付、登录
- 刷单监控、订单失效监控、恶意登录(短时间内频繁登录失败)监控
1.1 项目模块设计
电商分析按照流量和业务分类,可分为两大类:
按照统计类型分类如下:
1.2 数据源
数据结构
UserBehavior
ApacheLogEvent
二、项目模块
本次项目做5个分析:
处理数据时,先对某个id分组,并设定窗口,之后对某个字段增量聚合(并设定指定输出格式),最后最窗口分组,将同一个窗口内的数据累加。
2.1 实时热门商品统计
- 基本需求
- 统计近1个小时内的热门商品,每五分钟更新一次
- 热门度使用浏览(“pv”)来衡量
- 解决思路
- 在所有用户行为数据中,过滤出浏览(”pv”)行为进行统计
- 构建滑动窗口,窗口长度为1小时,滑动距离为5分钟,统计出每一种商品的访问数
- 再根据滑动窗口的时间,统计出访问次数最多的5个商品
第二步的流程大致如下:
首先,按照商品id进行分区
接着对数据划分滑动时间窗口
时间窗口区间为左闭右开,同一份数据会被分到不同的窗口。
例如:
然后进行窗口聚合
aggregate第一个参数是窗口聚合的规则,第二个参数是定义输出的数据结构
窗口聚合函数
窗口聚合策略—每出现一条记录就加一。
需要实现AggragateFunction接口,并需要实现4个函数createAccumulator
、add
、merge
、getResult
interface AggregateFunction<IN, ACC, OUT>
/**
* IN :输入类型
* ACC :累加器类型
* OUT :输出类型
*/
Window输出类型函数
定义输出结构:ItemViewCount(itemId,windowEnd,count)
itemId:商品id
windowEnd:窗口结束时间
count:计数
interface WindowFunction<IN,OUT,KEY,W extends Window>
/**
* WindowFunction<IN,OUT,KEY,W extends Window>
* IN :输出类型,就是累加器最后输出类型
* OUT:最后想要输出类型
* KEY:Tuple泛型,分组的key,在这里是itemId,窗口根据itemId聚合
* W : 聚合的窗口,w.getEnd就能拿到窗口的结束时间
*/
画图举例:
首先根据id分组,然后窗口聚合。
之后再进行统计处理,相同窗口的数据放在一起,并输出top5
这个过程需要一个中间值,把同一个窗口的数据都放进去,这就需要状态了。
最终排序输出 — KeyedProcessFunction
- 针对有状态流的底层API
- KeyedProcessFunction会对分区后的每一条子流进行处理
- 以windowEnd作为key,保证分流以后每一条流的数据都在一个时间窗口内
- 从ListState中读取当前流的状态,存储数据进行排序输出
使用ProcessFunction定义KeyedStream的处理逻辑。
分区之后,每个KeyedStream都有自己的生命周期
- open:初始化,在这里可以获取当前流的状态
- processElement:处理流中每一个元素时调用
- onTimer:定时调用,注册定时器Timer并触发之后的回调操作
创建POJO
需要生成get/set、无参/有参构造函数,toString
-
ItemViewCount
private Long itemId; private Long windowEnd; private Long count;
-
UserBehavior
private Long uerId; private Long itemId; private Integer categoryId; private String behavior; private Long timestamp;
代码
测试使用的是60秒窗口大小、30秒滑动窗口
import beans.ItemViewCount;
import beans.UserBehavior;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.guava18.com.google.common.collect.Lists;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.sql.Timestamp;
import java.util.ArrayList;
import java.util.Comparator;
/**
* @author Kewei
* @Date 2022/3/5 15:10
*/
public class HotItems {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
DataStreamSource<String> inputStream = env.readTextFile("D:\\IdeaProjects\\UserBehaviorAnalysis\\HotItemsAnalysis\\src\\main\\resources\\UserBehavior.csv");
// 将数据转换为POJO,并设置事件时间
SingleOutputStreamOperator<UserBehavior> dataStream = inputStream.map(line -> {
String[] filed = line.split(",");
return new UserBehavior(new Long(filed[0]), new Long(filed[1]), new Integer(filed[2]), filed[3], new Long(filed[4]));
}).assignTimestampsAndWatermarks(new AscendingTimestampExtractor<UserBehavior>() {
@Override
public long extractAscendingTimestamp(UserBehavior userBehavior) {
return userBehavior.getTimestamp() * 1000;
}
});
// 筛选出pv的数据,按照商品id分组,划分滑动时间窗口,对每个窗口进行增量聚合,并将输出结果进行设定指定格式ItemViewCount
SingleOutputStreamOperator<ItemViewCount> windowAggStream = dataStream
.filter(data -> "pv".equals(data.getBehavior()))
.keyBy("itemId")
.timeWindow(Time.seconds(60), Time.seconds(30))
.aggregate(new CountAgg(), new WindowResultFunction());
// 将同一个窗口的数据,进行分组,最后设置定时输出
SingleOutputStreamOperator<String> resultStream = windowAggStream
.keyBy("windowEnd")
.process(new TopNHotItems(5));
resultStream.print();
env.execute("hot items");
}
// 设定同一个商品数据的聚合方法
/**
* AggregateFunction<IN, ACC, OUT>
* IN :输出类型
* ACC:累加器类型
* OUT:最后输出结果
*/
private static class CountAgg implements AggregateFunction<UserBehavior, Long, Long> {
@Override
public Long createAccumulator() {
return 0L;
}
@Override
public Long add(UserBehavior userBehavior, Long aLong) {
return aLong + 1;
}
@Override
public Long getResult(Long aLong) {
return aLong;
}
@Override
public Long merge(Long aLong, Long acc1) {
return aLong + acc1;
}
}
// 设定输出格式
/**
* WindowFunction<IN,OUT,KEY,W extends Window>
* IN :输出类型,就是累加器最后输出类型
* OUT:最后想要输出类型
* KEY:Tuple泛型,分组的key,在这里是itemId,窗口根据itemId聚合
* W : 聚合的窗口,w.getEnd就能拿到窗口的结束时间
*/
private static class WindowResultFunction implements WindowFunction<Long, ItemViewCount, Tuple, TimeWindow> {
@Override
public void apply(Tuple tuple, TimeWindow timeWindow, Iterable<Long> iterable, Collector<ItemViewCount> out) throws Exception {
Long itemId = tuple.getField(0);
Long end = timeWindow.getEnd();
Long next = iterable.iterator().next();
out.collect(new ItemViewCount(itemId, end, next));
}
}
/**
* KeyedProcessFunction<KEY,IN,OUT>
* KEY: 分组key的类型
* IN : 输入的类型
* OUT:输出的类型
*/
private static class TopNHotItems extends KeyedProcessFunction<Tuple, ItemViewCount, String> {
private Integer topSize;
public TopNHotItems(Integer topSize) {
this.topSize = topSize;
}
ListState<ItemViewCount> itemViewCountListState;
@Override
public void open(Configuration parameters) throws Exception {
itemViewCountListState = getRuntimeContext().getListState(new ListStateDescriptor<ItemViewCount>("Item-view-count-list",ItemViewCount.class));
}
@Override
public void processElement(ItemViewCount count, KeyedProcessFunction<Tuple, ItemViewCount, String>.Context ctx, Collector<String> out) throws Exception {
itemViewCountListState.add(count);
// 注册一个定时器,在1毫秒之后运行,由于同一个窗口的结束时间时一样的,所以当时间变了,就说明同一个窗口的数据都添加进去了
ctx.timerService().registerEventTimeTimer(count.getWindowEnd() + 1);
}
// 设定定时器任务
@Override
public void onTimer(long timestamp, KeyedProcessFunction<Tuple, ItemViewCount, String>.OnTimerContext ctx, Collector<String> out) throws Exception {
// 将ListState转换成ArrayList
ArrayList<ItemViewCount> itemViewCounts = Lists.newArrayList(itemViewCountListState.get().iterator());
// 排序
itemViewCounts.sort(new Comparator<ItemViewCount>() {
@Override
public int compare(ItemViewCount o1, ItemViewCount o2) {
return o2.getCount().intValue() - o1.getCount().intValue();
}
});
// 将数据格式化
StringBuffer stringBuffer = new StringBuffer();
stringBuffer.append("===========\n");
stringBuffer.append("窗口结束时间:").append(new Timestamp(timestamp - 1)).append("\n");
for (int i = 0; i < Math.min(topSize, itemViewCounts.size()); i++) {
ItemViewCount itemViewCount = itemViewCounts.get(i);
stringBuffer.append("NO ").append(i+1).append(":")
.append(" 商品id = ").append(itemViewCount.getItemId())
.append(" 热门度 = ").append(itemViewCount.getCount())
.append("\n");
}
stringBuffer.append("============\n\n");
// 控制输出频率
Thread.sleep(1000L);
// 输出数据
out.collect(stringBuffer.toString());
}
}
}
输出:
===========
窗口结束时间:2017-11-26 09:00:30.0
NO 1: 商品id = 2455388 热门度 = 2
NO 2: 商品id = 1715 热门度 = 1
NO 3: 商品id = 2244074 热门度 = 1
NO 4: 商品id = 3076029 热门度 = 1
NO 5: 商品id = 176722 热门度 = 1
============
...
2.2 实时流量统计 — 热门网页
- 基本需求
- 从web服务器的日志中,统计实时的热门访问页面
- 统计每分钟的ip访问量,取出访问量最大的5个地址,每五秒更新一次
- 解决思路
- 将apache服务器日志中的时间,转换为时间戳,作为Event Time
- 筛选出get请求的网页,将请求资源的的数据过滤掉
- 根据url分组,构建滑动窗口,窗口长度1分钟、滑动距离为5秒,之后进行增量聚合,并指定格式输出
- 最后根据窗口的时间分组,将同一个窗口的数据聚合,格式化输出
创建POJO
-
ApacheLogEvent
private String ip; private String userId; private Long timestamp; private String method; private String url;
-
PageViewCount
private String url; private Long windowEnd; private Long count;
代码
import bean.ApacheLogEvent;
import bean.PageViewCount;
import org.apache.commons.compress.utils.Lists;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.sql.Timestamp;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.regex.Pattern;
/**
* @author Kewei
* @Date 2022/3/10 15:57
*/
public class HotPages {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
DataStreamSource<String> inputStream = env.readTextFile("D:\\IdeaProjects\\UserBehaviorAnalysis\\HotPages\\apache.log");
// 将字符串格式的DataSream转换为POJO格式,并设置EventTime和watermark
SingleOutputStreamOperator<ApacheLogEvent> dataStream = inputStream.map(line -> {
String[] fields = line.split(" ");
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("dd/MM/yyyy:HH:mm:ss");
long time = simpleDateFormat.parse(fields[3]).getTime();
return new ApacheLogEvent(fields[0], fields[1], time, fields[5], fields[6]);
})
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<ApacheLogEvent>(Time.seconds(1)) {
@Override
public long extractTimestamp(ApacheLogEvent element) {
return element.getTimestamp();
}
});
/**
* 筛选数据,将请求和请求网页的数据筛选出来
* 之后根据url分组,设置滑动窗口(窗口大小1分钟,滑动距离5秒)
* 最后增量聚合,设置指定格式输出
*/
SingleOutputStreamOperator<PageViewCount> resStream = dataStream
.filter(data -> "GET".equals(data.getMethod()))
.filter(data -> {
String regex = "^((?!\\\\.(css|js|png|ico)$).)*$";
return Pattern.matches(regex, data.getUrl());
})
.keyBy(ApacheLogEvent::getUrl)
.timeWindow(Time.minutes(1), Time.seconds(5))
.aggregate(new PageCountAgg(), new PageView());
/**
* 根据窗口的最后时间进行分组,之后对分组之后的数据格式化输出
*/
SingleOutputStreamOperator<String> resultStream = resStream
.keyBy(PageViewCount::getWindowEnd)
.process(new MyProcessFunc());
resultStream.print();
env.execute();
}
public static class PageCountAgg implements AggregateFunction<ApacheLogEvent, Long, Long>{
@Override
public Long createAccumulator() {
return 0L;
}
@Override
public Long add(ApacheLogEvent apacheLogEvent, Long aLong) {
return aLong+1;
}
@Override
public Long getResult(Long aLong) {
return aLong;
}
@Override
public Long merge(Long aLong, Long acc1) {
return aLong+acc1;
}
}
public static class PageView implements WindowFunction<Long, PageViewCount,String, TimeWindow>{
@Override
public void apply(String s, TimeWindow window, Iterable<Long> input, Collector<PageViewCount> out) throws Exception {
out.collect(new PageViewCount(s, window.getEnd(), input.iterator().next()));
}
}
public static class MyProcessFunc extends KeyedProcessFunction<Long,PageViewCount,String>{
private ListState<PageViewCount> list;
@Override
public void onTimer(long timestamp, KeyedProcessFunction<Long, PageViewCount, String>.OnTimerContext ctx, Collector<String> out) throws Exception {
ArrayList<PageViewCount> pageViewCountArrayList = Lists.newArrayList(list.get().iterator());
pageViewCountArrayList.sort((p1,p2) -> p2.getCount().intValue() - p1.getCount().intValue());
StringBuffer result = new StringBuffer();
result.append("====================\n");
result.append("窗口结束时间").append(new Timestamp(timestamp - 1)).append("\n");
for (int i = 0; i < Math.min(5, pageViewCountArrayList.size()); i++) {
PageViewCount pageView = pageViewCountArrayList.get(i);
result.append(pageView.getUrl()).append(" ").append(pageView.getCount()).append("\n");
}
result.append("===============\n\n\n");
Thread.sleep(1000);
out.collect(result.toString());
}
@Override
public void open(Configuration parameters) throws Exception {
list = getRuntimeContext().getListState(new ListStateDescriptor<PageViewCount>("PageViewCount",PageViewCount.class));
}
@Override
public void processElement(PageViewCount value, KeyedProcessFunction<Long, PageViewCount, String>.Context ctx, Collector<String> out) throws Exception {
list.add(value);
ctx.timerService().registerEventTimeTimer(value.getWindowEnd()+1);
}
}
}
乱序输出
有点不理解,之后再看一下
2.3 实时流量统计 — PV和UV
- 基本需求
- 从埋点日志中,统计实时的PV和UV
- 统计每小时的访问量(PV),并且对用户去重(UV)
- 解决思路
- 对于pv行为,可以直接对数据筛选过滤之后,设置滚动时间窗口,sum累计就可以了。
- 对于uv行为,需要利用Set数据结构进行去重
- 对于超大规模的数据,可以考虑用布隆过滤器进行去重
统计PV代码
import beans.UserBehavior;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.table.api.java.StreamTableEnvironment;
/**
* @author Kewei
* @Date 2022/3/10 17:15
*/
public class PageView {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
DataStreamSource<String> inputSream = env.readTextFile("D:\\IdeaProjects\\UserBehaviorAnalysis\\HotItemsAnalysis\\src\\main\\resources\\UserBehavior.csv");
SingleOutputStreamOperator<UserBehavior> dataStream = inputSream.map(line -> {
String[] field = line.split(",");
return new UserBehavior(new Long(field[0]), new Long(field[1]), new Integer(field[2]), field[3], new Long(field[4]));
})
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<UserBehavior>() {
@Override
public long extractAscendingTimestamp(UserBehavior element) {
return element.getTimestamp() * 1000L;
}
});
SingleOutputStreamOperator<Tuple2<String, Long>> result = dataStream
.filter(data -> "pv".equals(data.getBehavior()))
.map(new MapFunction<UserBehavior, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> map(UserBehavior userBehavior) throws Exception {
return new Tuple2<>("pv", 1L);
}
})
.keyBy(data -> data.f0)
.timeWindow(Time.hours(1))
.sum(1);
result.print();
env.execute();
}