Flink 用户电商行为分析项目

最新推荐文章于 2024-04-21 12:18:59 发布

欧阳喇嘛

最新推荐文章于 2024-04-21 12:18:59 发布

阅读量1.1k

点赞数 1

分类专栏： Flink 文章标签： flink big data 大数据

本文链接：https://blog.csdn.net/LIANGFANGWEI/article/details/122266755

版权

Flink 用户电商行为分析

文章目录

- - Flink 用户电商行为分析

1. 实时统计分析

1. 1 热门商品统计

需求描述:每隔5分钟实时展示1小时内的该网站的热门商品的TopN
展示的数据形式：

时间窗口信息：

NO 1:商品ID+浏览次数1

NO 2:商品ID+浏览次数2

NO 1.商品ID+浏览次数3
实现思路：
- 1. 因为最终要窗口信息+商品ID 所有keyBy后需要全窗口函数这样才能拿到窗口时间+key
- 1. 而且需要浏览次数所以需要增量聚合函数 keyBy聚合后来一条数据增量聚合一条拿到浏览次数
- 1. 以上1 2步骤后只能拿到一个商品的浏览次数所以为了拿到1小时内的根据时间窗口keyBy 使用processFunction 窗口内的商品保存到ListStat中定时器到达窗口截止时间输出ListStat的数据
代码


/**
 * 做什么 :统计一小时内热门商品 5分钟更新一次结果
 * 怎么做：
 * 1.既然输出1小时内商品信息,即输出历史数据,且每隔5分钟触发一次 即到达窗口结束的时候触发一次
 * 输出5分钟内保存的状态信息
 * 输出: 窗口结束时间 商品ID 热门数
 * <p>
 * 2 那么就要统计数出商品结束时间 商品ID 热门数
 * 热门数：增量聚合函数
 * 结束时间+商品ID：全窗口
 * <p>
 * 输出结果:
 * 窗口结束时间：2017-11-26 12:20:00.0
 * 窗口内容：
 * NO 1: 商品ID = 2338453 热门度 = 27
 * NO 2: 商品ID = 812879 热门度 = 18
 * NO 3: 商品ID = 4443059 热门度 = 18
 * NO 4: 商品ID = 3810981 热门度 = 14
 * NO 5: 商品ID = 2364679 热门度 = 14
 */
public class HotItemsPractise {
   

    public static void main(String[] args) throws Exception {
   
        //    1. 环境准备
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.setParallelism(1);
        DataStreamSource<String> inputStream = env.readTextFile("/Users/liangfangwei/IdeaProjects/flinkUserAnalays/data_file/UserBehavior.csv");
        //2. 准备数据源
        DataStream<ItemBean> filterStream = inputStream.map(line -> {
   
            String[] split = line.split(",");
            return new ItemBean(Long.parseLong(split[0]), Long.parseLong(split[1]), Integer.parseInt(split[2]), split[3], Long.parseLong(split[4]));

        }).filter(item -> "pv".equals(item.getBehavior()));
       //3. 收集一个商品的聚合结果
        DataStream<ItemViewCount> windowsResult = filterStream.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<ItemBean>() {
   
            @Override
            public long extractAscendingTimestamp(ItemBean element) {
   
                return element.getTimestamp() * 1000L;
            }
        }).keyBy("itemId")
                .timeWindow(Time.hours(1), Time.minutes(5))
                .aggregate(new MyAggreateCount(), new MyAllWindowsView());
        //4. 收集一小时的聚合结果

        SingleOutputStreamOperator<String> windowEnd = windowsResult
                .keyBy("windowEnd")
                .process(new ItemHotTopN(5));
        windowEnd.print();

        env.execute("HotItemsPractise");
    }

    /**
     * 窗口函数 增量聚合 为了拿到同个商品的浏览次数
     */
    public static class MyAggreateCount implements AggregateFunction<ItemBean, Long, Long> {
   

        @Override
        public Long createAccumulator() {
   
            return 0L;
        }

        @Override
        public Long add(ItemBean value, Long accumulator) {
   
            return accumulator + 1L;
        }

        @Override
        public Long getResult(Long accumulator) {
   
            return accumulator;
        }

        @Override
        public Long merge(Long a, Long b) {
   
            return null;
        }
    }
    /**
     * 全函数：输入值是增量聚合的结果+key, 为了拿到时间窗口信息 窗口的截止时间+商品ID
     */
    public static class MyAllWindowsView implements WindowFunction<Long, ItemViewCount, Tuple, TimeWindow> {
   

        /**
         * @param tuple
         * @param window
         * @param input
         * @param out
         * @throws Exception
         */
        @Override
        public void apply(Tuple tuple, TimeWindow window, Iterable<Long> input, Collector<ItemViewCount> out) throws Exception {
   
            long windowEnd = window.getEnd();
            long count = input.iterator().next();
            long itemId = tuple.getField(0);

            out.collect(new ItemViewCount(itemId, windowEnd, count));
        }
    }

    /**
     * 将一小时内的商品 保存起来 时间窗口到了排序输出TopN
     */
    public static class ItemHotTopN extends KeyedProcessFunction<Tuple, ItemViewCount, String> {
   
        ListState<ItemViewCount> itemViewCountListState;
        private int topN;

        public ItemHotTopN(int topN) {
   
            this.topN = topN;
        }

        @Override
        public void open(Configuration parameters) throws Exception {
   
            itemViewCountListState = getRuntimeContext().getListState(new ListStateDescriptor<ItemViewCount>("itemViewCount", ItemViewCount.class));
        }

        @Override
        public void processElement(ItemViewCount itemViewCount, Context ctx, Collector<String> out) throws Exception {
   
            itemViewCountListState.add(itemViewCount);
            ctx.timerService().registerEventTimeTimer(itemViewCount.getWindowEnd() + 1L);
        }

        @Override
        public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
   
            // ListState转为ArrayList
            ArrayList<ItemViewCount> arraylist = Lists.newArrayList(itemViewCountListState.get().iterator());

            arraylist.sort(new Comparator<ItemViewCount>() {
   
                @Override
                public int compare(ItemViewCount o1, ItemViewCount o2) {
   
                    return o2.getCount().intValue() - o1.getCount().intValue();
                }
            });
            StringBuilder resultStringBuilder = new StringBuilder();
            resultStringBuilder.append("===================================" + "\n");
            resultStringBuilder.append("窗口结束时间：").append(new Timestamp(timestamp).toString()).append("\n");

            for (int i = 0; i < Math.min(topN, arraylist.size()); i++) {
   

                resultStringBuilder
                        .append("NO ")
                        .append(i + 1)
                        .append(": 商品ID = ")
                        .append(arraylist.get(i).getItemId())
                        .append(" 热门度 = ")
                        .append(arraylist.get(i).getCount())
                        .append("\n");

            }
            resultStringBuilder.append("===================================\n");
            out.collect(resultStringBuilder.toString());
            Thread.sleep(1000L);

        }

    }
}

1. 2 热门页面统计

需求：每隔5分钟输出一小时内浏览的热门页面
输出结果展示:

窗口结束时间：2015-05-18 13:08:50.0

NO 1: 页面URL = /blog/tags/puppet?flav=rss20 热门度 = 11
NO 2: 页面URL = /projects/xdotool/xdotool.xhtml 热门度 = 5
NO 3: 页面URL = /projects/xdotool/ 热门度 = 4
NO 4: 页面URL = /?flav=rss20 热门度 = 4
NO 5: 页面URL = /robots.txt 热门度 = 4

实现思路:和上一个不同的是该数据源中的数据的时间非增量
- 怎么保证保证乱序数据不丢
  - 1.所以要设置watermark与数据源之间的乱序程度
  - 2.设置一定的窗口延迟关闭时间在初始的时间窗口到了先聚合数据后续再来属于该窗口的数据来一条计算一条输出一条
  - 3.再有迟到的数据则直接扔到侧输出流中
- 怎么保证后续迟到的数据来一条覆盖前面的数据
  - 1 先开窗增量聚合再全窗口聚合再根据窗口截止时间分组
  - 2 根据时间的截止窗口开窗key by 收集窗口截止时间内的所有数据排序输出
  - 3 如果后续再来了延迟数据需要更新之前的结果。所以把之间的数据存咋mapstat中 key为页面url value为输出结果

代码

/**
 *
 * -                   _ooOoo_
 * -                  o8888888o
 * -                  88" . "88
 * -                  (| -_- |)
 * -                   O\ = /O
 * -               ____/`---'\____
 * -             .   ' \\| |// `.
 * -              / \\||| : |||// \
 * -            / _||||| -:- |||||- \
 * -              | | \\\ - /// | |
 * -            | \_| ''\---/'' | |
 * -             \ .-\__ `-` ___/-. /
 * -          ___`. .' /--.--\ `. . __
 * -       ."" '< `.___\_<|>_/___.' >'"".
 * -      | | : `- \`.;`\ _ /`;.`/ - ` : | |
 * -        \ \ `-. \_ __\ /__ _/ .-` / /
 * ======`-.____`-.___\_____/___.-`____.-'======
 * `=---='
 * .............................................
 * 佛祖保佑             永无BUG
 * <p>
 * 需求
 * 每5分钟输出一次1小时之内排名前5的页面
 *  小时统计一次结果 ,即开窗是一小时 收集1小时内的统计结果,按照窗口结束时间输出窗口内的结果。窗口的滑动步长设置为5min
 */

public class HotPages {
     

    public static void main(String[] args) throws Exception {
     
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        executionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        executionEnvironment.setParallelism(1);

        DataStreamSource<String> stringDataStreamSource = executionEnvironment.readTextFile("/Users/liangfangwei/IdeaProjects/flinkUserAnalays/data_file/apache.log");

        SimpleDateFormat simpleFormatter = new SimpleDateFormat("dd/MM/yyyy:HH:mm:ss");

        OutputTag<ApacheLogEvent> lateTag = new OutputTag<ApacheLogEvent>("late_date") {
     
        };

        DataStream<PageViewCount> streamPageViewCount = stringDataStreamSource.map(line -> {
     
            String[] s = line.split(" ");
            // 日期转时间戳
            Long timestamp = simpleFormatter.parse(s[3]).getTime();
            return new ApacheLogEvent(s[0], s[1], timestamp, s[

最低0.47元/天解锁文章

欧阳喇嘛

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
Flink 用户电商行为分析项目

Flink 用户电商行为分析文章目录Flink 用户电商行为分析1. 实时统计分析1. 1 热门商品统计1. 2 热门页面统计1. 3 网站uv统计2. 业务流程以及风险控制2. 1 页面广告黑名单过滤2. 2 恶意登陆监控2. 3 订单支付失效监控2. 4 支付实时对账3. 项目地址1. 实时统计分析1. 1 热门商品统计需求描述:每隔5分钟实时展示1小时内的该网站的热门商品的TopN展示的数据形式：时间窗口信息：NO 1:商品ID+浏览次数1NO 2:商品ID+浏览次数2N
复制链接

扫一扫

专栏目录