Flink 应用-电商用户行为分析

最新推荐文章于 2025-04-14 11:35:38 发布

努力生活的黄先生

最新推荐文章于 2025-04-14 11:35:38 发布

阅读量984

点赞数

分类专栏： flink 文章标签： flink

本文链接：https://blog.csdn.net/Java_KW/article/details/123997512

版权

本文介绍了使用 Apache Flink 进行实时数据分析的实践，涵盖了电商用户行为分析（如实时热门商品统计、实时流量统计）、市场营销分析（如APP市场推广统计、页面广告统计）和风险控制（如恶意登录监控）。通过设置时间窗口、使用CEP库进行模式匹配，以及状态编程实现对数据的实时处理和异常检测，从而实现高效的数据洞察和风险预警。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Flink 应用-电商用户行为分析

一、电商用户行为分析

在这里插入图片描述

电商业务分析主要有以下三类：

统计分析
- 点击、浏览
- 热门商品、最近热门商品、分类热门商品、流量统计
偏好统计
- 收藏、喜欢、评分、打标签
- 用户画像、推荐列表
风险控制
- 下订单、支付、登录
- 刷单监控、订单失效监控、恶意登录（短时间内频繁登录失败）监控

1.1 项目模块设计

电商分析按照流量和业务分类，可分为两大类：

在这里插入图片描述

按照统计类型分类如下：

在这里插入图片描述

1.2 数据源

在这里插入图片描述

数据结构

UserBehavior

在这里插入图片描述

ApacheLogEvent

在这里插入图片描述

二、项目模块

本次项目做5个分析：

在这里插入图片描述

处理数据时，先对某个id分组，并设定窗口，之后对某个字段增量聚合（并设定指定输出格式），最后最窗口分组，将同一个窗口内的数据累加。

2.1 实时热门商品统计

基本需求
- 统计近1个小时内的热门商品，每五分钟更新一次
- 热门度使用浏览（“pv”）来衡量
解决思路
1. 在所有用户行为数据中，过滤出浏览(”pv”)行为进行统计
2. 构建滑动窗口，窗口长度为1小时，滑动距离为5分钟，统计出每一种商品的访问数
3. 再根据滑动窗口的时间，统计出访问次数最多的5个商品

第二步的流程大致如下：

在这里插入图片描述

首先，按照商品id进行分区

在这里插入图片描述

接着对数据划分滑动时间窗口

在这里插入图片描述

时间窗口区间为左闭右开，同一份数据会被分到不同的窗口。

例如：

在这里插入图片描述

然后进行窗口聚合

在这里插入图片描述

aggregate第一个参数是窗口聚合的规则，第二个参数是定义输出的数据结构

窗口聚合函数

窗口聚合策略—每出现一条记录就加一。

需要实现AggragateFunction接口，并需要实现4个函数createAccumulator、add、merge、getResult

interface AggregateFunction<IN, ACC, OUT>
/**
* IN ：输入类型
* ACC ：累加器类型
* OUT ：输出类型
*/

Window输出类型函数

定义输出结构：ItemViewCount(itemId,windowEnd,count)

itemId：商品id

windowEnd：窗口结束时间

count：计数

interface WindowFunction<IN,OUT,KEY,W extends Window>

/**
     *  WindowFunction<IN,OUT,KEY,W extends Window>
     *      IN ：输出类型，就是累加器最后输出类型
     *      OUT：最后想要输出类型
     *      KEY：Tuple泛型，分组的key，在这里是itemId，窗口根据itemId聚合
     *      W  ： 聚合的窗口，w.getEnd就能拿到窗口的结束时间
     */

画图举例：

首先根据id分组，然后窗口聚合。

在这里插入图片描述

之后再进行统计处理，相同窗口的数据放在一起，并输出top5

在这里插入图片描述

这个过程需要一个中间值，把同一个窗口的数据都放进去，这就需要状态了。

在这里插入图片描述

最终排序输出 — KeyedProcessFunction

针对有状态流的底层API
KeyedProcessFunction会对分区后的每一条子流进行处理
以windowEnd作为key，保证分流以后每一条流的数据都在一个时间窗口内
从ListState中读取当前流的状态，存储数据进行排序输出

使用ProcessFunction定义KeyedStream的处理逻辑。

分区之后，每个KeyedStream都有自己的生命周期

open：初始化，在这里可以获取当前流的状态
processElement：处理流中每一个元素时调用
onTimer：定时调用，注册定时器Timer并触发之后的回调操作

在这里插入图片描述

创建POJO

需要生成get/set、无参/有参构造函数，toString

ItemViewCount

private Long itemId;
private Long windowEnd;
private Long count;

UserBehavior

private Long uerId;
private Long itemId;
private Integer categoryId;
private String behavior;
private Long timestamp;

代码

测试使用的是60秒窗口大小、30秒滑动窗口

import beans.ItemViewCount;
import beans.UserBehavior;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.guava18.com.google.common.collect.Lists;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.sql.Timestamp;
import java.util.ArrayList;
import java.util.Comparator;

/**
 * @author Kewei
 * @Date 2022/3/5 15:10
 */

public class HotItems {
   
    public static void main(String[] args) throws Exception {
   
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.setParallelism(1);

        DataStreamSource<String> inputStream = env.readTextFile("D:\\IdeaProjects\\UserBehaviorAnalysis\\HotItemsAnalysis\\src\\main\\resources\\UserBehavior.csv");

        // 将数据转换为POJO，并设置事件时间
        SingleOutputStreamOperator<UserBehavior> dataStream = inputStream.map(line -> {
   
            String[] filed = line.split(",");
            return new UserBehavior(new Long(filed[0]), new Long(filed[1]), new Integer(filed[2]), filed[3], new Long(filed[4]));
        }).assignTimestampsAndWatermarks(new AscendingTimestampExtractor<UserBehavior>() {
   
            @Override
            public long extractAscendingTimestamp(UserBehavior userBehavior) {
   
                return userBehavior.getTimestamp() * 1000;
            }
        });

        // 筛选出pv的数据，按照商品id分组，划分滑动时间窗口，对每个窗口进行增量聚合，并将输出结果进行设定指定格式ItemViewCount
        SingleOutputStreamOperator<ItemViewCount> windowAggStream = dataStream
                .filter(data -> "pv".equals(data.getBehavior()))
                .keyBy("itemId")
                .timeWindow(Time.seconds(60), Time.seconds(30))
                .aggregate(new CountAgg(), new WindowResultFunction());

        // 将同一个窗口的数据，进行分组，最后设置定时输出
        SingleOutputStreamOperator<String> resultStream = windowAggStream
                .keyBy("windowEnd")
                .process(new TopNHotItems(5));

        resultStream.print();

        env.execute("hot items");
    }

    // 设定同一个商品数据的聚合方法

    /**
     * AggregateFunction<IN, ACC, OUT>
     *     IN :输出类型
     *     ACC：累加器类型
     *     OUT：最后输出结果
     */
    private static class CountAgg implements AggregateFunction<UserBehavior, Long, Long> {
   
        @Override
        public Long createAccumulator() {
   
            return 0L;
        }

        @Override
        public Long add(UserBehavior userBehavior, Long aLong) {
   
            return aLong + 1;
        }

        @Override
        public Long getResult(Long aLong) {
   
            return aLong;
        }

        @Override
        public Long merge(Long aLong, Long acc1) {
   
            return aLong + acc1;
        }
    }

    // 设定输出格式

    /**
     *  WindowFunction<IN,OUT,KEY,W extends Window>
     *      IN ：输出类型，就是累加器最后输出类型
     *      OUT：最后想要输出类型
     *      KEY：Tuple泛型，分组的key，在这里是itemId，窗口根据itemId聚合
     *      W  ： 聚合的窗口，w.getEnd就能拿到窗口的结束时间
     */
    private static class WindowResultFunction implements WindowFunction<Long, ItemViewCount, Tuple, TimeWindow> {
   
        @Override
        public void apply(Tuple tuple, TimeWindow timeWindow, Iterable<Long> iterable, Collector<ItemViewCount> out) throws Exception {
   
            Long itemId = tuple.getField(0);
            Long end = timeWindow.getEnd();
            Long next = iterable.iterator().next();
            out.collect(new ItemViewCount(itemId, end, next));
        }
    }

    /**
     * KeyedProcessFunction<KEY,IN,OUT>
     *     KEY： 分组key的类型
     *     IN ： 输入的类型
     *     OUT：输出的类型
     */
    private static class TopNHotItems extends KeyedProcessFunction<Tuple, ItemViewCount, String> {
   
        private Integer topSize;
        public TopNHotItems(Integer topSize) {
   
            this.topSize = topSize;
        }

        ListState<ItemViewCount> itemViewCountListState;

        @Override
        public void open(Configuration parameters) throws Exception {
   
            itemViewCountListState = getRuntimeContext().getListState(new ListStateDescriptor<ItemViewCount>("Item-view-count-list",ItemViewCount.class));
        }

        @Override
        public void processElement(ItemViewCount count, KeyedProcessFunction<Tuple, ItemViewCount, String>.Context ctx, Collector<String> out) throws Exception {
   
            itemViewCountListState.add(count);
            // 注册一个定时器，在1毫秒之后运行，由于同一个窗口的结束时间时一样的，所以当时间变了，就说明同一个窗口的数据都添加进去了
            ctx.timerService().registerEventTimeTimer(count.getWindowEnd() + 1);
        }

        // 设定定时器任务
        @Override
        public void onTimer(long timestamp, KeyedProcessFunction<Tuple, ItemViewCount, String>.OnTimerContext ctx, Collector<String> out) throws Exception {
   
            // 将ListState转换成ArrayList
            ArrayList<ItemViewCount> itemViewCounts = Lists.newArrayList(itemViewCountListState.get().iterator());

            // 排序
            itemViewCounts.sort(new Comparator<ItemViewCount>() {
   
                @Override
                public int compare(ItemViewCount o1, ItemViewCount o2) {
   
                    return o2.getCount().intValue() - o1.getCount().intValue();
                }
            });

            // 将数据格式化
            StringBuffer stringBuffer = new StringBuffer();
            stringBuffer.append("===========\n");
            stringBuffer.append("窗口结束时间:").append(new Timestamp(timestamp - 1)).append("\n");

            for (int i = 0; i < Math.min(topSize, itemViewCounts.size()); i++) {
   
                ItemViewCount itemViewCount = itemViewCounts.get(i);
                stringBuffer.append("NO ").append(i+1).append(":")
                        .append(" 商品id = ").append(itemViewCount.getItemId())
                        .append(" 热门度 = ").append(itemViewCount.getCount())
                        .append("\n");
            }
            stringBuffer.append("============\n\n");

            // 控制输出频率
            Thread.sleep(1000L);

            // 输出数据
            out.collect(stringBuffer.toString());
        }
    }
}

输出：

===========
窗口结束时间:2017-11-26 09:00:30.0
NO 1: 商品id = 2455388 热门度 = 2
NO 2: 商品id = 1715 热门度 = 1
NO 3: 商品id = 2244074 热门度 = 1
NO 4: 商品id = 3076029 热门度 = 1
NO 5: 商品id = 176722 热门度 = 1
============

...

2.2 实时流量统计 — 热门网页

基本需求
- 从web服务器的日志中，统计实时的热门访问页面
- 统计每分钟的ip访问量，取出访问量最大的5个地址，每五秒更新一次
解决思路
1. 将apache服务器日志中的时间，转换为时间戳，作为Event Time
2. 筛选出get请求的网页，将请求资源的的数据过滤掉
3. 根据url分组，构建滑动窗口，窗口长度1分钟、滑动距离为5秒，之后进行增量聚合，并指定格式输出
4. 最后根据窗口的时间分组，将同一个窗口的数据聚合，格式化输出

创建POJO

ApacheLogEvent

private String ip;
private String userId;
private Long timestamp;
private String method;
private String url;

PageViewCount

private String url;
private Long windowEnd;
private Long count;

代码

import bean.ApacheLogEvent;
import bean.PageViewCount;
import org.apache.commons.compress.utils.Lists;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.sql.Timestamp;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.regex.Pattern;

/**
 * @author Kewei
 * @Date 2022/3/10 15:57
 */

public class HotPages {
   
    public static void main(String[] args) throws Exception {
   
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.setParallelism(1);

        DataStreamSource<String> inputStream = env.readTextFile("D:\\IdeaProjects\\UserBehaviorAnalysis\\HotPages\\apache.log");

        // 将字符串格式的DataSream转换为POJO格式，并设置EventTime和watermark
        SingleOutputStreamOperator<ApacheLogEvent> dataStream = inputStream.map(line -> {
   
                    String[] fields = line.split(" ");
                    SimpleDateFormat simpleDateFormat = new SimpleDateFormat("dd/MM/yyyy:HH:mm:ss");
                    long time = simpleDateFormat.parse(fields[3]).getTime();
                    return new ApacheLogEvent(fields[0], fields[1], time, fields[5], fields[6]);
                })
                .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<ApacheLogEvent>(Time.seconds(1)) {
   
                    @Override
                    public long extractTimestamp(ApacheLogEvent element) {
   
                        return element.getTimestamp();
                    }
                });

        /**
         * 筛选数据，将请求和请求网页的数据筛选出来
         * 之后根据url分组，设置滑动窗口（窗口大小1分钟，滑动距离5秒）
         * 最后增量聚合，设置指定格式输出
         */
        SingleOutputStreamOperator<PageViewCount> resStream = dataStream
                .filter(data -> "GET".equals(data.getMethod()))
                .filter(data -> {
   
                    String regex = "^((?!\\\\.(css|js|png|ico)$).)*$";
                    return Pattern.matches(regex, data.getUrl());
                })
                .keyBy(ApacheLogEvent::getUrl)
                .timeWindow(Time.minutes(1), Time.seconds(5))
                .aggregate(new PageCountAgg(), new PageView());

        /**
         * 根据窗口的最后时间进行分组，之后对分组之后的数据格式化输出
         */
        SingleOutputStreamOperator<String> resultStream = resStream
                .keyBy(PageViewCount::getWindowEnd)
                .process(new MyProcessFunc());

        resultStream.print();

        env.execute();

    }

    public static class PageCountAgg implements AggregateFunction<ApacheLogEvent, Long, Long>{
   

        @Override
        public Long createAccumulator() {
   
            return 0L;
        }

        @Override
        public Long add(ApacheLogEvent apacheLogEvent, Long aLong) {
   
            return aLong+1;
        }

        @Override
        public Long getResult(Long aLong) {
   
            return aLong;
        }

        @Override
        public Long merge(Long aLong, Long acc1) {
   
            return aLong+acc1;
        }
    }

    public static class PageView implements WindowFunction<Long, PageViewCount,String, TimeWindow>{
   
        @Override
        public void apply(String s, TimeWindow window, Iterable<Long> input, Collector<PageViewCount> out) throws Exception {
   
            out.collect(new PageViewCount(s, window.getEnd(), input.iterator().next()));
        }
    }

    public static class MyProcessFunc extends KeyedProcessFunction<Long,PageViewCount,String>{
   
        private ListState<PageViewCount> list;

        @Override
        public void onTimer(long timestamp, KeyedProcessFunction<Long, PageViewCount, String>.OnTimerContext ctx, Collector<String> out) throws Exception {
   
            ArrayList<PageViewCount> pageViewCountArrayList = Lists.newArrayList(list.get().iterator());

            pageViewCountArrayList.sort((p1,p2) -> p2.getCount().intValue() - p1.getCount().intValue());

            StringBuffer result = new StringBuffer();
            result.append("====================\n");
            result.append("窗口结束时间").append(new Timestamp(timestamp - 1)).append("\n");

            for (int i = 0; i < Math.min(5, pageViewCountArrayList.size()); i++) {
   
                PageViewCount pageView = pageViewCountArrayList.get(i);
                result.append(pageView.getUrl()).append("  ").append(pageView.getCount()).append("\n");
            }
            result.append("===============\n\n\n");

            Thread.sleep(1000);

            out.collect(result.toString());
        }

        @Override
        public void open(Configuration parameters) throws Exception {
   
            list = getRuntimeContext().getListState(new ListStateDescriptor<PageViewCount>("PageViewCount",PageViewCount.class));
        }

        @Override
        public void processElement(PageViewCount value, KeyedProcessFunction<Long, PageViewCount, String>.Context ctx, Collector<String> out) throws Exception {
   
            list.add(value);

            ctx.timerService().registerEventTimeTimer(value.getWindowEnd()+1);
        }
    }
}

乱序输出

有点不理解，之后再看一下

2.3 实时流量统计 — PV和UV

基本需求
- 从埋点日志中，统计实时的PV和UV
- 统计每小时的访问量（PV），并且对用户去重（UV）
解决思路
- 对于pv行为，可以直接对数据筛选过滤之后，设置滚动时间窗口，sum累计就可以了。
- 对于uv行为，需要利用Set数据结构进行去重
- 对于超大规模的数据，可以考虑用布隆过滤器进行去重

统计PV代码

import beans.UserBehavior;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.table.api.java.StreamTableEnvironment;

/**
 * @author Kewei
 * @Date 2022/3/10 17:15
 */

public class PageView {
   
    public static void main(String[] args) throws Exception {
   
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.setParallelism(1);

        DataStreamSource<String> inputSream = env.readTextFile("D:\\IdeaProjects\\UserBehaviorAnalysis\\HotItemsAnalysis\\src\\main\\resources\\UserBehavior.csv");

        SingleOutputStreamOperator<UserBehavior> dataStream = inputSream.map(line -> {
   
                    String[] field = line.split(",");
                    return new UserBehavior(new Long(field[0]), new Long(field[1]), new Integer(field[2]), field[3], new Long(field[4]));
                })
                .assignTimestampsAndWatermarks(new AscendingTimestampExtractor<UserBehavior>() {
   
                    @Override
                    public long extractAscendingTimestamp(UserBehavior element) {
   
                        return element.getTimestamp() * 1000L;
                    }
                });

        SingleOutputStreamOperator<Tuple2<String, Long>> result = dataStream
                .filter(data -> "pv".equals(data.getBehavior()))
                .map(new MapFunction<UserBehavior, Tuple2<String, Long>>() {
   
                    @Override
                    public Tuple2<String, Long> map(UserBehavior userBehavior) throws Exception {
   
                        return new Tuple2<>("pv", 1L);
                    }
                })
                .keyBy(data -> data.f0)
                .timeWindow(Time.hours(1))
                .sum(1);

        result.print();

        env.execute();

    }