如何使用Flink实现排行榜TopN

最新推荐文章于 2024-07-29 14:22:26 发布

轩裳已逝铭崖

最新推荐文章于 2024-07-29 14:22:26 发布

阅读量900

点赞数

文章标签： flink java 数据库

本文链接：https://blog.csdn.net/qq_43147161/article/details/129886112

版权

TopN

要统计10秒内访问量最多的5条url，5秒钟刷新一次

分析题目，10秒内的数据进行统计，5秒钟刷新一次，首先确定是滑动窗口

第一个方案（差）：流水线使用乱序+延迟，计算则不用分区，针对10秒内所有的数据，使用增量函数+全窗口函数，统计所有url的访问次数，对其所有的url的访问次数进行排序，再输出前5条

第二个方案（还行）：流水线使用乱序+延迟，计算则针对url进行分区，各个分区使用增量函数+全窗口函数统计出每个url的访问总次数，再把这些分区的url统计结果放到一个分布式缓存里，注册一个延迟触发的事件时间定时器，触发时间就是窗口结束时间+1，进行排序后输出。

思考几个问题：

为什么要放入到分布式缓存里？

如果设置并行度，多个子任务，无论是统计一个url的并行任务，还是统计多个url的并行任务，肯定都不在一个task node上执行，多个task node可能部署在多台机器，数据都不是共享的，需要一个分布式的缓存，能够被所有task node访问以及存储，这样就可以合并多个算子任务结果。

如上图所示，这里的分布式缓存实际上在flink上定义为“列表状态”，每个元素实际上是分区的聚合后的数据

注册定时器由第一个分区数据到达后来执行，后续的数据到达后注册是无效操作，因为定时器的触发时间是同一个（各个分区的windowEnd相同）。

为什么要使用定时器？

因为在这些分区将各自url的统计结果输出给最后同一个算子任务时，实际上这些分区的窗口都已经到了结束时间，大家都不会再等待新数据进来进行计算了，那么有多少个分区数据给最后一个算子任务，什么时候能全部收集到，这个没办法判断（毕竟不是全量统计），所以就需要在一个分区的数据到达时，就设置一个定时器，并将这个分区的统计结果放入到缓存，其他分区数据到达时，放入到缓存，也设置定时器，但是发现第一个分区已经设置定时器了，就不会再设置定时器，最后在定时器里缓存就保存了所有分区的数据，排名即可。

为什么定时器要延迟触发？还要设置1毫秒？

这里仅仅是遵循‘允许迟到的分区总数据延迟进来“的原则，可以是1毫秒，也可以是100毫秒

但是不能单纯依靠这里的延迟去处理迟到数据，前面如果没有设置水位线延迟的话，这里设置的延迟时间实际意义上表示“等待所有分区数据到最后一个算子任务的总时间”，设置的短会出问题。

第二个方案优于第一个方案的点在于：

第一个方案：所有数据的增量聚合是一个算子任务来完成
第二个方案，根据url分区，多个算子任务按照自己的分区增量聚合url的访问次数
最后都是一个窗口里进行排序，这里两个没有区别，毕竟做top就得将所有数据放在一起统计

(1)方案实现的前提：

数据来源的pojo：

   public  static class Event {
        /**
         * 用户姓名
         */
        public String user;
        /**
         * url访问地址路径
         */
        public String url;
        /**
         * 用户访问时间
         */
        public Long timestamp;

        public Event() {
        }

        public String getUser() {
            return user;
        }

        public String getUrl() {
            return url;
        }

        public Long getTimestamp() {
            return timestamp;
        }

        public Event(String user, String url, Long timestamp) {
            this.user = user;
            this.url = url;
            this.timestamp = timestamp;
        }

        @Override
        public String toString() {
            return "Event{" +
                    "user='" + user + '\'' +
                    ", url='" + url + '\'' +
                    ", timestamp=" + new Timestamp(timestamp) +
                    '}';
        }
    }

数据来源：

模拟的1秒钟随机发一个url请求，不会停止

   public  static class customSource implements SourceFunction<Event> {
        private Boolean running = true;
        private int number = 0;

        /**
         * sourceContext.collect会返回数据
         * run一旦结束，数据源就停止发送
         *
         * @param sourceContext
         */
        @Override
        public void run(SourceContext<Event> sourceContext) throws InterruptedException {
            Random random = new Random();
            String[] users = {"Mary", "Alice", "Bob", "Cary"};
            String[] urls = {"./home", "./cart", "./fav", "./prod?id=1",
                    "./prod?id=2"};
            while (running) {
                sourceContext.collect(new Event(users[random.nextInt(users.length)],
                        urls[random.nextInt(urls.length)], Calendar.getInstance().getTimeInMillis()
                ));
                Thread.sleep(1000);
            }

        }

        /**
         * 中止数据
         */
        @Override
        public void cancel() {
            running = false;
        }
    }

(2)第一个方案

整体流程：

assignTimestampsAndWatermarks没有特殊要求，使用乱序流，延迟3秒
因为是攒10秒钟的数据，keyBy必须全部都走一个分区，所以返回是true
滑动窗口，所以使用SlidingEventTimeWindows，10秒钟的窗口，滑动5秒
这里因为增量聚合+全窗口，所以aggregate传入两个函数

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env =
                StreamExecutionEnvironment.getExecutionEnvironment();
        env.getConfig().setAutoWatermarkInterval(200);
        env.addSource(new aggregateTest.customSource())
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy.<aggregateTest.Event>forBoundedOutOfOrderness(Duration.ofSeconds(3)).
                                withTimestampAssigner((SerializableTimestampAssigner<aggregateTest.Event>) 
                                                      (event, l) -> event.timestamp))
                .keyBy(event -> true)
                .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
                .aggregate(new CustomAggregateFunction(), new CustomProcessWindowFunction())
                .print();
        env.execute();
    }

增量聚合函数

初始化一个hashmap，key就是url，value就是访问次数
add 就是把所有url进行一次聚合，如果添加了就访问次数+1，如果没添加过就设置初始访问次数为1

    private static class CustomAggregateFunction implements AggregateFunction<aggregateTest.Event, HashMap<String, Long>, HashMap<String, Long>> {


        @Override
        public HashMap<String, Long> createAccumulator() {
            return new HashMap<>();
        }

        @Override
        public HashMap<String, Long> add(aggregateTest.Event event, HashMap<String, Long> tmpHashMap) {
            if (tmpHashMap.containsKey(event.url)) {
                tmpHashMap.put(event.url, tmpHashMap.get(event.url) + 1);
            } else {
                tmpHashMap.put(event.url, 1L);
            }
            return tmpHashMap;
        }

        @Override
        public HashMap<String, Long> getResult(HashMap<String, Long> tmpHashMap) {
            return tmpHashMap;
        }

        @Override
        public HashMap<String, Long> merge(HashMap<String, Long> stringLongHashMap, HashMap<String, Long> acc1) {
            return null;
        }
    }

全窗口函数

这里数据肯定是全部的，所以迭代器直接获取第一个数据就行
对map排序，可以利用java 8 lambda，对map 的value进行转成list

   public static class CustomProcessWindowFunction extends ProcessWindowFunction<HashMap<String, Long>, String, Boolean, TimeWindow> {
        /**
         * @param key       分区key返回的值
         * @param context   上下文（可以获取当前处理时间、当前流水线、窗口状态）
         * @param iterable  全量数据的迭代器
         * @param collector
         * @throws Exception
         */
        @Override
        public void process(Boolean key, Context context, Iterable<HashMap<String, Long>> iterable, Collector<String> collector) throws Exception {
            List<HashMap<String, Long>> list = new ArrayList<>();

            for (HashMap<String, Long> map : iterable) {
                list.add(map);
            }
            HashMap<String, Long> valueMap = list.get(0);
            List<Map.Entry<String, Long>> finalValue = sortByMap(valueMap);
            TimeWindow window = context.window();
            long start = window.getStart();
            long end = window.getEnd();
            collector.collect("当前时间：" + new Date() + "分区key:" + key + "窗口起始时间:" + new Date(start) + "\t 窗口结束时间:" + new Date(end) + "\t排在前五名的是:" + finalValue.toString());
        }

        private List<Map.Entry<String, Long>> sortByMap(HashMap<String, Long> valueMap) {
            return valueMap.entrySet().stream().sorted(Map.Entry.comparingByValue(Comparator.reverseOrder())).collect(Collectors.toList());

        }
    }

看最后结果：
- 当前时间就是滑动距离
确认下访问次数，因为1秒钟模拟1个请求，正好随机的url都是5个，那么所有url访问次数之和是10个，对上了

(3)第二个方案

整体流程

流水线和窗口与第一个方案，不再叙述
注意，这里keyBy是按照访问的url，不再全部返回true
还有一点跟第一个方案的不同是在aggregate后又增了一个keyBy和 process函数

.keyBy(urlCount -> urlCount.windowEnd)的原因在于必须是同一个窗口结束时间的才能一起计算

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env =
                StreamExecutionEnvironment.getExecutionEnvironment();
        env.getConfig().setAutoWatermarkInterval(200);
        SingleOutputStreamOperator<String> returns = env.addSource(new aggregateTest.customSource())
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy.<aggregateTest.Event>forBoundedOutOfOrderness(Duration.ofSeconds(3)).
                                withTimestampAssigner((SerializableTimestampAssigner<aggregateTest.Event>) (event, l) -> event.timestamp))
                .keyBy(event -> event.url)
                .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
                .aggregate(new CustomAggregateFunction(), new CustomProcessWindowFunction())
                .keyBy(urlCount -> urlCount.windowEnd)
                .process(new CustomKeyProcessFunction())
                .returns(String.class);
        returns.print();
        env.execute();
    }

分区下的增量函数

这里因为统计单个url的访问次数，所以一个对象就能解决问题

 private static class CustomAggregateFunction implements AggregateFunction<aggregateTest.Event, EventUrlCount, EventUrlCount> {


        @Override
        public EventUrlCount createAccumulator() {
            return new EventUrlCount("", 0L, 0L);
        }

        @Override
        public EventUrlCount add(aggregateTest.Event event, EventUrlCount tmpEventUrlCount) {
            tmpEventUrlCount.count += 1;
            tmpEventUrlCount.url = event.url;
            return tmpEventUrlCount;
        }

        @Override
        public EventUrlCount getResult(EventUrlCount urlCount) {
            return urlCount;
        }

        @Override
        public EventUrlCount merge(EventUrlCount urlCount1, EventUrlCount urlCount2) {
            return null;
        }
    }

分区下的全窗口函数

因为后面最后一个算子任务要根据数据的同一个结束时间来整理，所以这里一定要设置窗口结束时间，给到后面的keyBy使用。

 public static class CustomProcessWindowFunction extends ProcessWindowFunction<EventUrlCount, EventUrlCount, String, TimeWindow> {


        /**
         * 因为增量和全窗口一起使用，所以迭代器只会有一个数据
         *
         * @param s
         * @param context
         * @param iterable
         * @param collector
         * @throws Exception
         */
        @Override
        public void process(String s, Context context, Iterable<EventUrlCount> iterable, Collector<EventUrlCount> collector) throws Exception {
            EventUrlCount finalEventUrlCount = null;
            for (EventUrlCount urlCount : iterable) {
                finalEventUrlCount = urlCount;
            }
            finalEventUrlCount.windowEnd = context.window().getEnd();
            collector.collect(finalEventUrlCount);
        }
    }

最后合并的函数+定时器

open时，会定义一个分布式缓存，实际上是“状态list”，赋值给eventUrlCounts
一旦一个分区数据到达时，放入到list，并注册一个定时器，触发时间就是窗口结束时间+1
定时器实际上就是按照list的count进行排序，最后输出list即可
⚠️这里没有把具体前几名的变量以参数传递进去，注意下

    public static class CustomKeyProcessFunction extends KeyedProcessFunction<Long, EventUrlCount, String> {


        ListState<EventUrlCount> eventUrlCounts;

        @Override
        public void open(Configuration parameters) throws Exception {
            super.open(parameters);
            eventUrlCounts = getRuntimeContext().getListState(new ListStateDescriptor<>("123-demo-list", EventUrlCount.class));
        }

      
        @Override
        public void processElement(EventUrlCount urlCount, Context context, Collector<String> collector) throws Exception {
            eventUrlCounts.add(urlCount);
            context.timerService().registerEventTimeTimer(urlCount.windowEnd+1);
        }

        @Override
        public void onTimer(long timestamp, KeyedProcessFunction<Long, EventUrlCount, String>.OnTimerContext ctx, Collector<String> out) throws Exception {
            List<EventUrlCount> finalList = sortByUrlCount(this.eventUrlCounts);
          //这里具体看前五名，这里需要作为参数传递进去
          finalList=finalList.sub
            out.collect("当前窗口结束时间：" + new java.sql.Time(timestamp) + "\t 当前排名为：" + JSON.toJSONString(finalList));
            this.eventUrlCounts.clear();
        }

        private List<EventUrlCount> sortByUrlCount(ListState<EventUrlCount> eventUrlCounts) throws Exception {
            List<EventUrlCount> list = new ArrayList<>();
            for (EventUrlCount urlCount : eventUrlCounts.get()) {
                list.add(urlCount);
            }
            list = list.size() <= 5 ? list : list.subList(0, 5);
            List<EventUrlCount> sortList = list.stream().sorted(Comparator.comparing(EventUrlCount::getCount).reversed()).collect(Collectors.toList());
            return sortList;
        }


    }