Flink实时---统计一段时间内的热门网页TopN问题解决方案一

最新推荐文章于 2024-04-26 14:36:50 发布

I披荆斩棘I

最新推荐文章于 2024-04-26 14:36:50 发布

阅读量912

点赞数

本文链接：https://blog.csdn.net/gene20200509/article/details/127094459

版权

大数据专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Flink的窗口计算处理是实际应用中常见的一种数据处理方式，例如，需要统计最近10秒钟内最热门的三个url链接，并且每5秒钟更新一次。
针对这种热门网站TopN问题有多种解决方案，最简单的想法就是，可以用一个滑动窗口来手机url的访问数据，不对url分组，然后用网页的访问次数表示网页的热门程度。

首先，模拟数据准备，定义一个ClickSource类来模拟用户点击事件生成数据：

package com.atguigu.bean;

import org.apache.flink.streaming.api.functions.source.SourceFunction;

import java.util.Calendar;
import java.util.Random;

public class ClickSource implements SourceFunction<Event> {
    private static Boolean run = true;
    @Override
    public void run(SourceContext sourceContext) throws Exception {
        Random random = new Random();
        String[] users = {"tom","mary","bibby","diff","alex"};
        String[] urls = {"/home","page1","order","favor","display1"};

        while (run){
            String user = users[random.nextInt(users.length)];
            String url = urls[random.nextInt(urls.length)];
            long timestamps = Calendar.getInstance().getTimeInMillis();
            Event event = new Event(user, url, timestamps);
            sourceContext.collect(event);

            Thread.sleep(1000);
        }

    }

    @Override
    public void cancel() {
        run = false;
    }
}

基于DataStream进行开窗，然后使用全窗口函数ProcessAllWindowFunction来进行处理：

package com.atguigu.ProcessFunction;

import com.atguigu.bean.ClickSource;
import com.atguigu.bean.Event;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.time.Duration;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.HashMap;
import java.util.HashSet;

/**
 * 实时统计一段时间内的top2的url
 */
public class TopNExample_ProcessWindowFunction1 {
    public static void main(String[] args) throws Exception {
        //创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		//方便测试起见，设置并行度为1
        env.setParallelism(1);
        //读取数据流
        SingleOutputStreamOperator<Event> stream = env.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ZERO)
                                .withTimestampAssigner(((event, l) -> event.timestamp))
                );

        stream.print("input");
		//将所有url分到一个组，之后基于keyedDataStream进行开窗，调用.process函数实现ProcessWindowFunction抽象类
        SingleOutputStreamOperator<String> process = stream.keyBy(value -> true)
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                .process(new ProcessWindowFunction<Event, String, Boolean, TimeWindow>() {
                    @Override
                    public void process(Boolean aBoolean, ProcessWindowFunction<Event, String, Boolean, TimeWindow>.Context context, Iterable<Event> iterable, Collector<String> collector) throws Exception {
                        //计算每个url的count，创建HashSet存放url以及访问量，每来一个url，count值+1
                        HashMap<String, Long> hashMap = new HashMap<>();
                        for (Event event : iterable) {
                            Long count = hashMap.getOrDefault(event.url, 0L);
                            hashMap.put(event.url, count + 1);
                        }

                        //将HashSet中的数据放入列表中 方便后续排序
                        ArrayList<Tuple2<String, Long>> urlList = new ArrayList<>();
                        for (String url : hashMap.keySet()) {
                            urlList.add(Tuple2.of(url, hashMap.get(url)));
                        }
                        //从大到小降序排列
                        urlList.sort(new Comparator<Tuple2<String, Long>>() {
                            @Override
                            public int compare(Tuple2<String, Long> o1, Tuple2<String, Long> o2) {

                                return Long.compare(o2.f1, o1.f1);
                            }
                        });

                        //方便起见，我们将其包装成String字符串打印
                        StringBuffer buffer = new StringBuffer();
                        buffer.append("===================\n");
                        //从context中获取窗口的起始和结束时间
                        buffer.append("窗口[" + context.window().getStart() + "~" + context.window().getEnd() + "]\n");
                        //循环遍历List列表，[0-2)元素个数，寻找Top2
                        for (int i = 0; i < Math.min(2, urlList.size()); i++) {
                            Tuple2<String, Long> tuple2 = urlList.get(i);
                            buffer.append("NO." + (i + 1) + ":url: " + tuple2.f0 + " 的访问量： " + tuple2.f1 + "\n");
                        }
                        collector.collect(buffer.toString());
                    }
                });
        process.print("processed");//打印输出
		//让执行环境运行起来
        env.execute();
    }
}

运行结果：

input> Event{user='alex', url='favor', timestamp=1664361379435}
input> Event{user='diff', url='page1', timestamp=1664361380449}
processed> ===================
窗口[1664361370000~1664361380000]
NO.1:url: favor 的访问量： 1

input> Event{user='bibby', url='order', timestamp=1664361381459}
input> Event{user='bibby', url='order', timestamp=1664361382465}
input> Event{user='diff', url='display1', timestamp=1664361383474}
input> Event{user='bibby', url='order', timestamp=1664361384487}
input> Event{user='diff', url='/home', timestamp=1664361385502}
input> Event{user='mary', url='display1', timestamp=1664361386511}
input> Event{user='diff', url='display1', timestamp=1664361387517}
input> Event{user='diff', url='order', timestamp=1664361388526}
input> Event{user='diff', url='favor', timestamp=1664361389538}
input> Event{user='tom', url='/home', timestamp=1664361390545}
processed> ===================
窗口[1664361380000~1664361390000]
NO.1:url: order 的访问量： 4
NO.2:url: display1 的访问量： 3

基于这种实现方式，没有对url进行按键分区，而是将所有的url分到一个窗口统一进行处理，相当于强行设置并行度为1，在实际应用中是要尽量避免的，而且Flink官方也不推荐使用AllWindowedStream进行处理。