Flink实时---统计一段时间内的热门网页TopN问题解决方案一

Flink的窗口计算处理是实际应用中常见的一种数据处理方式,例如,需要统计最近10秒钟内最热门的三个url链接,并且每5秒钟更新一次。
针对这种热门网站TopN问题有多种解决方案,最简单的想法就是,可以用一个滑动窗口来手机url的访问数据,不对url分组,然后用网页的访问次数表示网页的热门程度。

首先,模拟数据准备,定义一个ClickSource类来模拟用户点击事件生成数据:

package com.atguigu.bean;

import org.apache.flink.streaming.api.functions.source.SourceFunction;

import java.util.Calendar;
import java.util.Random;

public class ClickSource implements SourceFunction<Event> {
    private static Boolean run = true;
    @Override
    public void run(SourceContext sourceContext) throws Exception {
        Random random = new Random();
        String[] users = {"tom","mary","bibby","diff","alex"};
        String[] urls = {"/home","page1","order","favor","display1"};

        while (run){
            String user = users[random.nextInt(users.length)];
            String url = urls[random.nextInt(urls.length)];
            long timestamps = Calendar.getInstance().getTimeInMillis();
            Event event = new Event(user, url, timestamps);
            sourceContext.collect(event);

            Thread.sleep(1000);
        }

    }

    @Override
    public void cancel() {
        run = false;
    }
}

基于DataStream进行开窗,然后使用全窗口函数ProcessAllWindowFunction来进行处理:

package com.atguigu.ProcessFunction;

import com.atguigu.bean.ClickSource;
import com.atguigu.bean.Event;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.time.Duration;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.HashMap;
import java.util.HashSet;

/**
 * 实时统计一段时间内的top2的url
 */
public class TopNExample_ProcessWindowFunction1 {
    public static void main(String[] args) throws Exception {
        //创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		//方便测试起见,设置并行度为1
        env.setParallelism(1);
        //读取数据流
        SingleOutputStreamOperator<Event> stream = env.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ZERO)
                                .withTimestampAssigner(((event, l) -> event.timestamp))
                );

        stream.print("input");
		//将所有url分到一个组,之后基于keyedDataStream进行开窗,调用.process函数实现ProcessWindowFunction抽象类
        SingleOutputStreamOperator<String> process = stream.keyBy(value -> true)
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                .process(new ProcessWindowFunction<Event, String, Boolean, TimeWindow>() {
                    @Override
                    public void process(Boolean aBoolean, ProcessWindowFunction<Event, String, Boolean, TimeWindow>.Context context, Iterable<Event> iterable, Collector<String> collector) throws Exception {
                        //计算每个url的count,创建HashSet存放url以及访问量,每来一个url,count值+1
                        HashMap<String, Long> hashMap = new HashMap<>();
                        for (Event event : iterable) {
                            Long count = hashMap.getOrDefault(event.url, 0L);
                            hashMap.put(event.url, count + 1);
                        }

                        //将HashSet中的数据放入列表中 方便后续排序
                        ArrayList<Tuple2<String, Long>> urlList = new ArrayList<>();
                        for (String url : hashMap.keySet()) {
                            urlList.add(Tuple2.of(url, hashMap.get(url)));
                        }
                        //从大到小降序排列
                        urlList.sort(new Comparator<Tuple2<String, Long>>() {
                            @Override
                            public int compare(Tuple2<String, Long> o1, Tuple2<String, Long> o2) {

                                return Long.compare(o2.f1, o1.f1);
                            }
                        });

                        //方便起见,我们将其包装成String字符串打印
                        StringBuffer buffer = new StringBuffer();
                        buffer.append("===================\n");
                        //从context中获取窗口的起始和结束时间
                        buffer.append("窗口[" + context.window().getStart() + "~" + context.window().getEnd() + "]\n");
                        //循环遍历List列表,[0-2)元素个数,寻找Top2
                        for (int i = 0; i < Math.min(2, urlList.size()); i++) {
                            Tuple2<String, Long> tuple2 = urlList.get(i);
                            buffer.append("NO." + (i + 1) + ":url: " + tuple2.f0 + " 的访问量: " + tuple2.f1 + "\n");
                        }
                        collector.collect(buffer.toString());
                    }
                });
        process.print("processed");//打印输出
		//让执行环境运行起来
        env.execute();
    }
}

运行结果:

input> Event{user='alex', url='favor', timestamp=1664361379435}
input> Event{user='diff', url='page1', timestamp=1664361380449}
processed> ===================
窗口[1664361370000~1664361380000]
NO.1:url: favor 的访问量: 1

input> Event{user='bibby', url='order', timestamp=1664361381459}
input> Event{user='bibby', url='order', timestamp=1664361382465}
input> Event{user='diff', url='display1', timestamp=1664361383474}
input> Event{user='bibby', url='order', timestamp=1664361384487}
input> Event{user='diff', url='/home', timestamp=1664361385502}
input> Event{user='mary', url='display1', timestamp=1664361386511}
input> Event{user='diff', url='display1', timestamp=1664361387517}
input> Event{user='diff', url='order', timestamp=1664361388526}
input> Event{user='diff', url='favor', timestamp=1664361389538}
input> Event{user='tom', url='/home', timestamp=1664361390545}
processed> ===================
窗口[1664361380000~1664361390000]
NO.1:url: order 的访问量: 4
NO.2:url: display1 的访问量: 3

基于这种实现方式,没有对url进行按键分区,而是将所有的url分到一个窗口统一进行处理,相当于强行设置并行度为1,在实际应用中是要尽量避免的,而且Flink官方也不推荐使用AllWindowedStream进行处理。

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值