Flink的窗口计算处理是实际应用中常见的一种数据处理方式,例如,需要统计最近10秒钟内最热门的三个url链接,并且每5秒钟更新一次。
针对这种热门网站TopN问题有多种解决方案,最简单的想法就是,可以用一个滑动窗口来手机url的访问数据,不对url分组,然后用网页的访问次数表示网页的热门程度。
首先,模拟数据准备,定义一个ClickSource类来模拟用户点击事件生成数据:
package com.atguigu.bean;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import java.util.Calendar;
import java.util.Random;
public class ClickSource implements SourceFunction<Event> {
private static Boolean run = true;
@Override
public void run(SourceContext sourceContext) throws Exception {
Random random = new Random();
String[] users = {"tom","mary","bibby","diff","alex"};
String[] urls = {"/home","page1","order","favor","display1"};
while (run){
String user = users[random.nextInt(users.length)];
String url = urls[random.nextInt(urls.length)];
long timestamps = Calendar.getInstance().getTimeInMillis();
Event event = new Event(user, url, timestamps);
sourceContext.collect(event);
Thread.sleep(1000);
}
}
@Override
public void cancel() {
run = false;
}
}
基于DataStream进行开窗,然后使用全窗口函数ProcessAllWindowFunction来进行处理:
package com.atguigu.ProcessFunction;
import com.atguigu.bean.ClickSource;
import com.atguigu.bean.Event;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.time.Duration;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.HashMap;
import java.util.HashSet;
/**
* 实时统计一段时间内的top2的url
*/
public class TopNExample_ProcessWindowFunction1 {
public static void main(String[] args) throws Exception {
//创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//方便测试起见,设置并行度为1
env.setParallelism(1);
//读取数据流
SingleOutputStreamOperator<Event> stream = env.addSource(new ClickSource())
.assignTimestampsAndWatermarks(
WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(((event, l) -> event.timestamp))
);
stream.print("input");
//将所有url分到一个组,之后基于keyedDataStream进行开窗,调用.process函数实现ProcessWindowFunction抽象类
SingleOutputStreamOperator<String> process = stream.keyBy(value -> true)
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.process(new ProcessWindowFunction<Event, String, Boolean, TimeWindow>() {
@Override
public void process(Boolean aBoolean, ProcessWindowFunction<Event, String, Boolean, TimeWindow>.Context context, Iterable<Event> iterable, Collector<String> collector) throws Exception {
//计算每个url的count,创建HashSet存放url以及访问量,每来一个url,count值+1
HashMap<String, Long> hashMap = new HashMap<>();
for (Event event : iterable) {
Long count = hashMap.getOrDefault(event.url, 0L);
hashMap.put(event.url, count + 1);
}
//将HashSet中的数据放入列表中 方便后续排序
ArrayList<Tuple2<String, Long>> urlList = new ArrayList<>();
for (String url : hashMap.keySet()) {
urlList.add(Tuple2.of(url, hashMap.get(url)));
}
//从大到小降序排列
urlList.sort(new Comparator<Tuple2<String, Long>>() {
@Override
public int compare(Tuple2<String, Long> o1, Tuple2<String, Long> o2) {
return Long.compare(o2.f1, o1.f1);
}
});
//方便起见,我们将其包装成String字符串打印
StringBuffer buffer = new StringBuffer();
buffer.append("===================\n");
//从context中获取窗口的起始和结束时间
buffer.append("窗口[" + context.window().getStart() + "~" + context.window().getEnd() + "]\n");
//循环遍历List列表,[0-2)元素个数,寻找Top2
for (int i = 0; i < Math.min(2, urlList.size()); i++) {
Tuple2<String, Long> tuple2 = urlList.get(i);
buffer.append("NO." + (i + 1) + ":url: " + tuple2.f0 + " 的访问量: " + tuple2.f1 + "\n");
}
collector.collect(buffer.toString());
}
});
process.print("processed");//打印输出
//让执行环境运行起来
env.execute();
}
}
运行结果:
input> Event{user='alex', url='favor', timestamp=1664361379435}
input> Event{user='diff', url='page1', timestamp=1664361380449}
processed> ===================
窗口[1664361370000~1664361380000]
NO.1:url: favor 的访问量: 1
input> Event{user='bibby', url='order', timestamp=1664361381459}
input> Event{user='bibby', url='order', timestamp=1664361382465}
input> Event{user='diff', url='display1', timestamp=1664361383474}
input> Event{user='bibby', url='order', timestamp=1664361384487}
input> Event{user='diff', url='/home', timestamp=1664361385502}
input> Event{user='mary', url='display1', timestamp=1664361386511}
input> Event{user='diff', url='display1', timestamp=1664361387517}
input> Event{user='diff', url='order', timestamp=1664361388526}
input> Event{user='diff', url='favor', timestamp=1664361389538}
input> Event{user='tom', url='/home', timestamp=1664361390545}
processed> ===================
窗口[1664361380000~1664361390000]
NO.1:url: order 的访问量: 4
NO.2:url: display1 的访问量: 3
基于这种实现方式,没有对url进行按键分区,而是将所有的url分到一个窗口统一进行处理,相当于强行设置并行度为1,在实际应用中是要尽量避免的,而且Flink官方也不推荐使用AllWindowedStream进行处理。