零基础学Flink:实时热销榜Top5(案例)

6 篇文章 0 订阅
2 篇文章 0 订阅

如前文所预告的一样,今天我们来分析一下,如何通过flink完成实时热销榜单Top5的计算,本文案例,需要使用前文一些内容,如果不了解的同学,请移步《零基础学Flink:Join两个流》。


案例代码存放在 https://github.com/dafei1288/flink_casestudy


前文我们已经聚合好了两条流,结果是将汇率和订单价格最终计算成最后的成交价格。其数据结构如下:


Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer>  

时间戳(Long)

商品大类(String) 

商品细目(Integer) 

货币类型(String) 

价格(Integer) 

时间戳(Long) 

货币类型(String) 

汇率(Integer) 

成交价格(Integer)

我们本次便从这个数据流开始入手,首先是对这条流的事件时间进行一个重新定义,我们就使用订单的时间戳作为事件时间

 joinedStream.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer>>() {
@Override
public long extractAscendingTimestamp(Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer> element) {
return element.f0;
}
});


在这个案例里,我们并没有使用一个明确的商品ID来定义一个商品,而是设计了两个字段,分别是商品大类以及商品细目,我们使用这两个拼接形成的一个字段为分组字段,这么设计也可以帮助我们了解一下KeySelector的使用。


 joinedTimedStream.keyBy(new KeySelector<Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer>,String>(){
@Override
public String getKey(Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer> value) throws Exception {
return value.f1+value.f2;
}
}).timeWindow(Time.seconds(30), Time.seconds(10))
.aggregate(new SumAgg(), new WindowResultFunction());


这里我们再次引用这张图,来加深一下理解,stream是如何转换的。

640?wx_fmt=png

在这里,我们同时需要对数据进行聚合,这里我们不以订单计数来衡量热销商品,而是使用最终价格的聚合值来进行衡量。


接下来我们需要定义如何进行聚合计算,这里只做了简单聚合


public static class SumAgg implements AggregateFunction<Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer>, Long, Long> {

@Override
public Long createAccumulator() {
return 0L;
}

@Override
public Long add(Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer> value, Long acc) {
return acc + value.f8;
}

@Override
public Long getResult(Long acc) {
return acc;
}

@Override
public Long merge(Long acc1, Long acc2) {
return acc1 + acc2;
}
}


然后是定义输出窗口结果


/** 用于输出窗口的结果 */
//IN, OUT, KEY, W extends Window
public static class WindowResultFunction implements WindowFunction<Long, OrderView, String, TimeWindow> {

@Override
public void apply(
String key, // 窗口的主键
TimeWindow window, // 窗口
Iterable<Long> aggregateResult, // 聚合函数的结果
Collector<OrderView> collector // 输出类型为 OrderView
) throws Exception {
Long count = aggregateResult.iterator().next();
collector.collect(OrderView.of(key, window.getEnd(), count));
}

}


public static class OrderView {
public String itemId; // 商品ID
public long windowEnd; // 窗口结束时间戳
     public long allsum; // 商品的点击量
    public static OrderView of(String itemId, long windowEnd, long allsum) {
OrderView result = new OrderView();
result.itemId = itemId;
result.windowEnd = windowEnd;
result.allsum = allsum;
return result;
}

@Override
public String toString() {
return "OrderView{" +
"itemId='" + itemId + '\'' +
", windowEnd=" + windowEnd +
", viewCount=" + allsum +
'}';
}
}


经过上述的步骤,我们得到了一个经过聚合的时间窗口数据,接下来只需取再按时间分组并取到前五的数据就大功告成了。使用 ProcessFunction 实现一个自定义的 TopN 函数 TopNHot 来计算排名前5的商品,并将排名结果格式化成字符串,便于后续输出。


DataStream<String> topNHots = windowedData
.keyBy("windowEnd")
.process(new TopNHot(5));


ProcessFunction 是 Flink 提供的一个 low-level API,用于实现更高级的功能。它主要提供了定时器 timer 的功能(支持EventTime或ProcessingTime)。本案例中我们将利用 timer 来判断何时收齐了某个 window 下所有商品的点击量数据。由于 Watermark 的进度是全局的,在 processElement 方法中,每当收到一条数据(OrderView),我们就注册一个 windowEnd+1 的定时器(Flink 框架会自动忽略同一时间的重复注册)。windowEnd+1 的定时器被触发时,意味着收到了windowEnd+1的 Watermark,即收齐了该windowEnd下的所有商品窗口统计值。我们在 onTimer() 中处理将收集的所有商品及点击量进行排序,选出 TopN,并将排名信息格式化成字符串后进行输出。

这里我们还使用了 ListState<OrderView> 来存储收到的每条 OrderView消息,保证在发生故障时,状态数据的不丢失和一致性。ListState 是 Flink 提供的类似 Java List 接口的 State API,它集成了框架的 checkpoint 机制,自动做到了 exactly-once 的语义保证。


public static class TopNHot extends KeyedProcessFunction<Tuple, OrderView, String> {

private final int topSize;
public TopNHot(int topSize) {
this.topSize = topSize;
}

// 用于存储商品与点击数的状态,待收齐同一个窗口的数据后,再触发 TopN 计算
private ListState<OrderView> orderState;
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
// 状态的注册
ListStateDescriptor<OrderView> itemsStateDesc = new ListStateDescriptor<>(
"orderState-state",
OrderView.class);
orderState = getRuntimeContext().getListState(itemsStateDesc);
}

@Override
public void processElement(
OrderView input,
Context context,
Collector<String> collector) throws Exception {

// 每条数据都保存到状态中
orderState.add(input);
// 注册 windowEnd+1 的 EventTime Timer, 当触发时,说明收齐了属于windowEnd窗口的所有商品数据
context.timerService().registerEventTimeTimer(input.windowEnd + 1);
}

@Override
public void onTimer(
long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
// 获取收到的所有商品销售量
List<OrderView> allItems = new ArrayList<>();
orderState.get().forEach(it->allItems.add(it));
// 提前清除状态中的数据,释放空间
orderState.clear();
// 按照销售额从大到小排序
allItems.sort((x1,x2)-> (int) (x1.allsum - x2.allsum));
// 将排名信息格式化成 String, 便于打印
StringBuilder result = new StringBuilder();
result.append("====================================\n");
result.append("时间: ").append(new Timestamp(timestamp-1)).append("\n");
for (int i=0;i<topSize && i<allItems.size();i++) {
OrderView currentItem = allItems.get(i);
// No1: 商品ID=12224 销售额=2413
result.append("No").append(i+1).append(":")
.append(" 商品ID=").append(currentItem.itemId)
.append(" 销售额=").append(currentItem.allsum)
.append("\n");
}
result.append("====================================\n\n");
out.collect(result.toString());
}
}


下面是完整代码:


package cn.flinkhub.topndemo;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.*;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.api.windowing.windows.Window;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import org.apache.flink.util.Collector;
import java.io.IOException;
import java.sql.Timestamp;
import java.util.*;
public class App {

public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Map properties= new HashMap();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("group.id", "test");
properties.put("enable.auto.commit", "true");
properties.put("auto.commit.interval.ms", "1000");
properties.put("auto.offset.reset", "earliest");
properties.put("session.timeout.ms", "30000");
// properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
// properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("topicOrder", "order");
properties.put("topicRate", "rate");
ParameterTool parameterTool = ParameterTool.fromMap(properties);
FlinkKafkaConsumer010 consumer010Rate = new FlinkKafkaConsumer010(
parameterTool.getRequired("topicRate"), new DeserializationSchema() {
@Override
public TypeInformation getProducedType() {
return TypeInformation.of(new TypeHint<Tuple3<Long,String,Integer>>(){});
//return TypeInformation.of(new TypeHint<Tuple>(){});
}

@Override
public Tuple3<Long,String,Integer> deserialize(byte[] message) throws IOException {
String[] res = new String(message).split(",");
Long timestamp = Long.valueOf(res[0]);
String dm = res[1];
Integer value = Integer.valueOf(res[2]);
return Tuple3.of(timestamp,dm,value);
}

@Override
public boolean isEndOfStream(Object nextElement) {
return false;
}
}, parameterTool.getProperties());
FlinkKafkaConsumer010 consumer010Order = new FlinkKafkaConsumer010(
parameterTool.getRequired("topicOrder"), new DeserializationSchema() {
@Override
public TypeInformation getProducedType() {
return TypeInformation.of(new TypeHint<Tuple5<Long,String,Integer,String,Integer>>(){});
}

@Override
public Tuple5<Long,String,Integer,String,Integer> deserialize(byte[] message) throws IOException {
//%d,%s,%d,%s,%d
String[] res = new String(message).split(",");
Long timestamp = Long.valueOf(res[0]);
String catlog = res[1];
Integer subcat = Integer.valueOf(res[2]);
String dm = res[3];
Integer value = Integer.valueOf(res[4]);
return Tuple5.of(timestamp,catlog,subcat,dm,value);
}

@Override
public boolean isEndOfStream(Object nextElement) {
return false;
}
}, parameterTool.getProperties());
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
DataStream<Tuple3<Long,String,Integer>> rateStream = env.addSource(consumer010Rate);
DataStream<Tuple5<Long,String,Integer,String,Integer>> oraderStream = env.addSource(consumer010Order);
long delay = 1000;
DataStream<Tuple3<Long,String,Integer>> rateTimedStream = rateStream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Tuple3<Long,String,Integer>>(Time.milliseconds(delay)) {
@Override
public long extractTimestamp(Tuple3<Long, String, Integer> element) {
return (Long)element.getField(0);
}
});
DataStream<Tuple5<Long,String,Integer,String,Integer>> oraderTimedStream = oraderStream.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Tuple5<Long,String,Integer,String,Integer>>() {
@Override
public long extractAscendingTimestamp(Tuple5 value) {

return (Long)value.getField(0);
}
});
DataStream<Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer>> joinedStream = oraderTimedStream.join(rateTimedStream).where(new KeySelector<Tuple5<Long,String,Integer,String,Integer>,String>(){
@Override
public String getKey(Tuple5<Long,String,Integer,String,Integer> value) throws Exception {
return value.getField(3).toString();
}
}).equalTo(new KeySelector<Tuple3<Long,String,Integer>,String>(){
@Override
public String getKey(Tuple3<Long,String,Integer> value) throws Exception {
return value.getField(1).toString();
}
}).window(TumblingEventTimeWindows.of(Time.seconds(10)))
.apply(new JoinFunction<Tuple5<Long,String,Integer,String,Integer>, Tuple3<Long,String,Integer>,Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer>>() {
@Override
public Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer> join( Tuple5<Long,String,Integer,String,Integer> first, Tuple3<Long,String,Integer>second) throws Exception {
Integer res = (Integer)second.getField(2)*(Integer)first.getField(4);
return Tuple9.of(first.f0,first.f1,first.f2,first.f3,first.f4,second.f0,second.f1,second.f2,res);
}
});
DataStream<Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer>> joinedTimedStream = joinedStream.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer>>() {
@Override
public long extractAscendingTimestamp(Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer> element) {
return element.f0;
}
});
DataStream<OrderView> windowedData = joinedTimedStream.keyBy(new KeySelector<Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer>,String>(){
@Override
public String getKey(Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer> value) throws Exception {
return value.f1+value.f2;
}
}).timeWindow(Time.seconds(30), Time.seconds(10))
.aggregate(new SumAgg(), new WindowResultFunction());
DataStream<String> topNHots = windowedData
.keyBy("windowEnd")
.process(new TopNHot(5));
topNHots.print();
env.execute("done!");
}

public static class SumAgg implements AggregateFunction<Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer>, Long, Long> {

@Override
public Long createAccumulator() {
return 0L;
}

@Override
public Long add(Tuple9<Long,String,Integer,String,Integer,Long,String,Integer,Integer> value, Long acc) {
return acc + value.f8;
}

@Override
public Long getResult(Long acc) {
return acc;
}

@Override
public Long merge(Long acc1, Long acc2) {
return acc1 + acc2;
}
}

/** 用于输出窗口的结果 */
//IN, OUT, KEY, W extends Window
public static class WindowResultFunction implements WindowFunction<Long, OrderView, String, TimeWindow> {

@Override
public void apply(
String key, // 窗口的主键
TimeWindow window, // 窗口
Iterable<Long> aggregateResult, // 聚合函数的结果
Collector<OrderView> collector // 输出类型为 OrderView
) throws Exception {
Long count = aggregateResult.iterator().next();
collector.collect(OrderView.of(key, window.getEnd(), count));
}
}


public static class OrderView {
public String itemId; // 商品ID
public long windowEnd; // 窗口结束时间戳
public long allsum; // 商品的销售量
public static OrderView of(String itemId, long windowEnd, long allsum) {
OrderView result = new OrderView();
result.itemId = itemId;
result.windowEnd = windowEnd;
result.allsum = allsum;
return result;
}

@Override
public String toString() {
return "OrderView{" +
"itemId='" + itemId + '\'' +
", windowEnd=" + windowEnd +
", viewCount=" + allsum +
'}';
}
}

public static class TopNHot extends KeyedProcessFunction<Tuple, OrderView, String> {

private final int topSize;
public TopNHot(int topSize) {
this.topSize = topSize;
}

// 用于存储商品与点击数的状态,待收齐同一个窗口的数据后,再触发 TopN 计算
private ListState<OrderView> orderState;
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
// 状态的注册
ListStateDescriptor<OrderView> itemsStateDesc = new ListStateDescriptor<>(
"orderState-state",
OrderView.class);
orderState = getRuntimeContext().getListState(itemsStateDesc);
}

@Override
public void processElement(
OrderView input,
Context context,
Collector<String> collector) throws Exception {

// 每条数据都保存到状态中
orderState.add(input);
// 注册 windowEnd+1 的 EventTime Timer, 当触发时,说明收齐了属于windowEnd窗口的所有商品数据
context.timerService().registerEventTimeTimer(input.windowEnd + 1);
}

@Override
public void onTimer(
long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
// 获取收到的所有商品销售量
List<OrderView> allItems = new ArrayList<>();
orderState.get().forEach(it->allItems.add(it));
// 提前清除状态中的数据,释放空间
orderState.clear();
// 按照销售额从大到小排序
allItems.sort((x1,x2)-> (int) (x1.allsum - x2.allsum));
// 将排名信息格式化成 String, 便于打印
StringBuilder result = new StringBuilder();
result.append("====================================\n");
result.append("时间: ").append(new Timestamp(timestamp-1)).append("\n");
for (int i=0;i<topSize && i<allItems.size();i++) {
OrderView currentItem = allItems.get(i);
// No1: 商品ID=12224 销售额=2413
result.append("No").append(i+1).append(":")
.append(" 商品ID=").append(currentItem.itemId)
.append(" 销售额=").append(currentItem.allsum)
.append("\n");
}
result.append("====================================\n\n");
out.collect(result.toString());
}
}
}


好了,我们来看下结果


640?wx_fmt=png



参考连接:

http://wuchong.me/blog/2018/11/07/use-flink-calculate-hot-items/


  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

麒思妙想

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值