一、需求简单概述
1. 原因:
估计看到这篇文章的人都会觉得统计每天的下载量排名这个需求听起来就是T+1的离线批处理需求,其实我也是这么觉得的,所以为什么要写这个呢?其实这是以前的需求,以前是实时统计的需求,但是排名什么的是在后期的接口通过读取数据库的数据进行实现的,现在就觉得通过接口来获取数据库的数据进行排序什么的效率比较低,就希望直接把排序结果直接写到数据库中。这也是为什么平常我比较习惯用Scala去写Spark和Flink,而这次使用Java编写的原因,因为写在以前的项目中所以就得使用他们以前的编写方式了。然后我为什么要整理成文章呢?因为这个需求用到了Flink的 状态编程、WaterMark、时间语义、状态后端、定时器 等,一开始在网上找相关的资料也不多,所以还是有点记录意义的。
2. 需求:
一开始我以为就是实时统计当天0点到计算时候的这个时间段每个游戏的下载量排名,所以就按照这样去实现,后面我写完了才知道只需要实时统计下载量最大的那个游戏就可以了,其实差不多,所以代码中我会有注释掉的一部分。
3. 实现思路:
拿到数据以后,因为所有的游戏数量并不是很大,所以用一个状态保存所有游戏的下载量,方便后面统计拿到最大下载量的游戏数据进行入库。所以同一天的数据会使用同一个状态,那问题来了,这个今天的状态后面应该就没什么用了,那这个状态每天一个每天一个的不是越来越多,所以我设置了个定时器,定时器两天后触发,定时器触发的时候进行的操作就是对两天前的那个状态进行一个清空。所以还需要第二个状态来保存定时器的时间。我还设置了第三个定时器用于去重使用。这里还有一个小的细节,就是比如你的数据是两天以后才到,这时候两天前的定时器都清空了,就容易导致数据出错,所以还得进行一个判断,拿watermark的时间和传进来的时间进行一个比较,如果传进来的时间加上两天还大于watermark的时间那就舍去不处理。大概的思路是这样,因为这种需求是今天零点到现在的下载量最大,不是每个钟或者某段时间,像每五分钟的下载量最大这种其实用Window就会方便很多,也不用考虑那么多状态。但是这里其实也可以加上Window,就是每多少分钟执行一次,但是我这边没有加,也差不多。最后是输出到MongoDB数据库,这里就不贴输出的代码了。
二、代码实现
pom.xml:
<!-- Apache Flink dependencies -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-core</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-statebackend-rocksdb_2.11</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongo-java-driver</artifactId>
<version>3.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-hadoop-compatibility_2.11</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.mongodb.mongo-hadoop</groupId>
<artifactId>mongo-hadoop-core</artifactId>
<version>2.0.0</version>
</dependency>
数据格式:
游戏id 时间戳 设备标识 版本 渠道
id1 1576684800 imei1 version1 channel1
一条数据就是一次下载,上面这条数据的意思就是imei1这个设备在1576684800这个时间戳下载了id1的游戏。
- Topology.java
import com.mongodb.client.model.WriteModel;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.contrib.streaming.state.RocksDBStateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.bson.Document;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.List;
public class Topology {
public static final String ProjectName = "halo-download";
public static void main(String[] args) throws Exception {
Logger logger = LoggerFactory.getLogger(Topology.class);
// 不同项目加载不同的路径的配置
ParameterTool params = ParameterTool
.fromPropertiesFile(
Topology.class.getResourceAsStream("/normal.properties")
).mergeWith(ParameterTool.fromPropertiesFile(
Topology.class.getResourceAsStream("/mongodb.properties")
)).mergeWith(ParameterTool.fromPropertiesFile(
Topology.class.getResourceAsStream("/loghub.properties")
)).mergeWith(
ParameterTool.fromArgs(args)
);
String hdfsMaster = params.get("hdfs", "hdfs://");
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
.enableCheckpointing(Time.seconds(
Integer.valueOf(params.get("checkpoint.sec", "300"))
).toMilliseconds())
.setStateBackend(new RocksDBStateBackend(params.get("RocksDBStateBackend", hdfsMaster + "/flink/checkpoints")));
env.getConfig().setGlobalJobParameters(params);
// 执行环境
String execEnv = params.get("exec.env", "dev");
// 开启恰好一次语义
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// 开启系统log输出
env.getConfig().enableSysoutLogging();
// 以事件时间为时间窗,就是事件时间语义,EventTime
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// latency检测时间间隔
env.getConfig().setLatencyTrackingInterval(1000);
//生成Watermark的时间间隔为100毫秒
env.getConfig().setAutoWatermarkInterval(100L);
logger.info("execEnv: " + execEnv);
//真实场景的数据就不拿了,随便造点类似的数据进行模拟测试
//模拟数据用于测试,测试数据不能换行,换行的话时间戳会有问题
//id1 1576684800 imei1 version1 channel1 这是数据格式
//12-22 id2 1576944000 imei2 version2 channel1 前面的12-22是对应时间戳的日期,主要是方便我测试,没有实际意义
//12-23 id3 1577030400 imei2 version2 channel1
//12-24 id4 1577116800 imei2 version2 channel1
//12-25 id5 1577203200 imei2 version2 channel1
//12-26 id6 1577289600 imei2 version2 channel1
//12-27 id7 1577376000 imei2 version2 channel1
//12-28 id8 1577462400 imei2 version2 channel1
SingleOutputStreamOperator<SortDowComplete> localhost = env.socketTextStream("localhost", 8888).map(new MapFunction<String, SortDowComplete>() {
@Override
public SortDowComplete map(String s) throws Exception {
String[] splits = s.split("\\W+");
SortDowComplete sortDowComplete = new SortDowComplete(splits[0], Integer.parseInt(splits[1]), splits[2], splits[3], splits[4]);
return sortDowComplete;
}
});//.setParallelism(1)
SingleOutputStreamOperator<List<WriteModel<? extends Document>>> process =
//生成WaterMark,把时间字段传进去,并且把整个流数据当做递增的数据 不提供延迟时间,现在这一条数据产生到watermark要在下一条数据到达才会拿得到
//https://ci.apache.org/projects/flink/flink-docs-release-1.10/zh/dev/event_timestamp_extractors.html
localhost.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<SortDowComplete>() {
@Override
public long extractAscendingTimestamp(SortDowComplete sortDowComplete) {
return sortDowComplete.getTs() * 1000L;
}
}).setParallelism(1)
//把时间当做keyBy的对象,就能保证同一天的数据进到同一个processFunction中
.keyBy(new SortKeyByFunction()).process(new KeyTimeProcessFunction());
process.println();//输出
}
}
- SortDowComplete.java
/**
* @Author: fseast
* @Date: 2020/3/24 下午8:32
* @Description:
*/
public class SortDowComplete{
private String gameId;
private long ts;
private String imei;
private String haloVersion;
private String haloChannel;
public SortDowComplete(){}
public SortDowComplete(String gameId, int ts, String imei, String haloVersion, String haloChannel) {
this.gameId = gameId;
this.ts = ts;
this.imei = imei;
this.haloVersion = haloVersion;
this.haloChannel = haloChannel;
}
public String getGameId() {
return gameId;
}
public void setGameId(String gameId) {
this.gameId = gameId;
}
public Long getTs() {
return ts;
}
public void setTs(int ts) {
this.ts = ts;
}
public String getImei() {
return imei;
}
public void setImei(String imei) {
this.imei = imei;
}
public String getHaloVersion() {
return haloVersion;
}
public void setHaloVersion(String haloVersion) {
this.haloVersion = haloVersion;
}
public String getHaloChannel() {
return haloChannel;
}
public void setHaloChannel(String haloChannel) {
this.haloChannel = haloChannel;
}
@Override
public String toString() {
return "SortDowComplete{" +
"gameId='" + gameId + '\'' +
", ts=" + ts +
", haloVersion='" + haloVersion + '\'' +
", haloChannel='" + haloChannel + '\'' +
", imei='" + imei + '\'' +
'}';
}
}
- SortKeyByFunction.java
import org.apache.flink.api.java.functions.KeySelector;
/**
* @Author: fseast
* @Date: 2020/3/24 下午11:14
* @Description:
*/
public class SortKeyByFunction implements KeySelector<SortDowComplete,Long> {
@Override
public Long getKey(SortDowComplete sortDowComplete) throws Exception {
return sortDowComplete.getTs();
}
}
- KeyTimeProcessFunction.java
import com.mongodb.client.model.UpdateOneModel;
import com.mongodb.client.model.WriteModel;
import org.apache.flink.api.common.state.MapState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;
import org.bson.Document;
import java.util.*;
/**
* @Author: fseast
* @Date: 2020/3/24 下午5:23
* @Description:
*/
public class KeyTimeProcessFunction extends KeyedProcessFunction<Long,SortDowComplete,List<WriteModel<? extends Document>>> {
//private ValueState<String> gameState1;
//如果是同一天的数据会使用同一个状态
//定义一个state,用于保存每个游戏下载量,state的类型是MapState,key是gameId,value是(ga,下载量)
private transient MapState<String,Integer> gameState;
//定义一个state,用于保存定时器的时间,
private transient ValueState<Long> timerState;
//定义一个状态用来去重,同一天、同一个imei,同一个gameID就只需要一个
private transient MapState<String,Integer> distinctState;
//List<SortGameDownloads> list ;
//Map<Long,Integer> map;
@Override
public void open(Configuration parameters) throws Exception {
//System.out.println("====");
//super.open(parameters);
gameState = getRuntimeContext().getMapState(new MapStateDescriptor<>("gameState", String.class,Integer.class));
timerState = getRuntimeContext().getState(new ValueStateDescriptor<Long>("timerState",Long.class));
distinctState = getRuntimeContext().getMapState(new MapStateDescriptor<String, Integer>("distinctState", String.class, Integer.class));
//map = new HashMap<>();
//list = new ArrayList<>();
}
@Override
public void processElement(SortDowComplete sortDowComplete, Context context, Collector<List<WriteModel<? extends Document>>> collector) throws Exception {
//set声明不能放在外面
TreeSet<SortGameDownloads> set = new TreeSet<>();
//数据时间,乘以1000就是毫秒的。
long oneTime = sortDowComplete.getTs() * 1000L;
long watermark = context.timerService().currentWatermark() + 1L;
//System.out.println(day);//null或者1
System.out.println("当前process时间:" + context.timerService().currentProcessingTime() + ",watermark时间信息:"+ (context.timerService().currentWatermark()+1));
System.out.println("sortDowComplete传进来的时间:" + oneTime);
//System.out.println("定时器的state:"+ timerState.value());
//在状态还是空到时候,直接拿value是null
//watermark的时间也就是当前数据跑到的最远时间,小于这条数据的时间,才注册定时器。
//watermark大于当前数据的时间就证明这条数据是迟到数据,这样可以避免有些数据很多天以后才过来导致state已经被清空又重复被更新。
//判断有没有设置过定时器,没有设置过定时器的话,就注册一个定时器,
if (oneTime > watermark && timerState.value() == null) {
//拿到传进来到时间加上一天半的时间,一天半以后清空状态
long timerTs = oneTime + 129600000L;//加36个小时,其实就是3号的状态,5号0点才清除
context.timerService().registerEventTimeTimer(timerTs);
//把时间戳保存到state中,
timerState.update(timerTs);
System.out.println("设置定时器时间为:" + timerTs);
}
//System.out.println("定时器的state:"+ timerState.value());
/*Iterator<Map.Entry<String, Integer>> iterator = gameState.iterator();
while (iterator.hasNext()) {
Map.Entry<String, Integer> next = iterator.next();
System.out.println("gameState的key:"+next.getKey()+"==,value:"+next.getValue());
}
gameState.put(sortDowComplete.getGameId(),101);*/
//如果数据时间迟到24小时,那么直接忽略不计算,比如3号的数据,在5号以后才到,那么直接忽略该数据
long judge = oneTime + 172800000;
//System.out.println("判断判断:"+judge+" ,watermark:"+watermark);
if (judge <= watermark){
System.out.println("不执行");
}else {
System.out.println("执行");
//把根据游戏ID获取到对应的下载量。
Integer gameDowNum = gameState.get(sortDowComplete.getGameId());
String stateKey = sortDowComplete.getGameId() + "_" + sortDowComplete.getImei();
System.out.println(stateKey+", "+distinctState.get(stateKey));
//如果这个游戏添加到状态中过了,那么进行添加操作
if (gameDowNum != null){
Integer dist = distinctState.get(stateKey);
//如果这个用户没有下载过这个游戏,那么下载量加一
if (dist == null || dist == 0){
gameDowNum +=1;
//更新这个游戏和它的下载量
gameState.put(sortDowComplete.getGameId(),gameDowNum);
//更新去重状态
distinctState.put(stateKey,1);//把游戏名和imei的组合添加到去重状态中
}
} else if (gameDowNum == null || gameDowNum == 0){//如果这个游戏今天位置没有添加到状态中过,则赋一个初值1,
gameState.put(sortDowComplete.getGameId(),1);
distinctState.put(stateKey,1);//把游戏名和imei的组合添加到去重状态中
}
Iterator<Map.Entry<String, Integer>> iterator = gameState.iterator();
while (iterator.hasNext()){
Map.Entry<String, Integer> next = iterator.next();
//System.out.println("key:"+next.getKey()+",value:"+next.getValue());
SortGameDownloads sortGameDownloads = new SortGameDownloads(sortDowComplete.getTs(),next.getKey(), next.getValue());
set.add(sortGameDownloads);
}
List<WriteModel<? extends Document>> writeModels = new ArrayList<>();
//只拿下载量最大的
SortGameDownloads first = set.first();
first.setTop(1);
writeModels.add(new UpdateOneModel(first.getQueryDoc(),first.getUpdateDoc(),first.getOption()));
/*前面是只要了下载量最大的那条数据而已,这里是拿所有的数据并排序
Iterator<SortGameDownloads> iteSort = set.iterator();
Integer i = 1;
Integer temp = -1;
while (iteSort.hasNext()) {
SortGameDownloads sortNext = iteSort.next();
if (i == 1){
temp = sortNext.getDownloads();
//添加数据
sortNext.setTop(1);
//writeModels.add(new DeleteManyModel(sortNext.deleteFilter()));
writeModels.add(new UpdateOneModel(sortNext.getQueryDoc(),sortNext.getUpdateDoc(),sortNext.getOption()));
i = 2;
} else if (i == 2){//第一有多个
if (temp == sortNext.getDownloads()){//第二个 和第一个相等
//添加数据
sortNext.setTop(1);
writeModels.add(new UpdateOneModel(sortNext.getQueryDoc(),sortNext.getUpdateDoc(),sortNext.getOption()));
i = 2;
} else {
i ++;
break;
}
}
}*/
System.out.println("set开始");
System.out.println(first);
System.out.println(set);
System.out.println("set结束");
/*for (SortGameDownloads sortGameDow : set) {
writeModels.add(new UpdateOneModel(sortGameDow.getQueryDoc(),sortGameDow.getUpdateDoc(),sortGameDow.getOption()));
}*/
System.out.println(writeModels);
System.out.println("===1111111");
collector.collect(writeModels);
}
System.out.println();
}
//如果watermark等于或者大于你设定的触发到时间,则会触发
//每个keyBy对应到key的第一条数据都会触发一次,而且触发时间就是这条数据的时间
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<List<WriteModel<? extends Document>>> out) throws Exception {
//super.onTimer(timestamp, ctx, out);
//定时器触发则执行清空目前设置定时器的keyBy对应的state,不会删除别的状态。
gameState.clear();
timerState.clear();
distinctState.clear();
System.out.println("==触发定时器时间:"+timestamp);
}
}
- SortGameDownloads.java
import com.mongodb.client.model.UpdateOptions;
import org.bson.Document;
import java.util.Date;
/**
* @Author: fseast
* @Date: 2020/3/26 下午5:24
* @Description:
*/
public class SortGameDownloads implements Comparable , IMongoUpdate {
private Long time;
private String gameId;
private Integer downloads;
private Integer top = -1;
public SortGameDownloads(Long time, String gameId, Integer downloads) {
this.time = time;
this.gameId = gameId;
this.downloads = downloads;
}
public Long getTime() {
return time;
}
public void setTime(Long time) {
this.time = time;
}
public String getGameId() {
return gameId;
}
public void setGameId(String gameId) {
this.gameId = gameId;
}
public Integer getDownloads() {
return downloads;
}
public void setDownloads(Integer downloads) {
this.downloads = downloads;
}
public Integer getTop() {
return top;
}
public void setTop(Integer top) {
this.top = top;
}
@Override
public String toString() {
return "SortGameDownloads{" +
"time=" + time +
", gameId='" + gameId + '\'' +
", downloads=" + downloads +
", top=" + top +
'}';
}
@Override
public int compareTo(Object o) {
if (o instanceof SortGameDownloads){
SortGameDownloads sort = (SortGameDownloads) o;
//如果下载量一样,按照游戏id排序
int num = -(this.downloads - sort.downloads);
if (num == 0){
return this.gameId.compareTo(sort.gameId);
}
return num;
}
return 0;
}
@Override
public Document getQueryDoc() {
Document query = new Document();
//query.append("game_id",gameId);
query.append("time",new Date(time * 1000L));
return query;
}
@Override
public Document getUpdateDoc() {
Document inc = new Document();
inc.append("downloads",downloads);
inc.append("top",top);
inc.append("game_id",gameId);
return new Document("$set",inc);
}
@Override
public UpdateOptions getOption() {
return new UpdateOptions().upsert(true);
}
//删除操作
/*public Document deleteFilter(){
Document delete = new Document();
delete.append("time",new Date(time * 1000L));
return delete;
}*/
}
- IMongoUpdate.java
import com.mongodb.client.model.UpdateOptions;
import org.bson.Document;
public interface IMongoUpdate {
Document getQueryDoc();
Document getUpdateDoc();
UpdateOptions getOption();
}