背景
招聘类App
前端对于app上的聊天动作埋点
分析
元数据:
{"position_id": "111","user_id": "10001","identity": "1","friend_id": "10003","timestamp": 1000}
{"position_id": "101","user_id": "10002","identity": "2","friend_id": "10001","timestamp": 1001}
{"position_id": "101","user_id": "10002","identity": "2","friend_id": "10001","timestamp": 2002}
字段解释:
position_id:职位id
user_id:用户id
identity:身份(1:招聘者,2:应聘者)
friend_id:对方用户id
timestamp:打点时间
两种角色:1、招聘者 2、应聘者
聊天只能是招聘者和应聘者之间,每发出一条记录都会被埋点记录
统计一个职位30分钟下的聊天的pv和uv
pv:统计30分钟内所有聊天记录按照职位Id分组求count
uv:统计30分钟内所有应聘者的聊天记录并按照应聘者Id分组去重求count,由于招聘者和应聘者发出消息的埋点对于应聘者的Id记录位置不一致,所以要做区分
将kafka消息利用map函数转成JSONObject,然后过滤一些空值数据得到 job_chat_dataStream_filtered,map,filter代码略
//30分钟特征
val job_char_features_jch30 = job_chat_dataStream_filtered.keyBy(x => x.getString("position_id")).window(TumblingEventTimeWindows.of(Time.minutes(30))).process(new ProcessWindowFunction[JSONObject, JobFeatureBean, String, TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[JSONObject], out: Collector[JobFeatureBean]): Unit = {
val set = new mutable.HashSet[String]()
var timestamp = 0L
for (elem <- elements) {
//是招聘者的时候对friend_id去重
if ("1".equals(elem.getString("identity"))) {
set.add(elem.getString("friend_id"))
//是geek的时候对user_id去重
}else if ("2".equals(elem.getString("identity"))){
set.add(elem.getString("user_id"))
}
if (elem.getString("timestamp") > timestamp){
timestamp = elem.getString("timestamp")
}
}
out.collect(new JobFeatureBean(key.reverse,timestamp, elements.size.toString, "pv"))
out.collect(new JobFeatureBean(key.reverse,timestamp, set.size.toString, "uv"))
set.clear()
}
})
上述代码在功能上和逻辑上都没有什么问题,然后放到集群上跑的时候其实就会发现问题,一开始程序跑着跑着就挂了,当然早在程序里设置了重启策略,报错信息大概就是什么连接超时啊,内存溢出之类的,后来上网查看说是要更换yarn的jdk版本,但是其实我是用的更高版本的jdk,那就不是这个问题了,还有的说更改
akka.ask.timeout = 120s |
但是我这本来就是120s,也不是这个问题,后来经过自己查看在flink的监控页面上看chechpoint,发现checkpoint有时会失败,而且有时chechpoint的大小有4G左右,导致已经达到checkpointTimeout阈值了还没有check完,由于程序设置的允许check失败是3次,所以超过阈值就失败了,初步想法是先调节checkpointTimeout超时时间,但是调整大了有时候还是会重启,由于数据太多导致内存溢出,后来又增大了taskmanager的内存,暂时是稳定了,但是这样占用集群的资源会比较大,所以就考虑了怎么样来优化一下。
问题
checkpoint数据大是因为30分钟窗口内的数据太多,process会将30分钟内所有的数据都保存在窗口内,其实pv,uv的问题完全可以用agg来做
方案
job_chat_dataStream_filtered.keyBy(x => x.getString("position_id")).window(TumblingEventTimeWindows.of(Time.minutes(30))).aggregate(new PVAggUtil, new WindowResultPV)
job_chat_dataStream_filtered.keyBy(x => x.getString("position_id")).window(TumblingEventTimeWindows.of(Time.minutes(30))).aggregate(new UVAggUtil, new WindowResultUV)
聚合函数
PVAggUtil
class PVAggUtil extends AggregateFunction[JSONObject,Int,Int]{
override def createAccumulator(): Int = 0
override def add(in: JSONObject, acc: Int): Int = acc + 1
override def getResult(acc: Int): Int = acc
override def merge(acc: Int, acc1: Int): Int = acc + acc1
}
WindowResultPV
class WindowResultPV extends WindowFunction[Int,JobFeatureBean,String,TimeWindow]{
override def apply(key: String, window: TimeWindow, input: Iterable[Int], out: Collector[JobFeatureBean]): Unit = {
out.collect(new JobFeatureBean(key.reverse,window.getEnd,input.iterator.next().toString,"pv"))
}
}
UVAggUtil
class UVAggUtil extends AggregateFunction[JSONObject,mutable.HashSet[String],Int]{
//初始化累加器
override def createAccumulator(): mutable.HashSet[String] = new mutable.HashSet[String]()
//累加方法
override def add(in: JSONObject, acc: mutable.HashSet[String]): mutable.HashSet[String] ={
//是招聘者的时候对friend_id去重
if ("1".equals(in.getString("identity"))) {
acc.add(in.getString("friend_id"))
//是应聘者的时候对user_id去重
}else if ("2".equals(in.getString("identity"))){
acc.add(in.getString("user_id"))
}
acc
}
//输出OUT
override def getResult(acc: mutable.HashSet[String]): Int = acc.size
//不同分区合并方法
override def merge(acc: mutable.HashSet[String], acc1: mutable.HashSet[String]): mutable.HashSet[String] = acc ++ acc1
}
WindowResultUV
//WindowFunction参数,1:累加器输出类型,2:WindowFunction输出类型OUT,3:keyby的key类型,4:窗口类型
class WindowResultUV extends WindowFunction[Int,JobFeatureBean,String,TimeWindow]{
//key:keyby的key,window:窗口类型,input:累加器的输出,out:收集器
override def apply(key: String, window: TimeWindow, input: Iterable[Int], out: Collector[JobFeatureBean]): Unit = {
out.collect(new JobFeatureBean(key.reverse,window.getEnd,input.iterator.next().toString,"uv"))
}
}
改成agg后chechpoint大小从4G降到百M左右,check时间也大大减少,最重要的是需要的集群资源变得更少了
个人总结,还望多多指正