文章目录
Flink 实时项目
推荐模块
在flink-2hbase中,主要分为4个flink任务
mysql中中主要存储用户信息,商品信息,相当于维表,主要用于拼接获得信息。
hbase用于存储从flink处理完的数据结果。
日志导入
从kafka接受的数据直接写入Hbase事实表,保存完整的日志log,日志中包含了用户id,用户操作的产品id,操作时间,行为(如购买,点击,推荐等)
数据按时间窗口统计数据大屏需要的数据,返回前段显示
数据存储在Hbase的con表
从kafka的con topic读取数据,继承mapFunction编写map函数,将日志解析为LogEntity(userid,produceid,time,action),然后根据用户id、产品id、时间戳拼接hbase的rowkey,最终将每一条记录插入hbase的con表中。
用户-产品浏览历史->实现基于协同过滤的推荐逻辑
通过flink记录用户浏览过这个类目下的哪些产品,为后面的基于Item的协同过滤做准备,实时的记录用户的评分到Hbase中,为后续离线处理做准备。
从kafka的con topic读取数据,继承mapFunction编写map函数,将用户id、产品id分别存入u_history表和p_history表的对应行,并增加计数。
用户画像计算->实现基于标签的推荐逻辑
按照三个维度计算用户画像,分别是用户的颜色兴趣,用户的产地兴趣和用户的风格兴趣,根据日志不断的修改用户画像的数据,记录在Hbase中。数据存储在hbase的user表中
从kafka的con topic读取数据,查询对应的产品信息(country、color、style),在以userId为rowkey的记录中找到以(country、color、style)为列的cell,增加计数。存入的表名为user表。
产品画像记录->实现基于标签的推荐逻辑
用两个维度记录产品画像,一个是喜爱该产品的年龄段,另一个是性别,数据存储在Hbase的prod表。
从kafka的con topic中读取数据,继承mapFunction编写map函数,从mysql的user表查询出该用户的sex、age信息,数据存入hbase的prod表。增加计数。
实时热度榜->实现基于热度的推荐逻辑
通过flink时间窗口机制,统计当前时间的实时热度,并将数据缓存在Redis中,通过Flink的窗口机制计算实时热度,使用ListState保存一次热度榜。数据存储在redis中,按照时间戳存储list。增加计数。
#将数据统计后写入redis,用于实时热度榜使用
首先按照productId分组,对于每个productId内部使用滑动窗口,对于窗口内的进行aggregate操作,统计商品次数封装为topProduct对象。
aggregate目的是,每有一条数据累加一次。然后使用keyBy按照windwoEnd分组,然后对相同windwoEnd内的商品借助keyBy分组到一起,再进行排序,获得商品的排名。最后将结果写入。这里要理解为什么还要按照windwoEnd分组,就要理解窗口函数的输出传递到下游是什么,他是一个由分区号+窗口时间段唯一确定的一条记录。
#流处理的概念理解
当我们对流施加keyBy操作,本质是创建n个分区,当数据到来时,数据被分发到不同的分区。如果keyBy后续有操作,那么本质就是在多个分区上施加该操作。如果后续是聚合,那么会得到对应数目的聚合结果,聚合结果是针对整个流进行累积的状态呢,还是当前单个流呢?是针对整个流,来了新数据它会尝试更新状态,然后发往下一个任务。如果加了窗口呢?那就是针对一个窗口的状态,一个计算完就结束了,没有累计状态。
这个的问题是,为了统计相同窗口的,不同分区的聚合数据。如果不添加按照窗口分区,那么数据由同一个map处理。就是累积了嘛?额,所有没加窗口的聚合都是累积吧。
理清的关键是确定1.何时触发计算2计算的数据对象是什么3计算的结果是累积状态嘛4计算的结果发往哪里
即使用window后,当水位线到达windwoend时,就触发计算。计算的对象为该widows内数据,如果分区了,还要截取对应分区。计算的结果不是累积state,计算结果数量时分区数*窗口个数,该数目的记录数全部发往下游的任务。下游的任务,如果需要对应窗口的数据,只能借助窗口时间分组,才能获取到对应的数据,自己使用窗口时间分组,就涉及到触发时间的问题,因为分组无法自己触发计算,所以要借助定时器,当到达end时触发计算,统计该时间分组的所有数据。flatmap。
aggregate:对数据进行统计聚合
process:可以对流施加复杂的操作,包括设定定时器、触发定时器等。除了设定map函数,还可以设定其他如open、onTimer等函数,时较为底层的api,功能丰富。
实时统计指标.
窗口处理
窗口处理的方法由以下元素:窗口函数、
1不使用keyBy函数,使用windowAll和窗口处理函数,处理对应窗口的所有key数据
2使用keyBy函数,再window开窗,使用aggregate函数聚合key、窗口唯一对应的数据。
3使用keyBy函数,再window开窗,使用窗口处理函数(或全窗口函数)聚合key、窗口唯一对应的数据。
等同地位的几种函数:1增量聚合函数(归约、聚合)2全窗口函数 3窗口处理函数
日志结构
登录日志
登录失败日志
#页面数据,事件数据,启动数据和错误数据
对于事件、页面、曝光、错误、启动几个模块,在flink的项目中,我们需要事件(用于协同过滤),页面(用于UVPV分析)、曝光(用于曝光量计算)、启动日志(用于记录日活信息)
除此之外还需要商品的交易信息业务日志,用于计算topn商品、用户首单信息
有一些数据是业务数据库和行为日志都可以提供的,例如添加购物车、点击收藏,这些行为既可以通过前段日志记录,也可以通过业务数据库获得。
Top N商品
每10s进行一次统计,统计1分钟内的商品热度排行,找出当前热门的Top N对象
windowAll
不进行keyBy
#原始的想法为,不进行keyBy,直接在一个分区上执行,通过维护一个hashMap,key是url,value是改url的热门程度,用访问次数等进行表示。
#这里借助windowAll直接开窗,然后在窗口中使用窗口处理函数进行全量计算。
public class ProcessAllWindowTopN {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env= StreamExecutionEnvironment.getExecutionEnvironment();
SingleOutputStreamOperator<LogEntity> sourceStream=env.addSource(new ClickSource())
.assignTimestampsAndWatermarks(
WatermarkStrategy.<LogEntity>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner<LogEntity>() {
@Override
public long extractTimestamp(LogEntity element, long recordTimestamp) {
return element.getTime();
}
})
);
// SingleOutputStreamOperator<Integer> result=sourceStream.map(new MapFunction<LogEntity, Integer>() {
// @Override
// public Integer map(LogEntity value) throws Exception {
// return value.getProductId();
// }
// });
SingleOutputStreamOperator<TopProductEntity> result=sourceStream.map(logEntity->logEntity.getProductId())
.windowAll(SlidingEventTimeWindows.of(Time.seconds(60),Time.seconds(10)))
.process(new ProcessAllWindowFunction<Integer, TopProductEntity, TimeWindow>() {
@Override
public void process(Context context, Iterable<Integer> iterable, Collector<TopProductEntity> collector) throws Exception {
HashMap<Integer,Long> productCountMap=new HashMap<>();
for(Integer productId:iterable){
if(productCountMap.containsKey(productId))
{
Long oldValue=productCountMap.get(productId);
productCountMap.put(productId,oldValue+1L);
}
else{
productCountMap.put(productId,1L);
}
}
ArrayList<Tuple2<Integer,Long>> productIdCountList=new ArrayList<>();
for(Map.Entry<Integer,Long> entry:productCountMap.entrySet()){
productIdCountList.add(Tuple2.of(entry.getKey(),entry.getValue()));
}
productIdCountList.sort(new Comparator<Tuple2<Integer, Long>>() {
@Override
public int compare(Tuple2<Integer, Long> o1, Tuple2<Integer, Long> o2) {
return o2.f1.intValue()-o1.f1.intValue();
}
});
for(int i=0;i<10;i++){
Tuple2<Integer,Long> temp=productIdCountList.get(i);
collector.collect(TopProductEntity.of(temp.f0,context.window().getEnd(),temp.f1));
}
}
}
);
result.print();
env.execute();
}
}
Elasticsearch Sink
PUT /topproduct
{
"settings": {
"number_of_shards": 3
},
"mappings": {
"properties":{
"productid":{
"type":"keyword"
},
"times":{
"type":"keyword"
},
"windowEnd":{
"type":"date"
}
}
}
}
实时处理速度
在window主机上运行,其占用了i7-9700 cpu 的60%,和16GB的内存,处理的数据量为每分钟704万*10=7000万条数据。使用自定义的sourceFunction定义的输出源,每分钟输出7000万条模拟日志数据,程序能够实时的统计出1分钟Top N热榜,而且滑动窗口每10s滑动一次进行计算。
如果在用集群测试,处理速度会更加强大。
2> TopProductEntity{productId=110, actionTimes=7022101, windowEnd=1651199620000, rankName='1651199620000'}
6> TopProductEntity{productId=104, actionTimes=7049810, windowEnd=1651199630000, rankName='1651199630000'}
7> TopProductEntity{productId=108, actionTimes=7049275, windowEnd=1651199630000, rankName='1651199630000'}
8> TopProductEntity{productId=102, actionTimes=7049250, windowEnd=1651199630000, rankName='1651199630000'}
5> TopProductEntity{productId=105, actionTimes=7049997, windowEnd=1651199630000, rankName='1651199630000'}
3> TopProductEntity{productId=106, actionTimes=7047867, windowEnd=1651199630000, rankName='1651199630000'}
1> TopProductEntity{productId=109, actionTimes=7048695, windowEnd=1651199630000, rankName='1651199630000'}
2> TopProductEntity{productId=101, actionTimes=7048377, windowEnd=1651199630000, rankName='1651199630000'}
5> TopProductEntity{productId=107, actionTimes=7045300, windowEnd=1651199630000, rankName='1651199630000'}
4> TopProductEntity{productId=103, actionTimes=7051025, windowEnd=1651199630000, rankName='1651199630000'}
4> TopProductEntity{productId=110, actionTimes=7047220, windowEnd=1651199630000, rankName='1651199630000'}
1> TopProductEntity{productId=101, actionTimes=7075255, windowEnd=1651199640000, rankName='1651199640000'}
#集群中以会话模式提交作业
#由于该应用只有一个作业,所以单作业模式与应用模式差别不大,在资源隔离上是近似的。而且搭建的是测试用的集群,没有其他任务抢占cpu等资源,就只测试会话模式。
bin/flink run -h
#stanalone模式提交:指定提交什么包,执行什么类,提交到哪
#yarn模式直接指定会话、单作业模式、应用模式,不用指定jobmanager地址。-m 指定jobmaster,-c指定类路径,-d指定后台运行。
bin/flink run -m hbase1:8081 -c com.demo.task.practice.ProcessAllWindowTopN /opt/software/jars/flink-2-hbase-1.0-SNAPSHOT.jar
窗口处理函数 Top N
对产品的频次进行统计,然后进行排序。
keyBy Top N,先用keyBy按照produceId分组,然后开滑动窗口计算。
计算完成后,再把同一个窗口的keyBy到一起,然后使用KeyedProcessFunction处理排名,并输出。
对于同一个窗口的数据,都将数据存储到一个内存状态中,并设置一个定时器,当水位线到达窗口结束时,触发计算,统计每个商品的频次,排名后得到热门商品。
DataStream<TopProductEntity> topProduct = dataStream.map(new TopProductMapFunction()).
// 抽取时间戳做watermark 以 秒 为单位
assignTimestampsAndWatermarks(new AscendingTimestampExtractor<LogEntity>() {
@Override
public long extractAscendingTimestamp(LogEntity logEntity) {
return logEntity.getTime() * 1000;
}
})
// 按照productId 按滑动窗口
.keyBy("productId").timeWindow(Time.seconds(60),Time.seconds(5))
// 统计传入数据的总数 并封装为topProduceEntity ,要windowsend干嘛
.aggregate(new CountAgg(), new WindowResultFunction())
// 同一时间窗口的分到一起
.keyBy("windowEnd")
// 如果到达windowEnd,那么触发timer计时器,进行排序,并输出为arrayList
// flatmap就是用于处理arrayList,它将每个arrayList读取后,为每个string生成一个TopProductEntity,并写入排名。
// 发往下游的数据是一个windowEnd所对应的arrayList,不是累积状态。
.process(new TopNHotItems(topSize))
.flatMap(new FlatMapFunction<List<TopProductEntity>, TopProductEntity>() {
@Override
public void flatMap(List<TopProductEntity> TopProductEntitys, Collector<TopProductEntity> collector) throws Exception {
System.out.println("-------------Top N Product------------");
for (int i = 0; i < TopProductEntitys.size(); i++) {
TopProductEntity top = TopProductEntitys.get(i);
// 输出排名结果
System.out.println(top);
collector.collect(top);
}
}
});
用户日活
每到来一条数据,判断今日是否登陆过,如果是新登录则记录下来,然后统计日后数量。
统计当日累计共有多少不重复的用户登录,需要对用户进行去重,使用Hashset或外部的redis等即可。
保留用户的统计结果。
#日活所有用户id
#查询redis状态是否创建。未创建则创建。已创建则根据用户id读出状态,然后过滤出新登录的用户。将新登陆的用户写入状态,并插入到ES外部存储。
package com.demo.task.practice;
import com.demo.domain.LogEntity;
import com.demo.util.Property;
import com.demo.util.RedisUtil;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.flink.util.Collector;
import org.apache.http.HttpHost;
import org.apache.http.client.methods.HttpPost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;
import redis.clients.jedis.Jedis;
import java.text.SimpleDateFormat;
import java.util.*;
public class DailyActiveUser {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(4);
SingleOutputStreamOperator<StartLog> sourceStream=env.addSource(new StartLogSource())
.process(new ProcessFunction<StartLog, StartLog>() {
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
jedis= RedisUtil.connectRedis(Property.getStrValue("redis.host"));
if(jedis!=null){
System.out.println("jedis连接成功"+jedis);
}
simpleDateFormat=new SimpleDateFormat("yyyy-MM-dd");
}
private Jedis jedis;
private SimpleDateFormat simpleDateFormat;
@Override
public void processElement(StartLog value, Context ctx, Collector<StartLog> out) throws Exception {
// 查询内存状态
Long userId=value.getUserId();
Long time=value.getTs();
// dt
Date date=new Date(time);
Long flag=jedis.sadd("flinkdau"+simpleDateFormat.format(date), String.valueOf(userId));
//
if(jedis.ttl("flinkdau"+String.valueOf(date))==-1L)
{
jedis.expire("flinkdau"+String.valueOf(date), 3600 * 24);
}
if(flag==1){
out.collect(value);
}
}
@Override
public void close() throws Exception {
super.close();
jedis.close();
}
});
// sourceStream.print();
// httppost
//elasticsearchSinkFunction
ElasticsearchSinkFunction<StartLog> elasticsearchSinkFunction=new ElasticsearchSinkFunction<StartLog>() {
@Override
public void process(StartLog element, RuntimeContext ctx, RequestIndexer indexer) {
Map<String,String> result=new HashMap<>();
result.put("userId",Long.valueOf(element.getUserId()).toString());
result.put("entry",element.getEntry());
result.put("ts",Long.valueOf(element.getTs()).toString());
IndexRequest indexRequest= Requests.indexRequest().index("flinkdaustartlog").type("startLog").source(result).id(Long.valueOf(element.getUserId()).toString());
indexer.add(indexRequest);
}
};
List<HttpHost> httpPosts=new ArrayList<>();
httpPosts.add(new HttpHost("hbase",9200,"http"));
ElasticsearchSink.Builder<StartLog> builder=new ElasticsearchSink.Builder<StartLog>(httpPosts, elasticsearchSinkFunction);
builder.setBulkFlushMaxActions(1);
sourceStream.addSink(builder.build());
sourceStream.print();
env.execute("flinkdau");
}
}
当日首单用户数量
在电商中,需求是记录今天的首单用户有哪些,共有多少。需要对非首单的用户去重,为什么不用redis。
在内容平台中,就是今天首次消费用户数量。
由于是否首单,需要一直保存其标记,用redis不合适。所以使用hbase记录是否首单。
elasticsearch建表:
PUT /newpurchaseuser
{
"settings": {
"number_of_shards": 3
},
"mappings": {
"properties":{
"userId":{
"type":"keyword"
},
"productId":{
"type":"keyword"
},
"time":{
"type":"date"
}
}
}
}
#使用phoenix建表
create table newpurchaseuser(userid varchar not null primary,flag v)salt_buckets=16;
upsert into npuser values('101','1')
#查询是否首单,若是首单,则保留,并使用phoenix插入到hbase。
package com.demo.task.practice;
import com.demo.domain.LogEntity;
import com.demo.util.Property;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.flink.util.Collector;
import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
import java.util.*;
import java.util.concurrent.Executors;
public class NewPurchaseUser {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env= StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(4);
SingleOutputStreamOperator<LogEntity> sourceStream=env.addSource(new ClickSource())
// 过滤
.process(new ProcessFunction<LogEntity, LogEntity>() {
private Connection conn;
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
Class.forName("org.apache.phoenix.jdbc.PhoenixDriver");
String url = "jdbc:phoenix:hbase,hbase1,hbase2:2181";
conn = DriverManager.getConnection(url);
}
@Override
public void close() throws Exception {
super.close();
conn.close();
}
@Override
public void processElement(LogEntity value, Context ctx, Collector<LogEntity> out) throws Exception {
int userId=value.getUserId();
// 首单
Statement statement=conn.createStatement();
String sql="select userid from npuser where userid ='"+userId+"'";
ResultSet resultSet=statement.executeQuery(sql);
// 收集
if(!resultSet.next()){
Statement insertStatement=conn.createStatement();
String insertsql="upsert into npuser values('"+userId+"','"+"1')";
System.out.println(insertsql);
insertStatement.execute(insertsql);
conn.commit();
out.collect(value);
}
}
});
ElasticsearchSinkFunction<LogEntity> elasticsearchSinkFunction=new ElasticsearchSinkFunction<LogEntity>() {
@Override
public void process(LogEntity element, RuntimeContext ctx, RequestIndexer indexer) {
Map<String,String> result=new HashMap<>();
result.put("userId",Integer.valueOf(element.getUserId()).toString());
result.put("productId",Integer.valueOf(element.getProductId()).toString());
result.put("time",Long.valueOf(element.getTime()).toString());
IndexRequest indexRequest= Requests.indexRequest().index("newpurchaseuser").source(result).id(Integer.valueOf(element.getUserId()).toString());
indexer.add(indexRequest);
}
};
List<HttpHost> httpPosts=new ArrayList<>();
httpPosts.add(new HttpHost(new Property().getElasProperties().getProperty("host"),9201,"http"));
ElasticsearchSink.Builder<LogEntity> builder=new ElasticsearchSink.Builder<LogEntity>(httpPosts, elasticsearchSinkFunction);
builder.setBulkFlushMaxActions(1);
sourceStream.addSink(builder.build());
sourceStream.print();
env.execute("npuser");
}
}
UV PV
User View,对应页面的distinct用户浏览量,独立访客数。
Page View,对应页面的浏览量,页面浏览量。
PV/UV表示人均重复访问量,也就是每个用户平均访问多少次一面,这在一定程度上代表了用户的粘度。
uv pv也是对对象进行频次统计,不过是对页面进行统计,其中UV需要另外处理,进行去重。
elasticsearch建表:
PUT /uvpv
{
"settings": {
"number_of_shards": 3
},
"mappings": {
"properties":{
"uvpv":{
"type":"keyword"
},
"productId":{
"type":"keyword"
},
"ts":{
"type":"date"
}
}
}
}
package com.demo.task.practice;
import akka.stream.impl.fusing.Sliding;
import com.demo.domain.LogEntity;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.flink.util.Collector;
import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;
import java.util.*;
public class UVPV {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Tuple2<Integer,Double>> stream=env.addSource(new ClickSource())
.assignTimestampsAndWatermarks(
WatermarkStrategy.<LogEntity>forMonotonousTimestamps()
.withTimestampAssigner(
new SerializableTimestampAssigner<LogEntity>() {
@Override
public long extractTimestamp(LogEntity element, long recordTimestamp) {
return element.getTime();
}
}
)
)
.keyBy(LogEntity-> LogEntity.getProductId())
.window( SlidingEventTimeWindows.of(Time.seconds(60),Time.seconds(10)))
.aggregate(new AggregateFunction<LogEntity, Tuple2<HashSet<String>, Long>, Double>() {
@Override
public Tuple2<HashSet<String>, Long> createAccumulator() {
return Tuple2.of(new HashSet<String>(), 0L);
}
@Override
public Tuple2<HashSet<String>, Long> add(LogEntity value, Tuple2<HashSet<String>, Long> accumulator) {
accumulator.f0.add(Integer.valueOf(value.getUserId()).toString());
return Tuple2.of(accumulator.f0, accumulator.f1 + 1L);
}
@Override
public Double getResult(Tuple2<HashSet<String>, Long> accumulator) {
return (double) accumulator.f1 / accumulator.f0.size();
}
@Override
public Tuple2<HashSet<String>, Long> merge(Tuple2<HashSet<String>, Long> a, Tuple2<HashSet<String>, Long> b) {
return null;
}
}, new WindowFunction<Double, Tuple2<Integer,Double>, Integer, TimeWindow>() {
@Override
public void apply(Integer integer, TimeWindow window, Iterable<Double> input, Collector<Tuple2<Integer,Double>> out) throws Exception {
out.collect(Tuple2.of(integer,input.iterator().next()));
}
});
ElasticsearchSinkFunction<Tuple2<Integer,Double>> elasticsearchSinkFunction=new ElasticsearchSinkFunction<Tuple2<Integer,Double>>() {
@Override
public void process(Tuple2<Integer,Double> element, RuntimeContext ctx, RequestIndexer indexer) {
Map<String,String> result=new HashMap<>();
result.put("productId",Integer.valueOf(element.f0).toString());
result.put("pvuv",Double.valueOf(element.f1).toString());
IndexRequest indexRequest= Requests.indexRequest().index("flinkpvuv").type("logEntity").source(result).id(Integer.valueOf(element.f0).toString());
indexer.add(indexRequest);
}
};
List<HttpHost> httpPosts=new ArrayList<>();
httpPosts.add(new HttpHost("hbase",9200,"http"));
ElasticsearchSink.Builder<Tuple2<Integer,Double>> builder=new ElasticsearchSink.Builder<Tuple2<Integer,Double>>(httpPosts, elasticsearchSinkFunction);
builder.setBulkFlushMaxActions(1);
stream.print();
stream.addSink(builder.build());
env.execute("UVPV");
}
}
#flink sql写法
CEP连续登录失败
接下来我们考虑一个具体的需求:检测用户行为,如果连续三次登录失败,就输出报警信 息。很显然,这是一个复杂事件的检测处理,我们可以使用 Flink CEP 来实现。
PUT /cepFailBehavior
{
"settings": {
"number_of_shards": 3
},
"mappings": {
"properties":{
"userId":{
"type":"keyword"
},
"first":{
"type":"date"
},
"second":{
"type":"date"
},
"third":{
"type":"date"
}
}
}
}
package com.demo.task.practice;
import com.demo.domain.LogEntity;
import com.typesafe.config.ConfigIncluderFile;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.cep.PatternSelectFunction;
import org.apache.flink.cep.PatternStream;
import org.apache.flink.cep.pattern.Pattern;
import org.apache.flink.cep.CEP;
import org.apache.flink.cep.pattern.conditions.SimpleCondition;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.http.HttpHost;
import org.apache.kafka.common.protocol.types.Field;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class FailBehavior {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
System.setProperty("HADOOP_USER_NAME", "root");
System.setProperty("user.name", "root");
env.setParallelism(4);
// checkpoint
env.enableCheckpointing(1000);
CheckpointConfig config=env.getCheckpointConfig();
config.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
config.setMinPauseBetweenCheckpoints(500);
config.setCheckpointTimeout(60000);
config.setMaxConcurrentCheckpoints(1);
config.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
config.enableUnalignedCheckpoints();
config.setCheckpointStorage("hdfs://hbase:9000/flink/checkpoints");
DataStream<LogEntity> sourceStream=env.addSource(new ClickSource())
.assignTimestampsAndWatermarks(
WatermarkStrategy.<LogEntity>forMonotonousTimestamps()
.withTimestampAssigner(
new SerializableTimestampAssigner<LogEntity>() {
@Override
public long extractTimestamp(LogEntity value, long l)
{
return value.getTime();
}
}
)
)
.keyBy(LogEntity->LogEntity.getUserId());
// sourceStream.print();
Pattern<LogEntity,LogEntity> pattern=Pattern
.<LogEntity>begin("first")
.where(new SimpleCondition<LogEntity>() {
@Override
public boolean filter(LogEntity value) throws Exception {
return value.getAction().equals("1");
}
})
.next("second")
.where(new SimpleCondition<LogEntity>() {
@Override
public boolean filter(LogEntity value) throws Exception {
return value.getAction().equals("1");
}
})
.next("third")
.where(new SimpleCondition<LogEntity>() {
@Override
public boolean filter(LogEntity value) throws Exception {
return value.getAction().equals("2");
}
});
PatternStream<LogEntity> patternStream=CEP.pattern(sourceStream,pattern);
DataStream<Tuple4<Integer,Long,Long,Long>> stream=patternStream.select(new PatternSelectFunction<LogEntity, Tuple4<Integer,Long,Long,Long>>() {
@Override
public Tuple4<Integer,Long,Long,Long> select(Map<String, List<LogEntity>> map) throws Exception {
LogEntity first=map.get("first").get(0);
LogEntity second=map.get("second").get(0);
LogEntity third=map.get("third").get(0);
return Tuple4.of(first.getUserId(),first.getTime(),second.getTime(),third.getTime());
}
});
stream.print("warning");
// sinkfunciton
// httphost
// essinkbulder
ElasticsearchSinkFunction<Tuple4<Integer,Long,Long,Long>> elasticsearchSinkFunction=new ElasticsearchSinkFunction<Tuple4<Integer,Long,Long,Long>>() {
@Override
public void process(Tuple4<Integer,Long,Long,Long> element, RuntimeContext ctx, RequestIndexer indexer) {
Map<String,String> result=new HashMap<>();
result.put("userId",Integer.valueOf(element.f0).toString());
result.put("first",Long.valueOf(element.f1).toString());
result.put("second",Long.valueOf(element.f2).toString());
result.put("third",Long.valueOf(element.f3).toString());
IndexRequest indexRequest= Requests.indexRequest().index("flinkwarning").type("logEntity").source(result).id(Integer.valueOf(element.f0).toString());
indexer.add(indexRequest);
}
};
List<HttpHost> httpPosts=new ArrayList<>();
httpPosts.add(new HttpHost("hbase",9200,"http"));
ElasticsearchSink.Builder<Tuple4<Integer,Long,Long,Long>> builder=new ElasticsearchSink.Builder<Tuple4<Integer,Long,Long,Long>>(httpPosts, elasticsearchSinkFunction);
builder.setBulkFlushMaxActions(1);
stream.addSink(builder.build());
env.execute("FailBehavior");
}
}
检查点
在CEP这个例子中使用了检查点
1需要停止stream流处理任务的场景,完成检查点的保存 2完成检查点的恢复,确保故障恢复到正确的内存状态和外部存储系统状态。
广播状态过滤keyword
借助广播状态进行全局配置,对于一些可能需要变动的配置,使用广播变量全局配置。本例中使用广播变量配置,定义任务的处理规则。
https://blog.csdn.net/wangpei1949/article/details/99698978
通过周期性的从mysql获取信息,将配置进行广播。
PUT /boradcastf
{
"settings": {
"number_of_shards": 3
},
"mappings": {
"properties":{
"userId":{
"type":"keyword"
},
"first":{
"type":"date"
},
"second":{
"type":"date"
},
"third":{
"type":"date"
}
}
}
}
#主函数
package com.demo.task.practice;
import com.demo.domain.LogEntity;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.datastream.BroadcastConnectedStream;
import org.apache.flink.streaming.api.datastream.BroadcastStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction;
import org.apache.flink.util.Collector;
import javax.xml.crypto.Data;
public class KeywordsFilter {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<LogEntity> sourceStream=env.addSource(new ClickSource());
DataStream<String> configStream=env.addSource(new MysqlSource("hbase",3306,"con","root","root",1));
MapStateDescriptor<String,String> configStateDescriptor=new MapStateDescriptor<String, String>("config", Types.STRING,Types.STRING);
BroadcastStream<String> broadcastConfigStream=configStream.broadcast(configStateDescriptor);
BroadcastConnectedStream<LogEntity,String> broadcastConnectedStream=sourceStream.connect(broadcastConfigStream);
DataStream<LogEntity> filterStream=broadcastConnectedStream.process(new BroadcastProcessFunction<LogEntity,String,LogEntity>(){
private String keyword="-1";
@Override
public void processElement(LogEntity value, ReadOnlyContext ctx, Collector<LogEntity> out) throws Exception {
if(value.getProductId()==Integer.parseInt(keyword))
{
out.collect(value);
}
}
@Override
public void processBroadcastElement(String value, Context ctx, Collector<LogEntity> out) throws Exception {
keyword=value;
}
});
filterStream.print();
env.execute("keywordFilter");
}
}
#自定义的mysqlsource
package com.demo.task.practice;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.source.RichSourceFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
public class MysqlSource extends RichSourceFunction<String> {
private boolean running=true;
private Connection connection;
private String host;
private Integer port;
private String db;
private String user;
private String passwd;
private Integer secondInterval;
private PreparedStatement preparedStatement;
public MysqlSource(String host, Integer port, String db, String user, String passwd, Integer secondInterval) {
this.host = host;
this.port = port;
this.db = db;
this.user = user;
this.passwd = passwd;
this.secondInterval = secondInterval;
}
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
Class.forName("com.mysql.jdbc.Driver");
connection= DriverManager.getConnection("jdbc:mysql://"+host+":"+port+"/"+db+"?useUnicode=true&characterEncoding=UTF-8",user,passwd);
String sql="select keyword from config";
preparedStatement=connection.prepareStatement(sql);
}
@Override
public void run(SourceContext<String> ctx) throws Exception {
while (running){
ResultSet resultset=preparedStatement.executeQuery();
String keyword;
while (resultset.next()){
// action1=resultset.getString("action1");
// action2=resultset.getString("action2");
// action3=resultset.getString("action3");
keyword=resultset.getString("keyword");
ctx.collect(keyword);
}
Thread.sleep(1000*secondInterval);
}
}
@Override
public void cancel() {
running=false;
}
@Override
public void close() throws Exception {
super.close();
if(connection!=null){
connection.close();
}
if(preparedStatement!=null){
preparedStatement.close();
}
}
}
保存点
检查点启用后,配置检查点保存间隔,检查点保存到堆内存还是外部hdfs等存储,设置精确一次消费模式。检查点是由flink自行管理,到达触发时机自动保存程序内存状态等信息,如果发生故障,就会使用保存的checkpoint恢复应用状态,并继续执行,达到故障恢复的目的,提高容错性。
而保存点必须由用户手动配置并管理,包括触发时机,以及如何使用,我们可以计划性的对应用设置保存点,然后从保存点恢复应用。为了能够直接恢复应用,保存点相比于检查点多保存了一些元数据,如算子ID。
常用场景:
1版本管理归档
2更新flink版本
3更新应用程序4调整并行度
5暂停应用
注意程序的更改兼容是有条件的,就是状态的拓扑结构和数据类型不能改变。如果使用保存点,需要设置算子的ID。
package com.demo.task.practice;
import com.demo.domain.LogEntity;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class SavePoint {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
env.addSource(new ClickSource())
.uid("source-id")
.map(LogEntity->LogEntity)
.uid("mapper-id")
.print();
env.execute("savepoint");
}
}
bin/flink run -m hbase1:8081 -c com.demo.task.practice.SavePoint /opt/software/jars/flink-2-hbase-1.0-SNAPSHOT.jar
bin/flink savepoint jobId file:///opt/module/flink/savepoints
bin/flink savepoint jobId hdfs://hbase:9000/flink/savepoints
#从savepoint恢复
bin/flink run -s hdfs://hbase:9000/flink/savepoints -c com.demo.task.practice.SavePoint /opt/software/jars/flink-2-hbase-1.0-SNAPSHOT.jar
项目难点
实时更改配置
自定义sourceFunciton,读取数据源并广播
广播连接流读取其中的配置并更新,处理流访问配置状态,动态变更规则。
优雅停止任务
通过读取文件路径,当文件路径存在时,将任务停止。