文章目录
五、核心编程
5.1 Environment
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vvja8Wrd-1606754072421)(https://i.loli.net/2020/11/30/tM8sr2XA1LUZiwV.jpg)]
#flink提交jar包的方式
bin/flink run
-m hadoop102:6123
-c Flink02_WordCount_BoundStream #类名
/opt/module/data/flink-wc.jar #jar包路径
FlinkJob在提交执行计算时,需要首先建立和Flink框架之间的联系,也就是指的是当前的flink运行环境,只有获取了环境信息,才能将task调度到不同的taskManager执行,而且这个环境对象的获取方式相对比较简单:
//批处理对象
ExecutionEnvironment env = ExecutionEnvironmnet.getxecutionEnvironment( );
//流式数据处理环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment( );
5.2 Source —— 从哪获取(消费)数据
5.2.1 读取文件数据
--main
1.创建环境
StreamExecutionEnvironment env= StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.Source
//readCollection 读取文件数据
DataStreamSource<String> filsDS = env.readCollection("input/word.txt");
//fromCollection 读取一个集合数据
DataStreamSource<String> collections = env.fromCollection(Array.asList("1","2","3"));
//fromElement 读取多个集合数据
DataStreamSource<Integer> elementDS = env.fromElement(1,2,3,4,5,6,7,8);
env.executor("source job");
5.2.2 读取kafka数据
--main
1.创建环境
StreamExecutionEnvironment env= StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读取kafka的数据
Properties properties = new Properties();
properties.setProperty("bootstrap.server","hadoop102:9092");
properties.setProperty("group.id","aaaaaa");
//添加数据源,new一个kafka的消费者对象,设置从最早开始消费
DataStreamSource<String> kafkaSource = env.addSource(
new FlinkKafkaConsumer1<String>(
"sensor0621", //主题
new SimpleStringScheme(),
properties //连接对象
).setStartFromEarliest()
);
kafkaSource.print();
env.execute();
5.2.3 自定义数据源
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.自定义数据源
DataStreamSource<WaterSensor> inputDS = env.addSource(new MySourceFunction());
inputDS.print();
env.execute();
--自定义类实现SourceFunction方法,泛型为自定义类
public static class MySourceFunction implements SourceFunction<WaterSensor> {
//可见性问题
private volatile boolean isRunning = true;
@Override
public void run(SourceContext<WaterSensor> ctx) throws Exception {
Random random = new RAndom();
while (isRunning) {
ctx.collect(
new WaterSensor("sensor_" + random.nextInt(3)), //id取3以内的随机数
System.currentTimeMillis(),
random.nextInt(10)+40 )
);
Thread.sleep(1000); //睡一秒
}
}
@Override
public void cancel(){
isRunning = false;
}
}
5.3 Transform —— 处理数据
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XOd2En5a-1606754072425)(https://i.loli.net/2020/11/30/YtQFk4oXsxT1en8.jpg)]
5.3.1 Map
普通函数 MapFunction
富函数 RichMapFunction
--富函数
1.生命周期:可以用于外部环境的管理
初始化时 用open
没数据时 调用close
读文件要关闭两次
2.运行时上下文
可以获取环境信息,状态
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读数据
DataStreamSource<Integer> numDS = env.fromElements(1,2,3,4,5);
//TODO Map
SingleOutStreamOperator<String> resultDS = numDS.map(new MapFunction());
resultDS.print();
env.execute();
--自定义类
public static class MyMapFunction implements MapFunction<Integer , String> {
@Override
public String map(Integer value) throw Exception {
return String.valueOf(value * 2) + " =========== ";
}
}
5.3.2 MapRichFunction
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读数据
DataStreamSource<Integer> numDS = env.fromElements(1,2,3,4,5);
/*
读文件会有特殊处理
env.readTextFile("input/word.txt")
.map(new MyRichMapFunction())
.print();
*/
//TODO RichFunction 富函数
//1.生命周期方法:open、close => 可以用于外部连接 管理
//2.运行时上下文:RuntimeContext => 可以获取 环境信息,状态。。。
SingOutputStreamOperator<String> resultDs = numDs.map(new MyRichMapFunction());
resultDS.print();
env.execute();
--自定义方法
public static class MyRichMapFunction extends RichMapFunction<Integer,String> {
@Override
public void open(Configuration parameters) throws Exception {
System.out.println("open ...");
}
@Override
public void close() throw Exception {
System.out.println("close ...")
}
@Override
public String map(Integer value) throws Exception {
return value + "==========="+getRuntimeContext().getTaskNameWithSubtasks();
}
}
5.3.3 FlatMap
1.输出数据的时候,使用采集器,collect
2.可以实现类似过滤的效果,不满足条件不用采集器往下游发送
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读数据
//TODO flatmap 压平:一进一出、一进零出
//1.可以试想类似过滤的效果,不满足条件,就不用采集器网下游发送
env
.fromElement(
Array.alList(1,2,3,4)
)
.flatMap(new FlatMapFunction< List<Integer>,String >(){
@Override
public void flatmap(List<Integer> value, Collector<String> out) throw Exception {
for (Integer num : value) {
if (num % 2 == 0 ) {
out.collect(num + " ");
}
}
}
});
.print();
env.execute();
5.3.4 Filter
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读数据
DataStreamSource<Integer> numDS = env.fromElements(1, 2, 3, 4, 5);
//TODO Filter
numDS
/*
.fliter(new FilterFunction<Integer>(){
@Override
public boolean filter(Integer value) throw Exception {
return value % 2 == 0;
}
})
*/
.filter(data -> data % 2 == 0 )
.print();
env.execute();
5.3.5 Keyby
返回类型有两个泛型,固定返回一个Tuple
使用位置索引方式,只能确定位置,程序无法确定类型,所以统一给的一个Tuple
使用属性名的方式,只能确定名字,程序无法确定类型
--正确的使用:
KeyedStream<WaterSensor,String> sensorKS = sensorDS.keyBy(sensor -> sensor.getId());
--源码简介:
key做了两次hash:
第一次:自己调用hashcode方法
第二次:mermerhash
两次hash之后,对默认值128 取模 ,得到一个ID值
ID值 * 并行度 / 默认值128 得到selectChannel 的Channel值
--keyby是一个逻辑上的分组,跟资源没有强绑定
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读数据
SingleOutputStreamOperator<WaterSensor> sensorDS = env
.readTextFile("input/seneor-date.log")
.map(new MapFunction<String value> throws Exception {
String[] datas = value.split(",");
return new WaterSensor(data[0],
Long.valueOf(datas[1]),
Integer.valueOf(datas[2]))
})
//TODO Keyby
/*
keyby对数据进行 分组 => 同一组的数据在一起
keyby是一个逻辑上的分组,跟资源没有强绑定
*/
/*
源码简介:
key做了两次hash:
第一次:自己调用hashcode方法
第二次:mermerhash
两次hash之后,对默认值128 取模 ,得到一个ID值
ID值 * 并行度 / 默认值128 得到selectChannel 的Channel值
*/
SingleOutputStreamOperator<Tuple3<String,Long.Integer>> sensorTupleDS = sensorDS
.map(new MapFunction<WaterSensor,Tuple3<String,Long,Integer>>(){
@Override
public Tuple3<String, Long, Integer> map(WaterSensor value) throws Exception {
return Tuple3.of(value.getId(), value.getTs(), value.getVc());
}
});
//使用位置索引的方式,只能确定位置,程序无法确定类型,所以统一给Tuple
KeyedStream<Tuple3<String,Long,Integer>,Tuple> sensorKs = sensorTupleDS.keyBy(0);
//使用属性名的方式,只能确定名字,程序无法确定类型,所以统一给Tuple
KeyedStream<WaterSensor, Tuple> sensorKS = seneorDS.keyBy("id");
/*
正确写法
*/
//写法一:
KeyedStream<WaterSensor,String> sensorKS = sensorDS.keyBy(new KeySelector<WaterSensor,String>(){
//从数据里提取(指定)key
@Override
public String getKey(WaterSensor value) throw Exception {
return value.getId();
}
});
//写法二:
KeyedStream<WaterSensor,String> sensorKS = sensorDS.keyBy(sensor -> sensor.getId());
sensorKS.print();
env.execute();
--自定类
public static class MyMapFunction implements MapFunction<Integer , String> {
@Override
public String map (Integer value) throws Exception {
return String.valueOf(value * 2 ) + " === ";
}
}
5.3.6 shuffle
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读数据
/*
读取文件数据
使用map方法
将数据封装成样例类
*/
SingleOutputStreamOperator<WaterSensor> sensorDS = env
.readTextFile("input/sensor-date.log")
.map(new MapFunction<String,WaterSensor>(){
@Override
public WaterSensor map(String value) throws Exception {
String[] datas = value.split(",");
return new WaterSensor(datas[0],
Long.valueOf(datas[1]),
Integer.valueOf(datas[2]) );
}
});
sensorDS.print("sensor");
//TODO Shuffle
DataStream<WaterSensor> shuffleDS =sensorDS.shuffle();
shuffleDS.print("shuffle");
env.execute();
--自定义类
public static class MyMapFunction implements MapFunction<Integer, String> {
@Override
public String map(Integer value) throws Exception {
return String.valueOf(value * 2) + "===================";
}
}
5.3.7 Split —— new OutputSelect
--将数据分流打标签,逻辑上切分,打赏标签,
--实际上在一个流里,要用时,通过标签取
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读数据
SingleOutputStreamOperator<WaterSensor> sensorDS = env
.readTextFile("input/seneor-data.log")
.map(new MapFunction<String,WaterSEnsor>() {
@Override
public WaterSensor map(String value) throws Exception {
String[] datas = value.split(",");
return new WAterSensor(datas[0],Long.Valueof(datas[1]),Integer.valueOf(date[2]));
}
});
//TODO Split & Select
//逻辑上做切分 打上标签,实际上还在一天流里面
//要用的时候,通过select (标签名) 取出特定的数据
SplitStream<WaterSensor> sensorSS = sensorDs.split(new OutputSelect<WaterSensor>(){
@Override
public Interable<String> select(WaterSensor value) {
if (value.getVc() < 5){
return Array.asList("low","hahaha");
}else if (value.getVc() < 8) {
return Array.asList("middle","hahaha");
}else {
return Arrays.asList("hight");
}
}
});
sensorSS.select("hahaha").print();
env.execute();
合流
5.3.8 Connect
--connect 连接两条类型不一致的流
1.只能连接 两条流
2.两条流的 数据类型 可以不一样
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读数据
SingleOutputStreamOperator<WaterSensor> sensorDS = env
.readTextFile("input/sensor-date.log")
.map(new MapFunction<String,WaterSensor>(){
@Override
public WaterSensor map(String value) throws Exception {
String[] datas =value.split(",");
return new WaterSensor(datas[0],
Long.valueOf(datas[1]),
Long.valueOf(datas[2]) );
}
}).setParallelism(2);
DataStreamSource<Integer> numDS = env.fromElements(1,2,3,4)
//connect
ConnectedStreams<WaterSensor,Integer> sensorNumCS = sensorDS.connect(NUmDS);
//TODO connect 连接两条流
//1.只能连接 两条流
//2.两条流的 数据类型 可以不一样
sensorNumCS
.map(new CoMapFunction<WaterSensor, Integer, Object>(){
@Override
public Object map1(WaterSensor value) throws Exception {
return "我是WaterSensor" + value ;
}
@Override
public Object map2(Integer value) throws Exception {
return "我是小x" + value ;
}
})
.print("aaa");
env.execute();
5.3.9 Union
--union 连接多条类型一致的流
1.可以合并多条流
2.每条流 数据类型 必须一致
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读数据
SingleOutputStreamOperator<WaterSensor> sensorDS = env
.readTextFile("input/sensor-data.log")
.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String value) throws Exception {
String[] datas = value.split(",");
return new WaterSensor(datas[0], Long.valueOf(datas[1]), Integer.valueOf(datas[2]));
}
});
DataStreamSource<Integer> numDS = env.fromElements(1, 2, 3, 4);
DataStreamSource<Integer> numDS1 = env.fromElements(11, 12, 13, 14);
DataStreamSource<Integer> numDS2 = env.fromElements(111, 112, 113, 114);
//connect 连接两条类型不一致的流
ConnectedStreams<WaterSensor,Integer> sensorNumCS = sensorDs.connect(numDS);
//TODO union
//1.可以合并多条流
//2.每条流 数据类型 必须一致
DataStream<Integer> resultDS =numDS.union(numDS1).union(numDS2);
resultDS.print();
env.executor();
5.4 Operator —— 计算
5.4.1 RollAgg
--这些聚合必须是分组之后调用
--只更新指定的字段,其它字段以第一条为准
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读数据
SingleOutputStreamOperator<WaterSensor> sensorDS = env
.readTextFile("input/sensor-data.log")
.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String value) throws Exception {
String[] datas = value.split(",");
return new WaterSensor(datas[0], Long.valueOf(datas[1]), Integer.valueOf(datas[2]));
}
});
3.分组
KeyedStream<WaterSensor,String> sensorKS = sensorDS.keyBy(sensor -> sensor.getId());
4.求和,求最大值,求最小值
sensorKS.sum("vc").print("sum");
sensorKS.max("vc").print("max");
sensorKS.min("vc").print("min");
env.execute();
5.4.2 Reduce —— new ReduceFunction
--返回值类型,原来数据怎么样,返回类型就怎么样
1.【返回类型与输入类型一致】
2.同一组的数据,进行reduce
3.每一组的第一条数据不会进入reduce方法
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读数据
SingleOutputStreamOperator<WaterSensor> sensorDS = env
.readTextFile("input/sensor-data.log")
.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String value) throws Exception {
String[] datas = value.split(",");
return new WaterSensor(datas[0], Long.valueOf(datas[1]), Integer.valueOf(datas[2]));
}
});
3.分组
KeyedStream<WaterSensor,String> sensorKS = sensorDS.keyBy(sensor -> sensor.getId());
//TODO Reduce
/*
返回类型 必须 与输入类型 一致
同一组的数据,进行reduce
同一组的第一条数据,不会进入reduce方法
*/
SingleOutputStreamOperator<WaterSensor> resultDS = sensorKS.reduce (
new ReduceFunction<Watersensor>(){
@Override
public WaterSensor reduce(WaterSensor value1,WaterSensor value2) throws Exception{
System.out.print(value1 + " -------- " + value2);
}
}
);
resuleDS.print();
env.execute();
5.4.3 Process
泛型:
<key值,输入类型,输出类型>
--env调用source
--流调用sink
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读数据
SingleOutputStreamOperator<WaterSensor> sensorDS = env
.readTextFile("input/sensor-data.log")
.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String value) throws Exception {
String[] datas = value.split(",");
return new WaterSensor(datas[0], Long.valueOf(datas[1]), Integer.valueOf(datas[2]));
}
});
3.分组
KeyedStream<WaterSensor,String> sensorKS = sensorDS.keyBy(sensor -> sensor.getId());
//TODO Process
sensorKS
.process(new MyKeyedProcessFunction())
.print();
env.execute();
--自定义类
//泛型<输入key值,输入数据类型,输出数据类型>
public static class MyKeyedProcessFunction extends KeyedProcessFunction<String, WaterSensor, String> {
@Override
public void processElement(WaterSensor value,Context ctx, Collector<String> out) throws Exception {
out.collect(value + ",key=" + ctx.getCurrentKey());
}
}
5.5 Sink —— 将数据存到哪
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Im6JnppW-1606754072427)(https://i.loli.net/2020/11/30/UGSAwI8un2ydOH3.jpg)]
5.5.1 Kafka
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读数据
SingleOutputStreamOperator<String> sensorDS = env
.readTextFile("input/sensor-data.log");
sensorDS.addSink(
//new一个kafka的生产者
new FlinkKafkaProducer02<String> (
//连接对象
"hadoop102:9092,hadoop103:9092",
//主题
"sensor0621",
new SimpleStringSchema())
);
env.execute();
5.5.2 Redis
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置并行度
2.读数据
SingleOutputStreamOperator<WaterSensor> sensorDS = env.readTextFile("input/sensor-data.log")
.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String value) throws Exception {
String[] datas = value.split(",");
return new WaterSensor(datas[0],
Long.valueOf(datas[1]),
Integer.valueOf(datas[2]));
}
});
//TODO Sink Redis
//第一个参数
FlinkJedisPoolConfig config = new FlinkJedisPoolConfig.Builder()
.setHost("hadoop102")
.setPort(6379)
.build();
//第二个参数
MyRedisMapper myRedisMapper = new MyRedisMapper();
sensorDS.addSink(
new RedisSink<WaterSensor>(config,myRedisMapper)
);
}
--自定义类
public static class MyRedisMapper implements RedisMapper<WaterSensor> {
@Override
public RedisCommandDescription getCommandDescription() {
//主题
return new RedisCommandDescription(RedisCommand.HSET,"sensor0621");
}
// 如果是Hash,那么获取的就是 hash的 key
@Override
public String getKeyFromData(WaterSensor data) {
return data.getTs().toString();
}
// 如果是Hash,那么获取的就是 hash的 value
@Override
public String getValueFromData(WaterSensor data) {
return data.getVc().toString();
}
}
5.5.3 ES
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
2.读数据
SingleOutputStreamOperator<WaterSensor> sensorDS = env.socketTextStream("localhost", 9999)
.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String value) throws Exception {
String[] datas = value.split(",");
return new WaterSensor(datas[0],
Long.valueOf(datas[1]),
Integer.valueOf(datas[2]));
}
});
//TODO Sink ElasticSearch
3.Builder的第一个参数
List<HttpHost> httpHosts = new ArrayList<>();
httpHosts.add(new HttpHost("hadoop102",9092));
httpHosts.add(new HttpHost("hadoop103",9092));
httpHosts.add(new HttpHost("hadoop104",9092));
4.Builder的第二个参数
MyElasticSearchSinkSFunction myElasticSearchSinkSFunction = new MyElasticSearchSinkSFunction();
ElasticsearchSink.Builder<WaterSensor> esBuilder = new ElasticsearchSink.Builder<>(httpHosts, myElasticSearchSinkSFunction);
//设置bulk的容量,1条就刷写
//TODO 生产环境不要设置为1,影响性能,这里只是为了快速看到 无界流写入ES的结果
esBuilder.setBulkFlushMaxActions(1);
sensorDS.addSink(esBuilder.build());
env.execute();
}
--自定义类
public static class MyElasticSearchSinkSFunction implements ElasticsearchSinkFunction<WaterSensor> {
@Override
public void process(WaterSensor element, RuntimeContext ctx, RequestIndexer indexer) {
Map<String, String> sourceMap = new HashMap<>();
sourceMap.put("data",element.toString());
//创建一个Request
IndexRequest indexRequest = Requests.indexRequest("sensor0621")
.type("read")
.source(sourceMap);
//放入indexer
indexer.add(indexRequest);
}
}
5.5.4 MySQL
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
2.读数据
SingleOutputStreamOperator<WaterSensor> sensorDS = env
.readTextFile("input/sensor-data.log")
// .socketTextStream("localhost", 9999)
.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String value) throws Exception {
String[] datas = value.split(",");
return new WaterSensor(datas[0],
Long.valueOf(datas[1]),
Integer.valueOf(datas[2]));
}
});
//TODO Sink MySQL
sensorDS.addSink(new MySQLSink());
}
--自定义类
public static class MySQLSink extends RichSinkFunction<WaterSensor>{
Connection conn = null;
PreparedStatement pstmt = null;
@Override
public void open(Configuration parameters) throws Exception {
// 创建mysql连接
conn = DriverManager.getConnection("jdbc:mysql://hadoop102:3306/test", "root", "000000");
pstmt = conn.prepareStatement("INSERT INTO sensor VALUES (?,?,?)");
}
@Override
public void close() throws Exception {
pstmt.close();
conn.close();
}
}
5.6 案例实操
5.6.1 基于埋点日志数据的网络流量统计
5.6.1.1 Case_PV
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
2.读数据
SingleOutputStreamOperator<UserBehavior> userbehaviorDS = env
.readTextFile("input/UserBehavior.csv")
.map(new MapFunction<String, UserBehavior>() {
@Override
public UserBehavior map(String value) throws Exception {
String[] datas = value.split(",");
return new UserBehavior(
Long.valueOf(datas[0]),
Long.valueOf(datas[1]),
Integer.valueOf(datas[2]),
datas[3],
Long.valueOf(datas[4])
);
}
});
3.处理数据
3.1 能过滤就先过滤
SingleOutputStreamOperator<UserBehavior> pvDS = userbehaviorDS.filter(sensor -> "pv".equals(sensor.getBehavior()));
3.2 参考wordcount的思路,转换成(pv,1)
SingleOutputStreamOperator<Tuple2<String, Integer>> pvAndOneDS = pvDS.map(new MapFunction<UserBehavior, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(UserBehavior userBehavior) throws Exception {
return Tuple2.of("pv", 1);
}
});
3.3 按照pv 行为分组
KeyedStream<Tuple2<String, Integer>, String> pvAndOneKS = pvAndOneDS.keyBy(data -> data.f0);
3.4 聚合统计
SingleOutputStreamOperator<Tuple2<String, Integer>> pv = pvAndOneKS.sum(1);
4.输出
pv.print();
env.execute();
5.6.1.2 Case_PVByFlatMap
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
2.处理
env .readTextFile("input/UserBehavior.csv")
.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
String[] datas = value.split(",");
//如果是pv的数据,转换成(pv,1),往下游发送
if ("pv".equals(datas[3])){
out.collect(Tuple2.of("pv",1));
}
}
})
.keyBy(r -> r.f0)
.sum(1)
.print();
env.execute();
5.6.1.3 Case_PVByProcess
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(10);
2.读数据
SingleOutputStreamOperator<UserBehavior> userbehaviorDS = env
.readTextFile("input/UserBehavior.csv")
.map(new MapFunction<String, UserBehavior>() {
@Override
public UserBehavior map(String value) throws Exception {
String[] datas = value.split(",");
return new UserBehavior(
Long.valueOf(datas[0]),
Long.valueOf(datas[1]),
Integer.valueOf(datas[2]),
datas[3],
Long.valueOf(datas[4])
);
}
});
3.处理数据
3.1 能过滤就先过滤
SingleOutputStreamOperator<UserBehavior> pvDS = userbehaviorDS.filter(sensor -> "pv".equals(sensor.getBehavior()));
3.2 按照 pv 分组
KeyedStream<UserBehavior, String> userbehavioKS = pvDS.keyBy(r -> r.getBehavior());
3.3 计数
SingleOutputStreamOperator<Long> pv = userbehavioKS.process(new KeyedProcessFunction<String, UserBehavior, Long>() {
private long pvCount = 0L;
@Override
public void processElement(UserBehavior value, Context ctx, Collector<Long> out) throws Exception {
// 来一条,计一条
pvCount++;
out.collect(pvCount);
}
});
pv.print();
env.execute();
5.6.1.4 Case_PVByAcc
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(3);
2.读数据
SingleOutputStreamOperator<UserBehavior> userbehaviorDS = env
// .readTextFile("input/UserBehavior.csv")
.socketTextStream("localhost", 9999)
.map(new MapFunction<String, UserBehavior>() {
@Override
public UserBehavior map(String value) throws Exception {
String[] datas = value.split(",");
return new UserBehavior(
Long.valueOf(datas[0]),
Long.valueOf(datas[1]),
Integer.valueOf(datas[2]),
datas[3],
Long.valueOf(datas[4])
);
}
});
3.处理数据
3.1 能过滤就先过滤
SingleOutputStreamOperator<UserBehavior> pvDS = userbehaviorDS.filter(sensor -> "pv".equals(sensor.getBehavior()));
3.3 使用 process实现 计数
pvDS
.map(
new RichMapFunction<UserBehavior, UserBehavior>() {
// TODO 1.创建累加器
private LongCounter pvCount = new LongCounter();
@Override
public void open(Configuration parameters) throws Exception {
// TODO 2.注册累加器
getRuntimeContext().addAccumulator("pvCount", pvCount);
}
@Override
public UserBehavior map(UserBehavior value) throws Exception {
// TODO 3.使用 累加器 计数
pvCount.add(1L);
System.out.println(value + " <-------------------> " + pvCount.getLocalValue());
return value;
}
});
4.从Job的执行结果,取出累加器的值
JobExecutionResult result = env.execute();
Object pvCount = result.getAccumulatorResult("pvCount");
System.out.println("统计的PV值为:" + pvCount);
}
5.6.1.5 Case_UV
--需要根据用户id去重
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
2.读数据
SingleOutputStreamOperator<UserBehavior> userbehaviorDS = env
.readTextFile("input/UserBehavior.csv")
.map(new MapFunction<String, UserBehavior>() {
@Override
public UserBehavior map(String value) throws Exception {
String[] datas = value.split(",");
return new UserBehavior(
Long.valueOf(datas[0]),
Long.valueOf(datas[1]),
Integer.valueOf(datas[2]),
datas[3],
Long.valueOf(datas[4])
);
}
});
3.处理数据
3.1 能过滤就先过滤
SingleOutputStreamOperator<UserBehavior> pvDS = userbehaviorDS.filter(sensor -> "pv".equals(sensor.getBehavior()));
// 对 userId 进行去重
3.2 转换成 (uv,userId)
// => 第一个元素,给个固定的字符串 “uv” => 用来做 keyby
// => 第二个元素,是 用户ID, 用来 添加到 Set里,进行去重, 后续可以用 Set.size 得到 uv值
SingleOutputStreamOperator<Tuple2<String, Long>> uvDS = pvDS.map(new MapFunction<UserBehavior, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> map(UserBehavior value) throws Exception {
return Tuple2.of("uv", value.getUserId());
}
});
3.3 按照 "uv" 分组
KeyedStream<Tuple2<String, Long>, String> uvKS = uvDS.keyBy(r -> r.f0);
3.4 把 userId添加到 Set里
SingleOutputStreamOperator<Long> uvCount = uvKS.process(
new KeyedProcessFunction<String, Tuple2<String, Long>, Long>() {
// 定义一个Set,用来 存放 userId,实现去重
Set<Long> uvSet = new HashSet();
@Override
public void processElement(Tuple2<String, Long> value, Context ctx, Collector<Long> out) throws Exception {
// 取出 userId
Long userId = value.f1;
uvSet.add(userId);
out.collect(Long.valueOf(uvSet.size()));
}
}
);
4. 输出
uvCount.print();
env.execute();
5.6.2 市场营销商业指标统计分析
5.6.2.1 Case_APPMarketingAnalysis ——分渠道
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
2.读取数据
DataStreamSource<MarketingUserBehavior> appDS = env.addSource(new AppMarketingLog());
3.处理数据
3.1 按照 统计维度(行为、渠道) 分组 => 求和
appDS
.map(new MapFunction<MarketingUserBehavior, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(MarketingUserBehavior value) throws Exception {
return Tuple2.of(value.getBehavior() + "_" + value.getChannel(), 1);
}
})
.keyBy(r -> r.f0)
.sum(1)
.print();
env.execute();
--自定义类
public static class AppMarketingLog implements SourceFunction<MarketingUserBehavior> {
private volatile boolean isRunning = true;
private List<String> behaviorList = Arrays.asList("DOWNLOAD", "INSTALL", "UPDATE", "UNINSTALL");
private List<String> channelList = Arrays.asList("XIAOMI", "HUAWEI", "OPPO", "VIVO", "APPSTORE");
@Override
public void run(SourceContext<MarketingUserBehavior> ctx) throws Exception {
Random random = new Random();
while (isRunning) {
ctx.collect(
new MarketingUserBehavior(
random.nextLong(),
behaviorList.get(random.nextInt(behaviorList.size())),
channelList.get(random.nextInt(channelList.size())),
System.currentTimeMillis())
);
Thread.sleep(1000L);
}
}
@Override
public void cancel() {
isRunning = false;
}
}
5.6.2.2 Case_APPMarketingAnalysisWithoutChannel ——不分渠道
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
2.读取数据
DataStreamSource<MarketingUserBehavior> appDS = env.addSource(new AppMarketingLog());
3.处理数据
3.1 按照 统计维度(行为) 分组 => 求和
appDS
.map(new MapFunction<MarketingUserBehavior, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(MarketingUserBehavior value) throws Exception {
return Tuple2.of(value.getBehavior(), 1);
}
})
.keyBy(r -> r.f0)
.sum(1)
.print();
env.execute();
--自定义类
public static class AppMarketingLog implements SourceFunction<MarketingUserBehavior> {
private volatile boolean isRunning = true;
private List<String> behaviorList = Arrays.asList("DOWNLOAD", "INSTALL", "UPDATE", "UNINSTALL");
private List<String> channelList = Arrays.asList("XIAOMI", "HUAWEI", "OPPO", "VIVO", "APPSTORE");
@Override
public void run(SourceContext<MarketingUserBehavior> ctx) throws Exception {
Random random = new Random();
while (isRunning) {
ctx.collect(
new MarketingUserBehavior(
random.nextLong(),
behaviorList.get(random.nextInt(behaviorList.size())),
channelList.get(random.nextInt(channelList.size())),
System.currentTimeMillis())
);
Thread.sleep(1000L);
}
}
@Override
public void cancel() {
isRunning = false;
}
}
5.6.3 页面广告分析
5.6.3.1 Case_AdClickAnalysis
--不同省份,不同广告的点击量
1.获取环境
2.读取数据,获取的数据封装成样例类return
3.分析数据
使用mapFuncation封装成tuple
keyby
求和
输出
4.执行
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
2.读取数据
SingleOutputStreamOperator<AdClickLog> adClickDS = env
.readTextFile("input/AdClickLog.csv")
.map(new MapFunction<String, AdClickLog>() {
@Override
public AdClickLog map(String value) throws Exception {
String[] datas = value.split(",");
return new AdClickLog(
Long.valueOf(datas[0]),
Long.valueOf(datas[1]),
datas[2],
datas[3],
Long.valueOf(datas[4])
);
}
});
3.处理数据
3.1 按照 统计维度 (省份、广告) 分组
adClickDS
.map(new MapFunction<AdClickLog, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(AdClickLog value) throws Exception {
return Tuple2.of(value.getProvince() + "_" + value.getAdId(), 1);
}
})
.keyBy(r -> r.f0)
.sum(1)
.print();
env.execute();
5.6.4 订单支付实时监控
5.6.4.1 Case_OrderTxAnalysis
1.创建环境
2.获取数据,两条流,都封装成样例类return返回
3.处理数据 process(new 连接后的数据流对象)
连接两条流 collect
keyBy(key1,key2)
====================================
这里可以先keyby再连接(connect)
====================================
使用process,
coProcessFunction<第一条流输入类型,第二条流输入类型,输出类型>
谁调用process方法 谁就是第一条流
重写两个方法
方法一:processElement1
判断交易系统数据是否来过
来过,对账成功,删除缓存=>remove(key值)
未来过,把业务数据存起来=>put(key值【value.getTxid】,数据)
方法二:processElemrnt2
判断对遗憾交易码的业务数据是否来过
来过 => 对账成功,清楚业务数据的缓存remove(交易码)
没来过 => 把自己(郊野数据),缓存起来put(交易码,value)
--main
1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
2.读取数据
2.1 读取 业务系统 数据
SingleOutputStreamOperator<OrderEvent> orderDS = env
.readTextFile("input/OrderLog.csv")
.map(new MapFunction<String, OrderEvent>() {
@Override
public OrderEvent map(String value) throws Exception {
String[] datas = value.split(",");
return new OrderEvent(
Long.valueOf(datas[0]),
datas[1],
datas[2],
Long.valueOf(datas[3]));
}
});
2.2 读取 交易系统 数据
SingleOutputStreamOperator<TxEvent> txDS = env
.readTextFile("input/ReceiptLog.csv")
.map(new MapFunction<String, TxEvent>() {
@Override
public TxEvent map(String value) throws Exception {
String[] datas = value.split(",");
return new TxEvent(
datas[0],
datas[1],
Long.valueOf(datas[2]));
}
});
3. 处理数据: 对两条流进行关联,匹配 交易码
3.1 连接两条流, 正常 使用 connect,要考虑 做 keyby
ConnectedStreams<OrderEvent, TxEvent> orderTxCS = orderDS
.keyBy(order -> order.getTxId())
.connect(txDS.keyBy(tx -> tx.getTxId()));
3.2 使用 Process
SingleOutputStreamOperator<String> resultDS = orderTxCS
// .keyBy(order -> order.getTxId(), tx -> tx.getTxId())
.process(new OrderTxDetect());
4.输出
resultDS.print();
env.execute();
}
--自定义类
public static class OrderTxDetect extends CoProcessFunction<OrderEvent, TxEvent, String> {
// 缓存 交易系统 的数据
private Map<String, TxEvent> txEventMap = new HashMap();
// 缓存 业务系统 的数据
private Map<String, OrderEvent> orderEventMap = new HashMap();
/**
* 处理 业务系统 数据
*
* @param value
* @param ctx
* @param out
* @throws Exception
*/
@Override
public void processElement1(OrderEvent value, Context ctx, Collector<String> out) throws Exception {
// 说明来的是 业务数据
// 判断 对应交易码的交易数据 是否来过?
if (value.getTxId() != null) {
if (txEventMap.containsKey(value.getTxId())) {
// 1.交易数据 来过 => 对账成功,清除 交易码对应的 交易数据 的缓存
out.collect("订单" + value.getOrderId() + "对账成功!!!");
txEventMap.remove(value.getTxId());
} else {
// 2.交易数据 没来过 => 把 自己(业务数据) 缓存起来
orderEventMap.put(value.getTxId(), value);
}
}
}
/**
* 处理 交易系统 数据
*
* @param value
* @param ctx
* @param out
* @throws Exception
*/
@Override
public void processElement2(TxEvent value, Context ctx, Collector<String> out) throws Exception {
// 说明来的是 交易数据
// 判断 对应交易码的业务数据 是否来过
if (orderEventMap.containsKey(value.getTxId())) {
// 1.说明 业务数据 来过 => 对账成功, 清除 业务数据的缓存
out.collect("订单" + orderEventMap.get(value.getTxId()).getOrderId() + "对账成功!!!");
orderEventMap.remove(value.getTxId());
} else {
// 2.说明 业务数据 没来过 => 把自己(交易数据)缓存起来
txEventMap.put(value.getTxId(), value);
}
}
}