Flink的transform转换算子
数据准备
- 下面所使用的数据全都是以下数据
sensor_1,1547718199,35.8
sensor_6,1547718201,15.4
sensor_7,1547718202,6.7
sensor_10,1547718205,38.1
sensor_1,1547728199,25.8
sensor_6,1547712201,35.4
sensor_7,1547718102,16.7
sensor_10,1547712205,28.1
基本转换算子(Map、FlatMap、Filter)
Map
- map是将数据从一种形式转换成另一个形式
- 将字符串类型转换成输出他的字符串长度
SingleOutputStreamOperator<Integer> mapStream = inputStream.map(new MapFunction<String, Integer>() {
@Override
public Integer map(String s) throws Exception {
return s.length();
}
});
- 上面的是完整的写法,还有一个简单的写法
SingleOutputStreamOperator<Integer> map = inputStream.map(str -> str.length());
- 结果展示
flatMap
- flatMap是将一个数组拆分成多个组
- 将数据按照逗号进行分割
SingleOutputStreamOperator<String> flatMapStream = inputStream.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String s, Collector<String> collector) throws Exception {
String[] fields = s.split(",");
for (String field : fields) {
collector.collect(field);
}
}
});
- 结果展示
Filter
- 过滤出自己所需要的数据
- 筛选sensor_1开头的数据
SingleOutputStreamOperator<String> filterStream = inputStream.filter(new FilterFunction<String>() {
@Override
public boolean filter(String s) throws Exception {
return s.equ("sensor_1");
}
});
- 结果展示
KeyBy
- DataStream → KeyedStream:逻辑地将一个流拆分成不相交的分区,每个分区包含具有相同 key 的元素,在内部以 hash 的形式实现的。
- 滚动聚合算子可以针对 KeyedStream 的每一个支流做聚合
- 代码展示
package transform;
import beans.SenSorReading;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class KeyByAndRollAggregation {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//从文件中读取数据
DataStreamSource<String> inputStream = env.readTextFile("src/main/resources/sensor.txt");
//转换成自定义的类型
// SingleOutputStreamOperator<SenSorReading> dataStream = inputStream.map(new MapFunction<String, SenSorReading>() {
// @Override
// public SenSorReading map(String s) throws Exception {
// String[] fileds = s.split(",");
// return new SenSorReading(fileds[0], new Long(fileds[1]), new Double(fileds[2]));
// }
// });
SingleOutputStreamOperator<SenSorReading> dataStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SenSorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
});
/**
* key的两种实现方式
* 第一种:如果数据类型是元组(Tuple)的话,直接使用所分区的位置即可
* 第二种:字段名称,也就是自定义类的分区属性名
*/
KeyedStream<SenSorReading, Tuple> keyByStream = dataStream.keyBy("id");
// KeyedStream<SenSorReading, String> keyByStream02 = dataStream.keyBy(data -> data.getId());
//滚动聚合
/**
* 什么是滚动聚合?
* 就是不同的在更新,来一条数据就会更新一条数据
*/
SingleOutputStreamOperator<SenSorReading> resultStream = keyByStream.max("temperature");
/**
* maxBy和max的区别?
* max的是取最大值,但是朱慧注意所要取的那个元素,最后输出的结果是索取元素的最大值,但是其他元素不一定是和其匹配的
* maxBy记录的是整条元素
*/
SingleOutputStreamOperator<SenSorReading> maxByresultStream = keyByStream.maxBy("temperature");
maxByresultStream.print("maxBy");
resultStream.print("max");
env.execute();
}
}
-
max和maxBy的区别是什么呢?
max的是取最大值,但是朱慧注意所要取的那个元素,最后输出的结果是索取元素的最大值,但是其他元素不一定是和其匹配的
maxBy记录的是整条元素
-
原数据可见,最大那条的数据应该是
sensor_10,1547718205,38.1
,我们分别来看看max和maxBY所求出的数据有什么不同
自定义reduce
-
KeyedStream → DataStream:一个分组数据流的聚合操作,合并当前的元素和上次聚合的结果,产生一个新的值,返回的流中包含每一次聚合的结果,而不是只返回最后一次聚合的最终结果。
-
代码展示
package transform;
import beans.SenSorReading;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class reduceTest01 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//从文件中读取数据
DataStreamSource<String> inputStream = env.readTextFile("src/main/resources/sensor.txt");
SingleOutputStreamOperator<SenSorReading> dataStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SenSorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
});
KeyedStream<SenSorReading, String> keyByStream = dataStream.keyBy(data -> data.getId());
//reduce 取得是最大的温度值,和最新的时间戳
// SingleOutputStreamOperator<SenSorReading> reduceStream = keyByStream.reduce(new ReduceFunction<SenSorReading>() {
// /**
// * 第一个是之前处理的结果数据
// * 第二个是等等待处理的数据
// * @param senSorReading
// * @param t1
// * @return
// * @throws Exception
// */
// @Override
// public SenSorReading reduce(SenSorReading senSorReading, SenSorReading t1) throws Exception {
// return new SenSorReading(senSorReading.getId(), t1.getTimeStamp(), Math.max(senSorReading.getTemperature(), t1.getTemperature()));
// }
// });
SingleOutputStreamOperator<SenSorReading> reduceStream =keyByStream.reduce((curr, newData) -> new SenSorReading(curr.getId(), newData.getTimeStamp(), Math.max(curr.getTemperature(), newData.getTemperature())));
reduceStream.print();
env.execute();
}
}
-
结果展示
-
可以看出时间戳是变换为最新的
拆分和组合
Split 和 Select
- Split:DataStream → SplitStream:根据某些特征把一个 DataStream 拆分成两个或者多个 DataStream。
- Select:SplitStream→DataStream:从一个 SplitStream 中获取一个或者多个DataStream。
- 将数据以温度30度划分
package transform;
import beans.SenSorReading;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.collector.selector.OutputSelector;
import org.apache.flink.streaming.api.datastream.*;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoMapFunction;
import java.util.Collections;
public class CollectAndColMap {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//从文件中读取数据
DataStreamSource<String> inputStream = env.readTextFile("src/main/resources/sensor.txt");
SingleOutputStreamOperator<SenSorReading> mapStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SenSorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
});
//分流,将数据按照三十度来进行分流
SplitStream<SenSorReading> splitStream = mapStream.split(new OutputSelector<SenSorReading>() {
@Override
public Iterable<String> select(SenSorReading senSorReading) {
return (senSorReading.getTemperature() > 30 ? Collections.singletonList("high") : Collections.singletonList("low"));
}
});
DataStream<SenSorReading> highStream = splitStream.select("high");
DataStream<SenSorReading> lowStream = splitStream.select("low");
DataStream<SenSorReading> allStream = splitStream.select("high", "low");
highStream.print("high");
lowStream.print("low");
allStream.print("all");
env.execute();
}
}
- 结果展示
Connect 和 CoMap
- Connect:DataStream,DataStream → ConnectedStreams:连接两个保持他们类型的数据流,两个数据流被 Connect 之后,只是被放在了一个同一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立。
- CoMap:ConnectedStreams → DataStream:作用于 ConnectedStreams 上,功能与 map和 flatMap 一样,对 ConnectedStreams 中的每一个 Stream 分别进行 map 和 flatMap处理。
- Connect和CoMap是可以连接不同数据类型的数据的,下面我们代码示范将高低数据连接
package transform;
import beans.SenSorReading;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.collector.selector.OutputSelector;
import org.apache.flink.streaming.api.datastream.*;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoMapFunction;
import java.util.Collections;
public class CollectAndColMap {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//从文件中读取数据
DataStreamSource<String> inputStream = env.readTextFile("src/main/resources/sensor.txt");
SingleOutputStreamOperator<SenSorReading> mapStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SenSorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
});
//分流,将数据按照三十度来进行分流
SplitStream<SenSorReading> splitStream = mapStream.split(new OutputSelector<SenSorReading>() {
@Override
public Iterable<String> select(SenSorReading senSorReading) {
return (senSorReading.getTemperature() > 30 ? Collections.singletonList("high") : Collections.singletonList("low"));
}
});
DataStream<SenSorReading> highStream = splitStream.select("high");
DataStream<SenSorReading> lowStream = splitStream.select("low");
DataStream<SenSorReading> allStream = splitStream.select("high", "low");
highStream.print("high");
lowStream.print("low");
allStream.print("all");
/**
* 合流
* 将高温转换成二元组类型,于低温流连接之后,输出状态信息
*/
SingleOutputStreamOperator<Tuple2<String, Double>> highMapStream = highStream.map(new MapFunction<SenSorReading, Tuple2<String, Double>>() {
@Override
public Tuple2<String, Double> map(SenSorReading senSorReading) throws Exception {
return new Tuple2<>(senSorReading.getId(), senSorReading.getTemperature());
}
});
//连接
ConnectedStreams<Tuple2<String, Double>, SenSorReading> connectStream = highMapStream.connect(lowStream);
/**
* 三个参数的意义
* 前两个分别是合并map的类型
* 第三个是需要转换成的数据类型
*/
SingleOutputStreamOperator<Object> map = connectStream.map(new CoMapFunction<Tuple2<String, Double>, SenSorReading, Object>() {
@Override
public Object map1(Tuple2<String, Double> stringDoubleTuple2) throws Exception {
return new Tuple3<>(stringDoubleTuple2.f0,stringDoubleTuple2.f1,"high temp waring");
}
@Override
public Object map2(SenSorReading senSorReading) throws Exception {
return new Tuple2<>(senSorReading.getId(),"normal");
}
});
map.print();
}
}
- 结果展示
Union
- 那有没有办法连接多个流呢?有的,union就可以,但是union只可以连接数据类型相同的流
DataStream<SenSorReading> union = highStream.union(lowStream, allStream);
富函数(Rich Functions)
-
“富函数”是 DataStream API 提供的一个函数类的接口,所有 Flink 函数类都有其 Rich 版本。它与常规函数的不同在于,可以获取运行环境的上下文,并拥有一些生命周期方法,所以可以实现更复杂的功能。
RichMapFunction
RichFlatMapFunction
RichFilterFunction
-
Rich Function 有一个生命周期的概念。典型的生命周期方法有:
open()方法是 rich function 的初始化方法,当一个算子例如 map 或者 filter被调用之前 open()会被调用。
close()方法是生命周期中的最后一个调用的方法,做一些清理工作。
getRuntimeContext()方法提供了函数的 RuntimeContext 的一些信息,例如函数执行的并行度,任务的名字,以及 state 状态
-
富函数的主要作用就是可以获取运行环境的信息,和在程序运行前做一些准备工作
package transform;
import beans.SenSorReading;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class RichFunctionTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//从文件中读取数据
DataStreamSource<String> inputStream = env.readTextFile("src/main/resources/sensor.txt");
SingleOutputStreamOperator<SenSorReading> dataStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SenSorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
});
DataStream<Tuple2<String, Integer>> res = dataStream.map(new MyMapper());
res.print();
env.execute();
}
public static class MyMapper01 implements MapFunction<SenSorReading, Tuple2<String, Integer>> {
@Override
public Tuple2<String, Integer> map(SenSorReading senSorReading) throws Exception {
return new Tuple2<>(senSorReading.getId(), senSorReading.getId().length());
}
}
/**
* 主要的作用就是可以获取到当前的很多运行时信息
*/
//实现自定义富函数类
public static class MyMapper extends RichMapFunction<SenSorReading, Tuple2<String, Integer>> {
@Override
public Tuple2<String, Integer> map(SenSorReading senSorReading) throws Exception {
return new Tuple2<>(senSorReading.getId(), getRuntimeContext().getIndexOfThisSubtask()+1);
}
@Override
public void open(Configuration parameters) throws Exception {
/**
* 初始化的工作,一般是定义工作状态,或者建立数据库连接
*/
System.out.println("open");
}
@Override
public void close() throws Exception {
/**
* 释放缓存,关闭连接
*/
System.out.println("close");
}
}
}
- 结果展示
跳转顶部