基本转换算子
-
map:输入一条记录,输出一个结果,不允许不输出
-
flatmap:输入一条记录,可以输出0或者多个结果
-
filter:如果结果为真,则仅发出记录
package transform; import org.apache.flink.api.common.functions.FilterFunction; import org.apache.flink.api.common.functions.FlatMapFunction; import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.util.Collector; import wc.WordCountSet; /** * Created with IntelliJ IDEA. * * @Author: yingtian * @Date: 2021/05/13/9:55 * @Description: 基础算子 map flatmap filter */ public class TransformTest1_Base { public static void main(String[] args) throws Exception{ // 创建执行环境 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); //读取数据 String inputPath = WordCountSet.class.getClassLoader().getResource("sensor.txt").getFile(); DataStreamSource<String> dataStream = env.readTextFile(inputPath); //输出每行字符的长度 结果一一对应 需要new一个MapFunction SingleOutputStreamOperator<Integer> mapStream = dataStream.map(new MapFunction<String, Integer>() { @Override public Integer map(String s) throws Exception { return s.length(); } }); //按逗号拆分,输出拆分结果 结果0到多个 SingleOutputStreamOperator<String> flatmapStream = dataStream.flatMap(new FlatMapFunction<String, String>() { @Override public void flatMap(String s, Collector<String> collector) throws Exception { String[] split = s.split(","); for (String str : split) collector.collect(str); } }); //过滤 包含sensor_1才要 SingleOutputStreamOperator<String> filterStream = dataStream.filter(new FilterFunction<String>() { @Override public boolean filter(String s) throws Exception { return s.startsWith("sensor_1"); } }); //打印输出 mapStream.print("map"); flatmapStream.print("flatmap"); filterStream.print("filter"); env.execute(); } }
聚合操作算子
在flink的设计中,所有数据必须先分组才能做聚合操作。先keyBy得到KeyedStream,然后调用其reduce、sum等聚合操作方法。(先分组后聚合)。
常见的聚合操作算子主要有:
-
keyBy
-
滚动聚合算子Rolling Aggregation
package transform; import bean.SensorReading; import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.datastream.KeyedStream; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import wc.WordCountSet; /** * Created with IntelliJ IDEA. * * @Author: yingtian * @Date: 2021/05/08/15:04 * @Description: 测试max maxBy */ public class TransformTest2_RollingAggregation { public static void main(String[] args) throws Exception{ //创建执行环境 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); //读取数据 String inputPath = WordCountSet.class.getClassLoader().getResource("sensor.txt").getFile(); DataStreamSource<String> dataStream = env.readTextFile(inputPath); //拆分数据 SingleOutputStreamOperator<SensorReading> sensorStream = dataStream.map((MapFunction<String, SensorReading>) line -> { String[] split = line.split(","); return new SensorReading(split[0], new Long(split[1]), new Double(split[2])); }); //按id分组 KeyedStream<SensorReading, String> keyedStream = sensorStream.keyBy(SensorReading::getId); //分组取最大的温度值 SingleOutputStreamOperator<SensorReading> maxStream = keyedStream.max("temperature"); SingleOutputStreamOperator<SensorReading> maxByStream = keyedStream.maxBy("temperature"); maxStream.print("max"); maxByStream.print("maxBy"); env.execute(); } }
-
reduce
Reduce适用于更加一般化的聚合操作场景。java中需要实现ReduceFunction函数式接口。package transform; import bean.SensorReading; import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.api.common.functions.ReduceFunction; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.datastream.KeyedStream; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import wc.WordCountSet; /** * Created with IntelliJ IDEA. * * @Author: yingtian * @Date: 2021/05/14/16:00 * @Description: 测试reduce */ public class TransformTest2_Reduce { public static void main(String[] args) throws Exception{ //创建执行环境 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); //读取数据 String inputPath = WordCountSet.class.getClassLoader().getResource("sensor.txt").getFile(); DataStreamSource<String> dataStream = env.readTextFile(inputPath); //拆分数据 SingleOutputStreamOperator<SensorReading> sensorStream = dataStream.map((MapFunction<String, SensorReading>) line -> { String[] split = line.split(","); return new SensorReading(split[0], new Long(split[1]), new Double(split[2])); }); //按id分组 KeyedStream<SensorReading, String> keyedStream = sensorStream.keyBy(SensorReading::getId); //获取同组历史温度最高的传感器信息,同时要求实时更新其时间戳信息 //value2代表最新的数据 SingleOutputStreamOperator<SensorReading> reduceStream = keyedStream.reduce((ReduceFunction<SensorReading>) (value1, value2) -> new SensorReading(value1.getId(),value2.getTimestamp(),Math.max(value1.getTemperature(),value2.getTemperature()))); reduceStream.print(); env.execute(); } }
多流转换算子
-
OutputTag
OutputTag可以按照一定的条件拆分一个流。package transform; import bean.SensorReading; import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.ProcessFunction; import org.apache.flink.util.Collector; import org.apache.flink.util.OutputTag; import wc.WordCountSet; /** * Created with IntelliJ IDEA. * * @Author: yingtian * @Date: 2021/05/14/16:22 * @Description: */ public class TransformTest4_MultipleStreams { public static void main(String[] args) throws Exception{ //创建执行环境 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); //读取数据 String inputPath = WordCountSet.class.getClassLoader().getResource("sensor.txt").getFile(); DataStreamSource<String> dataStream = env.readTextFile(inputPath); //拆分数据 SingleOutputStreamOperator<SensorReading> sensorStream = dataStream.map((MapFunction<String, SensorReading>) line -> { String[] split = line.split(","); return new SensorReading(split[0], new Long(split[1]), new Double(split[2])); }); //定义侧输出流 按照温度值30度为界分为两条流 OutputTag<SensorReading> high = new OutputTag<SensorReading>("high"){}; OutputTag<SensorReading> low = new OutputTag<SensorReading>("low"){}; //实现侧输出流 SingleOutputStreamOperator<SensorReading> outputStream = sensorStream.process(new ProcessFunction<SensorReading, SensorReading>() { @Override public void processElement(SensorReading sensorReading, Context context, Collector<SensorReading> collector) throws Exception { collector.collect(sensorReading);//常规输出 if (sensorReading.getTemperature() > 30) { //侧输出 context.output(high, sensorReading); } else { context.output(low, sensorReading); } } }); //获取侧输出流 DataStream<SensorReading> highStream = outputStream.getSideOutput(high); DataStream<SensorReading> lowStream = outputStream.getSideOutput(low); //打印 outputStream.print("out"); highStream.print("high"); lowStream.print("low"); env.execute(); } }
-
Connect
DataStream,DataStream -> ConnectedStreams: 连接两个保持他们类型的数据流,两个数据流被Connect 之后,只是被放在了一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立。 -
CoMap
ConnectedStreams -> DataStream: 作用于ConnectedStreams 上,功能与map和flatMap一样,对ConnectedStreams 中的每一个Stream分别进行map和flatMap操作;
public static void main(String[] args) throws Exception{
//创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//读取数据
String inputPath = WordCountSet.class.getClassLoader().getResource("sensor.txt").getFile();
DataStreamSource<String> dataStream = env.readTextFile(inputPath);
//拆分数据
SingleOutputStreamOperator<SensorReading> sensorStream = dataStream.map((MapFunction<String, SensorReading>) line -> {
String[] split = line.split(",");
return new SensorReading(split[0], new Long(split[1]), new Double(split[2]));
});
//定义侧输出流 按照温度值30度为界分为两条流
OutputTag<SensorReading> high = new OutputTag<SensorReading>("high"){};
OutputTag<SensorReading> low = new OutputTag<SensorReading>("low"){};
//实现侧输出流
SingleOutputStreamOperator<SensorReading> outputStream = sensorStream.process(new ProcessFunction<SensorReading, SensorReading>() {
@Override
public void processElement(SensorReading sensorReading, Context context, Collector<SensorReading> collector) throws Exception {
collector.collect(sensorReading);//常规输出
if (sensorReading.getTemperature() > 30) { //侧输出
context.output(high, sensorReading);
} else {
context.output(low, sensorReading);
}
}
});
//获取侧输出流
DataStream<SensorReading> highStream = outputStream.getSideOutput(high);
DataStream<SensorReading> lowStream = outputStream.getSideOutput(low);
//连接两个流
ConnectedStreams<SensorReading, SensorReading> connectStream = highStream.connect(lowStream);
//分别对两个流做操作 使用coMapFunction
SingleOutputStreamOperator<String> coMapStream = connectStream.map(new CoMapFunction<SensorReading, SensorReading, String>() {
@Override
public String map1(SensorReading value) throws Exception {
return value.getId() + ":" + value.getTemperature() + ":high warning";
}
@Override
public String map2(SensorReading value) throws Exception {
return value.getId() + ":" + value.getTemperature() + ":low warning";
}
});
//打印
coMapStream.print();
env.execute();
}
- Union
DataStream -> DataStream:对两个或者两个以上的DataStream进行Union操作,产生一个包含多有DataStream元素的新DataStream。
总结
- map必须要输出一个结果
- flatmap可以输出0到多个结果
- fliter为真就输出
- 使用min、max、minBy、maxBy、reduce等算子必须先使用keyBy分组,如果数据结构是元组,可以使用下标位置来作为分组、计算的参数,但是如果是pojo,只能使用形参来作为参数,pojo必须有get/set方法,也必须包含无参构造
- Connect 的数据类型可以不同,Connect 只能合并两个流,coMap可以对两个流分别做操作并返回
- Union可以合并多条流,Union的数据结构必须是一样的
- 流转换图:
ps:以上内容整理于SGG教程。