Flink 流处理API
1 Environment
1.1 getExecutionEnvironment
创建一个执行环境,表示当前执行程序的上下文。 如果程序是独立调用的,则此方法返回本地执行环境;如果从命令行客户端调用程序以提交到集群,则此方法返回此集群的执行环境,也就是说,getExecutionEnvironment 会根据查询运行的方式决定返回什么样的运行环境,是最常用的一种创建执行环境的方式。
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
2 Source
2.1 从集合中读取
public class SourceFromCollectionTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> dataStream = env.fromCollection(Arrays.asList("1", "2", "3"));
dataStream.print().setParallelism(1);
env.execute();
}
}
2.2 从文件读取
DataStreamSource<String> dataStream = env.readTextFile("file_path");
2.3 从socket读取
DataStream<String> dataStream = env.socketTextStream("127.0.0.1", 60000);
2.4 从kafka读取
- 代码
public class SourceKafkaTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// kafka 配置项
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
properties.setProperty("group.id", "consumer-group");
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("auto.offset.reset", "latest");
// 从 kafka 读取数据
DataStream<String> dataStream = env.addSource(new
FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));
dataStream.print().setParallelism(1);
env.execute();
}
}
- 测试
2.5 自定义source
- 自定义数据源
/**
* 自定义数据源
*/
public class MySourceFunction implements SourceFunction<String> {
private Boolean isRunning = true;
private String sourceTxt;
private Long sleep;
public MySourceFunction(String sourceTxt, Long sleep) {
this.sourceTxt = sourceTxt;
this.sleep = sleep;`在这里插入代码片`
}
public void run(SourceContext<String> sourceContext) throws Exception {
int i = 0;
while (isRunning) {
i = i + 1;
sourceContext.collect(sourceTxt + i);
Thread.sleep(this.sleep);
}
}
public void cancel() {
this.isRunning = false;
}
}
- 调用
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> dataStream = env.addSource(new MySourceFunction("thread_1, name,", 1000L));
dataStream.print().setParallelism(1);
env.execute();
4 Transform-转换算子
4.1 map
作用:DataStream → DataStream: 输入一个参数产生一个参数。
public class MapTransFormTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// kafka 配置项
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
properties.setProperty("group.id", "consumer-group");
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("auto.offset.reset", "latest");
// 从 kafka 读取数据
DataStream<String> dataStream = env.addSource(new
FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));
// 转换为二元组,并统计每个单词出现的次数
dataStream.map(new MyMapFunction()).keyBy(0).sum(1).print().setParallelism(1);
env.execute();
}
public static class MyMapFunction implements MapFunction<String, Tuple2<String, Integer>> {
public Tuple2<String, Integer> map(String s) throws Exception {
return new Tuple2<String, Integer>(s, 1);
}
}
}
4.2 flatMap
作用:DataStream → DataStream: 输入一个参数产生0或多个参数。
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// kafka 配置项
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
properties.setProperty("group.id", "consumer-group");
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("auto.offset.reset", "latest");
// 从 kafka 读取数据
DataStream<String> dataStream = env.addSource(new
FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));
dataStream.flatMap(new MyFlatFunction()).keyBy(0).sum(1).print().setParallelism(1);
env.execute();
}
public static class MyFlatFunction implements FlatMapFunction<String, Tuple2<String, Integer>> {
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
collector.collect(new Tuple2<String, Integer>(s, 1));
}
}
4.3 filter
作用:过滤
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// kafka 配置项
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
properties.setProperty("group.id", "consumer-group");
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("auto.offset.reset", "latest");
// 从 kafka 读取数据
DataStream<String> dataStream = env.addSource(new
FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));
//dataStream.map(new MyMapFunction()).keyBy(0).sum(1).print().setParallelism(1);
//dataStream.flatMap(new MyFlatFunction()).keyBy(0).sum(1).print().setParallelism(1);
dataStream.filter(new MyFilterFunction()).map(new MyMapFunction()).keyBy(0).sum(1).print().setParallelism(1);
env.execute();
}
public static class MyMapFunction implements MapFunction<String, Tuple2<String, Integer>> {
public Tuple2<String, Integer> map(String s) throws Exception {
return new Tuple2<String, Integer>(s, 1);
}
}
public static class MyFilterFunction implements FilterFunction<String> {
// 过滤非a开头的字符串
public boolean filter(String s) throws Exception {
if (s.startsWith("a")) {
return true;
}
return false;
}
}
4.4 KeyBy
DataStream → KeyedStream:逻辑地将一个流拆分成不相交的分区,每个分区包含具有相同 key 的元素,在内部以 hash 的形式实现的。
- 代码
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// kafka 配置项
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
properties.setProperty("group.id", "consumer-group");
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("auto.offset.reset", "latest");
// 从 kafka 读取数据
DataStream<String> dataStream = env.addSource(new
FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));
DataStream<Tuple2<String, Integer>> flatMapStream = dataStream.flatMap(new MyFlatFunction());
flatMapStream.keyBy(0).max(1).print().setParallelism(1);
flatMapStream.keyBy(0).maxBy(1).print().setParallelism(2);
env.execute();
}
public static class MyFlatFunction implements FlatMapFunction<String, Tuple2<String, Integer>> {
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
Random random = new Random();
collector.collect(new Tuple2<String, Integer>(s, random.nextInt(100)));
}
}
4.5 滚动聚合算子(Rolling Aggregation)
- sum()
- min()
- max()
- minBy()
- maxBy()
max、min、sum 会分别返回最大值、最小值和汇总值;而 minBy 和 maxBy 则会把最小或者最大的元素全部返回。
案例参考:https://blog.csdn.net/Jackson_mvp/article/details/105825898
4.6 Reduce
KeyedStream → DataStream:一个分组数据流的聚合操作,合并当前的元素和上次聚合的结果,产生一个新的值,返回的流中包含每一次聚合的结果,而不是只返回最后一次聚合的最终结果。
- 代码
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// kafka 配置项
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
properties.setProperty("group.id", "consumer-group");
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("auto.offset.reset", "latest");
// 从 kafka 读取数据
DataStream<String> dataStream = env.addSource(new
FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));
dataStream.map(new MyMapFunction()).keyBy(0).reduce(new MyReduceFunction()).print().setParallelism(1);
env.execute();
}
// <String, Integer, Long> ==> key 大小值 时间戳
public static class MyMapFunction implements MapFunction<String, Tuple3<String, Integer, Long>> {
public Tuple3<String, Integer, Long> map(String s) throws Exception {
String[] split = s.split(",");
return new Tuple3<String, Integer, Long>(split[0], Integer.valueOf(split[1]), Long.valueOf(split[2]));
}
}
public static class MyReduceFunction implements ReduceFunction<Tuple3<String, Integer, Long>> {
/**
* 合并最大值 且 更新最大时间
* @param oldVal
* @param newVal
* @return
* @throws Exception
*/
public Tuple3<String, Integer, Long> reduce(Tuple3<String, Integer, Long> oldVal, Tuple3<String, Integer, Long> newVal) throws Exception {
if (oldVal.f1 < newVal.f1) {
return newVal;
}
return new Tuple3<String, Integer, Long>(oldVal.f0, oldVal.f1, newVal.f2);
}
}
- 结果
4.7 Split 和 Select(分流)
4.7.1 概念
新版本:1.13.0已被移弃
-
Split
DataStream → SplitStream:根据某些特征把一个 DataStream 拆分成两个或者多个 DataStream。 -
Select
SplitStream→DataStream:从一个 SplitStream 中获取一个或者多个DataStream。
4.7.2 案例:根据上报的数据,按临界值 50 分成两份流
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> dataStream = env.fromCollection(Arrays.asList(
"a,1",
"a,49",
"a,55",
"a,58"
));
SplitStream<Tuple2<String, Integer>> splitStream = dataStream.map(new MyMapFunction()).split(new MySplitFunction());
splitStream.select("high").print("high").setParallelism(1);
splitStream.select("low").print("low").setParallelism(1);
splitStream.select("high", "low").print("all").setParallelism(1);
env.execute();
}
public static class MyMapFunction implements MapFunction<String, Tuple2<String, Integer>> {
public Tuple2<String, Integer> map(String s) throws Exception {
String[] split = s.split(",");
return new Tuple2<String, Integer>(split[0], Integer.valueOf(split[1].trim()));
}
}
public static class MySplitFunction implements OutputSelector<Tuple2<String, Integer>> {
public Iterable<String> select(Tuple2<String, Integer> data) {
if(data.f1 > 50) {
return Collections.singletonList("high");
}
return Collections.singletonList("low");
}
}
- 结果
4.8 Connect 和 CoMap
4.8.1 概念
-
Connect
DataStream,DataStream → ConnectedStreams:连接两个保持他们类型的数据流,两个数据流被 Connect 之后,只是被放在了一个同一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立。 -
CoMap/toFlatMap
ConnectedStreams → DataStream:作用于 ConnectedStreams 上,功能与 map和 flatMap 一样,对 ConnectedStreams 中的每一个 Stream 分别进行 map 和 flatMap处理。 -
代码
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Tuple2<String, Integer>> dataStream1 = env.fromCollection(Arrays.asList(
new Tuple2<String, Integer>("a", 1),
new Tuple2<String, Integer>("a", 2),
new Tuple2<String, Integer>("a", 3),
new Tuple2<String, Integer>("a", 4),
new Tuple2<String, Integer>("a", 5),
new Tuple2<String, Integer>("a", 6)
));
DataStreamSource<Tuple2<String, String>> dataStream2 = env.fromCollection(Arrays.asList(
new Tuple2<String, String>("b", "1"),
new Tuple2<String, String>("b", "2"),
new Tuple2<String, String>("b", "3"),
new Tuple2<String, String>("b", "4"),
new Tuple2<String, String>("b", "5"),
new Tuple2<String, String>("b", "6"),
new Tuple2<String, String>("b", "7")
));
// 合并流,两个流的数据互相独立
ConnectedStreams<Tuple2<String, Integer>, Tuple2<String, String>> connectStream = dataStream1.connect(dataStream2);
connectStream.map(new CoMapFunction<Tuple2<String, Integer>, Tuple2<String, String>, Object>() {
// 做任意转换
@Override
public Object map1(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
return stringIntegerTuple2;
}
// 做任意转换
@Override
public Object map2(Tuple2<String, String> stringStringTuple2) throws Exception {
return stringStringTuple2;
}
}).print().setParallelism(1);
env.execute();
}
4.9 Union
DataStream → DataStream:对两个或者两个以上的 DataStream 进行 union 操作,产生一个包含所有 DataStream 元素的新 DataStream。
4.9.1 Connect 与 Union 区别:
- Union 之前两个流的类型必须是一样,Connect 可以不一样,在之后的 coMap
中再去调整成为一样的。 - Connect 只能操作两个流,Union 可以操作多个。
5 富函数(Rich Functions)
“富函数”是 DataStream API 提供的一个函数类的接口,所有 Flink 函数类都有其 Rich 版本。它与常规函数的不同在于,可以获取运行环境的上下文,并拥有一些生命周期方法,所以可以实现更复杂的功能。
- RichMapFunction
- RichFlatMapFunction
- RichFilterFunction
…
…
Rich Function 有一个生命周期的概念。典型的生命周期方法有:
- open()方法是 rich function 的初始化方法,当一个算子例如 map 或者 filter被调用之前 open()会被调用。
- close()方法是生命周期中的最后一个调用的方法,做一些清理工作。
- getRuntimeContext()方法提供了函数的 RuntimeContext 的一些信息,例如函数执行的并行度,任务的名字,以及 state 状态
public static class MyMapFunction extends RichMapFunction<SensorReading,
Tuple2<Integer, String>> {
@Override
public Tuple2<Integer, String> map(SensorReading value) throws Exception {
return new Tuple2<>(getRuntimeContext().getIndexOfThisSubtask(),
value.getId());
}
@Override
public void open(Configuration parameters) throws Exception {
System.out.println("my map open");
// 以下可以做一些初始化工作,例如建立一个和 HDFS 的连接
}
@Override
public void close() throws Exception {
System.out.println("my map close");
// 以下做一些清理工作,例如断开和 HDFS 的连接
}
}