Flink 流处理API
文章目录
一、Environment 创建环境
Flink 数据流向的过程如上图:
- 首先是Flink创建执行环境Environment。
- 通过环境的source来获取流数据DataStream。
- 将DataStream通过转换算子transform,转为另一个DataStream
- 环境指定输出位置sink。
1.1 getExecutionEnvironment
创建一个执行环境,表示当前执行程序的上下文。如果程序是独立调用的,则此方法返回本地执行环境;如果从命令行客户端调用程序以提交到集群,则此方法返回此集群的执行环境,也就是说,getExecutionEnvironment会根据查询运行的方式决定返回什么样的运行环境,是最常用的一种创建执行环境的方式。
// 批处理环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// 流处理环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
如果没有设置并行度,会以flink-conf.yaml中的配置为准,默认是1。
1.2 createLocalEnvironment
返回本地执行环境,需要在调用时指定默认的并行度。
LocalStreamEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(1);
1.3 createRemoteEnvironment
返回集群执行环境,将Jar提交到远程服务器。需要在调用时指定JobManager的IP和端口号,并指定要在集群中运行的Jar包。
StreamExecutionEnvironment env = StreamExecutionEnvironment.createRemoteEnvironment(String host, int port, String... jarFiles);
二、Source 读取数据
相关博客:
Flink-Environment的三种方式和Source的四种读取方式-从集合中、从kafka中、从文件中、自定义
2.1 从 集合/文件 中创建数据流
package com.root.source;
import com.root.SensorReading;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.util.Arrays;
/**
* @author Kewei
* @Date 2022/3/4 17:28
*/
public class SourceTest1_Collection {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 从集合中读取数据
DataStreamSource<SensorReading> sensorDataStreamSource = env.fromCollection(
Arrays.asList(
new SensorReading("sensor_1", 1547718199L, 35.8),
new SensorReading("sensor_6", 1547718201L, 15.4),
new SensorReading("sensor_7", 1547718202L, 6.7),
new SensorReading("sensor_10", 1547718205L, 38.1)
)
);
// 从文件中读取数据
DataStreamSource<String> inputStream = env.readTextFile("data/sensor.txt");
sensorDataStreamSource.print();
env.execute();
}
}
2.2 从kafka读取数据
暂时不写,由于虚拟机上没有,之后再用
2.3 自定义Source
使用自定义Source需要实现SourceFunction
接口,并实现run方法
package com.root.source;
import com.root.SensorReading;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import java.util.HashMap;
import java.util.Random;
/**
* @author Kewei
* @Date 2022/3/4 17:39
*/
public class SourceTest2_MySource {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<SensorReading> dataStream = env.addSource(new MySource());
dataStream.print();
env.execute();
}
public static class MySource implements SourceFunction<SensorReading>{
private boolean running = true;
@Override
public void run(SourceContext<SensorReading> sourceContext) throws Exception {
Random random = new Random();
// 实例一个HashMap对象
HashMap<String, Double> sensorTempMap = new HashMap<>();
// 添加10个键值对
for (int i = 0; i < 10; i++) {
sensorTempMap.put("sensor_" + (i + 1), 60 + random.nextGaussian() * 20);
}
// 更新时间,并输出
while (running){
for (String sensorId : sensorTempMap.keySet()) {
double newTemp = sensorTempMap.get(sensorId) + random.nextGaussian();
sensorTempMap.put(sensorId, newTemp);
sourceContext.collect(new SensorReading(sensorId, System.currentTimeMillis(), newTemp));
}
// 控制输出频率,每次输出10次,就暂停1秒
Thread.sleep(1000L);
}
}
@Override
public void cancel() {
this.running = false;
}
}
}
三、Transform 转换算子
转换算子就是将一个DataStream转换为另一个DataStream。
3.1 基本转换算子 map/flatMap/filter
例子:
package com.root.transfrom;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
/**
* @author Kewei
* @Date 2022/3/4 20:30
*/
public class TransFormTest1_Base {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<String> inputStream = env.readTextFile("data/text.txt");
// 将数据转换为本身的长度
SingleOutputStreamOperator<Integer> mapStream = inputStream.map(new MapFunction<String, Integer>() {
@Override
public Integer map(String s) throws Exception {
return s.length();
}
});
// 将数据按照","分割,并一一输出
SingleOutputStreamOperator<String> flatMapStream = inputStream.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String s, Collector<String> collector) throws Exception {
String[] words = s.split(",");
for (String word : words) {
collector.collect(word);
}
}
});
// 筛选以sensor_1开头的数据
SingleOutputStreamOperator<String> filterStream = inputStream.filter(new FilterFunction<String>() {
@Override
public boolean filter(String s) throws Exception {
return s.startsWith("sensor_1");
}
});
mapStream.print();
flatMapStream.print();
filterStream.print();
env.execute();
}
}
3.2 聚合算子
DataStream中没有reduce、sum这类聚合操作的方法,在Flink设计中,所有数据必须先分组才能做聚合操作。
DataStream先KeyBy得到KeyedStream,然后调用reduce、sum等聚合操作方法,先分组再聚合。
常见的聚合操作算子有:
- KeyBy 按照键分组
- Rolling Aggregation 滚动聚合算子
- recude 聚合
KeyBy
DataStream -> KeyedStream:逻辑地将一个流拆分成不相交的分区,每个分区包含具有相同key的元素,在内部以hash的形式实现的。
1、KeyBy会重新分区;
2、不同的key有可能分到一起,因为是通过hash原理实现的;
Rolling Aggregation
- sum()
- min()
- max()
- minBy()
- maxBy()
这些算子可以针对KeyedStream的每一个支流做聚合。
测试代码:
package com.root.transfrom;
import com.root.SensorReading;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
* @author Kewei
* @Date 2022/3/4 20:46
*/
public class TransFormTest2_RollingAggregation {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<String> inputStream = env.readTextFile("data/sensor.txt");
// map函数中的参数可以用匿名函数来写
SingleOutputStreamOperator<SensorReading> mapStream0 = inputStream.map(new MapFunction<String, SensorReading>() {
@Override
public SensorReading map(String s) throws Exception {
String[] fileds = s.split(",");
return new SensorReading(fileds[0], new Long(fileds[1]), new Double(fileds[2]));
}
});
SingleOutputStreamOperator<SensorReading> mapStream = inputStream.map(line -> {
String[] fileds = line.split(",");
return new SensorReading(fileds[0], new Long(fileds[1]), new Double(fileds[2]));
});
// 两种方式分组
KeyedStream<SensorReading, Tuple> keyedStream = mapStream.keyBy("id");
KeyedStream<SensorReading, String> keyedStream1 = mapStream.keyBy(SensorReading::getId);
DataStreamSource<Long> dataStream = env.fromElements(1L, 34L, 4L, 456L, 23L);
KeyedStream<Long, Integer> keyedStream2 = dataStream.keyBy(new KeySelector<Long, Integer>() {
@Override
public Integer getKey(Long aLong) throws Exception {
return aLong.intValue() * 2;
}
});
// 求temperature最大值
// 滚动聚合,max和maxBy区别在于,maxBy除了用于max比较的字段以外,其他字段也会更新成最新的,而max只有比较的字段更新,其他字段不变
SingleOutputStreamOperator<SensorReading> maxStream = keyedStream.maxBy("temperature");
// mapStream.print();
//
// mapStream0.print();
//
// keyedStream.print();
// keyedStream1.print("key1");
//
// keyedStream2.sum(0).print("key2");
maxStream.print("result");
env.execute();
}
}
输出:
result> SensorReading{id='sensor_1', timestamp=1547718199, temperature=35.8}
result> SensorReading{id='sensor_6', timestamp=1547718201, temperature=15.4}
result> SensorReading{id='sensor_7', timestamp=1547718202, temperature=6.7}
result> SensorReading{id='sensor_10', timestamp=1547718205, temperature=38.1}
result> SensorReading{id='sensor_1', timestamp=**1547718207**, temperature**=36.3**}
result> SensorReading{id='sensor_1', timestamp=**1547718207**, temperature**=36.3**}
result> SensorReading{id='sensor_1', timestamp=1547718212, temperature=37.1}
由于是滚动更新,每次输出历史最大值,所以36.3才会出现两次。
recude
Reduce适用于更加一般化的聚合操作场景。方法中需要实现ReduceFunction
函数式接口。
在前面,获取同组历史温度最高的传感器信息的基础上,同时要求实时更新其时间戳信息。
package com.root.transfrom;
import com.root.SensorReading;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
* @author Kewei
* @Date 2022/3/4 21:08
*/
public class TransFormTest3_Reduce {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<String> inputStream = env.readTextFile("data/sensor.txt");
SingleOutputStreamOperator<SensorReading> mapStream = inputStream.map(line -> {
String[] filed = line.split(",");
return new SensorReading(filed[0], new Long(filed[1]), new Double(filed[2]));
});
KeyedStream<SensorReading, Tuple> keyedStream = mapStream.keyBy("id");
// 实现ReduceFunction接口,每次都更新时间
// 可以使用匿名函数实现
SingleOutputStreamOperator<SensorReading> reduceStream = keyedStream.reduce(new ReduceFunction<SensorReading>() {
@Override
public SensorReading reduce(SensorReading sensorReading, SensorReading t1) throws Exception {
return new SensorReading(sensorReading.getId(), t1.getTimestamp(), Math.max(sensorReading.getTemperature(), t1.getTemperature()));
}
});
SingleOutputStreamOperator<SensorReading> reduceStream1 = keyedStream.reduce((s1, s2) ->
new SensorReading(s1.getId(), s2.getTimestamp(), Math.max(s1.getTemperature(), s2.getTemperature())));
// reduceStream.print();
reduceStream1.print();
env.execute();
}
}
输出:
SensorReading{id='sensor_1', timestamp=1547718199, temperature=35.8}
SensorReading{id='sensor_6', timestamp=1547718201, temperature=15.4}
SensorReading{id='sensor_7', timestamp=1547718202, temperature=6.7}
SensorReading{id='sensor_10', timestamp=1547718205, temperature=38.1}
SensorReading{id='sensor_1', timestamp=1547718207, temperature=36.3}
SensorReading{id='sensor_1', timestamp=1547718209, temperature=36.3}
SensorReading{id='sensor_1', timestamp=1547718212, temperature=37.1}
和前面“Rolling Aggregation”小节不同的是,倒数第二条数据的时间戳用了当前比较时最新的时间戳。
3.3 多转换算子
相关博客:
多流转换算子包括:
- Split和Select (新版已经移除)
- Connect和CoMap
- Union
Split和Select
新版本的Flink已经没有这两个算子了。
Split
DataStream -> SplitStream:根据某些特征把DataStream拆分成SplitStream;
SplitStream虽然看起来像是两个Stream,但是其实它是一个特殊的Stream。
Select
SplitStream -> DataStream:从一个SplitStream中获取一个或者多个DataStream;
我们可以结合split&select将一个DataStream拆分成多个DataStream。
测试:
根据传感器温度高低,划分成两组,high和low(>30归入high)
package com.root.transfrom;
import com.root.SensorReading;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.collector.selector.OutputSelector;
import org.apache.flink.streaming.api.datastream.*;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoMapFunction;
import java.util.Collections;
/**
* @author Kewei
* @Date 2022/3/4 21:24
*/
public class TransFormTest4_MultipleStreams {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> inputStream = env.readTextFile("data/sensor.txt");
SingleOutputStreamOperator<SensorReading> mapStream = inputStream.map(line -> {
String[] filed = line.split(",");
return new SensorReading(filed[0], new Long(filed[1]), new Double(filed[2]));
});
SplitStream<SensorReading> splitStream = mapStream.split(new OutputSelector<SensorReading>() {
@Override
public Iterable<String> select(SensorReading sensorReading) {
return (sensorReading.getTemperature() > 30) ? Collections.singletonList("high") : Collections.singletonList("low");
}
});
DataStream<SensorReading> high = splitStream.select("high");
DataStream<SensorReading> low = splitStream.select("low");
DataStream<SensorReading> all = splitStream.select("high", "low");
high.print("high");
low.print("low");
all.print("all");
env.execute();
}
}
Connect和CoMap
Connect
DataStream,DataStream -> ConnectedStreams: 连接两个保持他们类型的数据流,两个数据流被Connect 之后,只是被放在了一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立。
CoMap
ConnectedStreams -> DataStream: 作用于ConnectedStreams 上,功能与map和 flatMap一样,对ConnectedStreams 中的每一个Stream分别进行map和flatMap操作
测试:
package com.root.transfrom;
import com.root.SensorReading;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.collector.selector.OutputSelector;
import org.apache.flink.streaming.api.datastream.*;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoMapFunction;
import java.util.Collections;
/**
* @author Kewei
* @Date 2022/3/4 21:24
*/
public class TransFormTest4_MultipleStreams {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> inputStream = env.readTextFile("data/sensor.txt");
SingleOutputStreamOperator<SensorReading> mapStream = inputStream.map(line -> {
String[] filed = line.split(",");
return new SensorReading(filed[0], new Long(filed[1]), new Double(filed[2]));
});
SplitStream<SensorReading> splitStream = mapStream.split(new OutputSelector<SensorReading>() {
@Override
public Iterable<String> select(SensorReading sensorReading) {
return (sensorReading.getTemperature() > 30) ? Collections.singletonList("high") : Collections.singletonList("low");
}
});
DataStream<SensorReading> high = splitStream.select("high");
DataStream<SensorReading> low = splitStream.select("low");
DataStream<SensorReading> all = splitStream.select("high", "low");
// high.print("high");
// low.print("low");
// all.print("all");
SingleOutputStreamOperator<Tuple2<String, Double>> warningStream = high.map(new MapFunction<SensorReading, Tuple2<String, Double>>() {
@Override
public Tuple2<String, Double> map(SensorReading sensorReading) throws Exception {
return new Tuple2<String, Double>(sensorReading.getId(), sensorReading.getTemperature());
}
});
ConnectedStreams<Tuple2<String, Double>, SensorReading> connectStream = warningStream.connect(low);
SingleOutputStreamOperator<Object> resultStream = connectStream.map(new CoMapFunction<Tuple2<String, Double>, SensorReading, Object>() {
@Override
public Object map1(Tuple2<String, Double> value) throws Exception {
return new Tuple3<>(value.f0, value.f1, "high temp warning");
}
@Override
public Object map2(SensorReading value) throws Exception {
return new Tuple2<>(value.getId(), "normal");
}
});
resultStream.print();
DataStream<SensorReading> unionStream = high.union(low);
unionStream.print();
env.execute();
}
}
Union
DataStream -> DataStream :对两个或者两个以上的DataStream进行Union操作,产生一个包含多有DataStream元素的新DataStream
Union和Connect的区别?
- Connect 的数据类型可以不同,Connect 只能合并两个流;
- Union可以合并多条流,Union的数据结构必须是一样的;
// 3. union联合多条流
// warningStream.union(lowTempStream); 这个不行,因为warningStream类型是DataStream<Tuple2<String, Double>>,而highTempStream是DataStream<SensorReading>
highTempStream.union(lowTempStream, allTempStream);
3.4 算子转换
相关博客:
在Storm(另一个流处理框架)中,我们常常用Bolt的层级关系来表示各个数据的流向关系,组成一个拓扑。
在Flink中,Transformation算子就是将一个或多个DataStream转换为新的DataStream,可以将多个转换组合成复杂的数据流拓扑。如下图所示,DataStream会由不同的Transformation操作,转换、过滤、聚合成其他不同的流,从而完成我们的业务要求。
四、Flink 支持的数据类型
Flink流应用程序处理的是以数据对象表示的事件流。所以在Flink内部,我们需要能够处理这些对象。它们需要被序列化和反序列化,以便通过网络传送它们;或者从状态后端、检查点和保存点读取它们。为了有效地做到这一点,Flink需要明确知道应用程序所处理的数据类型。Flink使用**类型信息(?)**的概念来表示数据类型,并为每个数据类型生成特定的序列化器、反序列化器和比较器。
Flink还具有一个类型提取系统,该系统分析函数的输入和返回类型,以自动获取类型信息,从而获得序列化器和反序列化器。但是,在某些情况下,例如lambda函数或泛型类型,需要显式地提供类型信息,才能使应用程序正常工作或提高其性能。
Flink支持Java和Scala中所有常见数据类型。使用最广泛的类型有以下几种
4.1 基础数据类型
Flink支持所有的Java和Scala基础数据类型,Int、Double、Long、String、…
DataStream<Integer> numberStream = env.fromElements(1, 2, 3, 4);
numberStream.map(data -> data * 2);
4.2 Java和Scala元组(Tuples)
java不像Scala天生支持元组Tuple类型,java的元组类型由Flink的包提供,默认提供Tuple0~Tuple25。
DataStream<Tuple2<String, Integer>> personStream = env.fromElements(
new Tuple2("Adam", 17),
new Tuple2("Sarah", 23)
);
personStream.filter(p -> p.f1 > 18);
4.3 Scala样例类(case classes)
case calss Person(name:String, age:Int)
val numbers: DataStream[(String, Integer)] = env.fromElements(
Person("张三", 12),
Person("李四", 23)
)
4.4 Java简单对象(POJO)
java的POJO这里要求必须提供无参构造函数
成员变量要求都是public ,或者private但是提供get、set方法
package com.root;
/**
* @author Kewei
* @Date 2022/3/4 17:55
*/
public class SensorReading {
// 属性:id,时间戳,温度值
private String id;
private Long timestamp;
private Double temperature;
public SensorReading() {
}
public SensorReading(String id, Long timestamp, Double temperature) {
this.id = id;
this.timestamp = timestamp;
this.temperature = temperature;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public Long getTimestamp() {
return timestamp;
}
public void setTimestamp(Long timestamp) {
this.timestamp = timestamp;
}
public Double getTemperature() {
return temperature;
}
public void setTemperature(Double temperature) {
this.temperature = temperature;
}
@Override
public String toString() {
return "SensorReading{" +
"id='" + id + '\'' +
", timestamp=" + timestamp +
", temperature=" + temperature +
'}';
}
}
4.5 其他(Arrays,Lists,Maps,Enums,等等)
Flink对Java和Scala中的一些特殊目的的类型也都是支持的,比如Java的ArrayList,HashMap,Enum等等。
五、实现UDF函数—更细粒度的控制流
5.1 函数类
Flink暴露了所有UDF函数的接口(实现方式为接口或者抽象类)。例如MapFunction, FilterFunction, ProcessFunction等等。
例如:
// 实现filter筛选的函数类,判断输入的值中是否包括flink
public static class FlinkFilter implements FilterFunction<String> {
@Override public boolean filter(String value) throws Exception {
return value.contains("flink");
}
}
DataStream<String> flinkTweets = tweets.filter(new FlinkFilter());
当然还可以用匿名类
DataStream<String> flinkTweets = tweets.filter(
new FilterFunction<String>() {
@Override public boolean filter(String value) throws Exception {
return value.contains("flink");
}
}
);
因为只需要实现一个函数,因此可以使用匿名函数:
DataStream<String> flinkTweets = tweets.filter( tweet -> tweet.contains("flink") );
5.2 富函数(Rich Function)
富函数是DataStream API提供的一个函数类的接口,所有Flink函数类都有一个Rich版本。
富函数与常规的函数不同,它可以获取运行环境的上下文,并拥有一些声明周期的方法,所有可以实现更加复杂的功能。比如在运行环境中添加状态,实现复杂的功能。
- RichMapFunction
- RichFlatMapFunction
- RichFilterFunction
等等
Rich Function有一个声明周期的概念。典型的声明周期方法有:
- **
open()
**方法是Rich Function的初始化方法,当作一个算子例如map被调用之前open()对被调用。一般用于初始化状态。 **close()**
方法是生命周期中的最后一个调用的方法,做一些清理工作。一般用于清理状态。**getRuntimeContext()**
方法提供了函数的RuntimeContext的一些信息,例如函数执行的并行度,任务的名字,以及state状态。
测试代码:
package com.root.transfrom;
import com.root.SensorReading;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
* @author Kewei
* @Date 2022/3/5 9:46
*/
public class TransFormTest5_RichFunction {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> inputStream = env.readTextFile("data/sensor.txt");
SingleOutputStreamOperator<SensorReading> sensorStream = inputStream.map(line -> {
String[] filed = line.split(",");
return new SensorReading(filed[0], new Long(filed[1]), new Double(filed[2]));
});
// 使用自定义富函数
SingleOutputStreamOperator<Tuple2<String, Integer>> mapStream = sensorStream.map(new MyMapper());
mapStream.print();
env.execute();
}
// RichMapFunction是一个接口,需要使用extends
private static class MyMapper extends RichMapFunction<SensorReading,Tuple2<String, Integer>> {
@Override
public Tuple2<String, Integer> map(SensorReading value) throws Exception {
// getRuntimeContext().getIndexOfThisSubtask()
// getRuntimeContext().getState()
return new Tuple2<>(value.getId(),getRuntimeContext().getIndexOfThisSubtask());
}
@Override
public void open(Configuration par) throws Exception{
// 初始化工作,可以建立数据库连接
System.out.println("open");
}
@Override
public void close() throws Exception{
// 一般用于关闭资源和清空状态
System.out.println("close");
}
}
}
更多功能详见:
六、 数据重分区操作
重分区操作,在DataStream类中可以看到很多Partitioner
字眼的类。
其中partitionCustom(...)
方法用于自定义重分区。
测试代码:
package com.root.transfrom;
import com.root.SensorReading;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
* @author Kewei
* @Date 2022/3/5 10:00
*/
public class TransFormTest6_Partiton {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(4);
DataStreamSource<String> inputStream = env.readTextFile("data/sensor.txt");
SingleOutputStreamOperator<SensorReading> mapStream = inputStream.map(line -> {
String[] filed = line.split(",");
return new SensorReading(filed[0], new Long(filed[1]), new Double(filed[2]));
});
mapStream.print("input");
// 1.shuffle
DataStream<SensorReading> shuffle = mapStream.shuffle();
shuffle.print("shuffle");
mapStream.keyBy("id").print("keyby");
// 将所有分区汇集成一个分区
inputStream.global().print("global");
env.execute();
}
}
七、Sink 数据输出
相关博客:
Flink没有类似于spark中foreach方法,让用户进行迭代的操作。虽有对外的输出操作都要利用Sink完成。最后通过类似如下方式完成整个任务最终输出操作。
官方提供了一部分的框架的sink。除此以外,需要用户自定义实现sink。
// 自定义Sink
stream.addSink(new MySink(xxxx))
7.1 自定义JDBC作为输出Sink
自定义Sink需要继承RichSinkFunction
,并实现invoke
方法 。
测试连接MySQL
首先,导入pom依赖
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.23</version>
</dependency>
在Mysql中创建表
create table sensor_temp(
id varchar(30),
temp double
)
测试代码
package com.root.sink;
import com.root.SensorReading;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
/**
* @author Kewei
* @Date 2022/3/5 10:19
*/
public class SinkTest_Jdbc {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<String> inputStream = env.readTextFile("data/sensor.txt");
SingleOutputStreamOperator<SensorReading> mapStream = inputStream.map(line -> {
String[] filed = line.split(",");
return new SensorReading(filed[0], new Long(filed[1]), new Double(filed[2]));
});
mapStream.addSink(new MyJdbcSink());
env.execute();
}
public static class MyJdbcSink extends RichSinkFunction<SensorReading>{
// 声明连接和预编译语句
Connection connection = null;
PreparedStatement insertStmt = null;
PreparedStatement updateStmt = null;
@Override
public void open(Configuration par) throws Exception{
// 连接数据库,给预编译语句赋值
connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/text?useSSL=false","root","123456");
insertStmt = connection.prepareStatement("insert into sensor_temp(id, temp) values(?, ?)");
updateStmt = connection.prepareStatement("update sensor_temp set temp=? where id=?");
}
// 每来一条数据,调用连接,执行sql
@Override
public void invoke(SensorReading value, Context context) throws Exception{
updateStmt.setDouble(1,value.getTemperature());
updateStmt.setString(2,value.getId());
updateStmt.execute();
// 直接执行更新语句,如果没有更新那么就插入
if (updateStmt.getUpdateCount() == 0) {
insertStmt.setString(1,value.getId());
insertStmt.setDouble(2,value.getTemperature());
insertStmt.execute();
}
}
@Override
public void close() throws Exception{
insertStmt.close();
updateStmt.close();
connection.close();
}
}
}
7.2 其他Sink(Kafka、Redis、Elasticsearch)
相关博客:
Flink 1.12.1 ElasticSearch连接 Sink
目前虚拟机上没有相关资源,暂时不测试了。