DataStream API - Transformations
Map:DataStream → DataStream
新DataStream中的元素与原DataStream中的元素存在一对一的关系。
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class MapDemo {
private static int index = 1 ;
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> dataStream = env.readTextFile("F:/test.txt");
// new MapFunction<String,String>() 前一个String为入参类型,后一个String为出参类型
DataStream<String> newDataStream = dataStream.map(new MapFunction<String,String>() {
@Override
public String map(String value) throws Exception {
// TODO Auto-generated method stub
return (index++) + ".您输入的是:" + value;
}
});
newDataStream.print();
env.execute(" map demo start");
}
}
F/:test.txt内容
Takes
one
element
and
produces
one
element
输出结果为:
2> 1.您输入的是:one
1> 2.您输入的是:Takes
5> 3.您输入的是:produces
2> 4.您输入的是:element
4> 5.您输入的是:and
7> 6.您输入的是:element
6> 7.您输入的是:one
将从文件获取的String stream转为Integer stream
public class MapDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> dataStream = env.readTextFile("F:/test.txt");
// new MapFunction<String,Integer>() 前一个String为入参类型,后一个Integer为出参类型
DataStream<Integer> newDataStream = dataStream.map(new MapFunction<String,Integer>() {
@Override
public Integer map(String value) throws Exception {
// TODO Auto-generated method stub
return Integer.parseInt(value);
}
});
newDataStream.print();
env.execute(" map demo start");
}
}
FlatMap:DataStream → DataStream
新DataStream中的元素与原DataStream中的元素存在一对一或一对多的关系。
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
public class FlatMapDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> dataStream = env.readTextFile("F:/test.txt");
DataStream<String> newDataStream = dataStream.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
// TODO Auto-generated method stub
String[] strAry = value.split("\\s");
for(String str : strAry) {
out.collect(str);
}
}
});
newDataStream.print();
env.execute(" flatMap demo start");
}
}
F/:test.txt内容
Takes one element and produces zero, one, or more elements. A flatmap function that splits sentences to words
输出结果为:
1> Takes
1> one
1> element
1> and
1> produces
1> zero,
1> one,
1> or
1> more
1> elements.
1> A
1> flatmap
1> function
1> that
1> splits
1> sentences
1> to
1> words
Filter:DataStream → DataStream
对DataStream中的元素进行过滤
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class FilterDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> dataStream = env.readTextFile("F:/test.txt");
DataStream<Integer> newDataStream = dataStream.map(new MapFunction<String, Integer>() {
@Override
public Integer map(String value) throws Exception {
return Integer.parseInt(value);
}
}).filter(new FilterFunction<Integer>() {
@Override
public boolean filter(Integer value) throws Exception {
// 出去所有等于0的值
return value != 0;
}
});
newDataStream.print();
env.execute(" filter demo start");
}
}
F/:test.txt内容
0
1
0
2
3
4
输出结果为:
2> 1
5> 2
8> 4
6> 3
KeyBy:DataStream → KeyedStream
在逻辑上将流划分为不通的分区。所有键相同的记录都被分配到同一个分区。在程序内部,keyBy()是通过哈希实现分区的。有不同的方法来指定键。
dataStream.keyBy(“someKey”) // Key by field “someKey”
dataStream.keyBy(0) // Key by the first element of a Tuple
如果出现以下情况,则类型不能成为关键:
- 它是POJO类型,但不覆盖hashCode()方法并依赖于Object.hashCode()实现。
- 它是任何类型的数组。
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class KeyByDemo{
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> dataStream = env.readTextFile("F:/test.txt");
DataStream<WordWithCount> newDataStream = dataStream.map(
new MapFunction<String, WordWithCount>() {
@Override
public WordWithCount map(String value) throws Exception {
return new WordWithCount(value, 1L);
}
}).keyBy("word");
newDataStream.print();
env.execute(" filter demo start");
}
/**
* Data type for words with count.
*/
public static class WordWithCount {
public String word;
public long count;
public WordWithCount() {}
public WordWithCount(String word, long count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return word + " : " + count;
}
}
}
F/:test.txt内容
hello
word
hello word
hello
word
hello word
输出结果为:
4> hello word : 1
4> hello word : 1
6> word : 1
6> word : 1
3> hello : 1
3> hello : 1
Reduce(增量聚合):KeyedStream → DataStream
增量聚合:每收到一个元素就进行一次计算,并显示一次结果。
将逻辑上分区内的元素,聚合成一个元素。 reduce 操作每处理一个元素总是创建一个新值。
reduce方法不能直接应用于SingleOutputStreamOperator对象,因为这个对象是个无限的流,对无限的数据做合并,没有任何意义。所以reduce需要针对分组或者一个window(窗口)来执行,也就是分别对应于keyBy、window/timeWindow 处理后的数据,根据ReduceFunction将元素与上一个reduce后的结果合并,产出合并之后的结果。
注意:需先在Linux端开启scoket(nc -l 9000)然后再启动main方法
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
public class ReduceDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> dataStream = env.socketTextStream("192.168.220.150", 9000);
DataStream<WordWithCount> newDataStream = dataStream.map(
new MapFunction<String, WordWithCount>() {
@Override
public WordWithCount map(String value) throws Exception {
return new WordWithCount(value, 1L);
}
}).keyBy("word")
.timeWindow(Time.seconds(5))// 每5秒处理一次
.reduce(new ReduceFunction<WordWithCount>() {
// 输入两个对象,计算结果作为参数放入下一运算中
@Override
public WordWithCount reduce(WordWithCount a, WordWithCount b) throws Exception {
// TODO Auto-generated method stub
return new WordWithCount(a.word,a.count+b.count);
}
});
newDataStream.print().setParallelism(1);
env.execute(" filter demo start");
}
/**
* Data type for words with count.
*/
public static class WordWithCount {
public String word;
public long count;
public WordWithCount() {}
public WordWithCount(String word, long count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return word + " : " + count;
}
}
}
scoket输入:
输出结果为:
b : 3
a : 4
聚合函数(增量聚合):KeyedStream → DataStream
在KeyedStream 数据上进行”滚动“聚合。min和minBy之间的差异是min返回最小值,而minBy返回该字段中具有最小值的数据元(max和maxBy类似)。
keyedStream.sum(0);
keyedStream.sum("key");
keyedStream.min(0);
keyedStream.min("key");
keyedStream.max(0);
keyedStream.max("key");
keyedStream.minBy(0);
keyedStream.minBy("key");
keyedStream.maxBy(0);
keyedStream.maxBy("key");
sum-Demo
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
public class SumDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
KeyedStream keyedStream = env.fromElements(Tuple2.of(2L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L), Tuple2.of(2L, 4L), Tuple2.of(1L, 2L))
.keyBy(0); // 以数组的第一个元素作为key
// 对第一个元素(位置0)做sum
SingleOutputStreamOperator<Tuple2> sumStream = keyedStream.sum(0);
// 对第一个数据(也就是key)做了累加,然后value以第一个进来的数据为准。
sumStream.addSink(new PrintSinkFunction<>()).setParallelism(1);
env.execute("execute");
}
}
结果
(1,5)
(2,5)
(3,5)
(2,3)
(4,3)
min - Demo
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
public class SumDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
KeyedStream keyedStream = env.fromElements(Tuple2.of(2L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L), Tuple2.of(2L, 4L), Tuple2.of(1L, 2L))
.keyBy(0); // 以数组的第一个元素作为key
// 对第一个元素(位置1)求min
SingleOutputStreamOperator<Tuple2> sumStream = keyedStream.min(1);
// 对第一个数据(也就是key)做了累加,然后value以第一个进来的数据为准。
sumStream.addSink(new PrintSinkFunction<>()).setParallelism(1);
env.execute("execute");
}
}
结果:
(1,5)
(1,5)
(1,2)
(2,3)
(2,3)
连接
窗口连接 (Window Join)
两个数据源的相同窗口时间区间内元素组合成对,窗口时间区间可以是事件时间或者处理时间,组合对的具体形式也可以由apply函数定义。这种组合和表内连接相似,归纳如下:
stream.join(otherStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(<WindowAssigner>)
.apply(<JoinFunction>)
滚动窗口连接
当执行滚动窗口连接时,所有具有公共密钥和公共滚动窗口的元素都以成对组合的形式连接,并传递给JoinFunction或FlatJoinFunction。因为这就像一个内部连接,一个流的元素如果在其滚动窗口中没有来自另一个流的元素,就不会被释放
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
.apply (new JoinFunction<Integer, Integer, String> (){
@Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});
滑动窗口连接
当执行滑动窗口连接时,具有公共键和公共滑动窗口的所有元素都作为成对组合连接,并传递给JoinFunction或FlatJoinFunction。当前滑动窗口中的另一个流没有元素时的此流的元素不会被发出。
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(SlidingEventTimeWindows.of(Time.milliseconds(2) /* size */, Time.milliseconds(1) /* slide */))
.apply (new JoinFunction<Integer, Integer, String> (){
@Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});
会话窗口连接
当执行会话窗口连接时,具有共同会话条件的window内的所有元素都以成对组合连接,并传递给JoinFunction或FlatJoinFunction。同样,这将执行内部连接,因此,如果会话窗口只包含来自一个流的元素,则不会发出任何输出!
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(EventTimeSessionWindows.withGap(Time.milliseconds(1)))
.apply (new JoinFunction<Integer, Integer, String> (){
@Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});
间隔连接(Interval Join)
在事件时间轴上,以被连接数据源(如下列中的orangeStream)的每一个元素位顶点画锥形,本元素只和被锥形覆盖的另一数据源的元素组合。其中,锥形的两边分布被定义为下边界(负数)和上边界,即下例中between的两个参数:
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream
.keyBy(<KeySelector>)
.intervalJoin(greenStream.keyBy(<KeySelector>))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process (new ProcessJoinFunction<Integer, Integer, String(){
@Override
public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
out.collect(first + "," + second);
}
});
数据分区
分布式系统的通信开销通常都很大,在数据处理应用场景下传输大量数据更是如此。通过合理控制传输通道中的数据分布达到最优的网络通信性能,是实现流式数据处理引擎的一个重要课题。
在未使用数据分区时,Join节点的每个并行实例需要聚合来自所有Source节点实例的数据,大量数据传输会造成网络过载。使用数据分区后,Source和Join节点时一一相连的,可以用同一个Slot的一个线程运行两个相连的任务。
-
应用程序自定义分区(Custom Partition):根据指定key位置进行数据分区。
dataStream.partitionCustom(partitioner, "someKey")
-
均匀分布分区(Random Partition):数据均匀的分发给下一级节点
dataStream.shuffle();
-
负载均衡分区(Rebalance Partition):根据轮询调度算法,将数据均匀的分发给下一级节点。在某些物理拓扑情况下,这是最有效的分区方法,例如Source节点和算子节点部署在不同的物理设备上。
dataStream.rebalance();
-
可伸缩分区(Rescale Oartition):Flink引擎根据资源使用情况动态调节同一作业的数据分布,根据物理实例部署时的资源共享情况动态调节数据分布,目的是让数据尽可能的在同一Slot中流转,以减少网络开销。
dataStream.rescale();
-
广播分区(Broadcasting Partition):每个元素都被广播到所有下级节点。
dataStream.broadcast();
资源共享
Flink将多个任务链接成一个任务在一个线程中运行,在降低上下文切换的开销,减小缓存容量,提高系统吞吐量的同时降低延迟。这种机制是可配置的:
// 创建链,以下代码中后两个map函数被链接在一起,而第一个map函数则不会被链接。
dataStream.map(...).map(...).startNweChain().map();
// 关闭作业链优化,这样任何两个算子实例可不共享线程
dataStream.map(...).disableChaining();
// Slot共享组,即在同一组中所有任务的实例在同一个Slot中运行,以隔离非本组实例
dataStream.map(...).slotSharingGroup();