13、Flink Join原理

最新推荐文章于 2024-05-27 09:49:20 发布

江城子v3

最新推荐文章于 2024-05-27 09:49:20 发布

阅读量381

点赞数

分类专栏： Flink原理解析文章标签： flink

本文链接：https://blog.csdn.net/jiang7chengzi/article/details/107897707

版权

Flink原理解析专栏收录该内容

20 篇文章 6 订阅

订阅专栏

1 算子概览

用户通过算子能将一个或多个 DataStream 转换成新的 DataStream，在应用程序中可以将多个数据转换算子合并成一个复杂的数据流拓扑。

Transformation	Description
Map DataStream → DataStream	Takes one element and produces one element. A map function that doubles the values of the input stream: `DataStream<Integer> dataStream = //... dataStream.map(new MapFunction<Integer, Integer>() { @Override public Integer map(Integer value) throws Exception { return 2 * value; } });`
FlatMap DataStream → DataStream	Takes one element and produces zero, one, or more elements. A flatmap function that splits sentences to words: `dataStream.flatMap(new FlatMapFunction<String, String>() { @Override public void flatMap(String value, Collector<String> out) throws Exception { for(String word: value.split(" ")){ out.collect(word); } } });`
Filter DataStream → DataStream	Evaluates a boolean function for each element and retains those for which the function returns true. A filter that filters out zero values: `dataStream.filter(new FilterFunction<Integer>() { @Override public boolean filter(Integer value) throws Exception { return value != 0; } });`
KeyBy DataStream → KeyedStream	将一个流的数据分发到各个独立分区，相同 key 值的元素将分发到同一个分区。 key_by()内部实现根据 Hash 方式分发数据。这个转换操作将返回一个 KeyedStream , 除此之外，还会应用到 keyed state. `dataStream.keyBy(value -> value.getSomeKey()) // Key by field "someKey" dataStream.keyBy(value -> value.f0) // Key by the first element of a Tuple` Attention A type cannot be a key if: it is a POJO type but does not override the hashCode() method and relies on the Object.hashCode() implementation. it is an array of any type.
Reduce KeyedStream → DataStream	A "rolling" reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value. A reduce function that creates a stream of partial sums: `keyedStream.reduce(new ReduceFunction<Integer>() { @Override public Integer reduce(Integer value1, Integer value2) throws Exception { return value1 + value2; } });`
Aggregations KeyedStream → DataStream	Rolling aggregations on a keyed data stream. The difference between min and minBy is that min returns the minimum value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy). `keyedStream.sum(0); keyedStream.sum("key"); keyedStream.min(0); keyedStream.min("key"); keyedStream.max(0); keyedStream.max("key"); keyedStream.minBy(0); keyedStream.minBy("key"); keyedStream.maxBy(0); keyedStream.maxBy("key");`
Window KeyedStream → WindowedStream	Windows can be defined on already partitioned KeyedStreams. Windows group the data in each key according to some characteristic (e.g., the data that arrived within the last 5 seconds). See windows for a complete description of windows. `dataStream.keyBy(value -> value.f0).window(TumblingEventTimeWindows.of(Time.seconds(5))); // Last 5 seconds of data`
WindowAll DataStream → AllWindowedStream	Windows can be defined on regular DataStreams. Windows group all the stream events according to some characteristic (e.g., the data that arrived within the last 5 seconds). See windows for a complete description of windows. WARNING: This is in many cases a non-parallel transformation. All records will be gathered in one task for the windowAll operator. `dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(5))); // Last 5 seconds of data`
Window Apply WindowedStream → DataStream AllWindowedStream → DataStream	Applies a general function to the window as a whole. Below is a function that manually sums the elements of a window. Note: If you are using a windowAll transformation, you need to use an AllWindowFunction instead. windowedStream.apply (new WindowFunction<Tuple2<String,Integer>, Integer, Tuple, Window>() { public void apply (Tuple tuple, Window window, Iterable<Tuple2<String, Integer>> values, Collector<Integer> out) throws Exception { int sum = 0; for (value t: values) { sum += t.f1; } out.collect (new Integer(sum)); } }); // applying an AllWindowFunction on non-keyed window stream allWindowedStream.apply (new AllWindowFunction<Tuple2<String,Integer>, Integer, Window>() { public void apply (Window window, Iterable<Tuple2<String, Integer>> values, Collector<Integer> out) throws Exception { int sum = 0; for (value t: values) { sum += t.f1; } out.collect (new Integer(sum)); } });
Window Reduce WindowedStream → DataStream	Applies a functional reduce function to the window and returns the reduced value. `windowedStream.reduce (new ReduceFunction<Tuple2<String,Integer>>() { public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2) throws Exception { return new Tuple2<String,Integer>(value1.f0, value1.f1 + value2.f1); } });`
Aggregations on windows WindowedStream → DataStream	Aggregates the contents of a window. The difference between min and minBy is that min returns the minimum value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy). `windowedStream.sum(0); windowedStream.sum("key"); windowedStream.min(0); windowedStream.min("key"); windowedStream.max(0); windowedStream.max("key"); windowedStream.minBy(0); windowedStream.minBy("key"); windowedStream.maxBy(0); windowedStream.maxBy("key");`
Union DataStream* → DataStream	Union of two or more data streams creating a new stream containing all the elements from all the streams. Note: If you union a data stream with itself you will get each element twice in the resulting stream. `dataStream.union(otherStream1, otherStream2, ...);`
Window Join DataStream,DataStream → DataStream	Join two data streams on a given key and a common window. `dataStream.join(otherStream) .where(<key selector>).equalTo(<key selector>) .window(TumblingEventTimeWindows.of(Time.seconds(3))) .apply (new JoinFunction () {...});`
Interval Join KeyedStream,KeyedStream → DataStream	Join two elements e1 and e2 of two keyed streams with a common key over a given time interval, so that e1.timestamp + lowerBound <= e2.timestamp <= e1.timestamp + upperBound `// this will join the two streams so that // key1 == key2 && leftTs - 2 < rightTs < leftTs + 2 keyedStream.intervalJoin(otherKeyedStream) .between(Time.milliseconds(-2), Time.milliseconds(2)) // lower and upper bound .upperBoundExclusive(true) // optional .lowerBoundExclusive(true) // optional .process(new IntervalJoinFunction() {...});`
Window CoGroup DataStream,DataStream → DataStream	Cogroups two data streams on a given key and a common window. `dataStream.coGroup(otherStream) .where(0).equalTo(1) .window(TumblingEventTimeWindows.of(Time.seconds(3))) .apply (new CoGroupFunction () {...});`
Connect DataStream,DataStream → ConnectedStreams	"Connects" two data streams retaining their types. Connect allowing for shared state between the two streams. `DataStream<Integer> someStream = //... DataStream<String> otherStream = //... ConnectedStreams<Integer, String> connectedStreams = someStream.connect(otherStream);`
CoMap, CoFlatMap ConnectedStreams → DataStream	Similar to map and flatMap on a connected data stream connectedStreams.map(new CoMapFunction<Integer, String, Boolean>() { @Override public Boolean map1(Integer value) { return true; } @Override public Boolean map2(String value) { return false; } }); connectedStreams.flatMap(new CoFlatMapFunction<Integer, String, String>() { @Override public void flatMap1(Integer value, Collector<String> out) { out.collect(value.toString()); } @Override public void flatMap2(String value, Collector<String> out) { for (String word: value.split(" ")) { out.collect(word); } } });
Iterate DataStream → IterativeStream → DataStream	Creates a "feedback" loop in the flow, by redirecting the output of one operator to some previous operator. This is especially useful for defining algorithms that continuously update a model. The following code starts with a stream and applies the iteration body continuously. Elements that are greater than 0 are sent back to the feedback channel, and the rest of the elements are forwarded downstream. See iterations for a complete description. `IterativeStream<Long> iteration = initialStream.iterate(); DataStream<Long> iterationBody = iteration.map (/do something/); DataStream<Long> feedback = iterationBody.filter(new FilterFunction<Long>(){ @Override public boolean filter(Long value) throws Exception { return value > 0; } }); iteration.closeWith(feedback); DataStream<Long> output = iterationBody.filter(new FilterFunction<Long>(){ @Override public boolean filter(Long value) throws Exception { return value <= 0; } });`

抄自：DataStream API - Operator

Window算子在窗口机制中讲解过，本文重点对Join算子进行讲解。

双流Join是Flink面试的高频问题。一般情况下说明以下几点就可以hold了：

Join大体分类只有两种：Window Join和Interval Join。Window Join又可以根据Window的类型细分出3种： Tumbling Window Join、Sliding Window Join、Session Widnow Join；
Windows类型的join都是利用window的机制，先将数据缓存在Window State中，当窗口触发计算时，执行join操作；
interval join也是利用state存储数据再处理，区别在于state中的数据有失效机制，依靠数据触发数据清理；
目前Stream join的结果是数据的笛卡尔积；
日常使用中的一些问题，数据延迟、window序列化相关。

2 DataStream API

2.1 Window Join

基于窗口的Join是将具有相同key并位于同一个窗口中的事件进行联结。

使用方法：

stream.join(otherStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(<WindowAssigner>)
    .apply(<JoinFunction>)

官方案例：

Tumbling Window Join的实现，关于其他的窗口，如滑动窗口、会话窗口等，原理是一致的。

如图所示，我们定义了一个大小为2毫秒的滚动窗口，该窗口的形式为[0,1], [2,3], ...。该图显示了每个窗口中所有元素的成对组合，这些元素将传递给JoinFunction。注意，在翻转窗口中[6,7]什么也不发射，因为在绿色流中不存在要与橙色元素⑥和joined连接的元素。

基于window的join实现的是inner join，即将具有相同key并在同一个窗口中的事件Join，注意是key能完全匹配上才能Join上。除了官方提供的API，也可以通过coGroup来实现。

2.2 Window coGroup

该操作是将两个数据流/集合按照key进行group，然后将相同key的数据进行处理，但是它和join操作稍有区别，它在一个流/数据集中没有找到与另一个匹配的数据还是会输出。

coGroup的用法类似于Join，不同的是在apply中传入的是一个CoGroupFunction，而不是JoinFunction。

使用方法：

dataStream.coGroup(otherStream)
    .where(0).equalTo(1)
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))
    .apply (new CoGroupFunction () {...});

2.3 Interval Join

间隔Join表示A流Join B流，B流中事件时间戳在A流所界定的范围内的数据Join起来，实现的是Inner Join，Interval Join仅支持Event Time

在上面的示例中，我们将两个流“橙色”和“绿色”连接在一起，其下限为-2毫秒，上限为+1毫秒。

再次使用更正式的符号，这将转化为

orangeElem.ts + lowerBound <= greenElem.ts <= orangeElem.ts + upperBound

思考：基于间隔的Join实现的是Inner Join语义，如图中时间戳为4的橙流没有join到任何数据。但如果想实现left join语义，应该怎么处理？

3 SQL/Table API

4 Join常见问题

参考：

Flink Join实现

双流操作的实现

Flink SQL中流表Join的实现

江城子v3

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
13、Flink Join原理

不同于传统数据库或批处理场景，Streaming中两个数据流的关联查询主要面临如下两个问题：数据流是无限的，缓存数据对 long-running 的任务而言会带来较高的存储和查询压力；两侧数据流中消息到达的时间存在不一致的情况，可能造成关联结果的缺失。
复制链接

扫一扫

专栏目录