前言
前面说了官网的实例其实就是Inner JoinWindows Join Example
GitHub地址:flink-learn
下面就开始说说Tumbling Windows Join翻滚窗口连接
官网翻译
Tumbling Window Join
当执行翻滚窗口连接时,具有公共密钥和公共翻滚窗口的所有元素以成对组合的形式连接并传递给JoinFunction或FlatJoinFunction。因为它的行为类似于内连接,所以不会发出一个流的元素,这些元素在其翻滚窗口中没有来自另一个流的元素!
如图所示,我们定义了一个大小为2毫秒的翻滚窗口,这导致了窗体的窗口[0,1], [2,3], …。图像显示了每个窗口中所有元素的成对组合,这些元素将被传递给JoinFunction。请注意,在翻滚窗口中[6,7]没有任何东西被发射,因为绿色流中不存在与橙色元素⑥和⑦连接的元素。
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(TumblingEventTimeWindows.of(Time.seconds(2)))
.apply (new JoinFunction<Integer, Integer, String> (){
@Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});
从上面的实例中可以看到关键词JoinFunction
那么我们就来写JoinFunction
还是老样子
- 1.flink驱动搞起来
/*
flink驱动注册
自己直接封装乘一个方法免得后面重复造轮子
*/
public static StreamExecutionEnvironment getEnv() {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
return env;
}
- 2.设置数据源
DataStream<Tuple3<String, String, Long>> leftSource =
getEnv().addSource(new StreamDataSource()).name("Demo Source");
DataStream<Tuple3<String, String, Long>> rightSource =
getEnv().addSource(new StreamDataSource1()).name("Demo Source");
static DataStream<Tuple3<String, String, Long>> getLeftStream() {
// 设置数据源
private static DataStream<Tuple3<String, String, Long>>
getDataStream(DataStream<Tuple3<String, String, Long>> rightSource) {
long delay = 5100L;
// 设置水位线
return rightSource.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<Tuple3<String, String, Long>>(Time.milliseconds(delay)) {
private static final long serialVersionUID = 518406720598977074L;
@Override
public long extractTimestamp(Tuple3<String, String, Long> element) {
return element.f2;
}
}
);
}
- 3.join操作
// join 操作
JoinUtil.getDataStream(leftSource).join(JoinUtil.getDataStream(rightSource))
.where(new LeftSelectKey())
.equalTo(new RightSelectKey())
.window(TumblingEventTimeWindows.of(Time.seconds(windowSize)))
.apply((JoinFunction<Tuple3<String, String, Long>, Tuple3<String, String, Long>,
Tuple5<String, String, String, Long, Long>>)
(first, second) -> new Tuple5<>(first.f0, first.f1, second.f1, first.f2, second.f2)).print();
```
* 4.执行程序
```java
getEnv().execute("TimeWindowDemo");
数据源
StreamDataSource
package com.king.learn.Flink.streaming.join.source;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.functions.source.RichParallelSourceFunction;
import static com.king.learn.Flink.streaming.join.JoinUtil.test;
/**
* @Author: king
* @Date: 2019-01-14
* @Desc: TODO 数据源
*/
public class StreamDataSource extends RichParallelSourceFunction<Tuple3<String, String, Long>> {
private volatile boolean running = true;
@Override
public void run(SourceContext<Tuple3<String, String, Long>> ctx) throws Exception {
Tuple3[] elements = new Tuple3[]{
Tuple3.of("a", "1", 1000000050000L),
Tuple3.of("a", "2", 1000000054000L),
Tuple3.of("a", "3", 1000000079900L),
Tuple3.of("a", "4", 1000000115000L),
Tuple3.of("b", "5", 1000000100000L),
Tuple3.of("b", "6", 1000000108000L)
};
test(ctx, elements, running);
}
@Override
public void cancel() {
running = false;
}
}
StreamDataSource1
package com.king.learn.Flink.streaming.join.source;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.functions.source.RichParallelSourceFunction;
import static com.king.learn.Flink.streaming.join.JoinUtil.test;
/**
* @Author: king
* @Date: 2019-01-14
* @Desc: TODO
*/
public class StreamDataSource1 extends RichParallelSourceFunction<Tuple3<String, String, Long>> {
private static final long serialVersionUID = -8338462943401114121L;
private volatile boolean running = true;
@Override
public void run(SourceContext<Tuple3<String, String, Long>> ctx) throws Exception {
Tuple3[] elements = new Tuple3[]{
Tuple3.of("a", "hangzhou", 1000000059000L),
Tuple3.of("b", "beijing", 1000000105000L),
};
test(ctx, elements, running);
}
@Override
public void cancel() {
running = false;
}
}
StreamDataSource2
package com.king.learn.Flink.streaming.join.source;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.functions.source.RichParallelSourceFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import static com.king.learn.Flink.streaming.join.JoinUtil.test;
/**
* @Author: king
* @Date: 2019-01-14
* @Desc: TODO
*/
public class StreamDataSource2 extends RichParallelSourceFunction<Tuple3<String, String, Long>> {
private volatile boolean running = true;
@Override
public void run(SourceFunction.SourceContext<Tuple3<String, String, Long>> ctx) throws InterruptedException {
Tuple3[] elements = new Tuple3[]{
Tuple3.of("a", "beijing", 1000000058000L),
Tuple3.of("c", "beijing", 1000000055000L),
Tuple3.of("d", "beijing", 1000000106000L),
};
test(ctx, elements, running);
}
@Override
public void cancel() {
running = false;
}
}
封装的工具类 JoinUtil
package com.king.learn.Flink.streaming.join;
import com.king.learn.Flink.streaming.join.source.StreamDataSource;
import com.king.learn.Flink.streaming.join.source.StreamDataSource1;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.time.Time;
/**
* @Author: king
* @Date: 2019-01-14
* @Desc: TODO
*/
public class JoinUtil {
/*
flink驱动注册
*/
public static StreamExecutionEnvironment getEnv() {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
return env;
}
static DataStream<Tuple3<String, String, Long>>
getDataStream(DataStream<Tuple3<String, String, Long>> rightSource) {
long delay = 5100L;
// 设置水位线
return rightSource.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<Tuple3<String, String, Long>>(Time.milliseconds(delay)) {
private static final long serialVersionUID = 518406720598977074L;
@Override
public long extractTimestamp(Tuple3<String, String, Long> element) {
return element.f2;
}
}
);
}
public static void test(SourceFunction.SourceContext<Tuple3<String, String, Long>> ctx, Tuple3[] elements, boolean running) throws InterruptedException {
int count = 0;
while (running && count < elements.length) {
ctx.collect(new Tuple3<>((String) elements[count].f0, (String) elements[count].f1, (long) elements[count].f2));
count++;
Thread.sleep(1000);
}
}
}
InnerJoin
package com.king.learn.Flink.streaming.join;
import com.king.learn.Flink.streaming.join.source.StreamDataSource;
import com.king.learn.Flink.streaming.join.source.StreamDataSource1;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.api.common.functions.JoinFunction;
import static com.king.learn.Flink.streaming.join.JoinUtil.getEnv;
/**
* @Author: king
* @Date: 2019-01-14
* @Desc: TODO Inner Join
*/
public class FlinkTumblingWindowsInnerJoinDemo {
public static void main(String[] args) throws Exception {
int windowSize = 10;
// 设置数据源
DataStream<Tuple3<String, String, Long>> leftSource =
getEnv().addSource(new StreamDataSource()).name("Demo Source");
DataStream<Tuple3<String, String, Long>> rightSource =
getEnv().addSource(new StreamDataSource1()).name("Demo Source");
// join 操作
JoinUtil.getDataStream(leftSource).join(JoinUtil.getDataStream(rightSource))
.where(new LeftSelectKey())
.equalTo(new RightSelectKey())
.window(TumblingEventTimeWindows.of(Time.seconds(windowSize)))
.apply((JoinFunction<Tuple3<String, String, Long>, Tuple3<String, String, Long>,Tuple5<String, String, String, Long, Long>>)
(first, second) -> new Tuple5<>(first.f0, first.f1, second.f1, first.f2, second.f2)).print();
getEnv().execute("TimeWindowDemo");
}
public static class LeftSelectKey implements KeySelector<Tuple3<String, String, Long>, String> {
private static final long serialVersionUID = 3962206049185587477L;
@Override
public String getKey(Tuple3<String, String, Long> w) {
return w.f0;
}
}
public static class RightSelectKey implements KeySelector<Tuple3<String, String, Long>, String> {
private static final long serialVersionUID = -5385125386985167962L;
@Override
public String getKey(Tuple3<String, String, Long> w) {
return w.f0;
}
}
}
LeftJoin
package com.king.learn.Flink.streaming.join;
import com.king.learn.Flink.streaming.join.source.StreamDataSource;
import com.king.learn.Flink.streaming.join.source.StreamDataSource1;
import org.apache.flink.api.common.functions.CoGroupFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
import static com.king.learn.Flink.streaming.join.JoinUtil.getEnv;
/**
* @Author: king
* @Date: 2019-01-14
* @Desc: TODO Left Outer Join
*/
public class FlinkTumblingWindowsLeftJoinDemo {
public static void main(String[] args) throws Exception {
int windowSize = 10;
// 设置数据源
DataStream<Tuple3<String, String, Long>> leftSource =
getEnv().addSource(new StreamDataSource()).name("Demo Source");
DataStream<Tuple3<String, String, Long>> rightSource =
getEnv().addSource(new StreamDataSource1()).name("Demo Source");
// join 操作
JoinUtil.getDataStream(leftSource).coGroup(JoinUtil.getDataStream(rightSource))
.where(new LeftSelectKey()).equalTo(new RightSelectKey())
.window(TumblingEventTimeWindows.of(Time.seconds(windowSize)))
.apply(new LeftJoin())
.print();
JoinUtil.getEnv().execute("TimeWindowDemo");
}
public static class LeftJoin implements CoGroupFunction<Tuple3<String, String, Long>, Tuple3<String, String, Long>, Tuple5<String, String, String, Long, Long>> {
private static final long serialVersionUID = 3583938761914965374L;
@Override
public void coGroup(Iterable<Tuple3<String, String, Long>> leftElements, Iterable<Tuple3<String, String, Long>> rightElements, Collector<Tuple5<String, String, String, Long, Long>> out) {
for (Tuple3<String, String, Long> leftElem : leftElements) {
boolean hadElements = false;
for (Tuple3<String, String, Long> rightElem : rightElements) {
out.collect(new Tuple5<>(leftElem.f0, leftElem.f1, rightElem.f1, leftElem.f2, rightElem.f2));
hadElements = true;
}
if (!hadElements) {
out.collect(new Tuple5<>(leftElem.f0, leftElem.f1, "null", leftElem.f2, -1L));
}
}
}
}
public static class LeftSelectKey implements KeySelector<Tuple3<String, String, Long>, String> {
private static final long serialVersionUID = -4996755192016797420L;
@Override
public String getKey(Tuple3<String, String, Long> w) {
return w.f0;
}
}
public static class RightSelectKey implements KeySelector<Tuple3<String, String, Long>, String> {
private static final long serialVersionUID = -4959317241606342598L;
@Override
public String getKey(Tuple3<String, String, Long> w) {
return w.f0;
}
}
}
从上面的实例可以看出我们是基于coGroup()实现的
OuterJoin
package com.king.learn.Flink.streaming.join;
import com.king.learn.Flink.streaming.join.bean.Element;
import com.king.learn.Flink.streaming.join.source.StreamDataSource;
import com.king.learn.Flink.streaming.join.source.StreamDataSource1;
import com.king.learn.Flink.streaming.join.source.StreamDataSource2;
import org.apache.flink.api.common.functions.CoGroupFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
import java.util.HashMap;
import java.util.HashSet;
import static com.king.learn.Flink.streaming.join.JoinUtil.getEnv;
/**
* @Author: king
* @Date: 2019-01-14
* @Desc: TODO out join 必须左右两边去重的
*/
public class FlinkTumblingWindowsOuterJoinDemo {
public static void main(String[] args) throws Exception {
int windowSize = 10;
// 设置数据源
DataStream<Tuple3<String, String, Long>> leftSource =
getEnv().addSource(new StreamDataSource1()).name("Demo Source");
DataStream<Tuple3<String, String, Long>> rightSource =
getEnv().addSource(new StreamDataSource2()).name("Demo Source");
// join 操作
JoinUtil.getDataStream(leftSource).coGroup(JoinUtil.getDataStream(rightSource))
.where(new LeftSelectKey()).equalTo(new RightSelectKey())
.window(TumblingEventTimeWindows.of(Time.seconds(windowSize)))
.apply(new OuterJoin())
.print();
JoinUtil.getEnv().execute("TimeWindowDemo");
}
public static class OuterJoin implements CoGroupFunction<Tuple3<String, String, Long>, Tuple3<String, String, Long>, Tuple5<String, String, String, Long, Long>> {
private static final long serialVersionUID = 844632302486386586L;
@Override
public void coGroup(Iterable<Tuple3<String, String, Long>> leftElements, Iterable<Tuple3<String, String, Long>> rightElements, Collector<Tuple5<String, String, String, Long, Long>> out) {
HashMap<String, Element> left = new HashMap<>();
HashMap<String, Element> right = new HashMap<>();
HashSet<String> set = new HashSet<>();
for (Tuple3<String, String, Long> leftElem : leftElements) {
set.add(leftElem.f0);
left.put(leftElem.f0, new Element(leftElem.f1, leftElem.f2));
}
for (Tuple3<String, String, Long> rightElem : rightElements) {
set.add(rightElem.f0);
right.put(rightElem.f0, new Element(rightElem.f1, rightElem.f2));
}
for (String key : set) {
Element leftElem = getHashMapByDefault(left, key, new Element("null", -1L));
Element rightElem = getHashMapByDefault(right, key, new Element("null", -1L));
out.collect(new Tuple5<>(key, leftElem.getName(), rightElem.getName(), leftElem.getNumber(), rightElem.getNumber()));
}
}
private Element getHashMapByDefault(HashMap<String, Element> map, String key, Element defaultValue) {
return map.get(key) == null ? defaultValue : map.get(key);
}
}
public static class LeftSelectKey implements KeySelector<Tuple3<String, String, Long>, String> {
private static final long serialVersionUID = -8189893569324632208L;
@Override
public String getKey(Tuple3<String, String, Long> w) {
return w.f0;
}
}
public static class RightSelectKey implements KeySelector<Tuple3<String, String, Long>, String> {
private static final long serialVersionUID = 2249963842374426629L;
@Override
public String getKey(Tuple3<String, String, Long> w) {
return w.f0;
}
}
}
这个也是基于coGroup()实现的,但是有些限制,我下面的Code需要两个Stream 中不存在相同的 Join Key。也就是Join的字段值不能出现重复的。其中会用到两个数据源类 :StreamDataSource1 和 StreamDataSource2。以及一个pojo类:Element。StreamDataSource1见上文,Elemen:如下所示
bean Element
package com.king.learn.Flink.streaming.join.bean;
/**
* @Author: king
* @Date: 2019-01-14
* @Desc: TODO
*/
public class Element {
public String name;
public long number;
public Element() {
}
public Element(String name, long number) {
this.name = name;
this.number = number;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public long getNumber() {
return number;
}
public void setNumber(long number) {
this.number = number;
}
@Override
public String toString() {
return this.name + ":" + this.number;
}
}