上篇博客中,说了一下转化、分组、聚合,此博客接着连接。连接分为下面:
- union : 将数据类型相同的流合并成一个流。
- connect: 将数据类型不同的流合并一个流
- cogroup: 将数据类型不同的流合并成一个流并写到缓存到窗口中。
- join: 将数据类型不同的流合并成一个流并写到缓存到窗口中,当窗口被触发之后,两边的数据进行笛卡尔积式的计算。
- interval join : 处理数据的逻辑基本和 join 差不多,多了一点式可以扩大两个流之间的匹配范围,比如,A 是 stream1 的数据,B 是 stream2 的数据,A.timestamp - interval time <= B.timestamp <= A.timestamp + interval time 的数据。
- broadcast , 广播流,它会将广播流中的所有数据发送到另外一个流中的所有分区中,然后实现计算逻辑,这一特性可以让我们实现关联维表的功能。处理维表关联的其他方案还有异步I/O。这个会单独写一篇博客来讲解。
下面来展示一下所有 join 类型算子的功能。
union 的用法:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> src1 = env.socketTextStream("127.0.0.1", 6666);
DataStreamSource<String> src2 = env.socketTextStream("127.0.0.1", 8888);
DataStream<String> union = src1.union(src2);
union.print("------");
env.execute("test-union");
两个 source 从 socket 中读书数据,数据类型是 String 类型的,然后将两个流 union 起来,连接起来的数据都是一样的。
connect 的用法。当遇到得到两个 topic 中的数据之后,才能计算的情况下,需要使用 connect 将两个 topic 中的数据取出。下面的例子中,模拟了 inner join on 的效果,也就是取交集的效果,使用了 map state 来存储已经到来的数据,当另外一个流中的相关数据到来时,往下发送。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> src1 = env.socketTextStream("127.0.0.1", 6666);
DataStreamSource<String> src2 = env.socketTextStream("127.0.0.1", 8888);
KeyedStream<Integer, String> intSrc = src1.map(new RichMapFunction<String, Integer>() {
@Override
public Integer map(String record) throws Exception {
return Integer.parseInt(record);
}
}).keyBy(new KeySelector<Integer, String>() {
@Override
public String getKey(Integer integer) throws Exception {
return integer.toString();
}
});
KeyedStream<String, String> keyedSrc2 = src2.keyBy(x -> x);
/**
* 模拟 inner join 的逻辑,取交集
* */
intSrc.connect(keyedSrc2).process(new CoProcessFunction<Integer, String, Tuple2<String,Integer>>() {
private ValueState<List<String>> stream1Buffer = null ;
private ValueState<List<String>> stream2Buffer = null ;
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
ValueStateDescriptor<List<String>> stream1BufferDesc = new ValueStateDescriptor<List<String>>("Stream1Buffer"
, TypeInformation.of(new TypeHint<List<String>>() {})
);
ValueStateDescriptor<List<String>> stream2BufferDesc = new ValueStateDescriptor<List<String>>("Stream2Buffer"
, TypeInformation.of(new TypeHint<List<String>>() {})
);
stream1Buffer = getRuntimeContext().getState(stream1BufferDesc);
stream2Buffer = getRuntimeContext().getState(stream2BufferDesc);
}
@Override
public void processElement1(Integer record, CoProcessFunction<Integer, String, Tuple2<String, Integer>>.Context context, Collector<Tuple2<String, Integer>> collector) throws Exception {
join(record.toString() , collector , stream2Buffer , stream1Buffer);
}
@Override
public void processElement2(String record, CoProcessFunction<Integer, String, Tuple2<String, Integer>>.Context context, Collector<Tuple2<String, Integer>> collector) throws Exception {
join(record , collector , stream1Buffer, stream2Buffer);
}
private void join(String record , Collector<Tuple2<String, Integer>> collector , ValueState<List<String>> streamBuffered , ValueState<List<String>> streamOwer) throws IOException {
List<String> buffered = streamBuffered.value();
if(Objects.isNull(buffered)){
buffered = new CopyOnWriteArrayList<>();
}
int idx = Collections.<String>binarySearch(buffered, record);
if(idx>=0){
String s = buffered.get(idx);
buffered.remove(idx);
collector.collect(new Tuple2<String,Integer>(record+" join " + s , 1));
}else{
buffered.add(record);
streamOwer.update(buffered);
}
}
}).print("------");
env.execute();
join 的用法,JoinFunction 接口中,一次处理两个流中个一条数据,而且是笛卡尔积的方式发送给此接口计算。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
WatermarkStrategy<String> ws = WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(1L)).withTimestampAssigner((String data , long ts )->{
return Long.parseLong(data.split(",")[2]);
});
KeySelector<String, String> keySelector = new KeySelector<String, String>() {
@Override
public String getKey(String s) throws Exception {
return s.split(",")[0];
}
};
SingleOutputStreamOperator<String> src1 = env.socketTextStream("127.0.0.1", 6666).assignTimestampsAndWatermarks(ws);
SingleOutputStreamOperator<String> src2 = env.socketTextStream("127.0.0.1", 8888).assignTimestampsAndWatermarks(ws);
src1.join(src2)
.where(keySelector)
.equalTo(keySelector)
.window(TumblingEventTimeWindows.of(Time.seconds(2)))
.apply(new JoinFunction<String, String, String>() {
@Override
public String join(String s, String s2) throws Exception {
return s.concat(":").concat(s2);
}
}).print("----");
env.execute();
interval join 的用法,当发送测试数据的时候,会比上面的 join 早触发 1 秒,因我我设置了 interval 是 1 秒
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
WatermarkStrategy<String> ws = WatermarkStrategy
.<String>forBoundedOutOfOrderness(Duration.ofSeconds(1L))
.withTimestampAssigner((String data ,long ts )->{
return Long.parseLong(data.split(",")[2]);
});
KeyedStream<String, String> src1 = env.socketTextStream("127.0.0.1", 6666)
.assignTimestampsAndWatermarks(ws)
.keyBy(new KeySelector<String, String>() {
@Override
public String getKey(String value) throws Exception {
return value.split(",")[0];
}
});
KeyedStream<String, String> src2 = env.
socketTextStream("127.0.0.1", 8888)
.assignTimestampsAndWatermarks(ws).keyBy(new KeySelector<String, String>() {
@Override
public String getKey(String value) throws Exception {
return value.split(",")[0];
}
});
src1.intervalJoin(src2)
.between(Time.seconds(-1) , Time.seconds(1))
.upperBoundExclusive()
.lowerBoundExclusive()
.process(new ProcessJoinFunction<String, String, String>() {
@Override
public void processElement(String left, String right, Context ctx, Collector<String> out) throws Exception {
out.collect(left + "-->" + right);
}
}).print();
env.execute("test-interval-join");
coGroup 算子的用法,我使用的 tumbling time window ,时间使用的 eventtime ,当数据的最大时间戳到达了窗口的最大时间,则窗口被触发,执行 RichCoGroupFunction 接口中的计算。这里我使用了 forBoundedOutOfOrderness 的 watermark ,它里面的参数是 1 ,所有会比正常的窗口晚触发 1 秒。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
WatermarkStrategy<String> ws = WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(1L)).withTimestampAssigner((String data , long ts )->{
return Long.parseLong(data.split(",")[1]);
});
SingleOutputStreamOperator<String> src1 = env.socketTextStream("127.0.0.1", 6666).assignTimestampsAndWatermarks(ws);
SingleOutputStreamOperator<String> src2 = env.socketTextStream("127.0.0.1", 8888).assignTimestampsAndWatermarks(ws);
src1.keyBy(new KeySelector<String, String>() {
@Override
public String getKey(String s) throws Exception {
return s.split(",")[0] ;
}
}).coGroup(src2.keyBy(new KeySelector<String, String>() {
@Override
public String getKey(String s) throws Exception {
return s.split(",")[0] ;
}
})).where(new KeySelector<String, String>() {
@Override
public String getKey(String src1Data) throws Exception {
return src1Data.split(",")[0] ;
}
}).equalTo(new KeySelector<String, String>() {
@Override
public String getKey(String src2Data) throws Exception {
return src2Data.split(",")[0] ;
}
}).window(TumblingEventTimeWindows.of(Time.seconds(2)))
.apply(new RichCoGroupFunction<String, String, String>() {
@Override
public void coGroup(Iterable<String> first, Iterable<String> second, Collector<String> collector) throws Exception {
String a = "" ;
String b = "" ;
for(String e : first){
a+=e;
}
for(String e : second){
b+=e;
}
collector.collect(a + ":" + b);
}
})
.print("-----");
env.execute();
broadcast 广播流的功能演示,下面的例子是官方文档中的例子,很简单的例子,维表关联有一个数据预加载的问题,可以将维表中的数据加载到类的本地变量中,也可以在广播流中给那些没有关联到维表的数据打标记,然后在后面的算子中将打过标记的数据发送到测流中,进行处理。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> src1 = env.socketTextStream("127.0.0.1", 6666);
DataStreamSource<String> stringDataStreamSource = env.fromElements("green,good", "blue,excellant", "purple,2", "red,4");
MapStateDescriptor<String,String> mapDesc = new MapStateDescriptor<String,String>("rule" ,String.class,String.class);
BroadcastStream<String> broadcast = stringDataStreamSource.broadcast(mapDesc);
src1.connect(broadcast)
.process(new BroadcastProcessFunction<String,String,String>(){
private final MapStateDescriptor<String,String> mapRule = new MapStateDescriptor<String, String>("rule",String.class , String.class);
@Override
public void processElement(String value, ReadOnlyContext ctx, Collector<String> out) throws Exception {
String s = ctx.getBroadcastState(mapRule).get(value);
out.collect("out:"+s);
}
@Override
public void processBroadcastElement(String value, Context ctx, Collector<String> out) throws Exception {
ctx.getBroadcastState(mapRule).put(value.split(",")[0],value.split(",")[1]);
}
})
.print("------");
env.execute();
打完收工。