Flink 的算子介绍(下)

上篇博客中,说了一下转化、分组、聚合,此博客接着连接。连接分为下面:

  • union : 将数据类型相同的流合并成一个流。
  • connect: 将数据类型不同的流合并一个流
  • cogroup: 将数据类型不同的流合并成一个流并写到缓存到窗口中。
  • join: 将数据类型不同的流合并成一个流并写到缓存到窗口中,当窗口被触发之后,两边的数据进行笛卡尔积式的计算。
  • interval join : 处理数据的逻辑基本和 join 差不多,多了一点式可以扩大两个流之间的匹配范围,比如,A 是 stream1 的数据,B 是 stream2 的数据,A.timestamp - interval time <= B.timestamp <= A.timestamp + interval time 的数据。
  • broadcast , 广播流,它会将广播流中的所有数据发送到另外一个流中的所有分区中,然后实现计算逻辑,这一特性可以让我们实现关联维表的功能。处理维表关联的其他方案还有异步I/O。这个会单独写一篇博客来讲解。

下面来展示一下所有 join 类型算子的功能。

union 的用法:

       StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
       DataStreamSource<String> src1 = env.socketTextStream("127.0.0.1", 6666);
       DataStreamSource<String> src2 = env.socketTextStream("127.0.0.1", 8888);
       DataStream<String> union = src1.union(src2);
       union.print("------");
       env.execute("test-union");

两个 source 从 socket 中读书数据,数据类型是 String 类型的,然后将两个流 union 起来,连接起来的数据都是一样的。

connect 的用法。当遇到得到两个 topic 中的数据之后,才能计算的情况下,需要使用 connect 将两个 topic 中的数据取出。下面的例子中,模拟了 inner join on 的效果,也就是取交集的效果,使用了 map state 来存储已经到来的数据,当另外一个流中的相关数据到来时,往下发送。

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> src1 = env.socketTextStream("127.0.0.1", 6666);
        DataStreamSource<String> src2 = env.socketTextStream("127.0.0.1", 8888);
        KeyedStream<Integer, String> intSrc = src1.map(new RichMapFunction<String, Integer>() {
            @Override
            public Integer map(String record) throws Exception {
                return Integer.parseInt(record);
            }
        }).keyBy(new KeySelector<Integer, String>() {
            @Override
            public String getKey(Integer integer) throws Exception {
                return integer.toString();
            }
        });
        KeyedStream<String, String> keyedSrc2 = src2.keyBy(x -> x);
        /**
         * 模拟 inner join 的逻辑,取交集
         * */
        intSrc.connect(keyedSrc2).process(new CoProcessFunction<Integer, String, Tuple2<String,Integer>>() {
            private ValueState<List<String>> stream1Buffer = null ;
            private ValueState<List<String>> stream2Buffer = null ;

            @Override
            public void open(Configuration parameters) throws Exception {
                super.open(parameters);
                ValueStateDescriptor<List<String>> stream1BufferDesc = new ValueStateDescriptor<List<String>>("Stream1Buffer"
                        , TypeInformation.of(new TypeHint<List<String>>() {})
                );

                ValueStateDescriptor<List<String>> stream2BufferDesc = new ValueStateDescriptor<List<String>>("Stream2Buffer"
                        , TypeInformation.of(new TypeHint<List<String>>() {})
                );
                stream1Buffer = getRuntimeContext().getState(stream1BufferDesc);
                stream2Buffer = getRuntimeContext().getState(stream2BufferDesc);
            }

            @Override
            public void processElement1(Integer record, CoProcessFunction<Integer, String, Tuple2<String, Integer>>.Context context, Collector<Tuple2<String, Integer>> collector) throws Exception {
                join(record.toString() , collector , stream2Buffer , stream1Buffer);
            }

            @Override
            public void processElement2(String record, CoProcessFunction<Integer, String, Tuple2<String, Integer>>.Context context, Collector<Tuple2<String, Integer>> collector) throws Exception {
                join(record , collector , stream1Buffer, stream2Buffer);
            }
            private void join(String record , Collector<Tuple2<String, Integer>> collector , ValueState<List<String>> streamBuffered , ValueState<List<String>> streamOwer) throws IOException {
                List<String> buffered = streamBuffered.value();
                if(Objects.isNull(buffered)){
                    buffered = new CopyOnWriteArrayList<>();
                }
                int idx = Collections.<String>binarySearch(buffered, record);
                if(idx>=0){
                    String s = buffered.get(idx);
                    buffered.remove(idx);
                    collector.collect(new Tuple2<String,Integer>(record+" join " + s , 1));
                }else{
                    buffered.add(record);
                    streamOwer.update(buffered);
                }
            }
        }).print("------");
        env.execute();

join 的用法,JoinFunction 接口中,一次处理两个流中个一条数据,而且是笛卡尔积的方式发送给此接口计算。

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        WatermarkStrategy<String> ws = WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(1L)).withTimestampAssigner((String data , long ts )->{
            return Long.parseLong(data.split(",")[2]);
        });
        KeySelector<String, String> keySelector = new KeySelector<String, String>() {
            @Override
            public String getKey(String s) throws Exception {
                return s.split(",")[0];
            }
        };
        SingleOutputStreamOperator<String> src1 = env.socketTextStream("127.0.0.1", 6666).assignTimestampsAndWatermarks(ws);
        SingleOutputStreamOperator<String> src2 = env.socketTextStream("127.0.0.1", 8888).assignTimestampsAndWatermarks(ws);
        src1.join(src2)
                .where(keySelector)
                .equalTo(keySelector)
                .window(TumblingEventTimeWindows.of(Time.seconds(2)))
                .apply(new JoinFunction<String, String, String>() {
                    @Override
                    public String join(String s, String s2) throws Exception {
                        return s.concat(":").concat(s2);
                    }
                }).print("----");
        env.execute();

interval join 的用法,当发送测试数据的时候,会比上面的 join 早触发 1 秒,因我我设置了 interval 是 1 秒

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        WatermarkStrategy<String> ws = WatermarkStrategy
                .<String>forBoundedOutOfOrderness(Duration.ofSeconds(1L))
                .withTimestampAssigner((String data ,long ts )->{
            return Long.parseLong(data.split(",")[2]);
        });
        KeyedStream<String, String> src1 = env.socketTextStream("127.0.0.1", 6666)
                .assignTimestampsAndWatermarks(ws)
                .keyBy(new KeySelector<String, String>() {
                    @Override
                    public String getKey(String value) throws Exception {
                        return value.split(",")[0];
                    }
                });

        KeyedStream<String, String> src2 = env.
                socketTextStream("127.0.0.1", 8888)
                .assignTimestampsAndWatermarks(ws).keyBy(new KeySelector<String, String>() {
                    @Override
                    public String getKey(String value) throws Exception {
                        return value.split(",")[0];
                    }
                });

        src1.intervalJoin(src2)
                .between(Time.seconds(-1) , Time.seconds(1))
                .upperBoundExclusive()
                .lowerBoundExclusive()
                .process(new ProcessJoinFunction<String, String, String>() {
                    @Override
                    public void processElement(String left, String right, Context ctx, Collector<String> out) throws Exception {
                        out.collect(left + "-->" + right);
                    }
                }).print();

        env.execute("test-interval-join");

coGroup 算子的用法,我使用的 tumbling time window ,时间使用的 eventtime ,当数据的最大时间戳到达了窗口的最大时间,则窗口被触发,执行 RichCoGroupFunction 接口中的计算。这里我使用了 forBoundedOutOfOrderness 的 watermark ,它里面的参数是 1 ,所有会比正常的窗口晚触发 1 秒。

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        WatermarkStrategy<String> ws = WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(1L)).withTimestampAssigner((String data , long ts )->{
            return Long.parseLong(data.split(",")[1]);
        });
        SingleOutputStreamOperator<String> src1 = env.socketTextStream("127.0.0.1", 6666).assignTimestampsAndWatermarks(ws);
        SingleOutputStreamOperator<String> src2 = env.socketTextStream("127.0.0.1", 8888).assignTimestampsAndWatermarks(ws);
        src1.keyBy(new KeySelector<String, String>() {
                    @Override
                    public String getKey(String s) throws Exception {
                        return s.split(",")[0] ;
                    }
                }).coGroup(src2.keyBy(new KeySelector<String, String>() {
                    @Override
                    public String getKey(String s) throws Exception {
                        return s.split(",")[0] ;
                    }
                })).where(new KeySelector<String, String>() {
                    @Override
                    public String getKey(String src1Data) throws Exception {
                        return src1Data.split(",")[0] ;
                    }
                }).equalTo(new KeySelector<String, String>() {
                    @Override
                    public String getKey(String src2Data) throws Exception {
                        return src2Data.split(",")[0] ;
                    }
                }).window(TumblingEventTimeWindows.of(Time.seconds(2)))
                .apply(new RichCoGroupFunction<String, String, String>() {
                    @Override
                    public void coGroup(Iterable<String> first, Iterable<String> second, Collector<String> collector) throws Exception {
                        String a = "" ;
                        String b = "" ;
                        for(String e : first){
                            a+=e;
                        }
                        for(String e : second){
                            b+=e;
                        }
                        collector.collect(a + ":" + b);
                    }
                })
                .print("-----");
        env.execute();

broadcast 广播流的功能演示,下面的例子是官方文档中的例子,很简单的例子,维表关联有一个数据预加载的问题,可以将维表中的数据加载到类的本地变量中,也可以在广播流中给那些没有关联到维表的数据打标记,然后在后面的算子中将打过标记的数据发送到测流中,进行处理。

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> src1 = env.socketTextStream("127.0.0.1", 6666);
        DataStreamSource<String> stringDataStreamSource = env.fromElements("green,good", "blue,excellant", "purple,2", "red,4");
        MapStateDescriptor<String,String> mapDesc = new MapStateDescriptor<String,String>("rule" ,String.class,String.class);
        BroadcastStream<String> broadcast = stringDataStreamSource.broadcast(mapDesc);
        src1.connect(broadcast)
                .process(new BroadcastProcessFunction<String,String,String>(){
                    private final MapStateDescriptor<String,String> mapRule =  new MapStateDescriptor<String, String>("rule",String.class , String.class);

                    @Override
                    public void processElement(String value, ReadOnlyContext ctx, Collector<String> out) throws Exception {
                        String s = ctx.getBroadcastState(mapRule).get(value);
                        out.collect("out:"+s);
                    }

                    @Override
                    public void processBroadcastElement(String value, Context ctx, Collector<String> out) throws Exception {
                        ctx.getBroadcastState(mapRule).put(value.split(",")[0],value.split(",")[1]);
                    }
                })
                .print("------");
        env.execute();

打完收工。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值