Flink Operator之CoGroup、Join以及Connect

最新推荐文章于 2022-08-17 09:31:06 发布

vincent_hahaha

最新推荐文章于 2022-08-17 09:31:06 发布

阅读量1k

点赞数 1

分类专栏： Apache Flink

本文链接：https://blog.csdn.net/vincent_duan/article/details/102149294

版权

Apache Flink 专栏收录该内容

33 篇文章 11 订阅

订阅专栏

Flink 双数据流转换为单数据流操作的运算有cogroup, join和coflatmap。下面为大家对比介绍下这3个运算的功能和用法。

Join：只输出条件匹配的元素对。
CoGroup: 除了输出匹配的元素对以外，未能匹配的元素也会输出。
CoFlatMap：没有匹配条件，不进行匹配，分别处理两个流的元素。在此基础上完全可以实现join和cogroup的功能，比他们使用上更加自由。

对于join和cogroup来说，代码结构大致如下：

val stream1 = ...
val stream2 = ...

stream1.join(stream2)
    .where(_._1).equalTo(_._1) //join的条件stream1中的某个字段和stream2中的字段值相等
    .window(...) // 指定window，stream1和stream2中的数据会进入到该window中。只有该window中的数据才会被后续操作join
    .apply((t1, t2, out: Collector[String]) => {
      out.collect(...) // 捕获到匹配的数据t1和t2，在这里可以进行组装等操作
    })
    .print()

CoGroup操作

在这里插入图片描述
该操作是将两个数据流/集合按照key进行group，然后将相同key的数据进行处理，但是它和join操作稍有区别，它在一个流/数据集中没有找到与另一个匹配的数据还是会输出。

在DataStream中

侧重于group，对同一个key上的两组集合进行操作。
如果在一个流中没有找到与另一个流的window中匹配的数据，任何输出结果，即只输出一个流的数据。
仅能使用在window中。

下面看一个简单的例子，这个例子中从两个不同的端口来读取数据，模拟两个流，我们使用CoGroup来处理这两个数据流，观察输出结果：

public class CogroupFunction {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> source1 = env.socketTextStream("localhost", 9091);
        DataStreamSource<String> source2 = env.socketTextStream("localhost", 9092);

        SingleOutputStreamOperator<Tuple2<String, String>> input1 = source1.map(new MapFunction<String, Tuple2<String, String>>() {
            @Override
            public Tuple2<String, String> map(String value) throws Exception {
                return Tuple2.of(value.split(" ")[0], value.split(" ")[1]);
            }
        });

        SingleOutputStreamOperator<Tuple2<String, String>> input2 = source2.map(new MapFunction<String, Tuple2<String, String>>() {
            @Override
            public Tuple2<String, String> map(String value) throws Exception {
                return Tuple2.of(value.split(" ")[0], value.split(" ")[1]);
            }
        });

        DataStream<String> apply = input1.coGroup(input2).where(new KeySelector<Tuple2<String, String>, Object>() {
            @Override
            public Object getKey(Tuple2<String, String> value) throws Exception {
                return value.f0;
            }
        }).equalTo(new KeySelector<Tuple2<String, String>, Object>() {
            @Override
            public Object getKey(Tuple2<String, String> value) throws Exception {
                return value.f1;
            }
        }).window(ProcessingTimeSessionWindows.withGap(Time.seconds(10)))
                .trigger(CountTrigger.of(1)).apply(new CoGroupFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
            @Override
            public void coGroup(Iterable<Tuple2<String, String>> first, Iterable<Tuple2<String, String>> second, Collector<String> out) throws Exception {
                StringBuffer buffer = new StringBuffer();
                buffer.append("input1:");
                Iterator<Tuple2<String, String>> iterator1 = first.iterator();
                while (iterator1.hasNext()) {
                    Tuple2<String, String> next = iterator1.next();
                    buffer.append(next.f0 + "=>" + next.f1);
                }
                buffer.append("input2:");
                Iterator<Tuple2<String, String>> iterator2 = second.iterator();
                while (iterator2.hasNext()) {
                    Tuple2<String, String> next = iterator2.next();
                    buffer.append(next.f0 + "=>" + next.f1);
                }
                out.collect(buffer.toString());
            }
        });
        apply.print();
        env.execute("CogroupFunction");

    }
}

在DataSet中

下面的例子中，key代表学生班级ID，value为学生name，使用cogroup操作将两个集合中key相同数据合并：

public class CogroupFunction {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env=ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Tuple2<Long, String>> source1=env.fromElements(
                Tuple2.of(1L,"tom"),
                Tuple2.of(2L,"jerry"));

        DataSet<Tuple2<Long, String>> source2=env.fromElements(
                Tuple2.of(2L,"jack"),
                Tuple2.of(1L,"rose"),
                Tuple2.of(3L,"sofia"));

        source1.coGroup(source2)
                .where(0).equalTo(0)
                .with(new CoGroupFunction<Tuple2<Long,String>, Tuple2<Long,String>, Object>() {

                    @Override
                    public void coGroup(Iterable<Tuple2<Long, String>> iterable,
                                        Iterable<Tuple2<Long, String>> iterable1, Collector<Object> collector) throws Exception {
                        Map<Long,String> map=new HashMap<Long,String>();
                        for(Tuple2<Long,String> tuple:iterable){
                            String str=map.get(tuple.f0);
                            if(str==null){
                                map.put(tuple.f0,tuple.f1);
                            }else{
                                if(!str.equals(tuple.f1))
                                    map.put(tuple.f0,str+" "+tuple.f1);
                            }
                        }

                        for(Tuple2<Long,String> tuple:iterable1){
                            String str=map.get(tuple.f0);
                            if(str==null){
                                map.put(tuple.f0,tuple.f1);
                            }else{
                                if(!str.equals(tuple.f1))
                                    map.put(tuple.f0,str+" "+tuple.f1);
                            }
                        }
                        collector.collect(map);
                    }
                }).print();

    }
}

输出结果如下：

{3=sofia}
{1=tom rose}
{2=jerry jack}

CoGroup的作用和join基本相同，但有一点不一样的是，如果未能找到新到来的数据与另一个流在window中存在的匹配数据，仍会将其输出。如果未能找到新到来的数据与另一个流在window中存在的匹配数据，仍会将其输出。

Join操作

Flink中的Join操作类似于SQL中的join，按照一定条件分别取出两个流中匹配的元素，返回给下游处理。
示例代码如下：

public class JoinFunctionDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> source1 = env.socketTextStream("localhost", 9001);
        DataStreamSource<String> source2 = env.socketTextStream("localhost", 9002);
        DataStream<String> apply = source1.join(source2).where(new KeySelector<String, String>() {
            @Override
            public String getKey(String value) throws Exception {
                return value.split(" ")[0];
            }
        }).equalTo(new KeySelector<String, String>() {
            @Override
            public String getKey(String value) throws Exception {
                return value.split(" ")[0];
            }
        }).window(ProcessingTimeSessionWindows.withGap(Time.seconds(30))).trigger(CountTrigger.of(1))
                .apply(new JoinFunction<String, String, String>() {
                    @Override
                    public String join(String first, String second) throws Exception {
                        return first.split(" ")[1] + "<=>" + second.split(" ")[1];
                    }
                });
        apply.print();
        env.execute("JoinFunctionDemo");


    }
}

为测试方便，这里使用session window。只有两个元素到来时间前后相差不大于30秒之时才会被匹配。（Session window的特点为，没有固定的开始和结束时间，只要两个元素之间的时间间隔不大于设定值，就会分配到同一个window中，否则后来的元素会进入新的window）。
将window默认的trigger修改为count trigger。这里的含义为每到来一个元素，都会立刻触发计算。
处理匹配到的两个数据，例如到来的数据为(1, “a”)和(1, “b”)，输出到下游则为"a<=>b"

下面我们测试下程序。
打开两个terminal，分别输入 nc -lk 127.0.0.1 9000 和 nc -lk 127.0.0.1 9001。

在terminal1中输入，1 a，然后在terminal2中输入2 b。观察程序console，发现没有输出。这两条数据不满足匹配条件，因此没有输出。

在30秒之内输入1 c，发现程序控制台输出了结果a<=>c。再输入1 d，控制台输出a<=>c和a<=>d两个结果。

等待30秒之后，在terminal2中输入1 e，发现控制台无输出。由于session window的效果，该数据和之前stream1中的数据不在同一个window中。因此没有匹配结果，控制台不会有输出。
综上我们得出结论：

join只返回匹配到的数据对。若在window中没有能够与之匹配的数据，则不会有输出。
join会输出window中所有的匹配数据对。
不在window内的数据不会被匹配到。

CoFlatMap操作

相比之下CoFlatMap操作就比以上两个简单多了。CoFlatMap操作主要在CoFlatMapFunction中进行。
以下是CoFlatMapFunction的代码：

public interface CoFlatMapFunction<IN1, IN2, OUT> extends Function, Serializable {

    /**
     * This method is called for each element in the first of the connected streams.
     *
     * @param value The stream element
     * @param out The collector to emit resulting elements to
     * @throws Exception The function may throw exceptions which cause the streaming program
     *                   to fail and go into recovery.
     */
    void flatMap1(IN1 value, Collector<OUT> out) throws Exception;

    /**
     * This method is called for each element in the second of the connected streams.
     *
     * @param value The stream element
     * @param out The collector to emit resulting elements to
     * @throws Exception The function may throw exceptions which cause the streaming program
     *                   to fail and go into recovery.
     */
    void flatMap2(IN2 value, Collector<OUT> out) throws Exception;
}

简单理解就是当stream1数据到来时，会调用flatMap1方法，stream2收到数据之时，会调用flatMap2方法。

Connect操作

只适用操作DataStream，它会将两个流中匹配的数据进行处理，不匹配不会进行处理，它会分别处理两个流，相比于join和Cogroup操作更加自由，

总结

Join、CoGroup和CoFlatMap这三个运算符都能够将双数据流转换为单个数据流。Join和CoGroup会根据指定的条件进行数据配对操作，不同的是Join只输出匹配成功的数据对，CoGroup无论是否有匹配都会输出。CoFlatMap没有匹配操作，只是分别去接收两个流的输入。大家可以根据具体的业务需求，选择不同的双流操作。

vincent_hahaha

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
Flink Operator之CoGroup、Join以及Connect

Flink 双数据流转换为单数据流操作的运算有cogroup, join和coflatmap。下面为大家对比介绍下这3个运算的功能和用法。Join：只输出条件匹配的元素对。CoGroup: 除了输出匹配的元素对以外，未能匹配的元素也会输出。CoFlatMap：没有匹配条件，不进行匹配，分别处理两个流的元素。在此基础上完全可以实现join和cogroup的功能，比他们使用上更加自由。对于...
复制链接

扫一扫