【白话Flink进阶理论】Flink中的各种Join操作汇总(Flink1.12)

最新推荐文章于 2024-07-12 10:00:00 发布

橙心橙意橙续缘

最新推荐文章于 2024-07-12 10:00:00 发布

阅读量1.5k

点赞数 2

分类专栏：白话Flink基础理论文章标签： java 大数据 flink

本文链接：https://blog.csdn.net/qq_18506419/article/details/111694129

版权

——wirte by 橙心橙意橙续缘，

前言

白话系列
————————————————————————————
也就是我在写作时完全不考虑写作方面的约束，完全把自己学到的东西、以及理由和所思考的东西等等都用大白话诉说出来，这样能够让信息最大化的从自己脑子里输出并且输入到有需要的同学的脑中。PS：较为专业的地方还是会用专业口语诉说，大家放心！

白话Flink系列
————————————————————————————
主要是记录本人（国内某985研究生）在Flink基础理论阶段学习的一些所学，更重要的是一些所思所想，所参考的视频资料或者博客以及文献资料均在文末放出.由于研究生期间的课题组和研究方向与Flink接轨较多，而且Flink的学习对于想进入大厂的同学们来说也是非常的赞，所以该系列文章会随着本人学习的深入来不断修改和完善，希望大家也可以多批评指正或者提出宝贵建议。

说在前面
————————————
Join操作是SQL语言中很常用的一种操作，但是在Flink中不同的API中都实现了Join操作，除了在Table API中的Join类似于SQL中的Join操作外，其他很多比如Windwos中的Join操作，确是比较复杂的，所以在这里汇总一下Flink中的所有的Join连接，这里采用的是目前最新的Flink1.12版本。

DataSet API中的Join操作（内连接）

DataSet API中的Join操作将两个DataSets连接成一个DataSet。两个数据集的元素在通过一个或多个键上进行连接，这些键可以通过使用

a key expression
a key-selector function
one or more field position keys (Tuple DataSet only).
Case Class Fields

这几种不同的方法来进行指定。

Default Join

默认的 Join 变换会产生一个新的 Tuple DataSet，它有两个字段。每个Tuple在第一个字段中持有第一个输入DataSet的加入元素，在第二个字段中持有第二个输入DataSet的匹配元素。
涉及算子：.join().where().equalTo()

public static class User {
    public String name; public int zip; }
public static class Store {
    public Manager mgr; public int zip; }
DataSet<User> input1 = // [...]
DataSet<Store> input2 = // [...]
// result dataset is typed as Tuple2
DataSet<Tuple2<User, Store>>
            result = input1.join(input2)
                           .where("zip")       // key of the first input (users)
                           .equalTo("zip");    // key of the second input (stores)

Join with Join Function

Join转换也可以调用用户定义的join函数来对join后的全部数据进行处理。join函数接收第一个输入DataSet的一个元素和第二个输入DataSet的一个元素，并准确返回一个元素。
- 涉及算子：.with(new JoinFunction())

// some POJO
public class Rating {
   
  public String name;
  public String category;
  public int points;
}

// Join function that joins a custom POJO with a Tuple
public class PointWeighter
         implements JoinFunction<Rating, Tuple2<String, Double>, Tuple2<String, Double>> {
   

  @Override
  public Tuple2<String, Double> join(Rating rating, Tuple2<String, Double> weight) {
   
    // multiply the points and rating and construct a new output tuple
    return new Tuple2<String, Double>(rating.name, rating.points * weight.f1);
  }
}

DataSet<Rating> ratings = // [...]
DataSet<Tuple2<String, Double>> weights = // [...]
DataSet<Tuple2<String, Double>>
            weightedRatings =
            ratings.join(weights)

                   // key of the first input
                   .where("category")  //fileds of POJO

                   // key of the second input
                   .equalTo("f0")  //pos of Tuple

                   // applying the JoinFunction on joining pairs
                   .with(new PointWeighter());

Join with Flat-Join Function

类似于Map和FlatMap，FlatJoin的行为方式与Join相同，但它不是返回一个元素，而是可以返回（Collecter）、零、一个或多个元素。
涉及算子：.with(new FlatJoinFunction())

public class PointWeighter
         implements FlatJoinFunction<Rating, Tuple2<String, Double>, Tuple2<String, Double>> {
   
  @Override
  public void join(Rating rating, Tuple2<String, Double> weight,
	  Collector<Tuple2<String, Double>> out) {
   
	if (weight.f1 > 0.1) {
   
		out.collect(new Tuple2<String, Double>(rating.name, rating.points * weight.f1));
	}
  }
}

DataSet<Tuple2<String, Double>>
            weightedRatings =
            ratings.join(weights) // [...]
             // key of the first input
            .where("category")  //fileds of POJO

            // key of the second input
            .equalTo("f0")  //pos of Tuple

            // applying the JoinFunction on joining pairs
            .with(new PointWeighter());

Join with Projection(Java Only)

Join with Projection主要用来选择JOIN后加入到新的DataSet的字段和其顺序。

涉及算子:.projectFirst(0)，.projectSecond()
下面的例子为Tuple DataSets。
加入投影也适用于非Tuple DataSets，在这种情况下，必须在没有参数的情况下调用projectFirst()或projectSecond()，才能将加入的元素添加到输出的Tuple中。

DataSet<Tuple3<Integer, Byte, String>> input1 = // [...]
DataSet<Tuple2<Integer, Double>> input2 = // [...]
DataSet<Tuple4<Integer, String, Double, Byte>>
            result =
            input1.join(input2)
                  // key definition on first DataSet using a field position key
                  .where(0)
                  // key definition of second DataSet using a field position key
                  .equalTo(0)
                  // select and reorder fields of matching tuples
                  .projectFirst(0,2).projectSecond(1).projectFirst(1);

projectFirst(int…)和projectSecond(int…)选择第一个DataSet和第二个Dataset加入到Join后的输出的字段，这些字段应该被组装成一个输出元组。索引的顺序定义了输出元组中字段的顺序。

Join with DataSet Size Hint</

最低0.47元/天解锁文章

橙心橙意橙续缘

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
【白话Flink进阶理论】Flink中的各种Join操作汇总(Flink1.12)

——wirte by 橙心橙意橙续缘，前言白话系列————————————————————————————也就是我在写作时完全不考虑写作方面的约束，完全把自己学到的东西、以及理由和所思考的东西等等都用大白话诉说出来，这样能够让信息最大化的从自己脑子里输出并且输入到有需要的同学的脑中。PS：较为专业的地方还是会用专业口语诉说，大家放心！白话Flink系列————————————————————————————主要是记录本人（国内某985研究生）在Flink基础理论阶段学习的一些所学，更重要的是一
复制链接

扫一扫

专栏目录