【flink番外篇】1、flink的23种常用算子介绍及详细示例（3）-window、distinct、join等

原创已于 2024-04-10 12:28:29 修改 · 6.1w 阅读

24 ·

CC 4.0 BY-SA版权

文章标签：

#flink #数据库 #android #flink 流批一体化 #flink hive #flink kafka #flink operator

于 2023-12-05 10:43:48 首次发布

flink 示例专栏专栏收录该内容

75 篇文章

订阅专栏

本文是Flink系列文章，介绍了Flink专栏和示例专栏内容。重点讲解了Flink的10种常用算子，如first、distinct、join、Window等，并给出具体可运行示例。还提及可在Flink专栏了解更系统内容，本专题分为五篇。

Flink 系列文章

一、Flink 专栏

Flink 专栏系统介绍某一知识点，并辅以具体的示例进行说明。

1、Flink 部署系列
本部分介绍Flink的部署、配置相关基础内容。
2、Flink基础系列
本部分介绍Flink 的基础部分，比如术语、架构、编程模型、编程指南、基本的datastream api用法、四大基石等内容。
3、Flik Table API和SQL基础系列
本部分介绍Flink Table Api和SQL的基本用法，比如Table API和SQL创建库、表用法、查询、窗口函数、catalog等等内容。
4、Flik Table API和SQL提高与应用系列
本部分是table api 和sql的应用部分，和实际的生产应用联系更为密切，以及有一定开发难度的内容。
5、Flink 监控系列
本部分和实际的运维、监控工作相关。

二、Flink 示例专栏

Flink 示例专栏是 Flink 专栏的辅助说明，一般不会介绍知识点的信息，更多的是提供一个一个可以具体使用的示例。本专栏不再分目录，通过链接即可看出介绍的内容。

两专栏的所有文章入口点击：Flink 系列文章汇总索引

本文主要介绍Flink 的10种常用的operator（window、distinct、join等）及以具体可运行示例进行说明.
如果需要了解更多内容，可以在本人Flink 专栏中了解更新系统的内容。
本文除了maven依赖外，没有其他依赖。

本专题分为五篇，即：
【flink番外篇】1、flink的23种常用算子介绍及详细示例（1）- map、flatmap和filter
【flink番外篇】1、flink的23种常用算子介绍及详细示例（2）- keyby、reduce和Aggregations
【flink番外篇】1、flink的23种常用算子介绍及详细示例（3）-window、distinct、join等
 【flink番外篇】1、flink的23种常用算子介绍及详细示例（4）- union、window join、connect、outputtag、cache、iterator、project
【flink番外篇】1、flink的23种常用算子介绍及详细示例（完整版）

一、Flink的23种算子说明及示例

本文示例中使用的maven依赖和java bean 参考本专题的第一篇中的maven和java bean。

9、first、distinct、join、outjoin、cross

具体事例详见例子及结果。

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.common.operators.Order;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
import org.datastreamapi.User;


/**
 * @author alanchan
 *
 */
public class TestFirst_Join_Distinct_OutJoin_CrossDemo {
	public static void main(String[] args) throws Exception {
		ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
		joinFunction(env);
		env.execute();

	}

	public static void unionFunction(StreamExecutionEnvironment env) throws Exception {
		List<String> info1 = new ArrayList<>();
		info1.add("team A");
		info1.add("team B");

		List<String> info2 = new ArrayList<>();
		info2.add("team C");
		info2.add("team D");

		List<String> info3 = new ArrayList<>();
		info3.add("team E");
		info3.add("team F");

		List<String> info4 = new ArrayList<>();
		info4.add("team G");
		info4.add("team H");

		DataStream<String> source1 = env.fromCollection(info1);
		DataStream<String> source2 = env.fromCollection(info2);
		DataStream<String> source3 = env.fromCollection(info3);
		DataStream<String> source4 = env.fromCollection(info4);

		source1.union(source2).union(source3).union(source4).print();
//        team A
//        team C
//        team E
//        team G
//        team B
//        team D
//        team F
//        team H
	}

	public static void crossFunction(ExecutionEnvironment env) throws Exception {
		// cross,求两个集合的笛卡尔积,得到的结果数为：集合1的条数 乘以 集合2的条数
		List<String> info1 = new ArrayList<>();
		info1.add("team A");
		info1.add("team B");

		List<Tuple2<String, Integer>> info2 = new ArrayList<>();
		info2.add(new Tuple2("W", 3));
		info2.add(new Tuple2("D", 1));
		info2.add(new Tuple2("L", 0));

		DataSource<String> data1 = env.fromCollection(info1);
		DataSource<Tuple2<String, Integer>> data2 = env.fromCollection(info2);

		data1.cross(data2).print();
//        (team A,(W,3))
//        (team A,(D,1))
//        (team A,(L,0))
//        (team B,(W,3))
//        (team B,(D,1))
//        (team B,(L,0))
	}

	public static void outerJoinFunction(ExecutionEnvironment env) throws Exception {
		// Outjoin,跟sql语句中的left join,right join,full join意思一样
		// leftOuterJoin,跟join一样，但是左边集合的没有关联上的结果也会取出来,没关联上的右边为null
		// rightOuterJoin,跟join一样,但是右边集合的没有关联上的结果也会取出来,没关联上的左边为null
		// fullOuterJoin,跟join一样,但是两个集合没有关联上的结果也会取出来,没关联上的一边为null
		List<Tuple2<Integer, String>> info1 = new ArrayList<>();
		info1.add(new Tuple2<>(1, "shenzhen"));
		info1.add(new Tuple2<>(2, "guangzhou"));
		info1.add(new Tuple2<>(3, "shanghai"));
		info1.add(new Tuple2<>(4, "chengdu"));

		List<Tuple2<Integer, String>> info2 = new ArrayList<>();
		info2.add(new Tuple2<>(1, "深圳"));
		info2.add(new Tuple2<>(2, "广州"));
		info2.add(new Tuple2<>(3, "上海"));
		info2.add(new Tuple2<>(5, "杭州"));

		DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1);
		DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2);
		// left join
//        eft join:7> (1,shenzhen,深圳)
//        left join:2> (3,shanghai,上海)
//        left join:8> (4,chengdu,未知)
//        left join:16> (2,guangzhou,广州)
		data1.leftOuterJoin(data2).where(0).equalTo(0).with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {

			@Override
			public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
				Tuple3<Integer, String, String> tuple = new Tuple3();
				if (second == null) {
					tuple.setField(first.f0, 0);
					tuple.setField(first.f1, 1);
					tuple.setField("未知", 2);
				} else {
					// 另外一种赋值方式，和直接用构造函数赋值相同
					tuple.setField(first.f0, 0);
					tuple.setField(first.f1, 1);
					tuple.setField(second.f1, 2);
				}
				return tuple;
			}
		}).print("left join");

		// right join
//        right join:2> (3,shanghai,上海)
//        right join:7> (1,shenzhen,深圳)
//        right join:15> (5,--,杭州)
//        right join:16> (2,guangzhou,广州)
		data1.rightOuterJoin(data2).where(0).equalTo(0).with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {

			@Override
			public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
				Tuple3<Integer, String, String> tuple = new Tuple3();
				if (first == null) {
					tuple.setField(second.f0, 0);
					tuple.setField("--", 1);
					tuple.setField(second.f1, 2);
				} else {
					// 另外一种赋值方式，和直接用构造函数赋值相同
					tuple.setField(first.f0, 0);
					tuple.setField(first.f1, 1);
					tuple.setField(second.f1, 2);
				}
				return tuple;
			}
		}).print("right join");

		// fullOuterJoin
//        fullOuterJoin:2> (3,shanghai,上海)
//        fullOuterJoin:8> (4,chengdu,--)
//        fullOuterJoin:15> (5,--,杭州)
//        fullOuterJoin:16> (2,guangzhou,广州)
//        fullOuterJoin:7> (1,shenzhen,深圳)
		data1.fullOuterJoin(data2).where(0).equalTo(0).with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {

			@Override
			public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
				Tuple3<Integer, String, String> tuple = new Tuple3();
				if (second == null) {
					tuple.setField(first.f0, 0);
					tuple.setField(first.f1, 1);
					tuple.setField("--", 2);
				} else if (first == null) {
					tuple.setField(second.f0, 0);
					tuple.setField("--", 1);
					tuple.setField(second.f1, 2);
				} else {
					// 另外一种赋值方式，和直接用构造函数赋值相同
					tuple.setField(first.f0, 0);
					tuple.setField(first.f1, 1);
					tuple.setField(second.f1, 2);
				}
				return tuple;
			}
		}).print("fullOuterJoin");
	}

	public static void joinFunction(ExecutionEnvironment env) throws Exception {
		List<Tuple2<Integer, String>> info1 = new ArrayList<>();
		info1.add(new Tuple2<>(1, "shenzhen"));
		info1.add(new Tuple2<>(2, "guangzhou"));
		info1.add(new Tuple2<>(3, "shanghai"));
		info1.add(new Tuple2<>(4, "chengdu"));

		List<Tuple2<Integer, String>> info2 = new ArrayList<>();
		info2.add(new Tuple2<>(1, "深圳"));
		info2.add(new Tuple2<>(2, "广州"));
		info2.add(new Tuple2<>(3, "上海"));
		info2.add(new Tuple2<>(5, "杭州"));

		DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1);
		DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2);

		//

//        join:2> ((3,shanghai),(3,上海))
//        join:16> ((2,guangzhou),(2,广州))
//        join:7> ((1,shenzhen),(1,深圳))
		data1.join(data2).where(0).equalTo(0).print("join");

//        join2:2> (3,上海,shanghai)
//        join2:7> (1,深圳,shenzhen)
//        join2:16> (2,广州,guangzhou)
		DataSet<Tuple3<Integer, String, String>> data3 = data1.join(data2).where(0).equalTo(0)
				.with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {

					@Override
					public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
						return new Tuple3<Integer, String, String>(first.f0, second.f1, first.f1);
					}
				});
		data3.print("join2");

	}

	public static void firstFunction(ExecutionEnvironment env) throws Exception {
		List<Tuple2<Integer, String>> info = new ArrayList<>();
		info.add(new Tuple2(1, "Hadoop"));
		info.add(new Tuple2(1, "Spark"));
		info.add(new Tuple2(1, "Flink"));
		info.add(new Tuple2(2, "Scala"));
		info.add(new Tuple2(2, "Java"));
		info.add(new Tuple2(2, "Python"));
		info.add(new Tuple2(3, "Linux"));
		info.add(new Tuple2(3, "Window"));
		info.add(new Tuple2(3, "MacOS"));

		DataSet<Tuple2<Integer, String>> dataSet = env.fromCollection(info);
		// 前几个
//	        dataSet.first(4).print();
//	        (1,Hadoop)
//	        (1,Spark)
//	        (1,Flink)
//	        (2,Scala)

		// 按照tuple2的第一个元素进行分组，查出每组的前2个
//	        dataSet.groupBy(0).first(2).print();
//	        (3,Linux)
//	        (3,Window)
//	        (1,Hadoop)
//	        (1,Spark)
//	        (2,Scala)
//	        (2,Java)

		// 按照tpule2的第一个元素进行分组，并按照倒序排列，查出每组的前2个
		dataSet.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();
//	        (3,Window)
//	        (3,MacOS)
//	        (1,Spark)
//	        (1,Hadoop)
//	        (2,Scala)
//	        (2,Python)
	}

	public static void distinctFunction(ExecutionEnvironment env) throws Exception {
		List list = new ArrayList<Tuple3<Integer, Integer, Integer>>();
		list.add(new Tuple3<>(0, 3, 6));
		list.add(new Tuple3<>(0, 2, 5));
		list.add(new Tuple3<>(0, 3, 6));
		list.add(new Tuple3<>(1, 1, 9));
		list.add(new Tuple3<>(1, 2, 8));
		list.add(new Tuple3<>(1, 2, 8));
		list.add(new Tuple3<>(1, 3, 9));

		DataSet<Tuple3<Integer, Integer, Integer>> source = env.fromCollection(list);
		// 去除tuple3中元素完全一样的
		source.distinct().print();
//		(1,3,9)
//		(0,3,6)
//		(1,1,9)
//		(1,2,8)
//		(0,2,5)
		// 去除tuple3中第一个元素一样的，只保留第一个
		// source.distinct(0).print();
//		(1,1,9)
//		(0,3,6)
		// 去除tuple3中第一个和第三个相同的元素，只保留第一个
		// source.distinct(0,2).print();
//		(0,3,6)
//		(1,1,9)
//		(1,2,8)
//		(0,2,5)
	}

	public static void distinctFunction2(ExecutionEnvironment env) throws Exception {
		DataSet<User> source = env.fromCollection(Arrays.asList(new User(1, "alan1", "1", "1@1.com", 18, 3000), new User(2, "alan2", "2", "2@2.com", 19, 200),
				new User(3, "alan1", "3", "3@3.com", 18, 1000), new User(5, "alan1", "5", "5@5.com", 28, 1500), new User(4, "alan2", "4", "4@4.com", 20, 300)));

//		source.distinct("name").print();
//		User(id=2, name=alan2, pwd=2, email=2@2.com, age=19, balance=200.0)
//		User(id=1, name=alan1, pwd=1, email=1@1.com, age=18, balance=3000.0)

		source.distinct("name", "age").print();
//		User(id=1, name=alan1, pwd=1, email=1@1.com, age=18, balance=3000.0)
//		User(id=2, name=alan2, pwd=2, email=2@2.com, age=19, balance=200.0)
//		User(id=5, name=alan1, pwd=5, email=5@5.com, age=28, balance=1500.0)
//		User(id=4, name=alan2, pwd=4, email=4@4.com, age=20, balance=300.0)
	}

	public static void distinctFunction3(ExecutionEnvironment env) throws Exception {
		DataSet<User> source = env.fromCollection(Arrays.asList(new User(1, "alan1", "1", "1@1.com", 18, -1000), new User(2, "alan2", "2", "2@2.com", 19, 200),
				new User(3, "alan1", "3", "3@3.com", 18, -1000), new User(5, "alan1", "5", "5@5.com", 28, 1500), new User(4, "alan2", "4", "4@4.com", 20, -300)));
		// 针对balance增加绝对值去重
		source.distinct(new KeySelector<User, Double>() {
			@Override
			public Double getKey(User value) throws Exception {
				return Math.abs(value.getBalance());
			}
		}).print();
//		User(id=5, name=alan1, pwd=5, email=5@5.com, age=28, balance=1500.0)
//		User(id=2, name=alan2, pwd=2, email=2@2.com, age=19, balance=200.0)
//		User(id=1, name=alan1, pwd=1, email=1@1.com, age=18, balance=-1000.0)
//		User(id=4, name=alan2, pwd=4, email=4@4.com, age=20, balance=-300.0)
	}

	public static void distinctFunction4(ExecutionEnvironment env) throws Exception {
		List<String> info = new ArrayList<>();
		info.add("Hadoop,Spark");
		info.add("Spark,Flink");
		info.add("Hadoop,Flink");
		info.add("Hadoop,Flink");

		DataSet<String> source = env.fromCollection(info);
		source.flatMap(new FlatMapFunction<String, String>() {

			@Override
			public void flatMap(String value, Collector<String> out) throws Exception {
				System.err.print("come in ");
				for (String token : value.split(",")) {
					out.collect(token);
				}
			}
		});
		source.distinct().print();
	}

}

10、Window

KeyedStream → WindowedStream
Window 函数允许按时间或其他条件对现有 KeyedStream 进行分组。以下是以 10 秒的时间窗口聚合：

inputStream.keyBy(0).window(Time.seconds(10));

Flink 定义数据片段以便（可能）处理无限数据流。这些切片称为窗口。此切片有助于通过应用转换处理数据块。要对流进行窗口化，需要分配一个可以进行分发的键和一个描述要对窗口化流执行哪些转换的函数。要将流切片到窗口，可以使用 Flink 自带的窗口分配器。我们有选项，如 tumbling windows, sliding windows, global 和 session windows。
具体参考系列文章
6、Flink四大基石之Window详解与详细示例（一）
6、Flink四大基石之Window详解与详细示例（二）
7、Flink四大基石之Time和WaterMaker详解与详细示例（watermaker基本使用、kafka作为数据源的watermaker使用示例以及超出最大允许延迟数据的接收实现）

11、WindowAll

DataStream → AllWindowedStream
windowAll 函数允许对常规数据流进行分组。通常，这是非并行数据转换，因为它在非分区数据流上运行。
与常规数据流功能类似，也有窗口数据流功能。唯一的区别是它们处理窗口数据流。所以窗口缩小就像 Reduce 函数一样，Window fold 就像 Fold 函数一样，并且还有聚合。

dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(5))); // Last 5 seconds of data

这适用于非并行转换的大多数场景。所有记录都将收集到 windowAll 算子对应的一个任务中。

具体参考系列文章
6、Flink四大基石之Window详解与详细示例（一）
6、Flink四大基石之Window详解与详细示例（二）
7、Flink四大基石之Time和WaterMaker详解与详细示例（watermaker基本使用、kafka作为数据源的watermaker使用示例以及超出最大允许延迟数据的接收实现）

12、Window Apply

WindowedStream → DataStream
AllWindowedStream → DataStream
将通用 function 应用于整个窗口。下面是一个手动对窗口内元素求和的 function。

如果你使用 windowAll 转换，则需要改用 AllWindowFunction。

windowedStream.apply(new WindowFunction<Tuple2<String,Integer>, Integer, Tuple, Window>() {
    public void apply (Tuple tuple,
            Window window,
            Iterable<Tuple2<String, Integer>> values,
            Collector<Integer> out) throws Exception {
        int sum = 0;
        for (value t: values) {
            sum += t.f1;
        }
        out.collect (new Integer(sum));
    }
});

// 在 non-keyed 窗口流上应用 AllWindowFunction
allWindowedStream.apply (new AllWindowFunction<Tuple2<String,Integer>, Integer, Window>() {
    public void apply (Window window,
            Iterable<Tuple2<String, Integer>> values,
            Collector<Integer> out) throws Exception {
        int sum = 0;
        for (value t: values) {
            sum += t.f1;
        }
        out.collect (new Integer(sum));
    }
});

13、Window Reduce

WindowedStream → DataStream
对窗口应用 reduce function 并返回 reduce 后的值。

windowedStream.reduce (new ReduceFunction<Tuple2<String,Integer>>() {
    public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2) throws Exception {
        return new Tuple2<String,Integer>(value1.f0, value1.f1 + value2.f1);
    }
});

14、Aggregations on windows

WindowedStream → DataStream
聚合窗口的内容。min和minBy之间的区别在于，min返回最小值，而minBy返回该字段中具有最小值的元素（max和maxBy相同）。

windowedStream.sum(0);
windowedStream.sum("key");
windowedStream.min(0);
windowedStream.min("key");
windowedStream.max(0);
windowedStream.max("key");
windowedStream.minBy(0);
windowedStream.minBy("key");
windowedStream.maxBy(0);
windowedStream.maxBy("key");