Flink Table API 编程

最新推荐文章于 2024-08-22 08:31:00 发布

马本不想再等了

最新推荐文章于 2024-08-22 08:31:00 发布

阅读量302

点赞数

文章标签： flink

本文链接：https://blog.csdn.net/qq_42180284/article/details/104235468

版权

文章目录

一、什么是Tabel API
- 1.1 Flink API 总览
- 1.2 Table API 的特性
二、Table API 编程
三、Table API 动态

一、什么是Tabel API

1.1 Flink API 总览

Flink API 总览

1.2 Table API 的特性

Table API & SQL
以 wordcount 为例，Table API 与 SQL 的对比：
高性能：groupby 的聚合只计算一次，后面如果多次select恢复用前面聚合的结果的。
流批统一：Table API 的对于流计算和批计算的API只有统一的一套，方便开发。

Table API 的特点以及与SQL的关系

如何理解，Tabel API 使得多声明的数据处理写写来比较容易

// 一个过滤操作，将不同的结果插入到不同的表中
Table.filter(a < 10).insertInto("table1")
Talbe.filter(a > 10).insertInto("table2")

以上情况使用 Table API 会比 SQL 简单的多。

总的来说，Table API 可以看做是 SQL 的一个超集，因为 Table API 是 Flink 自身的API，其易用性、功能性和扩展性都有一定的提升。

二、Table API 编程

2.1 WordCount 示例

https://github.com/hequn8128/TableApiDemo

import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.java.BatchTableEnvironment;
import org.apache.flink.table.descriptors.FileSystem;
import org.apache.flink.table.descriptors.OldCsv;
import org.apache.flink.table.descriptors.Schema;
import org.apache.flink.types.Row;

public class JavaBatchWordCount {

	public static void main(String[] args) throws Exception {
		ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
		BatchTableEnvironment tEnv = BatchTableEnvironment.create(env);

		String path = JavaBatchWordCount.class.getClassLoader().getResource("words.txt").getPath();
		// 读取文件
		tEnv.connect(new FileSystem().path(path))
			// 指定格式（csv/行分隔符）
			.withFormat(new OldCsv().field("word", Types.STRING).lineDelimiter("\n"))
			// 指定Schema
			.withSchema(new Schema().field("word", Types.STRING))
			// 将这个source文件注册到Env中
			.registerTableSource("fileSource");
		// 扫描source，拿到table，进行TableAPI编程
		Table result = tEnv.scan("fileSource")
			.groupBy("word")
			.select("word, count(1) as count");

		tEnv.toDataSet(result, Row.class).print();
	}
}

注意：在使用 Table Environment 的时候要引入所需要的包下的 Environment。
当前的 Table Environment 有已下8种：

2.2 Table API 操作

How to get a table?

可以理解为 table 是注册到 env 中，再从 env 中 scan 出来的。

tEnv.
	...
	...
	.registerTableSource("Mytable");
Table myTable = tablEnvironment.scan("Mytable");

已下是3中注册table的方法：

Table descriptor
指定某个文件系统，指定格式，schema

tEnv
.connect(
new FileSystem()
.path(path))
.withFormat(
new OldCsv()
.field("word", Types.STRING)
.lineDelimiter("\n"))
.withSchema(
new Schema()
.field("word", Types.STRING))
.registerTableSource("sourceTable");

User defined table source
根据 table source 的接口，写一个自定义的 table source，然后向 env 中注册。

TableSource csvSource = new CsvTableSource(
path,
new String[]{"word"},
new TypeInformation[]{Types.STRING});
tEnv.registerTableSource("sourceTable2", csvSource);

DataStream<String> stream = ...
// register the DataStream as table " myTable3" with
// fields "word"
tableEnv
.registerDataStream("myTable3", stream, "word");

有了以上3种注册table的方式，就可以将 table 注册到 env 中，在 scan 出来，进行 Table API 编程。

How to emit a table?

resultTable 是一个table类型的结果表，使用insertInto可以将其输出到一个目标表中。

resultTable.insertInto("TargetTable");

同样有3中输出table的方式：

Table descriptor

tEnv
.connect(
new FileSystem()
.path(path))
.withFormat(
new OldCsv()
.field("word", Types.STRING)
.lineDelimiter("\n"))
.withSchema(
new Schema()
.field("word", Types.STRING))
.registerTableSink("targetTable");

User defined table sink

TableSink csvSink = new CsvTableSink(
path,
new String[]{"word"},
new TypeInformation[]{Types.STRING});
tEnv.registerTableSink("sinkTable2", csvSink);

emit to a DataStream

// emit the result table to a DataStream
DataStream<Tuple2<Boolean, Row>> stream =
tableEnv
.toRetractStream(resultTable, Row.class);

How to query a table?

query table

Table API 的分类

Table API 分类

Columns Operarion & Function

Columns Operarion（易用性）

// 新增一列
AddColumns Table orders = tableEnv.scan("Orders");
Table result = orders.addColumns(“concat(c, ‘sunny‘) as desc");
// 新增一列且覆盖原有列
AddOrReplaceColumns Table orders = tableEnv.scan("Orders");
Table result = orders.addOrReplaceColumns("concat(c, 'sunny') as desc");
// 删除一列
DropColumns Table orders = tableEnv.scan("Orders");
Table result = orders.dropColumns("b, c");
// 重命名一列
RenameColumns Table orders = tableEnv.scan("Orders");
Table result = orders.renameColumns("b as b2, c as c2");

Columns Function（易用性）

// 选择指定列：2到4列
select("withColumns(2 to 4)"）
// 反选指定的列：除2到4列以外的列
select("withoutColumns 2 to 4")

关于 Colums Function 的参数
在这里插入图片描述
可以传入引用、下标、列名等。
Columns Operation & Function 总结

Row-based Operation

map Operation（易用性）
map 中需要定义一个 scalarFunction，来对每一列进行独立的map操作。
当一个table的列很多，且一次select要对每一个列进行udf操作，那么可以使用map统一进行操作。如下：
在这里插入图片描述
flatmap Operation（易用性）
输入一行输出多行，flatmap 操作，其中要定义一个 TableFunction

aggregate Operation（易用性）
输入多行输出一行，接收一个 aggergateFunction，以 Count 为例，先定义一个 CountAccumulater L累加器，然后写聚合逻辑，最终 getValue 将结果返回。

FlatAggregate Operarion（功能性，新功能扩展）
输入多行输出多行，例如 topN操作，其中要传入一个TableAggreateFunction，先定义一个TopNAcc累加器，然后进行accumulate操作，emitValue可以拿到Colletor，就可以多次输出结果，如下

Aggregate 与 TableAggregate 比较
在这里插入图片描述
可以看到，在步骤2的累计中间结果的部分，max会记录一个最大值，top2的逻辑是记录两个值，但是最后getValue只输出一次，而emitValue可以输出两次，完成top2的逻辑。
Row-based Operation 总结

三、Table API 动态

3.1 Flip29
https://issues.apache.org/jira/browse/FLINK-11199
3.2 Python Table API
https://issues.apache.org/jira/browse/FLINK-10972
3.3 Interactive Programming(交互式编程)
https://issues.apache.org/jira/browse/FLINK-12308
3.4 Iterative Processing(迭代计算)
https://issues.apache.org/jira/browse/FLINK-11199

以上内容大部分来自Ververica Flink 社区的的公开分享内容。
https://github.com/flink-china/flink-training-course#19-flink-sql-%E7%BC%96%E7%A8%8B

马本不想再等了

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Flink Table API 编程

一、什么是Tabel API1.1 Flink API 总览1.2 Table API 的特性以 wordcount 为例，Table API 与 SQL 的对比：高性能：groupby 的聚合只计算一次，后面如果多次select恢复用前面聚合的结果的。流批统一：Table API 的对于流计算和批计算的API只有统一的一套，方便开发。如何理解，Tabel API 使得多声明的数...
复制链接

扫一扫