（十七）Flink Table API & SQL 编程指南概念与通用api

最新推荐文章于 2023-04-04 23:53:11 发布

京河小蚁

最新推荐文章于 2023-04-04 23:53:11 发布

阅读量508

点赞数

分类专栏： flink 文章标签： sql flink

本文链接：https://blog.csdn.net/u010772882/article/details/125187015

版权

flink 专栏收录该内容

86 篇文章 33 订阅

订阅专栏

文章目录

Table API 和 SQL 程序的结构
创建一个 TableEnvironment
Create Tables in the Catalog
查询表
发出一个表
翻译并执行查询
解释表

Table API & SQL 集成在同一套 API 中。这套 API 的核心概念是Table，用作查询的输入和输出。本文介绍 Table API & SQL 查询程序的通用结构、如何注册 Table 、如何查询 Table 以及如何输出 Table 。

源码地址 https://github.com/DeveloperZJQ/learning-flink/tree/master/flink-learning-base/src/main/java/com/flink/tableandsql

Table API 和 SQL 程序的结构

所有用于批处理和流处理的 Table API 和 SQL 程序都遵循相同的模式。下面的代码示例展示了 Table API 和 SQL 程序的通用结构（本代码伪代码，不能运行，只为提供学习代码结构，文章末尾会提供能运行的代码）。

import org.apache.flink.connector.datagen.table.DataGenConnectorOptions;
import org.apache.flink.table.api.*;

/**
 * @author happy
 * @since 2022/6/10
 */
public class TableAndSQLFirstProgrammer {
    public static void main(String[] args) {
        final EnvironmentSettings settings =
                EnvironmentSettings.newInstance().inStreamingMode().build();
        TableEnvironment tableEnv = TableEnvironment.create(settings);

// Create a source table
        tableEnv.createTemporaryTable("SourceTable", TableDescriptor.forConnector("datagen")
                .schema(Schema.newBuilder()
                        .column("f0", DataTypes.STRING())
                        .build())
                .option(DataGenConnectorOptions.ROWS_PER_SECOND, 100L)
                .build());

// Create a sink table (using SQL DDL)
        tableEnv.executeSql("CREATE TEMPORARY TABLE SinkTable WITH ('connector' = 'blackhole') LIKE SourceTable (EXCLUDING ALL)");

// Create a Table object from a Table API query
        Table table2 = tableEnv.from("SourceTable");

// Create a Table object from a SQL query
        tableEnv.sqlQuery("SELECT * FROM SourceTable").execute().print();

        tableEnv.createTemporaryView("TEMP_TABLE_A", table2);
        tableEnv.sqlQuery("SELECT * FROM TEMP_TABLE_A").execute().print();
// Emit a Table API result Table to a TableSink, same for SQL result
        table2.insertInto("SinkTable").execute();
    }
}

Table API 和 SQL 查询可以轻松地与 DataStream 程序集成并嵌入到其中。查看DataStream API 集成页面，了解如何将 DataStreams 转换为表，反之亦然。

创建一个 TableEnvironment

是 Table API 和 SQL 集成的TableEnvironment入口点，负责：

Table在内部catalog中注册
注册catalog
加载可插拔模块
执行 SQL 查询
注册用户定义的（标量、表或聚合）函数
DataStream在和之间转换Table（在的情况下StreamTableEnvironment）

Table总是绑定到一个特定的TableEnvironment. 不可能在同一个查询中组合不同 TableEnvironments 的表，例如，加入或联合它们。ATableEnvironment是通过调用静态TableEnvironment.create()方法创建的。

import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.TableEnvironment;

EnvironmentSettings settings = EnvironmentSettings
    .newInstance()
    .inStreamingMode()
    //.inBatchMode()
    .build();

TableEnvironment tEnv = TableEnvironment.create(settings);

或者，用户可以StreamTableEnvironment从现有的 API 创建一个StreamExecutionEnvironment 与 API 进行互操作DataStream。

Create Tables in the Catalog

TableEnvironment 维护使用标识符创建的表目录的映射。每个标识符由 3 部分组成：目录名称、数据库名称和对象名称。如果未指定目录或数据库，则将使用当前默认值（请参阅表标识符扩展部分中的示例）。

表可以是虚拟的 ( VIEWS) 或常规的 ( TABLES)。VIEWS可以从现有Table对象创建，通常是 Table API 或 SQL 查询的结果。TABLES描述外部数据，例如文件、数据库表或消息队列。

临时表与永久表

表可以是临时的，并且与单个 Flink 会话的生命周期相关联，也可以是永久的，并且在多个 Flink 会话和集群中可见。

永久表需要一个目录（例如 Hive Metastore）来维护有关表的元数据。一旦创建了永久表，它对连接到目录的任何 Flink 会话都是可见的，并且将继续存在，直到表被显式删除。

另一方面，临时表始终存储在内存中，并且仅在它们创建的 Flink 会话期间存在。这些表对其他会话不可见。它们不绑定到任何目录或数据库，但可以在其中一个的命名空间中创建。如果删除了相应的数据库，则不会删除临时表。

影子

可以使用与现有永久表相同的标识符注册临时表。临时表会影响永久表，只要临时表存在，就无法访问永久表。所有具有该标识符的查询都将针对临时表执行。

这可能对实验有用。它允许首先对一个临时表运行完全相同的查询，例如，该表只有一个数据子集，或者数据被混淆了。一旦验证查询是正确的，它就可以针对真实的生产表运行。

创建表

虚拟表

TableAPI 对象对应于 SQL 术语中的（VIEW虚拟表）。它封装了一个逻辑查询计划。它可以在目录中创建，如下所示：

// get a TableEnvironment
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section

// table is the result of a simple projection query 
Table projTable = tableEnv.from("X").select(...);

// register the Table projTable as table "projectedTable"
tableEnv.createTemporaryView("projectedTable", projTable);

注意： Table对象与VIEW关系数据库系统中的 's 类似，即定义的查询Table未优化，但当另一个查询引用已注册的时将被内联Table。如果多个查询引用同一个已注册的Table，则会为每个引用的查询内联并执行多次，即注册的结果Table不会被共享。

连接器表

也可以TABLE从连接器声明中创建关系数据库中已知的。连接器描述了存储表数据的外部系统。可以在此处声明存储系统，例如 Apache Kafka 或常规文件系统。

此类表可以直接使用 Table API 创建，也可以通过切换到 SQL DDL 创建。

// 方法一  Using table descriptors
final TableDescriptor sourceDescriptor = TableDescriptor.forConnector("datagen")
    .schema(Schema.newBuilder()
    .column("f0", DataTypes.STRING())
    .build())
    .option(DataGenOptions.ROWS_PER_SECOND, 100)
    .build();

tableEnv.createTable("SourceTableA", sourceDescriptor);
tableEnv.createTemporaryTable("SourceTableB", sourceDescriptor);

// 方法二   Using SQL DDL
tableEnv.executeSql("CREATE [TEMPORARY] TABLE MyTable (...) WITH (...)");

扩展表标识符

表始终使用由目录、数据库和表名组成的 3 部分标识符进行注册。

用户可以将其中的一个目录和一个数据库设置为“当前目录”和“当前数据库”。有了它们，上面提到的 3 部分标识符中的前两部分可以是可选的 - 如果未提供它们，则将引用当前目录和当前数据库。用户可以通过 Table API 或 SQL 切换当前目录和当前数据库。

标识符遵循 SQL 要求，这意味着它们可以使用反引号字符 ( `) 进行转义。

TableEnvironment tEnv = ...;
tEnv.useCatalog("custom_catalog");
tEnv.useDatabase("custom_database");

Table table = ...;

// register the view named 'exampleView' in the catalog named 'custom_catalog'
// in the database named 'custom_database' 
tableEnv.createTemporaryView("exampleView", table);

// register the view named 'exampleView' in the catalog named 'custom_catalog'
// in the database named 'other_database' 
tableEnv.createTemporaryView("other_database.exampleView", table);

// register the view named 'example.View' in the catalog named 'custom_catalog'
// in the database named 'custom_database' 
tableEnv.createTemporaryView("`example.View`", table);

// register the view named 'exampleView' in the catalog named 'other_catalog'
// in the database named 'other_database' 
tableEnv.createTemporaryView("other_catalog.other_database.exampleView", table);

查询表

TABLE API

Table API 是用于 Scala 和 Java 的语言集成查询 API。与 SQL 相比，查询不指定为字符串，而是使用宿主语言逐步组成。

API 基于Table表示表（流式或批处理）的类，并提供应用关系操作的方法。这些方法返回一个新Table对象，它表示对输入应用关系操作的结果Table。一些关系操作由多个方法调用组成，例如table.groupBy(…).select()，其中groupBy(…)指定的分组table，以及select(…)对的分组的投影table。

Table API文档描述了流和批处理表支持的所有 Table API 操作。

以下示例显示了一个简单的 Table API 聚合查询：

// get a TableEnvironment
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section

// register Orders table

// scan registered Orders table
Table orders = tableEnv.from("Orders");
// compute revenue for all customers from France
Table revenue = orders
  .filter($("cCountry").isEqual("FRANCE"))
  .groupBy($("cID"), $("cName"))
  .select($("cID"), $("cName"), $("revenue").sum().as("revSum"));

// emit or convert Table
// execute query

SQL

Flink 的 SQL 集成基于Apache Calcite，它实现了 SQL 标准。SQL 查询被指定为常规字符串。

SQL文档描述了 Flink 对流表和批处理表的 SQL 支持。

以下示例显示如何指定查询并将结果作为Table.

// get a TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section

// register Orders table

// compute revenue for all customers from France
val revenue = tableEnv.sqlQuery("""
  |SELECT cID, cName, SUM(revenue) AS revSum
  |FROM Orders
  |WHERE cCountry = 'FRANCE'
  |GROUP BY cID, cName
  """.stripMargin)

// emit or convert Table
// execute query

以下示例显示如何指定将其结果插入已注册表的更新查询。

// get a TableEnvironment
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section

// register "Orders" table
// register "RevenueFrance" output table

// compute revenue for all customers from France and emit to "RevenueFrance"
tableEnv.executeSql(
    "INSERT INTO RevenueFrance " +
    "SELECT cID, cName, SUM(revenue) AS revSum " +
    "FROM Orders " +
    "WHERE cCountry = 'FRANCE' " +
    "GROUP BY cID, cName"
  );

混合表 API 和 SQL

表 API 和 SQL 查询很容易混合，因为两者都返回Table对象：

可以在TableSQL 查询返回的对象上定义 Table API 查询。
通过在 SQL 查询的子句中注册结果表并在SQL 查询TableEnvironment的子句中引用它，可以在Table API 查询的结果上定义FROMSQL 查询。

发出一个表

Table通过将其写入 a 来发出TableSink。TableSink是一个通用接口，支持多种文件格式（例如 CSV、Apache Parquet、Apache Avro）、存储系统（例如 JDBC、Apache HBase、Apache Cassandra、Elasticsearch）或消息系统（例如 Apache Kafka、RabbitMQ ）。

批处理Table只能写入 BatchTableSink，而流式传输Table需要AppendStreamTableSink 、 RetractStreamTableSink或UpsertStreamTableSink.

有关可用接收器的详细信息以及如何实现自定义的说明，请参阅有关表源和接收器DynamicTableSink的文档。

该Table.insertInto(String tableName)方法定义了一个完整的端到端管道，将源表发送到已注册的接收器表。该方法通过名称从catalog中查找表接收器，并验证接收器的结构Table是否与接收器的结构相同。可以通过TablePipeline.explain()调用来解释和执行管道TablePipeline.execute()。

以下示例显示了如何发出Table:

// get a TableEnvironment
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section

// create an output Table
final Schema schema = Schema.newBuilder()
    .column("a", DataTypes.INT())
    .column("b", DataTypes.STRING())
    .column("c", DataTypes.BIGINT())
    .build();

tableEnv.createTemporaryTable("CsvSinkTable", TableDescriptor.forConnector("filesystem")
    .schema(schema)
    .option("path", "/path/to/file")
    .format(FormatDescriptor.forFormat("csv")
        .option("field-delimiter", "|")
        .build())
    .build());

// compute a result Table using Table API operators and/or SQL queries
Table result = ...;

// Prepare the insert into pipeline
TablePipeline pipeline = result.insertInto("CsvSinkTable");

// Print explain details
pipeline.printExplain();

// emit the result Table to the registered TableSink
pipeline.execute();

翻译并执行查询

表 API 和 SQL 查询被转换为DataStream程序，无论它们的输入是流式还是批处理。查询在内部表示为逻辑查询计划，并分两个阶段进行转换：

优化逻辑计划，
翻译成 DataStream 程序。

在以下情况下会翻译 Table API 或 SQL 查询：

TableEnvironment.executeSql()。该方法用于执行给定的语句，一旦调用该方法，就会立即翻译 sql 查询。
TablePipeline.execute()。该方法用于执行 source-to-sink 管道，调用该方法后立即翻译 Table API 程序。
Table.execute()。该方法用于将表格内容采集到本地客户端，调用该方法后立即翻译表格API。
StatementSet.execute()。TablePipeline（通过发送到接收器StatementSet.add()）或 INSERT 语句（通过指定StatementSet.addInsertSql()）将StatementSet首先被缓冲。StatementSet.execute()一旦被调用，它们就会被转换。所有接收器都将优化为一个 DAG。
Table在转换为DataStream（请参阅与 DataStream 集成）。翻译后，它是一个常规的 DataStream 程序，并在StreamExecutionEnvironment.execute()被调用时执行。

解释表

Table API 提供了一种机制来解释逻辑和优化的查询计划以计算Table. 这是通过Table.explain()方法或StatementSet.explain()方法完成的。Table.explain()返回一个计划Table。StatementSet.explain()返回多个接收器的计划。它返回一个描述三个计划的字符串：

关系查询的抽象语法树，即未优化的逻辑查询计划，
优化的逻辑查询计划
物理执行计划。

TableEnvironment.explainSql()和TableEnvironment.executeSql()支持执行EXPLAIN语句来获取计划。

以下代码显示了给定TableusingTable.explain()方法的示例和相应输出：

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);

DataStream<Tuple2<Integer, String>> stream1 = env.fromElements(new Tuple2<>(1, "hello"));
DataStream<Tuple2<Integer, String>> stream2 = env.fromElements(new Tuple2<>(1, "hello"));

// explain Table API
Table table1 = tEnv.fromDataStream(stream1, $("count"), $("word"));
Table table2 = tEnv.fromDataStream(stream2, $("count"), $("word"));
Table table = table1
  .where($("word").like("F%"))
  .unionAll(table2);

System.out.println(table.explain());

上面例子的结果是

== Abstract Syntax Tree ==
LogicalUnion(all=[true])
:- LogicalFilter(condition=[LIKE($1, _UTF-16LE'F%')])
:  +- LogicalTableScan(table=[[Unregistered_DataStream_1]])
+- LogicalTableScan(table=[[Unregistered_DataStream_2]])

== Optimized Physical Plan ==
Union(all=[true], union=[count, word])
:- Calc(select=[count, word], where=[LIKE(word, _UTF-16LE'F%')])
:  +- DataStreamScan(table=[[Unregistered_DataStream_1]], fields=[count, word])
+- DataStreamScan(table=[[Unregistered_DataStream_2]], fields=[count, word])

== Optimized Execution Plan ==
Union(all=[true], union=[count, word])
:- Calc(select=[count, word], where=[LIKE(word, 'F%')])
:  +- DataStreamScan(table=[[Unregistered_DataStream_1]], fields=[count, word])
+- DataStreamScan(table=[[Unregistered_DataStream_2]], fields=[count, word])

以下代码显示了多接收器计划使用StatementSet.explain()方法的示例和相应输出：

EnvironmentSettings settings = EnvironmentSettings.inStreamingMode();
TableEnvironment tEnv = TableEnvironment.create(settings);

final Schema schema = Schema.newBuilder()
    .column("count", DataTypes.INT())
    .column("word", DataTypes.STRING())
    .build();

tEnv.createTemporaryTable("MySource1", TableDescriptor.forConnector("filesystem")
    .schema(schema)
    .option("path", "/source/path1")
    .format("csv")
    .build());
tEnv.createTemporaryTable("MySource2", TableDescriptor.forConnector("filesystem")
    .schema(schema)
    .option("path", "/source/path2")
    .format("csv")
    .build());
tEnv.createTemporaryTable("MySink1", TableDescriptor.forConnector("filesystem")
    .schema(schema)
    .option("path", "/sink/path1")
    .format("csv")
    .build());
tEnv.createTemporaryTable("MySink2", TableDescriptor.forConnector("filesystem")
    .schema(schema)
    .option("path", "/sink/path2")
    .format("csv")
    .build());

StatementSet stmtSet = tEnv.createStatementSet();

Table table1 = tEnv.from("MySource1").where($("word").like("F%"));
stmtSet.add(table1.insertInto("MySink1"));

Table table2 = table1.unionAll(tEnv.from("MySource2"));
stmtSet.add(table2.insertInto("MySink2"));

String explanation = stmtSet.explain();
System.out.println(explanation);

多汇计划的结果是

== Abstract Syntax Tree ==
LogicalLegacySink(name=[`default_catalog`.`default_database`.`MySink1`], fields=[count, word])
+- LogicalFilter(condition=[LIKE($1, _UTF-16LE'F%')])
   +- LogicalTableScan(table=[[default_catalog, default_database, MySource1, source: [CsvTableSource(read fields: count, word)]]])

LogicalLegacySink(name=[`default_catalog`.`default_database`.`MySink2`], fields=[count, word])
+- LogicalUnion(all=[true])
   :- LogicalFilter(condition=[LIKE($1, _UTF-16LE'F%')])
   :  +- LogicalTableScan(table=[[default_catalog, default_database, MySource1, source: [CsvTableSource(read fields: count, word)]]])
   +- LogicalTableScan(table=[[default_catalog, default_database, MySource2, source: [CsvTableSource(read fields: count, word)]]])

== Optimized Physical Plan ==
LegacySink(name=[`default_catalog`.`default_database`.`MySink1`], fields=[count, word])
+- Calc(select=[count, word], where=[LIKE(word, _UTF-16LE'F%')])
   +- LegacyTableSourceScan(table=[[default_catalog, default_database, MySource1, source: [CsvTableSource(read fields: count, word)]]], fields=[count, word])

LegacySink(name=[`default_catalog`.`default_database`.`MySink2`], fields=[count, word])
+- Union(all=[true], union=[count, word])
   :- Calc(select=[count, word], where=[LIKE(word, _UTF-16LE'F%')])
   :  +- LegacyTableSourceScan(table=[[default_catalog, default_database, MySource1, source: [CsvTableSource(read fields: count, word)]]], fields=[count, word])
   +- LegacyTableSourceScan(table=[[default_catalog, default_database, MySource2, source: [CsvTableSource(read fields: count, word)]]], fields=[count, word])

== Optimized Execution Plan ==
Calc(select=[count, word], where=[LIKE(word, _UTF-16LE'F%')])(reuse_id=[1])
+- LegacyTableSourceScan(table=[[default_catalog, default_database, MySource1, source: [CsvTableSource(read fields: count, word)]]], fields=[count, word])

LegacySink(name=[`default_catalog`.`default_database`.`MySink1`], fields=[count, word])
+- Reused(reference_id=[1])

LegacySink(name=[`default_catalog`.`default_database`.`MySink2`], fields=[count, word])
+- Union(all=[true], union=[count, word])
   :- Reused(reference_id=[1])
   +- LegacyTableSourceScan(table=[[default_catalog, default_database, MySource2, source: [CsvTableSource(read fields: count, word)]]], fields=[count, word])

京河小蚁

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
（十七）Flink Table API & SQL 编程指南概念与通用api

Table API & SQL 集成在同一套 API 中。这套 API 的核心概念是Table，用作查询的输入和输出。本文介绍 Table API & SQL 查询程序的通用结构、如何注册 Table 、如何查询 Table 以及如何输出 Table 。所有用于批处理和流处理的 Table API 和 SQL 程序都遵循相同的模式。下面的代码示例展示了 Table API 和 SQL 程序的通用结构（本代码伪代码，不能运行，只为提供学习代码结构，文章末尾会提供能运行的代码）。Table API 和 S
复制链接

扫一扫