Table API&SQL编程

最新推荐文章于 2024-10-10 17:55:32 发布

Moca·

最新推荐文章于 2024-10-10 17:55:32 发布

阅读量641

点赞数

分类专栏：实时计算Flink 文章标签：大数据 java scala flink

本文链接：https://blog.csdn.net/qq_32165517/article/details/108228475

版权

本文详细介绍了Apache Flink的关系型API——Table API和SQL，包括它们的编程环境搭建、概念与通用API、两种计划器的区别、Table API与SQL的结构、创建TableEnvironment、表的创建与查询、数据类型映射以及与DataStream和DataSet API的结合使用。内容涵盖了从基础到进阶的各个方面，旨在帮助开发者理解并掌握Flink的流批统一处理能力。

摘要由CSDN通过智能技术生成

Table API&SQL编程

什么是Flink关系型API

在这里插入图片描述
DataSet&DataStream API
1) 熟悉两套API：DataSet/DataStream Java/Scala
MapReduce ==> Hive SQL
Spark ==> Spark SQL
Flink ==> SQL
2) Flink是支持批处理/流处理，如何做到API层面的统一

==> Table & SQL API 关系型API

Table API&SQL开发概述

Apache Flink 有两种关系型 API 来做流批统一处理：Table API 和 SQL。Table API 是用于 Scala 和 Java 语言的查询API，它可以用一种非常直观的方式来组合使用选取、过滤、join 等关系型算子。Flink SQL 是基于 Apache Calcite 来实现的标准 SQL。这两种 API 中的查询对于批（DataSet）和流（DataStream）的输入有相同的语义，也会产生同样的计算结果。

Table API 和 SQL 两种 API 是紧密集成的，以及 DataStream 和 DataSet API。你可以在这些 API 之间，以及一些基于这些 API 的库之间轻松的切换。比如，你可以先用 CEP 从 DataStream 中做模式匹配，然后用 Table API 来分析匹配的结果；或者你可以用 SQL 来扫描、过滤、聚合一个批式的表，然后再跑一个 Gelly 图算法来处理已经预处理好的数据。

注意：Table API 和 SQL 现在还处于活跃开发阶段，还没有完全实现所有的特性。不是所有的 [Table API，SQL] 和 [流，批] 的组合都是支持的。

Table API&SQL编程环境搭建

依赖图

从1.9开始，Flink 提供了两个 Table Planner 实现来执行 Table API 和 SQL 程序：Blink Planner 和 Old Planner，Old Planner 在1.9之前就已经存在了。 Planner 的作用主要是把关系型的操作翻译成可执行的、经过优化的 Flink 任务。两种 Planner 所使用的优化规则以及运行时类都不一样。它们在支持的功能上也有些差异。

注意对于生产环境，我们建议使用在1.11版本之后已经变成默认的Blink Planner。

所有的 Table API 和 SQL 的代码都在 flink-table 或者 flink-table-blink Maven artifacts 下。

下面是各个依赖：

flink-table-common: 公共模块，比如自定义函数、格式等需要依赖的。
flink-table-api-java: Table 和 SQL API，使用 Java 语言编写的，给纯 table 程序使用（还在早期开发阶段，不建议使用）
flink-table-api-scala: Table 和 SQL API，使用 Scala 语言编写的，给纯 table 程序使用（还在早期开发阶段，不建议使用）
flink-table-api-java-bridge: Table 和 SQL API 结合 DataStream/DataSet API 一起使用，给 Java 语言使用。
flink-table-api-scala-bridge: Table 和 SQL API 结合 DataStream/DataSet API 一起使用，给 Scala 语言使用。
flink-table-planner: table Planner 和运行时。这是在1.9之前 Flink 的唯一的 Planner，但是从1.11版本开始我们不推荐继续使用。
flink-table-planner-blink: 新的 Blink Planner，从1.11版本开始成为默认的 Planner。
flink-table-runtime-blink: 新的 Blink 运行时。
flink-table-uber: 把上述模块以及 Old Planner 打包到一起，可以在大部分 Table & SQL API 场景下使用。打包到一起的 jar 文件 flink-table-*.jar 默认会直接放到 Flink 发行版的 /lib 目录下。
flink-table-uber-blink: 把上述模块以及 Blink Planner 打包到一起，可以在大部分 Table & SQL API 场景下使用。打包到一起的 jar 文件 flink-table-blink-*.jar 默认会放到 Flink 发行版的 /lib 目录下。

关于如何使用 Old Planner 以及 Blink Planner，可以参考公共 API。

Table 程序依赖

取决于你使用的编程语言，选择 Java 或者 Scala API 来构建你的 Table API 和 SQL 程序：

<!-- Either... -->
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-table-api-java-bridge_2.11</artifactId>
  <version>1.11.0</version>
  <scope>provided</scope>
</dependency>
<!-- or... -->
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-table-api-scala-bridge_2.11</artifactId>
  <version>1.11.0</version>
  <scope>provided</scope>
</dependency>

除此之外，如果你想在 IDE 本地运行你的程序，你需要添加下面的模块，具体用哪个取决于你使用哪个 Planner：

<!-- Either... (for the old planner that was available before Flink 1.9) -->
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-table-planner_2.11</artifactId>
  <version>1.11.0</version>
  <scope>provided</scope>
</dependency>
<!-- or.. (for the new Blink planner) -->
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-table-planner-blink_2.11</artifactId>
  <version>1.11.0</version>
  <scope>provided</scope>
</dependency>

内部实现上，部分 table 相关的代码是用 Scala 实现的。所以，下面的依赖也需要添加到你的程序里，不管是批式还是流式的程序：

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-streaming-scala_2.11</artifactId>
  <version>1.11.0</version>
  <scope>provided</scope>
</dependency>

扩展依赖

如果你想实现自定义格式来解析 Kafka 数据，或者自定义函数，下面的依赖就足够了，编译出来的 jar 文件可以直接给 SQL Client 使用：

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-table-common</artifactId>
  <version>1.11.0</version>
  <scope>provided</scope>
</dependency>

当前，本模块包含以下可以扩展的接口：

SerializationSchemaFactory
DeserializationSchemaFactory
ScalarFunction
TableFunction
AggregateFunction

概念与通用 API

Table API 和 SQL 集成在同一套 API 中。这套 API 的核心概念是Table，用作查询的输入和输出。本文介绍了 Table API 和 SQL 查询程序的通用结构、如何注册 Table 、如何查询 Table 以及如何输出 Table 。

两种计划器（Planner）的主要区别

Blink 将批处理作业视作流处理的一种特例。严格来说，Table 和 DataSet 之间不支持相互转换，并且批处理作业也不会转换成 DataSet 程序而是转换成 DataStream 程序，流处理作业也一样。
Blink 计划器不支持 BatchTableSource，而是使用有界的 StreamTableSource 来替代。
旧计划器和 Blink 计划器中 FilterableTableSource 的实现是不兼容的。旧计划器会将 PlannerExpression 下推至 FilterableTableSource，而 Blink 计划器则是将 Expression 下推。
基于字符串的键值配置选项仅在 Blink 计划器中使用。（详情参见配置）
PlannerConfig 在两种计划器中的实现（CalciteConfig）是不同的。
Blink 计划器会将多sink（multiple-sinks）优化成一张有向无环图（DAG），TableEnvironment 和 StreamTableEnvironment 都支持该特性。旧计划器总是将每个sink都优化成一个新的有向无环图，且所有图相互独立。
旧计划器目前不支持 catalog 统计数据，而 Blink 支持。

Table API 和 SQL 程序的结构

所有用于批处理和流处理的 Table API 和 SQL 程序都遵循相同的模式。下面的代码示例展示了 Table API 和 SQL 程序的通用结构。

Java

// create a TableEnvironment for specific planner batch or streaming
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section

// create a Table
tableEnv.connect(...).createTemporaryTable("table1");
// register an output Table
tableEnv.connect(...).createTemporaryTable("outputTable");

// create a Table object from a Table API query
Table tapiResult = tableEnv.from("table1").select(...);
// create a Table object from a SQL query
Table sqlResult  = tableEnv.sqlQuery("SELECT ... FROM table1 ... ");

// emit a Table API result Table to a TableSink, same for SQL result
TableResult tableResult = tapiResult.executeInsert("outputTable");
tableResult...

// execute
tableEnv.execute("java_job");

Scala

// create a TableEnvironment for specific planner batch or streaming
val tableEnv = ... // see "Create a TableEnvironment" section

// create a Table
tableEnv.connect(...).createTemporaryTable("table1")
// register an output Table
tableEnv.connect(...).createTemporaryTable("outputTable")

// create a Table from a Table API query
val tapiResult = tableEnv.from("table1").select(...)
// create a Table from a SQL query
val sqlResult  = tableEnv.sqlQuery("SELECT ... FROM table1 ...")

// emit a Table API result Table to a TableSink, same for SQL result
TableResult tableResult = tapiResult.executeInsert("outputTable");
tableResult...

// execute
tableEnv.execute("scala_job")

Python

# create a TableEnvironment for specific planner batch or streaming
table_env = ... # see "Create a TableEnvironment" section

# register a Table
table_env.connect(...).create_temporary_table("table1")

# register an output Table
table_env.connect(...).create_temporary_table("outputTable")

# create a Table from a Table API query
tapi_result = table_env.from_path("table1").select(...)
# create a Table from a SQL query
sql_result  = table_env.sql_query("SELECT ... FROM table1 ...")

# emit a Table API result Table to a TableSink, same for SQL result
table_result = tapi_result.execute_insert("outputTable")
table_result...

# execute
table_env.execute("python_job")