概念和通用API
Table API和SQL集成在联合API中。该API的中心概念是Table,可用作查询的输入和输出。本文档介绍了具有Table API和SQL查询的程序的通用结构,如何注册 Table,如何查询Table以及如何发出Table。
两个Planner的主要区别
- Blink将批处理作业视为流的特殊情况。因此,还不支持Table和DataSet之间的转换,并且批处理作业不会被转换为DateSet程序,而是被转换为Datastream程序,流处理作业也一样。
- Blink planner不支持BatchTableSource,用有界的StreamTableSource来代替。
- Blink planner仅支持全新的Catalog,不支持ExternalCatalog,因其被废弃了。
- FilterableTableSource的实现在old planner和 Blink planner是不兼容的。old planner会将PlannerExpressions 下推到FilterableTableSource,而Blink planner则将Expressions下推。
- 基于字符串的键值配置选项(有关详细信息,请参阅有关配置文档)仅适用于Blink planner。
- 两个plannerd的PlannerConfig实现(CalciteConfig)不同。
- Blink planner会将多个sink优化为一个DAG(仅支持TableEnvironment,不支持StreamTableEnvironment)。old planner始终将每个接收器优化为一个新的DAG,且所有DAG彼此独立。
- old planner现在不支持catalog统计信息,而Blink planner则支持。
Table API和SQL程序的结构
所有用于批和流处理的Table API和SQL程序都遵循相同的模式。下面的代码示例展示了Table API和SQL程序的通用结构。
// create a TableEnvironment for specific planner batch or streaming
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
// create a Table
tableEnv.connect(...).createTemporaryTable("table1");
// register an output Table
tableEnv.connect(...).createTemporaryTable("outputTable");
// create a Table object from a Table API query
Table tapiResult = tableEnv.from("table1").select(...);
// create a Table object from a SQL query
Table sqlResult = tableEnv.sqlQuery("SELECT ... FROM table1 ... ");
// emit a Table API result Table to a TableSink, same for SQL result
tapiResult.insertInto("outputTable");
// execute
tableEnv.execute("java_job");
// create a TableEnvironment for specific planner batch or streaming
val tableEnv = ... // see "Create a TableEnvironment" section
// create a Table
tableEnv.connect(...).createTemporaryTable("table1")
// register an output Table
tableEnv.connect(...).createTemporaryTable("outputTable")
// create a Table from a Table API query
val tapiResult = tableEnv.from("table1").select(...)
// create a Table from a SQL query
val sqlResult = tableEnv.sqlQuery("SELECT ... FROM table1 ...")
// emit a Table API result Table to a TableSink, same for SQL result
tapiResult.insertInto("outputTable")
// execute
tableEnv.execute("scala_job")
# create a TableEnvironment for specific planner batch or streaming
table_env = ... # see "Create a TableEnvironment" section
# register a Table
table_env.connect(...).create_temporary_table("table1")
# register an output Table
table_env.connect(...).create_temporary_table("outputTable")
# create a Table from a Table API query
tapi_result = table_env.from_path("table1").select(...)
# create a Table from a SQL query
sql_result = table_env.sql_query("SELECT ... FROM table1 ...")
# emit a Table API result Table to a TableSink, same for SQL result
tapi_result.insert_into("outputTable")
# execute
table_env.execute("python_job")
注:Table API和SQL查询可以很容易地集成到Datastream或DataSet程序并嵌入其中。看一下与Datastream和DataSet API的集成章节,以了解如何将数据流和数据集转换为表,反之亦然。
创建TableEnvironment
TableEnvironment是Table API和SQL集成的核心概念。它负责:
- 注册Table到内部catalog中
- 登记catalog
- 加载可插入模块
- 执行SQL查询
- 注册自定义(标量、表或聚合)函数
- 将DataStream或DataSet转换成Table
- 持有对ExecutionEnvironment或StreamExecutionEnvironment的引用
Table总是和特定的TableEnvironment绑定。不能在同一个查询中连接不同TableEnvironment的表,例如join或union。
TableEnvironment通过静态调用BatchTableEnvironment.create()或StreamTableEnvironment.create()方法创建,可传递一个StreamExecutionEnvironment或者ExecutionEnvironment参数及一个可选的TableConfig参数。这个TableConfig可用于配置TableEnvironment或自定义查询优化和翻译过程(请参见查询优化).
确保选择特定的planner BatchTableEnvironment/StreamTableEnvironment来和你的编程语言相匹配。
如果两个planner JAR都位于类路径(默认行为)上,则应在当前程序显式指定使用哪种planner。
Java
// **********************
// FLINK STREAMING QUERY
// **********************
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.java.StreamTableEnvironment;
EnvironmentSettings fsSettings = EnvironmentSettings.newInstance().useOldPlanner().inStreamingMode().build();
StreamExecutionEnvironment fsEnv = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment fsTableEnv = StreamTableEnvironment.create(fsEnv, fsSettings);
// or TableEnvironment fsTableEnv = TableEnvironment.create(fsSettings);
// ******************
// FLINK BATCH QUERY
// ******************
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.table.api.java.BatchTableEnvironment;
ExecutionEnvironment fbEnv = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment fbTableEnv = BatchTableEnvironment.create(fbEnv);
// **********************
// BLINK STREAMING QUERY
// **********************
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.java.StreamTableEnvironment;
StreamExecutionEnvironment bsEnv = StreamExecutionEnvironment.getExecutionEnvironment();
EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
StreamTableEnvironment bsTableEnv = StreamTableEnvironment.create(bsEnv, bsSettings);
// or TableEnvironment bsTableEnv = TableEnvironment.create(bsSettings);
// ******************
// BLINK BATCH QUERY
// ******************
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.TableEnvironment;
EnvironmentSettings bbSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build();
TableEnvironment bbTableEnv = TableEnvironment.create(bbSettings);
Scala
// **********************
// FLINK STREAMING QUERY
// **********************
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.scala.StreamTableEnvironment
val fsSettings = EnvironmentSettings.newInstance().useOldPlanner().inStreamingMode().build()
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val fsTableEnv = StreamTableEnvironment.create(fsEnv, fsSettings)
// or val fsTableEnv = TableEnvironment.create(fsSettings)
// ******************
// FLINK BATCH QUERY
// ******************
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.table.api.scala.BatchTableEnvironment
val fbEnv = ExecutionEnvironment.getExecutionEnvironment
val fbTableEnv = BatchTableEnvironment.create(fbEnv)
// **********************
// BLINK STREAMING QUERY
// **********************
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.scala.StreamTableEnvironment
val bsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val bsTableEnv = StreamTableEnvironment.create(bsEnv, bsSettings)
// or val bsTableEnv = TableEnvironment.create(bsSettings)
// ******************
// BLINK BATCH QUERY
// ******************
import org.apache.flink.table.api.{EnvironmentSettings, TableEnvironment}
val bbSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build()
val bbTableEnv = TableEnvironment.create(bbSettings)
Python
# **********************
# FLINK STREAMING QUERY
# **********************
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment, EnvironmentSettings
f_s_env = StreamExecutionEnvironment.get_execution_environment()
f_s_settings = EnvironmentSettings.new_instance().use_old_planner().in_streaming_mode().build()
f_s_t_env = StreamTableEnvironment.create(f_s_env, environment_settings=f_s_settings)
# ******************
# FLINK BATCH QUERY
# ******************
from pyflink.dataset import ExecutionEnvironment
from pyflink.table import BatchTableEnvironment
f_b_env = ExecutionEnvironment.get_execution_environment()
f_b_t_env = BatchTableEnvironment.create(f_b_env, table_config)
# **********************
# BLINK STREAMING QUERY
# **********************
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment, EnvironmentSettings
b_s_env = StreamExecutionEnvironment.get_execution_environment()
b_s_settings = EnvironmentSettings.new_instance().use_blink_planner().in_streaming_mode().build()
b_s_t_env = StreamTableEnvironment.create(b_s_env, environment_settings=b_s_settings)
# ******************
# BLINK BATCH QUERY
# ******************
from pyflink.table import EnvironmentSettings, BatchTableEnvironment
b_b_settings = EnvironmentSettings.new_instance().use_blink_planner().in_batch_mode().build()
b_b_t_env = BatchTableEnvironment.create(environment_settings=b_b_settings)
注:如果在/lib目录下只有一个planner,你可以使用useAnyPlanner (use_any_planner用于python)创建特定的EnvironmentSettings.
在Catalog中创建表
TableEnvironment维护使用标识符创建表的catalog的映射。每个标识符由3部分组成:catalog名称,数据库名称和对象名称。如果未指定catalog或数据库,则将使用当前默认值(请参阅表标识符扩展部分中的示例)。
表可以是虚拟的(VIEWS)或常规的(TABLES)。VIEWS可以从现有Table对象创建,通常是Table API或SQL查询的结果。TABLES描述外部数据,例如文件,数据库表或消息队列。
临时表与永久表
表可以是临时的,并与单个Flink会话的生命周期相关,也可以是永久的,并且在多个Flink会话和群集中可见。
永久表需要一个catalog(例如Hive Metastore)来维护有关表的元数据。创建永久表后,连接到目录的任何Flink会话都可以看到该表,并且该表将继续存在,直到明确删除该表为止。
另一方面,临时表始终存储在内存中,并且仅在创建它们的Flink会话期间存在。这些表对其他会话不可见。它们未绑定到任何catalog 或数据库,但可以在一个catalog 或数据库的命名空间中创建。如果删除它们对应数据库,也不会删除临时表。
重影
可以使用与现有永久表相同的标识符注册一个临时表。只要存在临时表,该临时表就会覆盖该永久表,并使该永久表不可访问。具有该标识符的所有查询将针对临时表执行。
这可能对测试有用。它首先允许对临时表(如仅包含数据的一个子集或模糊数据)运行完全相同的查询。一旦验证查询正确无误,就可以针对实际生产表运行查询。
创建Table
虚拟表
一个Table API对象对应于SQL术语中的一个VIEW(虚拟表)。它封装了逻辑查询计划。可以在catalog中创建它,如下所示:
Java
// get a TableEnvironment
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
// table is the result of a simple projection query
Table projTable = tableEnv.from("X").select(...);
// register the Table projTable as table "projectedTable"
tableEnv.createTemporaryView("projectedTable", projTable);
Scala
// get a TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section
// table is the result of a simple projection query
val projTable: Table = tableEnv.from("X").select(...)
// register the Table projTable as table "projectedTable"
tableEnv.createTemporaryView("projectedTable", projTable)
Python
# get a TableEnvironment
table_env = ... # see "Create a TableEnvironment" section
# table is the result of a simple projection query
proj_table = table_env.from_path("X").select(...)
# register the Table projTable as table "projectedTable"
table_env.register_table("projectedTable", proj_table)
**注意:**Table对象类似于关系型数据库系统中的VIEW,即定义在表上的查询不会被优化,但是当另一个查询引用到注册的表是会被内联。如果多个查询引用同一个已注册的Table,则将为每个引用查询内联并执行多次,即,不会共享已注册Table的结果。
连接器表
也可以从关系型数据库的连接器声明中创建一个表。连接器描述了存储表数据的外部系统。可以在此处声明诸如Apacha Kafka之类的存储系统或常规文件系统。
Java
tableEnvironment
.connect(...)
.withFormat(...)
.withSchema(...)
.inAppendMode()
.createTemporaryTable("MyTable")
Scala
tableEnvironment
.connect(...)
.withFormat(...)
.withSchema(...)
.inAppendMode()
.createTemporaryTable("MyTable")
Python
table_environment \
.connect(...) \
.with_format(...) \
.with_schema(...) \
.in_append_mode() \
.create_temporary_table("MyTable")
DDL
tableEnvironment.sqlUpdate("CREATE [TEMPORARY] TABLE MyTable (...) WITH (...)")
扩展Table标识符
注册一个表始终由catalog,数据库和表名三部分标识符组成。
用户可以将一个catalog 目录和其中的一个数据库设置为“当前目录”和“当前数据库”。使用它们,上述三部分标识符中的前两个部分可选的-如果未提供它们,则将引用当前目录和当前数据库。用户可以通过Table API或SQL切换当前目录和当前数据库。
标识符遵循SQL要求,这意味着可以使用反引号(`)对其进行转义。此外,所有SQL保留关键字都必须转义。
Java
TableEnvironment tEnv = ...;
tEnv.useCatalog("custom_catalog");
tEnv.useDatabase("custom_database");
Table table = ...;
// register the view named 'exampleView' in the catalog named 'custom_catalog'
// in the database named 'custom_database'
tableEnv.createTemporaryView("exampleView", table);
// register the view named 'exampleView' in the catalog named 'custom_catalog'
// in the database named 'other_database'
tableEnv.createTemporaryView("other_database.exampleView", table);
// register the view named 'View' in the catalog named 'custom_catalog' in the
// database named 'custom_database'. 'View' is a reserved keyword and must be escaped.
tableEnv.createTemporaryView("`View`", table);
// register the view named 'example.View' in the catalog named 'custom_catalog'
// in the database named 'custom_database'
tableEnv.createTemporaryView("`example.View`", table);
// register the view named 'exampleView' in the catalog named 'other_catalog'
// in the database named 'other_database'
tableEnv.createTemporaryView("other_catalog.other_database.exampleView", table);
Scala
// get a TableEnvironment
val tEnv: TableEnvironment = ...;
tEnv.useCatalog("custom_catalog")
tEnv.useDatabase("custom_database")
val table: Table = ...;
// register the view named 'exampleView' in the catalog named 'custom_catalog'
// in the database named 'custom_database'
tableEnv.createTemporaryView("exampleView", table)
// register the view named 'exampleView' in the catalog named 'custom_catalog'
// in the database named 'other_database'
tableEnv.createTemporaryView("other_database.exampleView", table)
// register the view named 'View' in the catalog named 'custom_catalog' in the
// database named 'custom_database'. 'View' is a reserved keyword and must be escaped.
tableEnv.createTemporaryView("`View`", table)
// register the view named 'example.View' in the catalog named 'custom_catalog'
// in the database named 'custom_database'
tableEnv.createTemporaryView("`example.View`", table)
// register the view named 'exampleView' in the catalog named 'other_catalog'
// in the database named 'other_database'
tableEnv.createTemporaryView("other_catalog.other_database.exampleView", table)
查询表
Table API
Table API是用于Scala和Java的语言集成查询API。与SQL相比,查询未指定为字符串,而是以宿主语言逐步构成。
API基于Table类,其代表一个(流或批)表,并提供关系操作的方法。这些方法返回一个新Table对象,该对象表示在输入表上应用关系运算的结果。一些关系操作由多个方法调用组成,例如table.groupBy(…).select(),其中groupBy(…)指定table的分组,select(…)是对table分组的突出部分。
该Table API文档介绍了流和批表中支持的所有Table API操作。
以下示例显示了一个简单的Table API聚合查询:
Java
// get a TableEnvironment
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
// register Orders table
// scan registered Orders table
Table orders = tableEnv.from("Orders");
// compute revenue for all customers from France
Table revenue = orders
.filter("cCountry === 'FRANCE'")
.groupBy("cID, cName")
.select("cID, cName, revenue.sum AS revSum");
// emit or convert Table
// execute query
Scala
// get a TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section
// register Orders table
// scan registered Orders table
val orders = tableEnv.from("Orders")
// compute revenue for all customers from France
val revenue = orders
.filter('cCountry === "FRANCE")
.groupBy('cID, 'cName)
.select('cID, 'cName, 'revenue.sum AS 'revSum)
// emit or convert Table
// execute query
Python
# get a TableEnvironment
table_env = # see "Create a TableEnvironment" section
# register Orders table
# scan registered Orders table
orders = table_env.from_path("Orders")
# compute revenue for all customers from France
revenue = orders \
.filter("cCountry === 'FRANCE'") \
.group_by("cID, cName") \
.select("cID, cName, revenue.sum AS revSum")
# emit or convert Table
# execute query
SQL
Flink的SQL集成基于Apache Calcite,其实现了SQL标准。SQL查询被指定为常规字符串。
该SQL文档描述Flink 对流和批表的SQL支持。
以下示例说明如何指定查询并以Table的形式返回结果。
Java
// get a TableEnvironment
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
// register Orders table
// compute revenue for all customers from France
Table revenue = tableEnv.sqlQuery(
"SELECT cID, cName, SUM(revenue) AS revSum " +
"FROM Orders " +
"WHERE cCountry = 'FRANCE' " +
"GROUP BY cID, cName"
);
// emit or convert Table
// execute query
Scala
// get a TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section
// register Orders table
// compute revenue for all customers from France
val revenue = tableEnv.sqlQuery("""
|SELECT cID, cName, SUM(revenue) AS revSum
|FROM Orders
|WHERE cCountry = 'FRANCE'
|GROUP BY cID, cName
""".stripMargin)
// emit or convert Table
// execute query
Python
# get a TableEnvironment
table_env = ... # see "Create a TableEnvironment" section
# register Orders table
# compute revenue for all customers from France
revenue = table_env.sql_query(
"SELECT cID, cName, SUM(revenue) AS revSum "
"FROM Orders "
"WHERE cCountry = 'FRANCE' "
"GROUP BY cID, cName"
)
# emit or convert Table
# execute query
下面的示例演示如何指定将查询结果插入已注册表。
Java
// get a TableEnvironment
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
// register "Orders" table
// register "RevenueFrance" output table
// compute revenue for all customers from France and emit to "RevenueFrance"
tableEnv.sqlUpdate(
"INSERT INTO RevenueFrance " +
"SELECT cID, cName, SUM(revenue) AS revSum " +
"FROM Orders " +
"WHERE cCountry = 'FRANCE' " +
"GROUP BY cID, cName"
);
// execute query
Scala
// get a TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section
// register "Orders" table
// register "RevenueFrance" output table
// compute revenue for all customers from France and emit to "RevenueFrance"
tableEnv.sqlUpdate("""
|INSERT INTO RevenueFrance
|SELECT cID, cName, SUM(revenue) AS revSum
|FROM Orders
|WHERE cCountry = 'FRANCE'
|GROUP BY cID, cName
""".stripMargin)
// execute query
Python
# get a TableEnvironment
table_env = ... # see "Create a TableEnvironment" section
# register "Orders" table
# register "RevenueFrance" output table
# compute revenue for all customers from France and emit to "RevenueFrance"
table_env.sql_update(
"INSERT INTO RevenueFrance "
"SELECT cID, cName, SUM(revenue) AS revSum "
"FROM Orders "
"WHERE cCountry = 'FRANCE' "
"GROUP BY cID, cName"
)
# execute query
Table API 和 SQL混合使用
表API和SQL查询可以轻易地混合使用,因为它们都返回Table对象:
- 可以在SQL查询返回的Table对象上定义Table API 查询。
- 可以通过在Table API查询的结果上定义SQL查询,方法是在TableEnvironment中注册结果表,然后在SQL查询的FROM子句中引用该表。
输出一个表
通过写入TableSink来输出一个表。TableSink是通用接口,支持多种文件格式(如CSV,Apache Parquet,Apache Avro),存储系统(如JDBC,Apache HBase,Apache Cassandra,Elasticsearch)或消息系统(如Apache Kafka,RabbitMQ)。
批Table只能写入BatchTableSink,而流Table需要AppendStreamTableSink,RetractStreamTableSink,或UpsertStreamTableSink中的一种。
请参阅有关 Table Sources & Sinks的文档,以获取有关可用连接器的详细信息以及如何实现自定义TableSink的说明。
Table.insertInto(String tableName)方法将输出Table到已注册TableSink。这个方法通过名称从catalog中查找TableSink,并校验Table的schema 和TableSink的schema是否相同。
Java
// get a TableEnvironment
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
// create an output Table
final Schema schema = new Schema()
.field("a", DataTypes.INT())
.field("b", DataTypes.STRING())
.field("c", DataTypes.LONG());
tableEnv.connect(new FileSystem("/path/to/file"))
.withFormat(new Csv().fieldDelimiter('|').deriveSchema())
.withSchema(schema)
.createTemporaryTable("CsvSinkTable");
// compute a result Table using Table API operators and/or SQL queries
Table result = ...
// emit the result Table to the registered TableSink
result.insertInto("CsvSinkTable");
// execute the program
Scala
// get a TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section
// create an output Table
val schema = new Schema()
.field("a", DataTypes.INT())
.field("b", DataTypes.STRING())
.field("c", DataTypes.LONG())
tableEnv.connect(new FileSystem("/path/to/file"))
.withFormat(new Csv().fieldDelimiter('|').deriveSchema())
.withSchema(schema)
.createTemporaryTable("CsvSinkTable")
// compute a result Table using Table API operators and/or SQL queries
val result: Table = ...
// emit the result Table to the registered TableSink
result.insertInto("CsvSinkTable")
// execute the program
Python
# get a TableEnvironment
table_env = ... # see "Create a TableEnvironment" section
# create a TableSink
t_env.connect(FileSystem().path("/path/to/file")))
.with_format(Csv()
.field_delimiter(',')
.deriveSchema())
.with_schema(Schema()
.field("a", DataTypes.INT())
.field("b", DataTypes.STRING())
.field("c", DataTypes.BIGINT()))
.create_temporary_table("CsvSinkTable")
# compute a result Table using Table API operators and/or SQL queries
result = ...
# emit the result Table to the registered TableSink
result.insert_into("CsvSinkTable")
# execute the program
解析并执行查询
对两个planners来说,解析和执行查询的行为是不同的。
Old planner
根据Table API和SQL查询的输入是流还是批输入,将被转换为DataStream或DataSet程序。查询在内部表示为逻辑查询计划,并分为两个阶段:
- 优化逻辑计划
- 转换为DataStream或DataSet程序
在以下情况下,Table API或SQL查询被解析:
- Table被输出到TableSink,即当Table.insertInto()被调用时
- 指定了SQL更新查询,即当TableEnvironment.sqlUpdate()被调用。
- 将Table转换为DataStreamor DataSet(请参阅与DataStream和DataSet API集成)
与DataSet APIDataStream和DataSet API集成
对于流,两个planner都可以与DataStream API集成。只有Old planner可以与DataSet API集成,对于批,Blink planner不可以和它们结合使用。注意: 以下关于DataSet API的讨论只和old planner基于批有关。
表API和SQL查询可以轻松地与DataStream和DataSet程序集成并嵌入其中。例如,可以查询外部表(例如从RDBMS),进行一些预处理,例如过滤,投影,聚合或与元数据关联,然后进一步使用DataStream或DataSet API进行处理(以及在这些API之上构建的任何库,例如CEP或Gelly)。相反,也可以将Table API或SQL查询应用于DataStream或DataSet程序的结果上。
可以通过将DataStream或DataSeta 转换为 Table,反之亦然,来实现交互。在本节中,我们描述如何完成这些转换。
Scala的隐式转换
Scala Table API提供对DataSet,DataStream以及Table类的隐式转换功能。针对Scala DataStream API通过导入org.apache.flink.table.api.scala._ 和org.apache.flink.api.scala._ 包来启用这些转换。
从DataStream或DataSet创建视图
DataStream或DataSet可在TableEnvironment注册为视图。结果视图的schema取决于已注册DataStream或DataSet的数据类型。检查有关将数据类型映射到表schema 的部分,以获取详细信息。
注意: 由DataStream或DataSet创建的视图只能被注册为零时视图。
Java
// get StreamTableEnvironment
// registration of a DataSet in a BatchTableEnvironment is equivalent
StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
DataStream<Tuple2<Long, String>> stream = ...
// register the DataStream as View "myTable" with fields "f0", "f1"
tableEnv.createTemporaryView("myTable", stream);
// register the DataStream as View "myTable2" with fields "myLong", "myString"
tableEnv.createTemporaryView("myTable2", stream, "myLong, myString");
Scala
// get TableEnvironment
// registration of a DataSet is equivalent
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section
val stream: DataStream[(Long, String)] = ...
// register the DataStream as View "myTable" with fields "f0", "f1"
tableEnv.createTemporaryView("myTable", stream)
// register the DataStream as View "myTable2" with fields "myLong", "myString"
tableEnv.createTemporaryView("myTable2", stream, 'myLong, 'myString)
将DataStream或DataSet转换为表
TableEnvironment除了可以注册为DataStream或DataSet外,还可以直接转换成表。如果要在Table API查询中使用Table,这将很方便。
Java
// get StreamTableEnvironment
// registration of a DataSet in a BatchTableEnvironment is equivalent
StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
DataStream<Tuple2<Long, String>> stream = ...
// Convert the DataStream into a Table with default fields "f0", "f1"
Table table1 = tableEnv.fromDataStream(stream);
// Convert the DataStream into a Table with fields "myLong", "myString"
Table table2 = tableEnv.fromDataStream(stream, "myLong, myString");
Scala
// get TableEnvironment
// registration of a DataSet is equivalent
val tableEnv = ... // see "Create a TableEnvironment" section
val stream: DataStream[(Long, String)] = ...
// convert the DataStream into a Table with default fields '_1, '_2
val table1: Table = tableEnv.fromDataStream(stream)
// convert the DataStream into a Table with fields 'myLong, 'myString
val table2: Table = tableEnv.fromDataStream(stream, 'myLong, 'myString)
将表转换为DataStream或DataSet
Table可以转换为DataStream或DataSet。这样,可以在Table API或SQL查询的结果上运行自定义DataStream或DataSet程序。
将able转换为DataStream或DataSet时,你需要指定DataStream或DataSet结果的数据类型。即,表的行要转换为的数据类型。最方便的转换类型通常是Row。以下列表概述了不同选项的功能:
- Row: 字段按位置映射,任意数量,支持null值,类型不安全的访问。
- POJO: 字段按名称映射(POJO字段必须以表字段命名),任意数量,支持null值,类型安全访问。
- Case Class: 字段按位置映射,不支持null值,类型安全访问。
- Tuple: 字段按位置映射,上限22个(Scala)或25个(Java),不支持null值,类型安全访问。
- Atomic Type: Table必须具有单个字段,不支持null值,类型安全访问。
将表转换为DataStream
Table流查询的结果 将动态更新,即,随着新记录到达查询的输入流,它会发生变化。因此,这种动态查询转换成的DataStream需要对表的更新进行编码。
将表转换为DataStream有两种模式:
- Append Mode: 动态Table仅通过INSERT操作变更的情况下才能用这种模式,即,只会追加且之前已经输出的记录永远不会被更新。
- Retract Mode: 这个模式也经常被使用。通过一个布尔标识来区别INSERT 和DELETE操作。
Java
// get StreamTableEnvironment.
StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
// Table with two fields (String name, Integer age)
Table table = ...
// convert the Table into an append DataStream of Row by specifying the class
DataStream<Row> dsRow = tableEnv.toAppendStream(table, Row.class);
// convert the Table into an append DataStream of Tuple2<String, Integer>
// via a TypeInformation
TupleTypeInfo<Tuple2<String, Integer>> tupleType = new TupleTypeInfo<>(
Types.STRING(),
Types.INT());
DataStream<Tuple2<String, Integer>> dsTuple =
tableEnv.toAppendStream(table, tupleType);
// convert the Table into a retract DataStream of Row.
// A retract stream of type X is a DataStream<Tuple2<Boolean, X>>.
// The boolean field indicates the type of the change.
// True is INSERT, false is DELETE.
DataStream<Tuple2<Boolean, Row>> retractStream =
tableEnv.toRetractStream(table, Row.class);
Scala
// get TableEnvironment.
// registration of a DataSet is equivalent
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section
// Table with two fields (String name, Integer age)
val table: Table = ...
// convert the Table into an append DataStream of Row
val dsRow: DataStream[Row] = tableEnv.toAppendStream[Row](table)
// convert the Table into an append DataStream of Tuple2[String, Int]
val dsTuple: DataStream[(String, Int)] dsTuple =
tableEnv.toAppendStream[(String, Int)](table)
// convert the Table into a retract DataStream of Row.
// A retract stream of type X is a DataStream[(Boolean, X)].
// The boolean field indicates the type of the change.
// True is INSERT, false is DELETE.
val retractStream: DataStream[(Boolean, Row)] = tableEnv.toRetractStream[Row](table)
将表转换为DataSet
如下将表转换为DataSet:
Java
// get BatchTableEnvironment
BatchTableEnvironment tableEnv = BatchTableEnvironment.create(env);
// Table with two fields (String name, Integer age)
Table table = ...
// convert the Table into a DataSet of Row by specifying a class
DataSet<Row> dsRow = tableEnv.toDataSet(table, Row.class);
// convert the Table into a DataSet of Tuple2<String, Integer> via a TypeInformation
TupleTypeInfo<Tuple2<String, Integer>> tupleType = new TupleTypeInfo<>(
Types.STRING(),
Types.INT());
DataSet<Tuple2<String, Integer>> dsTuple =
tableEnv.toDataSet(table, tupleType);
Scala
// get TableEnvironment
// registration of a DataSet is equivalent
val tableEnv = BatchTableEnvironment.create(env)
// Table with two fields (String name, Integer age)
val table: Table = ...
// convert the Table into a DataSet of Row
val dsRow: DataSet[Row] = tableEnv.toDataSet[Row](table)
// convert the Table into a DataSet of Tuple2[String, Int]
val dsTuple: DataSet[(String, Int)] = tableEnv.toDataSet[(String, Int)](table)
数据类型和表Schema的映射
Flink的DataStream和DataSet API支持非常多种类型。
Tuples(内置Scala和Flink Java tuples ),POJO,Scala case classes和Flink的Row类型等复合类型允许具有多个字段的嵌套数据结构,这些字段可在table表达式中访问。其他类型被视为原子类型。下面,我们描述Table API如何将这些类型转换为内部行表示形式,并显示将Table转换DataStream的示例。
数据类型到表schema的映射可以两种方式发生:基于字段位置或基于字段名称。
基于位置映射
基于位置的映射可用于在保持字段顺序的同时为字段赋予更有意义的名称。此映射可用于具有既定的字段顺序的复合数据类型以及原子类型。 tuples, rows和 case classes等复合数据类型具有这样的字段顺序。但是POJO的字段必须基于名称的映射(见下一章节)。字段可以映射出来,但不能用as来重命名。
在定义基于位置的映射时,输入数据类型中不得存在指定的名称,否则API会假定映射应基于字段名称进行。如果未指定任何字段名,则使用默认类型的字段名和复合类型的字段顺序以及原子类型的f0。
Java
// get a StreamTableEnvironment, works for BatchTableEnvironment equivalently
StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section;
DataStream<Tuple2<Long, Integer>> stream = ...
// convert DataStream into Table with default field names "f0" and "f1"
Table table = tableEnv.fromDataStream(stream);
// convert DataStream into Table with field "myLong" only
Table table = tableEnv.fromDataStream(stream, "myLong");
// convert DataStream into Table with field names "myLong" and "myInt"
Table table = tableEnv.fromDataStream(stream, "myLong, myInt");
Scala
// get a TableEnvironment
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section
val stream: DataStream[(Long, Int)] = ...
// convert DataStream into Table with default field names "_1" and "_2"
val table: Table = tableEnv.fromDataStream(stream)
// convert DataStream into Table with field "myLong" only
val table: Table = tableEnv.fromDataStream(stream, 'myLong)
// convert DataStream into Table with field names "myLong" and "myInt"
val table: Table = tableEnv.fromDataStream(stream, 'myLong, 'myInt)
基于名称的映射
基于名称的映射可用于任何数据类型,包括POJO。这是定义表模式映射的最灵活的方法。映射中的所有字段均按名称引用,并且可以使用as重命名。字段可以重新排序和映射。
如果未指定任何字段名,则使用复合类型默认的字段名和字段顺序以及原子类型的f0。
Java
// get a StreamTableEnvironment, works for BatchTableEnvironment equivalently
StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
DataStream<Tuple2<Long, Integer>> stream = ...
// convert DataStream into Table with default field names "f0" and "f1"
Table table = tableEnv.fromDataStream(stream);
// convert DataStream into Table with field "f1" only
Table table = tableEnv.fromDataStream(stream, "f1");
// convert DataStream into Table with swapped fields
Table table = tableEnv.fromDataStream(stream, "f1, f0");
// convert DataStream into Table with swapped fields and field names "myInt" and "myLong"
Table table = tableEnv.fromDataStream(stream, "f1 as myInt, f0 as myLong");
Scala
// get a TableEnvironment
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section
val stream: DataStream[(Long, Int)] = ...
// convert DataStream into Table with default field names "_1" and "_2"
val table: Table = tableEnv.fromDataStream(stream)
// convert DataStream into Table with field "_2" only
val table: Table = tableEnv.fromDataStream(stream, '_2)
// convert DataStream into Table with swapped fields
val table: Table = tableEnv.fromDataStream(stream, '_2, '_1)
// convert DataStream into Table with swapped fields and field names "myInt" and "myLong"
val table: Table = tableEnv.fromDataStream(stream, '_2 as 'myInt, '_1 as 'myLong)
Flink把原始类型(Integer,Double,String)或通用类型(无法进行分析和分解的类型)做为原子类型。原子类型的DataStream或DataSet被转换的Table只具有单个属性。属性的类型可以从原子类型推断,并且可以指定属性的名称。
Java
// get a StreamTableEnvironment, works for BatchTableEnvironment equivalently
StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
DataStream<Long> stream = ...
// convert DataStream into Table with default field name "f0"
Table table = tableEnv.fromDataStream(stream);
// convert DataStream into Table with field name "myLong"
Table table = tableEnv.fromDataStream(stream, "myLong");
Scala
// get a TableEnvironment
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section
val stream: DataStream[Long] = ...
// convert DataStream into Table with default field name "f0"
val table: Table = tableEnv.fromDataStream(stream)
// convert DataStream into Table with field name "myLong"
val table: Table = tableEnv.fromDataStream(stream, 'myLong)
Tuples(Scala和Java)和Case Classes(仅Scala)
Flink支持Scala的内置元组,并为Java提供了自己的元组类。两种元组的DataStreams和DataSet都可以转换为表。可以通过提供所有字段的名称来重命名字段(根据位置进行映射)。如果未指定任何字段名称,则使用默认字段名称。如果原始字段名(f0,f1,…为Flink元组和_1,_2…Scala元组)被引用时,API推断映射是基于名称而不是基于位置的。基于名称的映射允许使用别名(as)对字段和投影进行重新排序。
Java
// get a StreamTableEnvironment, works for BatchTableEnvironment equivalently
StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
DataStream<Tuple2<Long, String>> stream = ...
// convert DataStream into Table with default field names "f0", "f1"
Table table = tableEnv.fromDataStream(stream);
// convert DataStream into Table with renamed field names "myLong", "myString" (position-based)
Table table = tableEnv.fromDataStream(stream, "myLong, myString");
// convert DataStream into Table with reordered fields "f1", "f0" (name-based)
Table table = tableEnv.fromDataStream(stream, "f1, f0");
// convert DataStream into Table with projected field "f1" (name-based)
Table table = tableEnv.fromDataStream(stream, "f1");
// convert DataStream into Table with reordered and aliased fields "myString", "myLong" (name-based)
Table table = tableEnv.fromDataStream(stream, "f1 as 'myString', f0 as 'myLong'");
Scala
// get a TableEnvironment
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section
val stream: DataStream[(Long, String)] = ...
// convert DataStream into Table with renamed default field names '_1, '_2
val table: Table = tableEnv.fromDataStream(stream)
// convert DataStream into Table with field names "myLong", "myString" (position-based)
val table: Table = tableEnv.fromDataStream(stream, 'myLong, 'myString)
// convert DataStream into Table with reordered fields "_2", "_1" (name-based)
val table: Table = tableEnv.fromDataStream(stream, '_2, '_1)
// convert DataStream into Table with projected field "_2" (name-based)
val table: Table = tableEnv.fromDataStream(stream, '_2)
// convert DataStream into Table with reordered and aliased fields "myString", "myLong" (name-based)
val table: Table = tableEnv.fromDataStream(stream, '_2 as 'myString, '_1 as 'myLong)
// define case class
case class Person(name: String, age: Int)
val streamCC: DataStream[Person] = ...
// convert DataStream into Table with default field names 'name, 'age
val table = tableEnv.fromDataStream(streamCC)
// convert DataStream into Table with field names 'myName, 'myAge (position-based)
val table = tableEnv.fromDataStream(streamCC, 'myName, 'myAge)
// convert DataStream into Table with reordered and aliased fields "myAge", "myName" (name-based)
val table: Table = tableEnv.fromDataStream(stream, 'age as 'myAge, 'name as 'myName)
POJO(Java和Scala)
Flink支持POJO作为复合类型。确定POJO的规则记录在这里。
当转换一个POJO的 DataStream或DataSet成Table时没有指定字段名,则使用原始POJO字段的名称。名称映射需要原始名称,并且不能按位置进行映射。字段可以使用别名(带有as关键字)重命名,重新排序和投影。
Java
// get a StreamTableEnvironment, works for BatchTableEnvironment equivalently
StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
// Person is a POJO with fields "name" and "age"
DataStream<Person> stream = ...
// convert DataStream into Table with default field names "age", "name" (fields are ordered by name!)
Table table = tableEnv.fromDataStream(stream);
// convert DataStream into Table with renamed fields "myAge", "myName" (name-based)
Table table = tableEnv.fromDataStream(stream, "age as myAge, name as myName");
// convert DataStream into Table with projected field "name" (name-based)
Table table = tableEnv.fromDataStream(stream, "name");
// convert DataStream into Table with projected and renamed field "myName" (name-based)
Table table = tableEnv.fromDataStream(stream, "name as myName");
Scala
// get a TableEnvironment
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section
// Person is a POJO with field names "name" and "age"
val stream: DataStream[Person] = ...
// convert DataStream into Table with default field names "age", "name" (fields are ordered by name!)
val table: Table = tableEnv.fromDataStream(stream)
// convert DataStream into Table with renamed fields "myAge", "myName" (name-based)
val table: Table = tableEnv.fromDataStream(stream, 'age as 'myAge, 'name as 'myName)
// convert DataStream into Table with projected field "name" (name-based)
val table: Table = tableEnv.fromDataStream(stream, 'name)
// convert DataStream into Table with projected and renamed field "myName" (name-based)
val table: Table = tableEnv.fromDataStream(stream, 'name as 'myName)
Row
Row数据类型支持任意数量的字段且支持null值。字段名可以通过RowTypeInfo或当转换Row DataStream或DataSet成Table是指定。row类型可以通过位置或名称来映射。可以通过提供所有字段的名称(基于位置的映射)来重命名字段,也可以为投影/排序/重新命名(基于名称的映射)单独选择字段。
Java
// get a StreamTableEnvironment, works for BatchTableEnvironment equivalently
StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
// DataStream of Row with two fields "name" and "age" specified in `RowTypeInfo`
DataStream<Row> stream = ...
// convert DataStream into Table with default field names "name", "age"
Table table = tableEnv.fromDataStream(stream);
// convert DataStream into Table with renamed field names "myName", "myAge" (position-based)
Table table = tableEnv.fromDataStream(stream, "myName, myAge");
// convert DataStream into Table with renamed fields "myName", "myAge" (name-based)
Table table = tableEnv.fromDataStream(stream, "name as myName, age as myAge");
// convert DataStream into Table with projected field "name" (name-based)
Table table = tableEnv.fromDataStream(stream, "name");
// convert DataStream into Table with projected and renamed field "myName" (name-based)
Table table = tableEnv.fromDataStream(stream, "name as myName");
Scala
// get a TableEnvironment
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section
// DataStream of Row with two fields "name" and "age" specified in `RowTypeInfo`
val stream: DataStream[Row] = ...
// convert DataStream into Table with default field names "name", "age"
val table: Table = tableEnv.fromDataStream(stream)
// convert DataStream into Table with renamed field names "myName", "myAge" (position-based)
val table: Table = tableEnv.fromDataStream(stream, 'myName, 'myAge)
// convert DataStream into Table with renamed fields "myName", "myAge" (name-based)
val table: Table = tableEnv.fromDataStream(stream, 'name as 'myName, 'age as 'myAge)
// convert DataStream into Table with projected field "name" (name-based)
val table: Table = tableEnv.fromDataStream(stream, 'name)
// convert DataStream into Table with projected and renamed field "myName" (name-based)
val table: Table = tableEnv.fromDataStream(stream, 'name as 'myName)
查询优化
Old Planner
Apache Flink利用Apache Calcite来优化和翻译查询。当前执行的优化包括投影和过滤器下推,子查询去相关以及其他类型的查询重写。Old Planner尚未优化联接的顺序,而是按照查询中定义的顺序(FROM子句中的表顺序和/或子句中的连接谓词顺序WHERE)执行它们。
通过提供一个CalciteConfig对象,可以调整在不同阶段应用的优化规则集。可以通过构造器调用CalciteConfig.createBuilder())创建此属性,并通过调用 tableEnv.getConfig.setPlannerConfig(calciteConfig)将其提供给TableEnvironment。
解释表
Table API提供了一种机制来解释逻辑和优化查询计划计算Table。这是通过TableEnvironment.explain(table)或TableEnvironment.explain()方法完成的。explain(table)返回给定Table的执行计划。Table.explain()返回多个接收器计划的结果,主要用于Blink planner。它返回一个描述三个计划的字符串:
- 关系查询的抽象语法树,即未优化的逻辑查询计划,
- 优化的逻辑查询计划,以及
- 实际执行计划。
以下代码显示了一个示例以及对应给定Table使用explain(table)得到的相应输出:
Java
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
DataStream<Tuple2<Integer, String>> stream1 = env.fromElements(new Tuple2<>(1, "hello"));
DataStream<Tuple2<Integer, String>> stream2 = env.fromElements(new Tuple2<>(1, "hello"));
Table table1 = tEnv.fromDataStream(stream1, "count, word");
Table table2 = tEnv.fromDataStream(stream2, "count, word");
Table table = table1
.where("LIKE(word, 'F%')")
.unionAll(table2);
String explanation = tEnv.explain(table);
System.out.println(explanation);
Scala
val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = StreamTableEnvironment.create(env)
val table1 = env.fromElements((1, "hello")).toTable(tEnv, 'count, 'word)
val table2 = env.fromElements((1, "hello")).toTable(tEnv, 'count, 'word)
val table = table1
.where('word.like("F%"))
.unionAll(table2)
val explanation: String = tEnv.explain(table)
println(explanation)
Python
env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)
table1 = t_env.from_elements([(1, "hello")], ["count", "word"])
table2 = t_env.from_elements([(1, "hello")], ["count", "word"])
table = table1 \
.where("LIKE(word, 'F%')") \
.union_all(table2)
explanation = t_env.explain(table)
print(explanation)
Java
== Abstract Syntax Tree ==
LogicalUnion(all=[true])
LogicalFilter(condition=[LIKE($1, _UTF-16LE'F%')])
FlinkLogicalDataStreamScan(id=[1], fields=[count, word])
FlinkLogicalDataStreamScan(id=[2], fields=[count, word])
== Optimized Logical Plan ==
DataStreamUnion(all=[true], union all=[count, word])
DataStreamCalc(select=[count, word], where=[LIKE(word, _UTF-16LE'F%')])
DataStreamScan(id=[1], fields=[count, word])
DataStreamScan(id=[2], fields=[count, word])
== Physical Execution Plan ==
Stage 1 : Data Source
content : collect elements with CollectionInputFormat
Stage 2 : Data Source
content : collect elements with CollectionInputFormat
Stage 3 : Operator
content : from: (count, word)
ship_strategy : REBALANCE
Stage 4 : Operator
content : where: (LIKE(word, _UTF-16LE'F%')), select: (count, word)
ship_strategy : FORWARD
Stage 5 : Operator
content : from: (count, word)
ship_strategy : REBALANCE
Scala
== Abstract Syntax Tree ==
LogicalUnion(all=[true])
LogicalFilter(condition=[LIKE($1, _UTF-16LE'F%')])
FlinkLogicalDataStreamScan(id=[1], fields=[count, word])
FlinkLogicalDataStreamScan(id=[2], fields=[count, word])
== Optimized Logical Plan ==
DataStreamUnion(all=[true], union all=[count, word])
DataStreamCalc(select=[count, word], where=[LIKE(word, _UTF-16LE'F%')])
DataStreamScan(id=[1], fields=[count, word])
DataStreamScan(id=[2], fields=[count, word])
== Physical Execution Plan ==
Stage 1 : Data Source
content : collect elements with CollectionInputFormat
Stage 2 : Data Source
content : collect elements with CollectionInputFormat
Stage 3 : Operator
content : from: (count, word)
ship_strategy : REBALANCE
Stage 4 : Operator
content : where: (LIKE(word, _UTF-16LE'F%')), select: (count, word)
ship_strategy : FORWARD
Stage 5 : Operator
content : from: (count, word)
ship_strategy : REBALANCE
Python
== Abstract Syntax Tree ==
LogicalUnion(all=[true])
LogicalFilter(condition=[LIKE($1, _UTF-16LE'F%')])
FlinkLogicalDataStreamScan(id=[3], fields=[count, word])
FlinkLogicalDataStreamScan(id=[6], fields=[count, word])
== Optimized Logical Plan ==
DataStreamUnion(all=[true], union all=[count, word])
DataStreamCalc(select=[count, word], where=[LIKE(word, _UTF-16LE'F%')])
DataStreamScan(id=[3], fields=[count, word])
DataStreamScan(id=[6], fields=[count, word])
== Physical Execution Plan ==
Stage 1 : Data Source
content : collect elements with CollectionInputFormat
Stage 2 : Operator
content : Flat Map
ship_strategy : FORWARD
Stage 3 : Operator
content : Map
ship_strategy : FORWARD
Stage 4 : Data Source
content : collect elements with CollectionInputFormat
Stage 5 : Operator
content : Flat Map
ship_strategy : FORWARD
Stage 6 : Operator
content : Map
ship_strategy : FORWARD
Stage 7 : Operator
content : Map
ship_strategy : FORWARD
Stage 8 : Operator
content : where: (LIKE(word, _UTF-16LE'F%')), select: (count, word)
ship_strategy : FORWARD
Stage 9 : Operator
content : Map
ship_strategy : FORWARD
以下代码显示了一个示例以及对于多sink计划使用explain()得到的相应输出:
Java
EnvironmentSettings settings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
TableEnvironment tEnv = TableEnvironment.create(settings);
final Schema schema = new Schema()
.field("count", DataTypes.INT())
.field("word", DataTypes.STRING());
tEnv.connect(new FileSystem("/source/path1"))
.withFormat(new Csv().deriveSchema())
.withSchema(schema)
.createTemporaryTable("MySource1");
tEnv.connect(new FileSystem("/source/path2"))
.withFormat(new Csv().deriveSchema())
.withSchema(schema)
.createTemporaryTable("MySource2");
tEnv.connect(new FileSystem("/sink/path1"))
.withFormat(new Csv().deriveSchema())
.withSchema(schema)
.createTemporaryTable("MySink1");
tEnv.connect(new FileSystem("/sink/path2"))
.withFormat(new Csv().deriveSchema())
.withSchema(schema)
.createTemporaryTable("MySink2");
Table table1 = tEnv.from("MySource1").where("LIKE(word, 'F%')");
table1.insertInto("MySink1");
Table table2 = table1.unionAll(tEnv.from("MySource2"));
table2.insertInto("MySink2");
String explanation = tEnv.explain(false);
System.out.println(explanation);
Scala
val settings = EnvironmentSettings.newInstance.useBlinkPlanner.inStreamingMode.build
val tEnv = TableEnvironment.create(settings)
val schema = new Schema()
.field("count", DataTypes.INT())
.field("word", DataTypes.STRING())
tEnv.connect(new FileSystem("/source/path1"))
.withFormat(new Csv().deriveSchema())
.withSchema(schema)
.createTemporaryTable("MySource1")
tEnv.connect(new FileSystem("/source/path2"))
.withFormat(new Csv().deriveSchema())
.withSchema(schema)
.createTemporaryTable("MySource2")
tEnv.connect(new FileSystem("/sink/path1"))
.withFormat(new Csv().deriveSchema())
.withSchema(schema)
.createTemporaryTable("MySink1")
tEnv.connect(new FileSystem("/sink/path2"))
.withFormat(new Csv().deriveSchema())
.withSchema(schema)
.createTemporaryTable("MySink2")
val table1 = tEnv.from("MySource1").where("LIKE(word, 'F%')")
table1.insertInto("MySink1")
val table2 = table1.unionAll(tEnv.from("MySource2"))
table2.insertInto("MySink2")
val explanation = tEnv.explain(false)
println(explanation)
Python
settings = EnvironmentSettings.new_instance().use_blink_planner().in_streaming_mode().build()
t_env = TableEnvironment.create(environment_settings=settings)
schema = Schema()
.field("count", DataTypes.INT())
.field("word", DataTypes.STRING())
t_env.connect(FileSystem().path("/source/path1")))
.with_format(Csv().deriveSchema())
.with_schema(schema)
.create_temporary_table("MySource1")
t_env.connect(FileSystem().path("/source/path2")))
.with_format(Csv().deriveSchema())
.with_schema(schema)
.create_temporary_table("MySource2")
t_env.connect(FileSystem().path("/sink/path1")))
.with_format(Csv().deriveSchema())
.with_schema(schema)
.create_temporary_table("MySink1")
t_env.connect(FileSystem().path("/sink/path2")))
.with_format(Csv().deriveSchema())
.with_schema(schema)
.create_temporary_table("MySink2")
table1 = t_env.from_path("MySource1").where("LIKE(word, 'F%')")
table1.insert_into("MySink1")
table2 = table1.union_all(t_env.from_path("MySource2"))
table2.insert_into("MySink2")
explanation = t_env.explain()
print(explanation)
多sink计划的结果是
== Abstract Syntax Tree ==
LogicalSink(name=[MySink1], fields=[count, word])
+- LogicalFilter(condition=[LIKE($1, _UTF-16LE'F%')])
+- LogicalTableScan(table=[[default_catalog, default_database, MySource1, source: [CsvTableSource(read fields: count, word)]]])
LogicalSink(name=[MySink2], fields=[count, word])
+- LogicalUnion(all=[true])
:- LogicalFilter(condition=[LIKE($1, _UTF-16LE'F%')])
: +- LogicalTableScan(table=[[default_catalog, default_database, MySource1, source: [CsvTableSource(read fields: count, word)]]])
+- LogicalTableScan(table=[[default_catalog, default_database, MySource2, source: [CsvTableSource(read fields: count, word)]]])
== Optimized Logical Plan ==
Calc(select=[count, word], where=[LIKE(word, _UTF-16LE'F%')], reuse_id=[1])
+- TableSourceScan(table=[[default_catalog, default_database, MySource1, source: [CsvTableSource(read fields: count, word)]]], fields=[count, word])
Sink(name=[MySink1], fields=[count, word])
+- Reused(reference_id=[1])
Sink(name=[MySink2], fields=[count, word])
+- Union(all=[true], union=[count, word])
:- Reused(reference_id=[1])
+- TableSourceScan(table=[[default_catalog, default_database, MySource2, source: [CsvTableSource(read fields: count, word)]]], fields=[count, word])
== Physical Execution Plan ==
Stage 1 : Data Source
content : collect elements with CollectionInputFormat
Stage 2 : Operator
content : CsvTableSource(read fields: count, word)
ship_strategy : REBALANCE
Stage 3 : Operator
content : SourceConversion(table:Buffer(default_catalog, default_database, MySource1, source: [CsvTableSource(read fields: count, word)]), fields:(count, word))
ship_strategy : FORWARD
Stage 4 : Operator
content : Calc(where: (word LIKE _UTF-16LE'F%'), select: (count, word))
ship_strategy : FORWARD
Stage 5 : Operator
content : SinkConversionToRow
ship_strategy : FORWARD
Stage 6 : Operator
content : Map
ship_strategy : FORWARD
Stage 8 : Data Source
content : collect elements with CollectionInputFormat
Stage 9 : Operator
content : CsvTableSource(read fields: count, word)
ship_strategy : REBALANCE
Stage 10 : Operator
content : SourceConversion(table:Buffer(default_catalog, default_database, MySource2, source: [CsvTableSource(read fields: count, word)]), fields:(count, word))
ship_strategy : FORWARD
Stage 12 : Operator
content : SinkConversionToRow
ship_strategy : FORWARD
Stage 13 : Operator
content : Map
ship_strategy : FORWARD
Stage 7 : Data Sink
content : Sink: CsvTableSink(count, word)
ship_strategy : FORWARD
Stage 14 : Data Sink
content : Sink: CsvTableSink(count, word)
ship_strategy : FORWARD