Flink TableAPI 和 SQL

Flink Table API 和 SQL

本文将持续更新。。。

两种计划器(Planner)的主要区别 :

详细见官网:计划器对比

  1. Blink 将批处理作业视作流处理的一种特例。严格来说,Table 和 DataSet 之间不支持相互转换,并且批处理作业也不会转换成 DataSet 程序而是转换成 DataStream 程序,流处理作业也一样。
  2. Blink 计划器不支持 BatchTableSource,而是使用有界的 StreamTableSource 来替代。
  3. 旧计划器和 Blink 计划器中 FilterableTableSource 的实现是不兼容的。旧计划器会将 PlannerExpression 下推至 FilterableTableSource,而 Blink 计划器则是将 Expression 下推。
    基于字符串的键值配置选项仅在 Blink 计划器中使用。(详情参见 配置 )
  4. PlannerConfig 在两种计划器中的实现(CalciteConfig)是不同的。
  5. Blink 计划器会将多sink(multiple-sinks)优化成一张有向无环图(DAG),TableEnvironment 和 StreamTableEnvironment 都支持该特性。旧计划器总是将每个sink都优化成一个新的有向无环图,且所有图相互独立。
  6. 旧计划器目前不支持 catalog 统计数据,而 Blink 支持。

Table API 和 SQL 程序的结构

  • 官方示例:
// create a TableEnvironment for specific planner batch or streaming
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section

// create an input Table
tableEnv.executeSql("CREATE TEMPORARY TABLE table1 ... WITH ( 'connector' = ... )");
// register an output Table
tableEnv.executeSql("CREATE TEMPORARY TABLE outputTable ... WITH ( 'connector' = ... )");

// create a Table object from a Table API query
Table table2 = tableEnv.from("table1").select(...);
// create a Table object from a SQL query
Table table3 = tableEnv.sqlQuery("SELECT ... FROM table1 ... ");

// emit a Table API result Table to a TableSink, same for SQL result
TableResult tableResult = table2.executeInsert("outputTable");
tableResult...

创建 TableEnvironment

TableEnvironment 是 Table API 和 SQL 的核心概念。它负责:

  • 在内部的 catalog 中注册 Table
  • 注册外部的 catalog
  • 加载可插拔模块
  • 执行 SQL 查询
  • 注册自定义函数 (scalar、table 或 aggregation)
  • 将 DataStream 或 DataSet 转换成 Table
  • 持有对 ExecutionEnvironment 或 StreamExecutionEnvironment 的引用

Table 总是与特定的 TableEnvironment 绑定。不能在同一条查询中使用不同 TableEnvironment 中的表,例如,对它们进行 join 或 union 操作。

TableEnvironment 可以通过静态方法 BatchTableEnvironment.create() 或者 StreamTableEnvironment.create() 在 StreamExecutionEnvironment 或者 ExecutionEnvironment 中创建,TableConfig 是可选项。TableConfig可用于配置TableEnvironment或定制的查询优化和转换过程(参见 查询优化)。

// **********************
// FLINK STREAMING QUERY
// **********************
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

EnvironmentSettings fsSettings = EnvironmentSettings.newInstance().useOldPlanner().inStreamingMode().build();
StreamExecutionEnvironment fsEnv = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment fsTableEnv = StreamTableEnvironment.create(fsEnv, fsSettings);
// or TableEnvironment fsTableEnv = TableEnvironment.create(fsSettings);

// ******************
// FLINK BATCH QUERY
// ******************
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.table.api.bridge.java.BatchTableEnvironment;

ExecutionEnvironment fbEnv = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment fbTableEnv = BatchTableEnvironment.create(fbEnv);

// **********************
// BLINK STREAMING QUERY -- 常用的流式环境创建
// **********************
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

StreamExecutionEnvironment bsEnv = StreamExecutionEnvironment.getExecutionEnvironment();
EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
StreamTableEnvironment bsTableEnv = StreamTableEnvironment.create(bsEnv, bsSettings);
// or TableEnvironment bsTableEnv = TableEnvironment.create(bsSettings);

// ******************
// BLINK BATCH QUERY
// ******************
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.TableEnvironment;

EnvironmentSettings bbSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build();
TableEnvironment bbTableEnv = TableEnvironment.create(bbSettings);

创建表

临时表(Temporary Table)

临时表与单个 Flink 会话(session)的生命周期相关,临时表通常保存于内存中并且仅在创建它们的 Flink 会话持续期间存在。这些表对于其它会话是不可见的。它们不与任何 catalog 或者数据库绑定但可以在一个命名空间(namespace)中创建。即使它们对应的数据库被删除,临时表也不会被删除。

// 创建环境
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
// 查询结果
Table projTable = tableEnv.from("X").select(...);
// 创建临时表
tableEnv.createTemporaryView("projectedTable", projTable);
永久表(Permanent Table)

永久表的创建需要 catalog(例如 Hive Metastore),以维护表的元数据。一旦永久表被创建,它将对任何连接到 catalog 的 Flink 会话可见且持续存在,直至被明确删除。

import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.SqlDialect;
import org.apache.flink.table.api.TableEnvironment;
import org.apache.flink.table.catalog.hive.HiveCatalog;

public class hive_catalog{
    public static void main(String[] args) {
        EnvironmentSettings settings = EnvironmentSettings.newInstance()
                .inStreamingMode()
                .useBlinkPlanner()
                .build();
        TableEnvironment tableEnv = TableEnvironment.create(settings);

        String name = "myhive";
        String defaultDatabase = "default";
        String hiveConfDir = "/xxx/hive-conf"; // hive的配置文件路径

        // 创建HiveCatalog
        HiveCatalog hive = new HiveCatalog(name, defaultDatabase, hiveConfDir);
        // 注册HiveCatalog
        tableEnv.registerCatalog("myhive", hive);
        // 使用 HiveCatalog
        tableEnv.useCatalog("myhive");
        // 指定数据库
        tableEnv.useDatabase(defaultDatabase);
        // 配置 hive 方言 - 语法使用hive sql语法
        tableEnv.getConfig().setSqlDialect(SqlDialect.HIVE);
        // 配置 default 方言 - 语法使用 flink sql语法
        tableEnv.getConfig().setSqlDialect(SqlDialect.DEFAULT);
		// 创建表
		tableEnv.executeSql("create table table_name (name string, age int) with('connector' = 'print')");
    }
}

Flink TableAPI文档

// scan registered Orders table
Table orders = tableEnv.from("Orders");
// compute revenue for all customers from France
Table revenue = orders
  .filter($("cCountry").isEqual("FRANCE"))
  .groupBy($("cID"), $("cName"))
  .select($("cID"), $("cName"), $("revenue").sum().as("revSum"));

---->

SQL语句:
Table revenue = tableEnv.sqlQuery(
    "SELECT cID, cName, SUM(revenue) AS revSum " +
    "FROM Orders " +
    "WHERE cCountry = 'FRANCE' " +
    "GROUP BY cID, cName"
  );

输出表

Table 通过写入 TableSink 输出。TableSink 是一个通用接口,用于支持多种文件格式(如 CSV、Apache Parquet、Apache Avro)、存储系统(如 JDBC、Apache HBase、Apache Cassandra、Elasticsearch)或消息队列系统(如 Apache Kafka、RabbitMQ)。

批处理 Table 只能写入 BatchTableSink,而流处理 Table 需要指定写入 AppendStreamTableSink,RetractStreamTableSink 或者 UpsertStreamTableSink。

方法 Table.executeInsert(String tableName) 将 Table 发送至已注册的 TableSink。该方法通过名称在 catalog 中查找 TableSink 并确认Table schema 和 TableSink schema 一致。

// get a TableEnvironment
TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section

// create an output Table
final Schema schema = new Schema()
    .field("a", DataTypes.INT())
    .field("b", DataTypes.STRING())
    .field("c", DataTypes.BIGINT());

tableEnv.connect(new FileSystem().path("/path/to/file"))
    .withFormat(new Csv().fieldDelimiter('|').deriveSchema())
    .withSchema(schema)
    .createTemporaryTable("CsvSinkTable");

// compute a result Table using Table API operators and/or SQL queries
Table result = ...
// emit the result Table to the registered TableSink
result.executeInsert("CsvSinkTable");

----------------以下为TableAPI 和 SQL转换相关内容-----------

pom依赖

<!-- flink 依赖包 -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-java</artifactId>
    <version>1.13.5</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
    <version>1.13.5</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-clients_${scala.binary.version}</artifactId>
     <version>1.13.5</version>
</dependency>

<!-- flink table api 依赖包 -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-api-java-bridge_2.11</artifactId>
    <version>1.13.5</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-planner-blink_2.11</artifactId>
    <version>1.13.5</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-common</artifactId>
    <version>1.13.5</version>
</dependency>

<!-- 其他依赖包 -->
<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>fastjson</artifactId>
    <version>1.2.79</version>
</dependency>
<dependency>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro</artifactId>
    <version>1.11.0</version>
</dependency>
<dependency>
    <groupId>org.apache.logging.log4j</groupId>
    <artifactId>log4j-slf4j-impl</artifactId>
    <version>2.14.1</version>
    <scope>runtime</scope>
</dependency>
<dependency>
    <groupId>org.apache.logging.log4j</groupId>
    <artifactId>log4j-api</artifactId>
    <version>2.14.1</version>
    <scope>runtime</scope>
</dependency>
<dependency>
    <groupId>org.apache.logging.log4j</groupId>
    <artifactId>log4j-core</artifactId>
    <version>2.14.1</version>
    <scope>runtime</scope>
</dependency>

示例:

  • FlinkTableApiDemo001.java
package com.ali.flink.demo.driver;

import com.ali.flink.demo.bean.Event;
import com.ali.flink.demo.utils.DataGeneratorImpl003;
import com.ali.flink.demo.utils.FlinkEnv;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.datagen.DataGeneratorSource;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

import static org.apache.flink.table.api.Expressions.$;

public class FlinkTableApiDemo001 {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = FlinkEnv.FlinkDataStreamRunEnv();
        env.setParallelism(1);
        StreamTableEnvironment tableEnv = FlinkEnv.getStreamTableEnv(env);

        DataGeneratorSource<String> dataGeneratorSource = new DataGeneratorSource<>(new DataGeneratorImpl003());

        DataStream<String> sourceStream = env.addSource(dataGeneratorSource).returns(String.class);

//        sourceStream.print("source");

        DataStream<Event> mapStream = sourceStream.map(new MapFunction<String, Event>() {
            @Override
            public Event map(String s) throws Exception {
                JSONObject jsonObject = JSON.parseObject(s);
                String name = jsonObject.getString("name");
                JSONObject title = jsonObject.getJSONObject("title");
                String title_name = title.getString("title_name");
                int title_number = title.getIntValue("title_number");
                JSONArray user_info = jsonObject.getJSONArray("user_info");
                String address = user_info.getJSONObject(0).getString("address");
                JSONObject time_info = jsonObject.getJSONObject("time_info");
                long timestamp = time_info.getLongValue("timestamp");
                return new Event(name, title.toJSONString(), title_name, title_number, user_info.toJSONString(), address, time_info.toJSONString(), timestamp);
            }
        }).returns(Event.class);

        mapStream.print("map source");

        // 将 DataStream 转换为 Table
        Table sourceTable = tableEnv.fromDataStream(mapStream, $("name").as("username"), $("title"), $("title_name"), $("title_number")
                , $("user_info"), $("address"), $("time_info"), $("timestamp"));
//
//        // 注册临时表
        tableEnv.createTemporaryView("source_table", sourceTable);
        // sql语句
        String sql = "select\n" +
                "username\n" +
                ", title\n" +
                ", title_name\n" +
                ", title_number\n" +
                ", user_info\n" +
                ", address\n" +
                ", time_info\n" +
                ", `timestamp`\n" +
                "from source_table";

        // 执行sql
        Table result = tableEnv.sqlQuery(sql);

        // 输出 将 Table 转换为 DataStream
        tableEnv.toDataStream(result).print("out");

        env.execute("job start");
    }
}
  • Event.java
package com.ali.flink.demo.bean;

public class Event {
	// 属性设置为private时,必须增加set、get方法,否则报错:flink Too many fields referenced from an atomic type
    public String name;
    public String title;
    public String title_name;
    public int title_number;
    public String user_info;
    public String address;
    public String time_info;
    public Long timestamp;

	// 必须有无参构造函数,否则将会报错:flink Too many fields referenced from an atomic type
    public Event() {
    }

    public Event(String name, String title, String title_name, int title_number, String user_info, String address, String time_info, Long timestamp) {
        this.name = name;
        this.title = title;
        this.title_name = title_name;
        this.title_number = title_number;
        this.user_info = user_info;
        this.address = address;
        this.time_info = time_info;
        this.timestamp = timestamp;
    }

    @Override
    public String toString() {
        return "Event{" +
                "name='" + name + '\'' +
                ", title=" + title +
                ", title_name='" + title_name + '\'' +
                ", title_number=" + title_number +
                ", user_info=" + user_info +
                ", address='" + address + '\'' +
                ", time_info=" + time_info +
                ", timestamp='" + timestamp + '\'' +
                '}';
    }
}

结果:

map source> Event{name='Tom3', title={"title_number":3,"title_name":"表情包"}, title_name='表情包', title_number=3, user_info=[{"address":"北京市","city":"beijing"},{"address":"上海市","city":"shanghai"}], address='北京市', time_info={"timestamp":1657332118000}, timestamp='1657332118000'}
out> +I[Tom3, {"title_number":3,"title_name":"表情包"}, 表情包, 3, [{"address":"北京市","city":"beijing"},{"address":"上海市","city":"shanghai"}], 北京市, {"timestamp":1657332118000}, 1657332118000]
map source> Event{name='Tom4', title={"title_number":3,"title_name":"表情包"}, title_name='表情包', title_number=3, user_info=[{"address":"北京市","city":"beijing"},{"address":"上海市","city":"shanghai"}], address='北京市', time_info={"timestamp":1657332118000}, timestamp='1657332118000'}
out> +I[Tom4, {"title_number":3,"title_name":"表情包"}, 表情包, 3, [{"address":"北京市","city":"beijing"},{"address":"上海市","city":"shanghai"}], 北京市, {"timestamp":1657332118000}, 1657332118000]

报错记录:

Exception in thread "main" org.apache.flink.table.api.ValidationException: Field reference expression expected.
	at org.apache.flink.table.typeutils.FieldInfoUtils.extractFieldInfoFromAtomicType(FieldInfoUtils.java:487)
	at org.apache.flink.table.typeutils.FieldInfoUtils.extractFieldInformation(FieldInfoUtils.java:296)
	at org.apache.flink.table.typeutils.FieldInfoUtils.getFieldsInfo(FieldInfoUtils.java:260)
	at org.apache.flink.table.api.bridge.java.internal.StreamTableEnvironmentImpl.lambda$asQueryOperation$1(StreamTableEnvironmentImpl.java:596)
	at java.util.Optional.map(Optional.java:215)
	at org.apache.flink.table.api.bridge.java.internal.StreamTableEnvironmentImpl.asQueryOperation(StreamTableEnvironmentImpl.java:593)
	at org.apache.flink.table.api.bridge.java.internal.StreamTableEnvironmentImpl.fromDataStream(StreamTableEnvironmentImpl.java:456)
	at com.ali.flink.demo.driver.FlinkTableApiDemo001.main(FlinkTableApiDemo001.java:50)
  • 解决方法:
    1)查看是否有无参构造函数,如果没有,则需要添加
    2)查看属性是否为private,如果为private,一种方式为改为public,另一种方式为添加get、set方法。

本文将持续更新。。。

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值