Flink-Table连接到外部系统（八）

最新推荐文章于 2024-04-28 11:40:37 发布

springk

最新推荐文章于 2024-04-28 11:40:37 发布

阅读量1.8k

点赞数 1

分类专栏： flink 文章标签： flink

本文链接：https://blog.csdn.net/springk/article/details/104424679

版权

flink 专栏收录该内容

23 篇文章 13 订阅

订阅专栏

连接到外部系统
Flink的Table API和SQL程序可以连接到其他外部系统来读写批处理表和流式表。表源提供对存储在外部系统（如数据库、键值存储、消息队列或文件系统）中的数据的访问。表接收器将表发送到外部存储系统。根据源和汇的类型，它们支持不同的格式，如CSV、Parquet或ORC。
本页描述如何声明内置的表源和/或表汇，并在Flink中注册它们。注册源或接收器后，可以通过表API&SQL语句访问它。

一、依赖关系
下表列出了所有可用的连接器和格式。在表连接器和表格式的相应部分中标记了它们的相互兼容性。下表提供了使用构建自动化工具（如Maven或SBT）和带有SQL JAR包的SQL客户机的两个项目的依赖关系信息。

1、连接器
2、格式

二、概述
从Flink 1.6开始，到外部系统的连接声明与实际实现分离。
可以指定连接：

1、以编程方式在org.apache.flink.table.descriptors下为table&SQL API使用描述符
2、通过SQL客户机的YAML配置文件进行声明。

        这样不仅可以更好地统一api和SQL客户机，而且还可以在不更改实际声明的情况下更好地扩展自定义实现。
        每个声明都类似于SQL CREATE TABLE语句。可以预先定义表的名称、表的模式、连接器和连接到外部系统的数据格式。
        连接器描述存储表数据的外部系统。存储系统（如Apacha Kafka或常规文件系统）可以在此声明。连接器可能已经为字段和架构提供了固定格式。
        有些系统支持不同的数据格式。例如，存储在Kafka或文件中的表可以使用CSV、JSON或Avro对其行进行编码。数据库连接器可能需要此处的表架构。无论存储系统是否需要定义格式，都会记录每个连接器。不同的系统还需要不同类型的格式（例如，面向列的格式与面向行的格式）。文档说明了哪些格式类型和连接器是兼容的。
        表架构定义表的架构，该表向SQL查询公开。它描述了源如何将数据格式映射到表架构，而接收器如何将数据格式映射到表架构。架构可以访问由连接器或格式定义的字段。它可以使用一个或多个字段来提取或插入时间属性。如果输入字段没有确定的字段顺序，那么模式将清楚地定义列名、它们的顺序和来源。
        接下来的部分将更详细地介绍每个定义部分（连接器、格式和模式）。下面的示例演示如何传递它们：

tableEnvironment
  .connect(...)
  .withFormat(...)
  .withSchema(...)
  .inAppendMode()
  .registerTableSource("MyTable")

        表的类型（源、接收器或两者）决定表的注册方式。如果表类型为以上两者，则表源和表接收器都以相同的名称注册。从逻辑上讲，这意味着我们可以对这样一个表进行读写，这与常规DBMS中的表类似。
        对于流式查询，更新模式声明如何在动态表和存储系统之间通信以进行连续查询。
        下面的代码演示了如何连接到Kafka以读取Avro记录的完整示例。

tableEnvironment
  // declare the external system to connect to
  .connect(
    new Kafka()
      .version("0.10")
      .topic("test-input")
      .startFromEarliest()
      .property("zookeeper.connect", "localhost:2181")
      .property("bootstrap.servers", "localhost:9092")
  )

  // declare a format for this system
  .withFormat(
    new Avro()
      .avroSchema(
        "{" +
        "  \"namespace\": \"org.myorganization\"," +
        "  \"type\": \"record\"," +
        "  \"name\": \"UserMessage\"," +
        "    \"fields\": [" +
        "      {\"name\": \"timestamp\", \"type\": \"string\"}," +
        "      {\"name\": \"user\", \"type\": \"long\"}," +
        "      {\"name\": \"message\", \"type\": [\"string\", \"null\"]}" +
        "    ]" +
        "}"
      )
  )

  // declare the schema of the table
  .withSchema(
    new Schema()
      .field("rowtime", Types.SQL_TIMESTAMP)
        .rowtime(new Rowtime()
          .timestampsFromField("timestamp")
          .watermarksPeriodicBounded(60000)
        )
      .field("user", Types.LONG)
      .field("message", Types.STRING)
  )

  // specify the update-mode for streaming tables
  .inAppendMode()

  // register as source, sink, or both and under a name
  .registerTableSource("MyUserTable");

在这两种方式中，所需的连接属性都转换为规范化的、基于字符串的键值对。所谓的表工厂从键值对创建配置的表源、表汇和相应的格式。在搜索一个完全匹配的表工厂时，将考虑通过Java的服务提供者接口（Service Provider Interfaces，SPI）可以找到的所有表工厂。
如果找不到工厂或多个工厂与给定属性匹配，则将引发一个异常，其中包含有关考虑的工厂和支持的属性的附加信息。

1、表架构（Table Schema）
表架构定义列的名称和类型，类似于SQL CREATE table语句的列定义。此外，还可以指定如何将列映射到表数据编码格式的字段。如果列的名称与输入/输出格式不同，则字段的来源可能很重要。例如，列用户名应该引用JSON格式中的字段$$-用户名。此外，需要使用模式将类型从外部系统映射到Flink的表示。对于表接收器，它确保只有具有有效架构的数据才会写入外部系统。
下面的示例显示了一个没有时间属性和输入/输出到表列的一对一字段映射的简单模式。

.withSchema(
  new Schema()
    .field("MyField1", Types.INT)     // required: specify the fields of the table (in this order)
    .field("MyField2", Types.STRING)
    .field("MyField3", Types.BOOLEAN)
)

对于每个字段，除了列的名称和类型之外，还可以声明以下属性：

.withSchema(
  new Schema()
    .field("MyField1", Types.SQL_TIMESTAMP)
      .proctime()      // optional: declares this field as a processing-time attribute
    .field("MyField2", Types.SQL_TIMESTAMP)
      .rowtime(...)    // optional: declares this field as a event-time attribute
    .field("MyField3", Types.BOOLEAN)
      .from("mf3")     // optional: original field in the input that is referenced/aliased by this field
)

        使用无边界流表时，时间属性非常重要。因此，处理时间和事件时间（也称为“rowtime”）属性都可以定义为架构的一部分。
有关Flink中时间处理的更多信息，特别是事件时间，我们建议使用“常规事件时间”部分。
        （1）、行时属性
        为了控制表的事件时间行为，Flink提供了预定义的时间戳抽取器和水印策略。
        支持以下时间戳提取程序：

// Converts an existing LONG or SQL_TIMESTAMP field in the input into the rowtime attribute.
.rowtime(
  new Rowtime()
    .timestampsFromField("ts_field")    // required: original field name in the input
)

// Converts the assigned timestamps from a DataStream API record into the rowtime attribute
// and thus preserves the assigned timestamps from the source.
// This requires a source that assigns timestamps (e.g., Kafka 0.10+).
.rowtime(
  new Rowtime()
    .timestampsFromSource()
)

// Sets a custom timestamp extractor to be used for the rowtime attribute.
// The extractor must extend `org.apache.flink.table.sources.tsextractors.TimestampExtractor`.
.rowtime(
  new Rowtime()
    .timestampsFromExtractor(...)
)

支持以下水印策略：

// Sets a watermark strategy for ascending rowtime attributes. Emits a watermark of the maximum
// observed timestamp so far minus 1. Rows that have a timestamp equal to the max timestamp
// are not late.
.rowtime(
  new Rowtime()
    .watermarksPeriodicAscending()
)

// Sets a built-in watermark strategy for rowtime attributes which are out-of-order by a bounded time interval.
// Emits watermarks which are the maximum observed timestamp minus the specified delay.
.rowtime(
  new Rowtime()
    .watermarksPeriodicBounded(2000)    // delay in milliseconds
)

// Sets a built-in watermark strategy which indicates the watermarks should be preserved from the
// underlying DataStream API and thus preserves the assigned watermarks from the source.
.rowtime(
  new Rowtime()
    .watermarksFromSource()
)

        确保始终声明时间戳和水印。触发基于时间的操作需要水印。
        （2）、字符串类型
                 由于类型信息仅在编程语言中可用，因此支持在YAML文件中定义以下类型字符串：

VARCHAR
BOOLEAN
TINYINT
SMALLINT
INT
BIGINT
FLOAT
DOUBLE
DECIMAL
DATE
TIME
TIMESTAMP
MAP<fieldtype, fieldtype>        # generic map; e.g. MAP<VARCHAR, INT> that is mapped to Flink's MapTypeInfo
MULTISET<fieldtype>              # multiset; e.g. MULTISET<VARCHAR> that is mapped to Flink's MultisetTypeInfo
PRIMITIVE_ARRAY<fieldtype>       # primitive array; e.g. PRIMITIVE_ARRAY<INT> that is mapped to Flink's PrimitiveArrayTypeInfo
OBJECT_ARRAY<fieldtype>          # object array; e.g. OBJECT_ARRAY<POJO(org.mycompany.MyPojoClass)> that is mapped to
                                 #   Flink's ObjectArrayTypeInfo
ROW<fieldtype, ...>              # unnamed row; e.g. ROW<VARCHAR, INT> that is mapped to Flink's RowTypeInfo
                                 #   with indexed fields names f0, f1, ...
ROW<fieldname fieldtype, ...>    # named row; e.g., ROW<myField VARCHAR, myOtherField INT> that
                                 #   is mapped to Flink's RowTypeInfo
POJO<class>                      # e.g., POJO<org.mycompany.MyPojoClass> that is mapped to Flink's PojoTypeInfo
ANY<class>                       # e.g., ANY<org.mycompany.MyClass> that is mapped to Flink's GenericTypeInfo
ANY<class, serialized>           # used for type information that is not supported by Flink's Table & SQL API

2、更新模式
        对于流式查询，需要声明如何在动态表和外部连接器之间执行转换。更新模式指定应与外部系统交换的消息类型：
        追加模式（Append Mode）：在追加模式下，动态表和外部连接器只交换插入消息。
        收回模式（Retract Mode）：在收回模式下，动态表和外部连接器交换添加和收回消息。插入更改编码为添加消息，删除更改编码为收回消息，更新更改编码为已更新（上一行）的收回消息和更新（新行）的添加消息。在此模式下，不能定义密钥，而不是upsert模式。但是，每个更新都包含两条消息，效率较低。
        Upsert模式（Upsert Mode）：在Upsert模式下，动态表和外部连接器交换Upsert和DELETE消息。此模式需要一个（可能是复合）唯一密钥，通过该密钥可以传播更新。为了正确应用消息，外部连接器需要知道唯一的键属性。插入和更新更改被编码为UPSERT消息。将更改删除为删除邮件。收缩流的主要区别在于，更新更改是用单个消息编码的，因此效率更高。
请注意每个连接器的文档说明支持哪些更新模式。

.connect(...)
  .inAppendMode()    // otherwise: inUpsertMode() or inRetractMode()

有关更多信息，请参见一般流概念文档。

3、表连接器
         Flink提供了一组用于连接外部系统的连接器。
        请注意，并不是所有的连接器在批处理和流式处理中都可用。此外，并非每个流连接器都支持每个流模式。因此，每个连接器都有相应的标记。格式标记表示连接器需要某种格式。
（1）、文件系统连接器
         Source: Batch Source: Streaming Append Mode Sink: Batch Sink: Streaming Append Mode Format: CSV-only
         文件系统连接器允许从本地或分布式文件系统进行读写。文件系统可以定义为：

 .connect(
  new FileSystem()
    .path("file:///path/to/whatever")    // required: path to a file or directory
)

文件系统连接器本身包含在Flink中，不需要额外的依赖项。需要为从文件系统读取和向文件系统写入行指定相应的格式。

注意：确保包含Flink文件系统特定的依赖项。

注意：文件系统的流媒体源和汇只是实验性的。在未来，我们将支持实际的流式处理用例，即目录监视和bucket输出。

        （2）、Kafka连接器
        Source: Streaming Append Mode Sink: Streaming Append Mode Format: Serialization Schema Format: Deserialization Schema
        Kafka连接器允许读取和写入Apache-Kafka主题。定义如下：

.connect(
  new Kafka()
    .version("0.11")    // required: valid connector versions are
                        //   "0.8", "0.9", "0.10", "0.11", and "universal"
    .topic("...")       // required: topic name from which the table is read

    // optional: connector specific properties
    .property("zookeeper.connect", "localhost:2181")
    .property("bootstrap.servers", "localhost:9092")
    .property("group.id", "testGroup")

    // optional: select a startup mode for Kafka offsets
    .startFromEarliest()
    .startFromLatest()
    .startFromSpecificOffsets(...)

    // optional: output partitioning from Flink's partitions into Kafka's partitions
    .sinkPartitionerFixed()         // each Flink partition ends up in at-most one Kafka partition (default)
    .sinkPartitionerRoundRobin()    // a Flink partition is distributed to Kafka partitions round-robin
    .sinkPartitionerCustom(MyCustom.class)    // use a custom FlinkKafkaPartitioner subclass
)

        指定开始读取位置：默认情况下，Kafka源将开始从Zookeeper或Kafka代理中提交的组偏移读取数据。您可以指定其他起始位置，它们对应于Kafka Consumers起始位置配置一节中的配置。
        Flink-Kafka-Sink分区：默认情况下，一个Kafka-Sink最多写入与其自身并行性相同的分区（每个Sink并行实例只写入一个分区）。为了将写入分发到更多分区或控制行到分区的路由，可以提供自定义接收器分区器。循环分区器有助于避免不平衡分区。但是，它会导致所有Flink实例和所有Kafka代理之间的大量网络连接。
        一致性保证：默认情况下，如果在启用检查点的情况下执行查询，Kafka接收器会将至少有一次保证的数据摄取到Kafka主题中。
        Kafka 0.10+时间戳：自Kafka 0.10以来，Kafka消息有一个时间戳作为元数据，用于指定何时将记录写入Kafka主题。通过分别选择timestamps:from source in YAML和timestampsFromSource（）in Java/Scala，可以将这些时间戳用于rowtime属性。
        Kafka 0.11+版本控制：从Flink 1.7开始，Kafka连接器定义应该独立于硬编码的Kafka版本。将连接器版本universal用作与从0.11开始的所有Kafka版本兼容的Flink的Kafka连接器的通配符。
        确保添加特定于版本的Kafka依赖项。此外，需要为从Kafka到Kafka的读写行指定相应的格式。

        （3）、Elasticsearch连接器
        Sink: Streaming Append Mode Sink: Streaming Upsert Mode Format: JSON-only
        Elasticsearch连接器允许写入Elasticsearch搜索引擎的索引。
        连接器可以在upsert模式下操作，以便使用查询定义的键与外部系统交换upsert/DELETE消息。
        对于仅附加查询，连接器还可以在附加模式下操作，以便仅与外部系统交换插入消息。如果查询未定义任何键，则通过Elasticsearch自动生成键。
        连接器的定义如下：

.connect(
  new Elasticsearch()
    .version("6")                      // required: valid connector versions are "6"
    .host("localhost", 9200, "http")   // required: one or more Elasticsearch hosts to connect to
    .index("MyUsers")                  // required: Elasticsearch index
    .documentType("user")              // required: Elasticsearch document type

    .keyDelimiter("$")        // optional: delimiter for composite keys ("_" by default)
                              //   e.g., "$" would result in IDs "KEY1$KEY2$KEY3"
    .keyNullLiteral("n/a")    // optional: representation for null fields in keys ("null" by default)

    // optional: failure handling strategy in case a request to Elasticsearch fails (fail by default)
    .failureHandlerFail()          // optional: throws an exception if a request fails and causes a job failure
    .failureHandlerIgnore()        //   or ignores failures and drops the request
    .failureHandlerRetryRejected() //   or re-adds requests that have failed due to queue capacity saturation
    .failureHandlerCustom(...)     //   or custom failure handling with a ActionRequestFailureHandler subclass

    // optional: configure how to buffer elements before sending them in bulk to the cluster for efficiency
    .disableFlushOnCheckpoint()    // optional: disables flushing on checkpoint (see notes below!)
    .bulkFlushMaxActions(42)       // optional: maximum number of actions to buffer for each bulk request
    .bulkFlushMaxSize("42 mb")     // optional: maximum size of buffered actions in bytes per bulk request
                                   //   (only MB granularity is supported)
    .bulkFlushInterval(60000L)     // optional: bulk flush interval (in milliseconds)

    .bulkFlushBackoffConstant()    // optional: use a constant backoff type
    .bulkFlushBackoffExponential() //   or use an exponential backoff type
    .bulkFlushBackoffMaxRetries(3) // optional: maximum number of retries
    .bulkFlushBackoffDelay(30000L) // optional: delay between each backoff attempt (in milliseconds)

    // optional: connection properties to be used during REST communication to Elasticsearch
    .connectionMaxRetryTimeout(3)  // optional: maximum timeout (in milliseconds) between retries
    .connectionPathPrefix("/v1")   // optional: prefix string to be added to every REST communication
)

        大容量冲洗：有关可选冲洗参数特性的更多信息，请参阅相应的低级文档。
        禁用检查点刷新：禁用时，接收器不会等待检查点上的Elasticsearch确认所有挂起的操作请求。因此，接收器不能为至少一次操作请求的传递提供任何有力的保证。
        密钥提取：Flink自动从查询中提取有效的密钥。例如，查询SELECT a，b，c FROM t GROUP BY a，b定义字段a和字段b的组合键。Elasticsearch连接器使用键分隔符将所有键字段按查询中定义的顺序连接起来，为每一行生成一个文档ID字符串。可以为键字段定义空文本的自定义表示。

注意：JSON格式定义了如何为外部系统编码文档，因此必须将其作为依赖项添加。

        （4）、HBase连接器
        Source: Batch Sink: Batch Sink: Streaming Append Mode Sink: Streaming Upsert Mode Temporal Join: Sync Mode
        HBase连接器允许读取和写入HBase群集。
        连接器可以在upsert模式下操作，以便使用查询定义的键与外部系统交换upsert/DELETE消息。
        对于仅附加查询，连接器还可以在附加模式下操作，以便仅与外部系统交换插入消息。
        连接器的定义如下（DDL）：

CREATE TABLE MyUserTable (
  hbase_rowkey_name rowkey_type,
  hbase_column_family_name1 ROW<...>,
  hbase_column_family_name2 ROW<...>
) WITH (
  'connector.type' = 'hbase', -- required: specify this table type is hbase
  
  'connector.version' = '1.4.3',          -- required: valid connector versions are "1.4.3"
  
  'connector.table-name' = 'hbase_table_name',  -- required: hbase table name
  
  'connector.zookeeper.quorum' = 'localhost:2181', -- required: HBase Zookeeper quorum configuration
  'connector.zookeeper.znode.parent' = '/test',    -- optional: the root dir in Zookeeper for HBase cluster.
                                                   -- The default value is "/hbase".

  'connector.write.buffer-flush.max-size' = '10mb', -- optional: writing option, determines how many size in memory of buffered
                                                    -- rows to insert per round trip. This can help performance on writing to JDBC
                                                    -- database. The default value is "2mb".

  'connector.write.buffer-flush.max-rows' = '1000', -- optional: writing option, determines how many rows to insert per round trip.
                                                    -- This can help performance on writing to JDBC database. No default value,
                                                    -- i.e. the default flushing is not depends on the number of buffered rows.

  'connector.write.buffer-flush.interval' = '2s',   -- optional: writing option, sets a flush interval flushing buffered requesting
                                                    -- if the interval passes, in milliseconds. Default value is "0s", which means
                                                    -- no asynchronous flush thread will be scheduled.
)

        列：HBase表中的所有列族必须声明为行类型，字段名映射到列族名称，嵌套字段名映射到列限定符名称。不需要在模式中声明所有的族和限定符，用户可以声明什么是必需的。除了行类型字段之外，只有一个原子类型的字段（例如STRING、BIGINT）将被识别为表的行键。行键字段的名称没有限制。
        临时连接：针对HBase的查找连接不使用任何缓存；数据总是直接通过HBase客户端进行查询。
        Java/Scala/Python API：还不支持Java/Scala/pythonapi。
        （5）、JDBC连接器
         Source: Batch Sink: Batch Sink: Streaming Append Mode Sink: Streaming Upsert Mode Temporal Join: Sync Mode
        JDBC连接器允许从JDBC客户机读写。
        连接器可以在upsert模式下操作，以便使用查询定义的键与外部系统交换upsert/DELETE消息。
        对于仅附加查询，连接器还可以在附加模式下操作，以便仅与外部系统交换插入消息。
         要使用JDBC连接器，需要选择要使用的实际驱动程序。以下是当前支持的驱动程序：
在这里插入图片描述
         连接器的定义如下（DDL）：

CREATE TABLE MyUserTable (
  ...
) WITH (
  'connector.type' = 'jdbc', -- required: specify this table type is jdbc
  
  'connector.url' = 'jdbc:mysql://localhost:3306/flink-test', -- required: JDBC DB url
  
  'connector.table' = 'jdbc_table_name',  -- required: jdbc table name
  
  'connector.driver' = 'com.mysql.jdbc.Driver', -- optional: the class name of the JDBC driver to use to connect to this URL. 
                                                -- If not set, it will automatically be derived from the URL.

  'connector.username' = 'name', -- optional: jdbc user name and password
  'connector.password' = 'password',
  
  -- scan options, optional, used when reading from table

  -- These options must all be specified if any of them is specified. In addition, partition.num must be specified. They
  -- describe how to partition the table when reading in parallel from multiple tasks. partition.column must be a numeric,
  -- date, or timestamp column from the table in question. Notice that lowerBound and upperBound are just used to decide
  -- the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned.
  -- This option applies only to reading.
  'connector.read.partition.column' = 'column_name', -- optional, name of the column used for partitioning the input.
  'connector.read.partition.num' = '50', -- optional, the number of partitions.
  'connector.read.partition.lower-bound' = '500', -- optional, the smallest value of the first partition.
  'connector.read.partition.upper-bound' = '1000', -- optional, the largest value of the last partition.
  
  'connector.read.fetch-size' = '100', -- optional, Gives the reader a hint as to the number of rows that should be fetched
                                       -- from the database when reading per round trip. If the value specified is zero, then
                                       -- the hint is ignored. The default value is zero.

  -- lookup options, optional, used in temporary join
  'connector.lookup.cache.max-rows' = '5000', -- optional, max number of rows of lookup cache, over this value, the oldest rows will
                                              -- be eliminated. "cache.max-rows" and "cache.ttl" options must all be specified if any
                                              -- of them is specified. Cache is not enabled as default.
  'connector.lookup.cache.ttl' = '10s', -- optional, the max time to live for each rows in lookup cache, over this time, the oldest rows
                                        -- will be expired. "cache.max-rows" and "cache.ttl" options must all be specified if any of
                                        -- them is specified. Cache is not enabled as default.
  'connector.lookup.max-retries' = '3', -- optional, max retry times if lookup database failed

  -- sink options, optional, used when writing into table
  'connector.write.flush.max-rows' = '5000', -- optional, flush max size (includes all append, upsert and delete records), 
                                             -- over this number of records, will flush data. The default value is "5000".
  'connector.write.flush.interval' = '2s', -- optional, flush interval mills, over this time, asynchronous threads will flush data.
                                           -- The default value is "0s", which means no asynchronous flush thread will be scheduled. 
  'connector.write.max-retries' = '3' -- optional, max retry times if writing records to database failed
)

         Upsert sink:Flink自动从查询中提取有效密钥。例如，查询SELECT a，b，c FROM t GROUP BY a，b定义字段a和字段b的组合键。如果使用JDBC表作为upsert sink，请确保查询键是基础数据库的唯一键集或主键之一。这可以保证输出结果如预期。
         临时连接：JDBC连接器可以在临时连接中用作查找源。目前，只支持同步查找模式。如果指定了查找缓存选项（connector.lookup.cache.max-rows和connector.lookup.cache.ttl），则必须全部指定这些选项。查找缓存通过先查询缓存而不是将所有请求发送到远程数据库来提高临时连接JDBC连接器的性能。但如果返回的值来自缓存，则可能不是最新的。所以这是吞吐量和正确性之间的平衡。
         写入：默认情况下，connector.write.flush.interval为0s，connector.write.flush.max-rows为5000，这意味着对于低流量查询，缓冲的输出行可能不会长时间刷新到数据库。所以建议设置间隔配置。

4、表格格式
Flink提供了一组可与表连接器一起使用的表格式。
格式标记指示与连接器匹配的格式类型。
（1）、CSV格式
Format: Serialization Schema Format: Deserialization Schema
         CSV格式旨在符合互联网工程工作组（IETF）提出的RFC-4180（“逗号分隔值（CSV）文件的通用格式和MIME类型”）。
         该格式允许读取和写入与给定格式架构相对应的CSV数据。格式模式可以定义为Flink类型，也可以从所需的表模式派生。
         如果格式架构等于表架构，则也可以自动派生架构。这只允许定义一次架构信息。格式的名称、类型和字段顺序由表的架构决定。如果时间属性的来源不是字段，则忽略它们。表架构中的from定义解释为格式中的字段重命名。
         CSV格式可以使用如下：

.withFormat(
  new Csv()

    // required: define the schema either by using type information
    .schema(Type.ROW(...))

    // or use the table's schema
    .deriveSchema()

    .fieldDelimiter(';')         // optional: field delimiter character (',' by default)
    .lineDelimiter("\r\n")       // optional: line delimiter ("\n" by default;
                                 //   otherwise "\r" or "\r\n" are allowed)
    .quoteCharacter('\'')        // optional: quote character for enclosing field values ('"' by default)
    .allowComments()             // optional: ignores comment lines that start with '#' (disabled by default);
                                 //   if enabled, make sure to also ignore parse errors to allow empty rows
    .ignoreParseErrors()         // optional: skip fields and rows with parse errors instead of failing;
                                 //   fields are set to null in case of errors
    .arrayElementDelimiter("|")  // optional: the array element delimiter string for separating
                                 //   array and row element values (";" by default)
    .escapeCharacter('\\')       // optional: escape character for escaping values (disabled by default)
    .nullLiteral("n/a")          // optional: null literal string that is interpreted as a
                                 //   null value (disabled by default)
)

下表列出了可读取和写入的受支持类型：

ROW
VARCHAR
ARRAY[_]
INT
BIGINT
FLOAT
DOUBLE
BOOLEAN
DATE TIME
TIMESTAMP
DECIMAL
NULL (unsupported yet)

         数值类型：值应为数字，但也可以理解文本“null”。空字符串被视为空。值也被修剪（前导/尾随空格）。数字是用Java的语义值来解析的。其他非数字字符串可能导致分析异常。
         字符串和时间类型：不修剪值。字面上的“空”也可以理解。时间类型必须按照JavaSQL时间格式进行格式化，精度为毫秒。例如：日期为2018-01-01，时间为20:43:59，时间戳为2018-01-01 20:43:59.999。
         布尔类型：值应为布尔（“true”、“false”）字符串或“null”。空字符串被解释为false。值被修剪（前导/尾随空格）。其他值导致异常。
         嵌套类型：使用数组元素分隔符的一级嵌套支持数组和行类型。
         基元字节数组：基元字节数组以Base64编码表示进行处理。
         行尾：即使对于行尾未引用的字符串字段，也需要考虑行尾，即使是基于行的连接器（如Kafka）也要忽略行尾。
         转义和引号：下表显示转义和引号如何影响使用*表示转义和’表示引号的字符串的解析：
在这里插入图片描述
         确保添加CSV格式作为依赖项。

        （2）、JSON格式
        Format: Serialization Schema Format: Deserialization Schema
        JSON格式允许读取和写入与给定格式模式对应的JSON数据。格式模式可以定义为Flink类型、JSON模式或从所需的表模式派生。Flink类型支持更类似SQL的定义并映射到相应的SQL数据类型。JSON模式允许更复杂的嵌套结构。
        如果格式架构等于表架构，则也可以自动派生架构。这只允许定义一次架构信息。格式的名称、类型和字段顺序由表的架构决定。如果时间属性的来源不是字段，则忽略它们。表架构中的from定义解释为格式中的字段重命名。
        JSON格式可以使用如下：

.withFormat(
  new Json()
    .failOnMissingField(true)   // optional: flag whether to fail if a field is missing or not, false by default

    // required: define the schema either by using type information which parses numbers to corresponding types
    .schema(Type.ROW(...))

    // or by using a JSON schema which parses to DECIMAL and TIMESTAMP
    .jsonSchema(
      "{" +
      "  type: 'object'," +
      "  properties: {" +
      "    lon: {" +
      "      type: 'number'" +
      "    }," +
      "    rideTime: {" +
      "      type: 'string'," +
      "      format: 'date-time'" +
      "    }" +
      "  }" +
      "}"
    )

    // or use the table's schema
    .deriveSchema()
)

        下表显示了JSON架构类型到Flink SQL类型的映射：
在这里插入图片描述
        目前，Flink只支持JSON模式规范draft-07的一个子集。还不支持联合类型（以及allOf、anyOf、not）。类型之一和数组仅支持指定可为空性。
        支持链接到文档中公共定义的简单引用，如下面更复杂的示例所示：

{
  "definitions": {
    "address": {
      "type": "object",
      "properties": {
        "street_address": {
          "type": "string"
        },
        "city": {
          "type": "string"
        },
        "state": {
          "type": "string"
        }
      },
      "required": [
        "street_address",
        "city",
        "state"
      ]
    }
  },
  "type": "object",
  "properties": {
    "billing_address": {
      "$ref": "#/definitions/address"
    },
    "shipping_address": {
      "$ref": "#/definitions/address"
    },
    "optional_address": {
      "oneOf": [
        {
          "type": "null"
        },
        {
          "$ref": "#/definitions/address"
        }
      ]
    }
  }
}

缺少字段处理：默认情况下，缺少的JSON字段设置为空。您可以启用严格的JSON解析，如果缺少字段，将取消源（和查询）。
确保添加JSON格式作为依赖项。

        （3）、Apache Avro格式
         Format: Serialization Schema Format: Deserialization Schema
         Apache Avro格式允许读取和写入与给定格式模式对应的Avro数据。格式架构可以定义为Avro特定记录的完全限定类名或Avro架构字符串。如果使用类名，则该类在运行时必须在类路径中可用。
         Avro格式可以使用如下：

.withFormat(
  new Avro()

    // required: define the schema either by using an Avro specific record class
    .recordClass(User.class)

    // or by using an Avro schema
    .avroSchema(
      "{" +
      "  \"type\": \"record\"," +
      "  \"name\": \"test\"," +
      "  \"fields\" : [" +
      "    {\"name\": \"a\", \"type\": \"long\"}," +
      "    {\"name\": \"b\", \"type\": \"string\"}" +
      "  ]" +
      "}"
    )
)

         Avro类型映射到相应的SQL数据类型。联合类型只支持指定可为空性，否则它们将转换为任何类型。下表显示了映射：
在这里插入图片描述
        Avro使用Joda Time表示特定记录类中的逻辑日期和时间类型。Joda时间依赖性不是Flink分布的一部分。因此，请确保Joda Time在运行时与特定记录类一起位于类路径中。通过模式字符串指定的Avro格式不需要Joda时间。
        确保添加Apache Avro依赖项。
        （4）、旧CSV格式

注意：仅用于原型制作！

旧的CSV格式允许使用文件系统连接器读写逗号分隔的行。
此格式描述了Flink的非标准CSV表源/汇。今后，格式将被一个适当的符合RFC的版本所取代。在写入Kafka时使用符合RFC的CSV格式。现在使用旧的文件系统操作流/批处理文件系统。

.withFormat(
  new OldCsv()
    .field("field1", Types.STRING)    // required: ordered format fields
    .field("field2", Types.TIMESTAMP)
    .fieldDelimiter(",")              // optional: string delimiter "," by default
    .lineDelimiter("\n")              // optional: string delimiter "\n" by default
    .quoteCharacter('"')              // optional: single character for string values, empty by default
    .commentPrefix('#')               // optional: string to indicate comments, empty by default
    .ignoreFirstLine()                // optional: ignore the first line, by default it is not skipped
    .ignoreParseErrors()              // optional: skip records with parse error instead of failing by default
)

旧的CSV格式包含在Flink中，不需要额外的依赖项。
注意：目前用于写入行的旧CSV格式是有限的。仅支持自定义字段分隔符作为可选参数。

5、进一步的TableSources和TableSinks
        下表的源和汇尚未迁移（或尚未完全迁移）到新的统一接口。
        以下是Flink提供的附加表源：

        这些是Flink提供的附加表链接：

（1）、OrcTableSource
        OrcTableSource读取ORC文件。ORC是结构化数据的一种文件格式，它以压缩的列表示形式存储数据。ORC非常节省存储空间，支持投影和向下过滤。
        创建OrcTableSource，如下所示：

// create Hadoop Configuration
Configuration config = new Configuration();

OrcTableSource orcTableSource = OrcTableSource.builder()
  // path to ORC file(s). NOTE: By default, directories are recursively scanned.
  .path("file:///path/to/data")
  // schema of ORC files
  .forOrcSchema("struct<name:string,addresses:array<struct<street:string,zip:smallint>>>")
  // Hadoop configuration
  .withConfiguration(config)
  // build OrcTableSource
  .build();

注意：OrcTableSource还不支持ORC的Union类型。

        （2）、CsvTableSink表接收器
        CsvTableSink向一个或多个CSV文件发出一个表。
        接收器只支持只追加流表。它不能用于发出连续更新的表。有关详细信息，请参阅有关表到流转换的文档。发出流式处理表时，至少写入一次行（如果启用了检查点），并且CsvTableSink不会将输出文件拆分为bucket文件，而是连续写入相同的文件。

CsvTableSink sink = new CsvTableSink(
    path,                  // output path
    "|",                   // optional: delimit files by '|'
    1,                     // optional: write to a single file
    WriteMode.OVERWRITE);  // optional: override existing files

tableEnv.registerTableSink(
  "csvOutputTable",
  // specify table schema
  new String[]{"f0", "f1"},
  new TypeInformation[]{Types.STRING, Types.INT},
  sink);

Table table = ...
table.insertInto("csvOutputTable");

        （3）、JDBCAppendTableSink
        JDBCAppendTableLink向JDBC连接发出一个表。接收器只支持只追加流表。它不能用于发出连续更新的表。有关详细信息，请参阅有关表到流转换的文档。
        JDBCAppendTableLink至少将每个表行插入数据库表一次（如果启用了检查点）。但是，可以使用REPLACE或INSERT OVERWRITE指定插入查询，以对数据库执行upsert写入。
        要使用JDBC接收器，必须将JDBC连接器依赖项（flink JDBC）添加到项目中。然后可以使用JDBCAppendSinkBuilder创建接收器：

JDBCAppendTableSink sink = JDBCAppendTableSink.builder()
  .setDrivername("org.apache.derby.jdbc.EmbeddedDriver")
  .setDBUrl("jdbc:derby:memory:ebookshop")
  .setQuery("INSERT INTO books (id) VALUES (?)")
  .setParameterTypes(INT_TYPE_INFO)
  .build();

tableEnv.registerTableSink(
  "jdbcOutputTable",
  // specify table schema
  new String[]{"id"},
  new TypeInformation[]{Types.INT},
  sink);

Table table = ...
table.insertInto("jdbcOutputTable");

与使用JDBCOutputFormat类似，您必须显式指定JDBC驱动程序的名称、jdbcURL、要执行的查询以及JDBC表的字段类型。

        （4）、CassandraAppendTableSink
         CassandraAppendTableSink将表发送到Cassandra表。接收器只支持只追加流表。它不能用于发出连续更新的表。有关详细信息，请参阅有关表到流转换的文档。
        如果启用了检查点，则CassandraAppendTableSink至少将所有行插入Cassandra表一次。但是，可以将查询指定为upsert query。
        要使用cassandraappendtablelink，必须将Cassandra连接器依赖项（flink connector Cassandra）添加到项目中。下面的示例演示如何使用cassandraappendtableLink。

ClusterBuilder builder = ... // configure Cassandra cluster connection

CassandraAppendTableSink sink = new CassandraAppendTableSink(
  builder,
  // the query must match the schema of the table
  "INSERT INTO flink.myTable (id, name, value) VALUES (?, ?, ?)");

tableEnv.registerTableSink(
  "cassandraOutputTable",
  // specify table schema
  new String[]{"id", "name", "value"},
  new TypeInformation[]{Types.INT, Types.STRING, Types.DOUBLE},
  sink);

Table table = ...
table.insertInto(cassandraOutputTable);

springk

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Flink-Table连接到外部系统（八）

Flink的Table API和SQL程序可以连接到其他外部系统来读写批处理表和流式表。表源提供对存储在外部系统（如数据库、键值存储、消息队列或文件系统）中的数据的访问。表接收器将表发送到外部存储系统。根据源和汇的类型，它们支持不同的格式，如CSV、Parquet或ORC。本页描述如何声明内置的表源和/或表汇，并在Flink中注册它们。注册源或接收器后，可以通过表API&SQL语句访问它。
复制链接

扫一扫