FlinkSQL 整合 Hive-- flink-1.13.6

宝哥大数据

已于 2022-11-09 21:44:52 修改

阅读量2.9k

点赞数 3

分类专栏： # Flink 文章标签： hive flink 大数据

于 2022-11-07 23:15:47 首次发布

本文链接：https://blog.csdn.net/wuxintdrh/article/details/127741170

版权

Flink 专栏收录该内容

92 篇文章 32 订阅

订阅专栏

文章目录

一、概览
二、Hive 方言
- 2.1、使用 Hive 方言
- 2.2、案例
三、Hive Read & Write
四、Hive Functions

一、概览

Apache Hive 已经成为了数据仓库生态系统中的核心。它不仅仅是一个用于大数据分析和ETL场景的SQL引擎，同样它也是一个数据管理平台，可用于发现，定义，和演化数据。

Flink 与 Hive 的集成包含两个层面。

一是利用了 Hive 的 MetaStore 作为持久化的 Catalog，用户可通过HiveCatalog将不同会话中的 Flink 元数据存储到 Hive Metastore 中。例如，用户可以使用HiveCatalog将其 Kafka 表或 Elasticsearch 表存储在 Hive Metastore 中，并后续在 SQL 查询中重新使用它们。
二是利用 Flink 来读写 Hive 的表。

HiveCatalog的设计提供了与 Hive 良好的兼容性，用户可以"开箱即用"的访问其已有的 Hive 数仓。您不需要修改现有的 Hive Metastore，也不需要更改表的数据位置或分区。

配置 flink-conf.yaml

classloader.check-leaked-classloader: false

创建 hivecatalog

CREATE CATALOG myhive WITH (
    'type' = 'hive',
    'default-database' = 'mydatabase',
    'hive-conf-dir' = '/opt/hive-conf'
);
-- set the HiveCatalog as the current catalog of the session
USE CATALOG myhive;

注意： Flink1.13开始移除了sql-client-defaults.yml配置⽂件，所以在该配置⽂件配置catalog的⽅法就不存在了，目前相关配置添加到sql-init.sql文件即可。

二、Hive 方言

从 1.11.0 开始，在使用 Hive 方言时，Flink 允许用户用 Hive 语法来编写 SQL 语句。通过提供与 Hive 语法的兼容性，我们旨在改善与 Hive 的互操作性，并减少用户需要在 Flink 和 Hive 之间切换来执行不同语句的情况。

2.1、使用 Hive 方言

Flink 目前支持两种 SQL 方言: default 和 hive。你需要先切换到 Hive 方言，然后才能使用 Hive 语法编写

可以在 SQL 客户端启动后设置方言。


Flink SQL> set table.sql-dialect=hive; -- to use hive dialect
[INFO] Session property has been set.

Flink SQL> set table.sql-dialect=default; -- to use default dialect
[INFO] Session property has been set.

2.2、案例

Flink SQL> use catalog myhive;
[INFO] Execute statement succeed.
Flink SQL> load module hive;
[INFO] Execute statement succeed.
Flink SQL> use modules hive,core;
[INFO] Execute statement succeed.
Flink SQL> set execution.type=batch;
[WARNING] The specified key 'execution.type' is deprecated. Please use 'execution.runtime-mode' instead.
[INFO] Session property has been set.
Flink SQL> SET sql-client.execution.result-mode=TABLEAU;
[INFO] Session property has been set.

Flink SQL> set table.sql-dialect=hive;			-- 设置hive方言
[INFO] Session property has been set.

showFlink SQL> select explode(array(1,2,3));       -- 使用 hive udf 函数
Hive Session ID = f01725e8-895d-430c-957a-367729e466ca
+-----+
| col |
+-----+
|   1 |
|   2 |
|   3 |
+-----+
3 rows in set
Flink SQL> drop table if exists tbl;
Hive Session ID = 0731f471-d579-45f1-8eb9-e38a8d68f8e8
[INFO] Execute statement succeed.
Flink SQL> create table tbl (key int,value string);
Hive Session ID = bcd020ed-55aa-4859-be07-a134330c53a3
[INFO] Execute statement succeed.
Flink SQL> insert overwrite table tbl values (5,'e'),(1,'a'),(1,'a'),(3,'c'),(2,'b'),(3,'c'),(3,'c'),(4,'d');
Hive Session ID = c932ebe6-55b0-4310-9558-59eb6e500862
[INFO] Submitting SQL update statement to the cluster...
2022-11-07 23:42:11,433 WARN  org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory      [] - The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: b7c6c304ac487c919243bf71eff0b1a8

Flink SQL> select * from tbl cluster by key;  -- run cluster by
Hive Session ID = 1b8d0c3b-5a95-45cf-93d7-02a57297df99
2022-11-07 23:42:16,265 INFO  org.apache.hadoop.conf.Configuration.deprecation             [] - mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
+-----+-------+
| key | value |
+-----+-------+
|   1 |     a |
|   1 |     a |
|   2 |     b |
|   3 |     c |
|   3 |     c |
|   3 |     c |
|   4 |     d |
|   5 |     e |
+-----+-------+
8 rows in set

Shutting down the session...
done.

报错1 Caused by: java.lang.ClassNotFoundException: org.antlr.runtime.tree.CommonTree, 缺少依赖包


       // add antlr-runtime if you need to use hive dialect
       antlr-runtime-3.5.2.jar

报错2:

[ERROR] Could not execute SQL statement. Reason:
java.lang.IllegalStateException: Streaming mode not support overwrite.

三、Hive Read & Write

通过使用HiveCatalog, Apache Flink可以对Apache Hive表进行统一的BATCH和STREAM处理。这意味着Flink可以作为Hive批处理引擎的一个性能更好的替代方案，或者可以在Hive表中持续地读写数据，为实时数据仓库应用提供动力。

3.1、写

Flink支持以 批处理(Batch)和流处理(Streaming) 的方式写入Hive表。当以批处理的方式写入Hive表时，只有当写入作业结束时，才可以看到写入的数据。批处理的方式写入支持append模式和overwrite模式。

3.1.1、批处理模式写入

向非分区表写入数据


Flink SQL> SET table.sql-dialect=hive;  -- hive 方言
Flink SQL> set execution.runtime-mode='batch'; -- 使用批处理模式
Flink SQL> create table `users` (id int, name string);  -- 在 flinksql 中创建 hive 非分区表
Flink SQL> INSERT INTO users SELECT 2,'tom';

向分区表写入数据

Flink SQL> SET table.sql-dialect=hive;  -- hive 方言
Flink SQL> set execution.runtime-mode='batch'; -- 使用批处理模式

Flink SQL> create table `users_p` (id int, name string) partitioned by (create_day string);
-- 向静态分区表写入数据
Flink SQL> INSERT OVERWRITE users_p PARTITION (create_day='2022-11-08') SELECT 1, 'tom';
-- 向动态分区表写入数据
Flink SQL> INSERT OVERWRITE users_p SELECT 1, 'tom', '2022-11-08';

3.1.2、流处理模式写入

流式写入Hive表，不支持**Insert overwrite **方式，否则报如下错误：

[ERROR] Could not execute SQL statement. Reason:
java.lang.IllegalStateException: Streaming mode not support overwrite.

案例：

Flink SQL> set execution.runtime-mode=streaming; 

-- 创建 hive表
SET table.sql-dialect=hive; -- hive方言
CREATE TABLE hive_table (
    user_id string,
    item_id string,
    category_id string,
    behavior string
) PARTITIONED BY (dt STRING, hr STRING) STORED AS parquet TBLPROPERTIES (
  'partition.time-extractor.timestamp-pattern'='$dt $hr:00:00',
  'sink.partition-commit.trigger'='partition-time',
  'sink.partition-commit.delay'='1 h',
  'sink.partition-commit.policy.kind'='metastore,success-file'
);

-- 创建 kafka 表
SET table.sql-dialect=default; -- 默认，即flinksql
CREATE TABLE user_behavior (
    user_id VARCHAR,
    item_id VARCHAR,
    category_id VARCHAR,
    behavior VARCHAR,
    `ts` timestamp(3),
	`proctime` as PROCTIME(),   -- 处理时间列
    WATERMARK FOR ts as ts - INTERVAL '5' SECOND  -- 在ts上定义watermark，ts成为事件时间列
) WITH (
    'connector' = 'kafka', -- 使用 kafka connector
    'topic' = 'user_behavior',  -- kafka topic
    'scan.startup.mode' = 'latest-offset', -- 从起始 offset 开始读取
	'properties.bootstrap.servers' = 'chb1:9092',
	'properties.group.id' = 'testGroup',
	'format' = 'csv'
);


-- streaming sql, insert into hive table
INSERT INTO hive_table 
SELECT user_id, item_id,category_id,behavior, DATE_FORMAT(`ts`, 'yyyy-MM-dd'), DATE_FORMAT(`ts`, 'HH') 
FROM user_behavior;

-- batch sql,查询Hive表的分区数据
SELECT * FROM hive_table WHERE dt='2021-01-04' AND  hr='16';

尖叫提示：

1.Flink读取Hive表默认使用的是batch模式，如果要使用流式读取Hive表，需要而外指定一些参数，见下文。
2.只有在完成 Checkpoint 之后，文件才会从 In-progress 状态变成 Finish 状态，同时生成_SUCCESS文件，所以，Flink流式写入Hive表需要开启并配置 Checkpoint。对于Flink SQL Client而言，需要在flink-conf.yaml中开启CheckPoint，配置内容为：
- state.backend: filesystem
- execution.checkpointing.externalized-checkpoint-retention:RETAIN_ON_CANCELLATION
- execution.checkpointing.interval: 60s
- execution.checkpointing.mode: EXACTLY_ONCE
- state.savepoints.dir: hdfs://kms-1:8020/flink-savepoints

3.2、读

Flink支持以 批处理(Batch)和流处理(Streaming) 的方式读取Hive中的表。批处理的方式与Hive的本身查询类似，即只在提交查询的时刻查询一次Hive表。流处理的方式将会持续地监控Hive表，并且会增量地提取新的数据。默认情况下，Flink是以批处理的方式读取Hive表。

关于流式读取Hive表，Flink既支持分区表又支持非分区表。对于分区表而言，Flink将会监控新产生的分区数据，并以增量的方式读取这些数据。对于非分区表，Flink会监控Hive表存储路径文件夹里面的新文件，并以增量的方式读取新的数据。

在 SQL Client 中需要显示地开启 SQL Hint 功能

Flink SQL> set table.dynamic-table-options.enabled= true;

使用SQLHint流式查询Hive表

SELECT * FROM hive_table/*+ OPTIONS('streaming-source.enable'='true', 'streaming-source.consume-start-offset'='2021-01-03') */;

Flink读取Hive表可以配置一下参数：

字段	默认值	解释
streaming-source.enable	false	是否开启流式读取 Hive 表，默认不开启。
streaming-source.partition.include	all	配置读取Hive的分区，包括两种方式：all和latest。all意味着读取所有分区的数据，latest表示只读取最新的分区数据。值得注意的是，latest方式只能用于开启了流式读取Hive表，并用于维表JOIN的场景。
streaming-source.monitor-interval	None	持续监控Hive表分区或者文件的时间间隔。值得注意的是，当以流的方式读取Hive表时，该参数的默认值是1m，即1分钟。当temporal join时，默认的值是60m，即1小时。另外，该参数配置不宜过短，最短是1 个小时，因为目前的实现是每个 task 都会查询 metastore，高频的查可能会对metastore 产生过大的压力。
streaming-source.partition-order	partition-name	streaming source的分区顺序。默认的是partition-name，表示使用默认分区名称顺序加载最新分区，也是推荐使用的方式。除此之外还有两种方式，分别为：create-time和partition-time。其中create-time表示使用分区文件创建时间顺序。partition-time表示使用分区时间顺序。指的注意的是，对于非分区表，该参数的默认值为：create-time。
streaming-source.consume-start-offset	None	流式读取Hive表的起始偏移量。
partition.time-extractor.kind	default	分区时间提取器类型。用于从分区中提取时间，支持default和自定义。如果使用default，则需要通过参数partition.time-extractor.timestamp-pattern配置时间戳提取的正则表达式。

注意：

在Flink 1.10中读Hive数据的方式是批的方式去读的，从1.11版本中，提供了流式的去读Hive数据。
Monitor strategy is to scan all directories/files currently in the location path. Many partitions may cause performance degradation.
Streaming reads for non-partitioned tables requires that each file be written atomically into the target directory.
Streaming reading for partitioned tables requires that each partition should be added atomically in the view of hive metastore. If not, new data added to an existing partition will be consumed.
流处理读取不支持 watemark,即不支持窗口操作。 Streaming reads do not support watermark grammar in Flink DDL. These tables cannot be used for window operators.

3.3、Temporal Table Join

Flink 1.12 支持了 Hive 最新的分区作为时态表的功能，可以通过 SQL 的方式直接关联 Hive 分区表的最新分区，并且会自动监听最新的 Hive 分区，当监控到新的分区后，会自动地做维表数据的全量替换。

Flink支持的是process-time的temporal join，也就是说总是与最新版本的时态表进行JOIN。另外，Flink既支持非分区表的temporal join，又支持分区表的temporal join。对于分区表而言，Flink会监听Hive表的最新分区数据。值得注意的是，Flink尚不支持 event-time temporal join。

3.3.1、Temporal Join The Latest Partition

对于一张随着时间变化的Hive分区表，Flink可以读取该表的数据作为一个无界流。如果Hive分区表的每个分区都包含某个时刻的全量数据，那么每个分区将做为一个时态表的版本数据 ，即将最新的分区数据作为一个全量维表数据。

注意： 该功能特点仅支持 Flink 的 STREAMING 模式。

使用 Hive 最新分区作为 Tempmoral table 之前，需要设置必要的两个参数：

-- 开启流式读取 Hive 表
'streaming-source.enable' = 'true',  
-- 开启流式读取 Hive 表, 必须是流式读取hive
'streaming-source.partition.include' = 'latest'

我们在使用Hive维表的时候，既可以在创建Hive表时指定具体的参数，也可以使用SQL Hint的方式动态指定参数。一个Hive维表的创建模板如下：

-- 使用Hive的sql方言
SET table.sql-dialect=hive;
CREATE TABLE dimension_table (
  product_id STRING,
  product_name STRING,
  unit_price DECIMAL(10, 4),
  pv_count BIGINT,
  like_count BIGINT,
  comment_count BIGINT,
  update_time TIMESTAMP(3),
  update_user STRING,
  ...
) PARTITIONED BY (pt_year STRING, pt_month STRING, pt_day STRING) TBLPROPERTIES (
  -- 方式1：按照分区名排序来识别最新分区(推荐使用该种方式)
  'streaming-source.enable' = 'true', -- 开启Streaming source
  'streaming-source.partition.include' = 'latest',-- 选择最新分区
  'streaming-source.monitor-interval' = '12 h',-- 每12小时加载一次最新分区数据
  'streaming-source.partition-order' = 'partition-name',  -- 按照分区名排序

  -- 方式2:分区文件的创建时间排序来识别最新分区
  'streaming-source.enable' = 'true',
  'streaming-source.partition.include' = 'latest',
  'streaming-source.partition-order' = 'create-time',-- 分区文件的创建时间排序
  'streaming-source.monitor-interval' = '12 h'

  -- 方式3:按照分区时间排序来识别最新分区
  'streaming-source.enable' = 'true',
  'streaming-source.partition.include' = 'latest',
  'streaming-source.monitor-interval' = '12 h',
  'streaming-source.partition-order' = 'partition-time', -- 按照分区时间排序
  'partition.time-extractor.kind' = 'default',
  'partition.time-extractor.timestamp-pattern' = '$pt_year-$pt_month-$pt_day 00:00:00' 
);

有了上面的Hive维表，我们就可以使用该维表与Kafka的实时流数据进行JOIN，得到相应的宽表数据

-- 使用default sql方言
SET table.sql-dialect=default;
-- kafka实时流数据表
CREATE TABLE orders_table (
  order_id STRING,
  order_amount DOUBLE,
  product_id STRING,
  log_ts TIMESTAMP(3),
  proctime as PROCTIME()
) WITH (...);

-- 将流表与hive最新分区数据关联 
SELECT *
FROM orders_table AS orders
JOIN dimension_table FOR SYSTEM_TIME AS OF orders.proctime AS dim 
ON orders.product_id = dim.product_id;

除了在定义Hive维表时指定相关的参数，我们还可以通过SQL Hint的方式动态指定相关的参数，具体方式如下：

SELECT *
FROM orders_table AS orders
JOIN dimension_table
/*+ OPTIONS('streaming-source.enable'='true',             
    'streaming-source.partition.include' = 'latest',
    'streaming-source.monitor-interval' = '1 h',
    'streaming-source.partition-order' = 'partition-name') */
FOR SYSTEM_TIME AS OF orders.proctime AS dim -- 时态表(维表)
ON orders.product_id = dim.product_id;

3.3.2、Temporal Join The Latest Table

对于 Hive 的非分区表，当使用 temporal join 时，整个Hive表会被缓存到Slot内存中，然后根据流中的数据对应的key与其进行匹配。使用最新的 Hive 表进行 temporal join 不需要进行额外的配置，我们只需要配置一个Hive表缓存的TTL时间，lookup.join.cache.ttl,该时间的作用是：当缓存过期时，就会重新扫描Hive表并加载最新的数据。

由于 Hive 维表会把维表所有数据缓存在 TM 的内存中，当维表数据量很大时，很容易造成 OOM。当然TTL的时间也不能太短，因为会频繁地加载数据，从而影响性能。

注意：当使用此种方式时，Hive表必须是有界的lookup表，即非Streaming Source的时态表，换句话说，该表的属性streaming-source.enable = false。

-- Hive维表数据使用批处理的方式按天装载
SET table.sql-dialect=hive;
CREATE TABLE dimension_table (
  product_id STRING,
  product_name STRING,
  unit_price DECIMAL(10, 4),
  pv_count BIGINT,
  like_count BIGINT,
  comment_count BIGINT,
  update_time TIMESTAMP(3),
  update_user STRING,
  ...
) TBLPROPERTIES (
  'streaming-source.enable' = 'false', -- 关闭streaming source
  'streaming-source.partition.include' = 'all',  -- 读取所有数据
  'lookup.join.cache.ttl' = '12 h' -- ttl 时间
);
-- kafka事实表
SET table.sql-dialect=default;
CREATE TABLE orders_table (
  order_id STRING,
  order_amount DOUBLE,
  product_id STRING,
  log_ts TIMESTAMP(3),
  proctime as PROCTIME()
) WITH (...);

-- Hive维表join，Flink会加载该维表的所有数据到内存中
SELECT *
FROM orders_table AS orders
JOIN dimension_table FOR SYSTEM_TIME AS OF orders.proctime AS dim
ON orders.product_id = dim.product_id;

尖叫提示：

1.每个子任务都需要缓存一份维表的全量数据，一定要确保TM的 task Slot 大小能够容纳维表的数据量；
2.推荐将streaming-source.monitor-interval和lookup.join.cache.ttl的值设为一个较大的数，因为频繁的更新和加载数据会影响性能。
3.当缓存的维表数据需要重新刷新时，目前的做法是将整个表进行加载，因此不能够将新数据与旧数据区分开来。

四、Hive Functions

参考：
https://nightlies.apache.org/flink/flink-docs-release-1.13/zh/docs/connectors/table/hive/hive_read_write/#writing
https://zhuanlan.zhihu.com/p/434562109
http://www.manongjc.com/detail/22-ygevltlxxowgkno.html