Presto/Trino的Hive Connector的使用(内部表、外部表、分区表)

最新推荐文章于 2024-09-03 07:03:40 发布

Bulut0907

最新推荐文章于 2024-09-03 07:03:40 发布

阅读量5.9k

点赞数 5

分类专栏： # Presto/Trino 文章标签： Hive连接器分区表外部表内部表 Presto/Trino

本文链接：https://blog.csdn.net/yy8623977/article/details/122828816

版权

Presto/Trino 专栏收录该内容

14 篇文章

订阅专栏

本文档详细介绍了如何在Trino中配置Hive连接器，包括创建schema、内部表、外部表和分区表，以及处理外部分区表的数据。此外，还展示了向表中插入数据和删除分区表数据的操作步骤。Trino通过HiveMetastore服务获取元数据，并直接从HDFS读写数据，不支持SQL查询下推到Hive。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. 配置Hive连接器

在所有节点上新建catalog配置文件

[root@trino1 catalog]# pwd
/root/trino-server-367/etc/catalog
[root@trino1 catalog]#
[root@trino1 catalog]# cat hive.properties 
connector.name=hive
hive.metastore.uri=thrift://192.168.23.91:9083
# 是否开启向Hive的外部表写入数据，默认是False
hive.non-managed-table-writes-enabled=true

[root@trino1 catalog]#

对于HA高可用的HDFS，则需要将Hadoop的配置文件core-site.xml和hdfs-site.xml复制到Trino的所有节点上

先在所有Trino节点上创建Hadoop配置目录

[root@trino1 ~]# mkdir trino-server-367/etc/hadoop
[root@trino1 ~]#

然后将Hadoop的配置文件复制到Trino的所有节点上

[root@hive1 ~]# 
[root@hive1 ~]# scp hadoop-3.3.1/etc/hadoop/core-site.xml root@192.168.23.81:/root/trino-server-367/etc/hadoop 
[root@hive1 ~]# scp hadoop-3.3.1/etc/hadoop/core-site.xml root@192.168.23.82:/root/trino-server-367/etc/hadoop     
[root@hive1 ~]# scp hadoop-3.3.1/etc/hadoop/core-site.xml root@192.168.23.83:/root/trino-server-367/etc/hadoop 
[root@hive1 ~]# scp hadoop-3.3.1/etc/hadoop/hdfs-site.xml root@192.168.23.81:/root/trino-server-367/etc/hadoop   
[root@hive1 ~]# scp hadoop-3.3.1/etc/hadoop/hdfs-site.xml root@192.168.23.82:/root/trino-server-367/etc/hadoop    
[root@hive1 ~]# scp hadoop-3.3.1/etc/hadoop/hdfs-site.xml root@192.168.23.83:/root/trino-server-367/etc/hadoop 
[root@hive1 ~]#

在所有Trino节点的/etc/hosts添加Hadoop所在服务器的地址映射

192.168.23.91 hive1
192.168.23.92 hive2
192.168.23.93 hive3

然后在Trino所有节点的etc/catalog/hive.properties添加如下配置

hive.config.resources=/root/trino-server-367/etc/hadoop/core-site.xml,/root/trino-server-367/etc/hadoop/hdfs-site.xml

重启trino

[root@trino1 catalog]# 
[root@trino1 catalog]# cd /root/trino-server-367
[root@trino1 trino-server-367]# 
[root@trino1 trino-server-367]# bin/launcher stop
Not running
[root@trino1 trino-server-367]# 
[root@trino1 trino-server-367]# bin/launcher start
Started as 46215
[root@trino1 trino-server-367]#

2. 通过Hive连接器创建schema

trino> create schema hive.test_db with (location = 'hdfs://nnha/user/hive/warehouse/test_db.db');
CREATE SCHEMA
trino>

location参数是可选的

3. 通过Hive连接器创建内部表和外部表

数据文件格式format默认为ORC，可以在catalog属性文件中通过参数hive.storage-format进行修改

数据压缩默认使用的是GZIP，可以在catalog属性文件中通过参数hive.compression-codec进行修改

创建外部表，external_location指向的目录位置必须存在，如下所示：

trino> 
trino> create table hive.test_db.test_external_tb(
    -> user_id bigint,
    -> user_name varchar,
    -> birthday date,
    -> country varchar
    -> ) with (external_location = 'hdfs://nnha/user/hive/warehouse/test_db.db/test_external_tb');
CREATE TABLE
trino>

如果没有external_location参数，则创建的是内部表

3. 通过Hive连接器创建分区表

分区列必须位于表中列位置的最后

trino> 
trino> create table hive.test_db.test_partition_tb(
    -> user_id bigint,
    -> user_name varchar,
    -> birthday date,
    -> country varchar
    -> ) with (partitioned_by = array['birthday', 'country']);
CREATE TABLE
trino>

这样使用分区列过滤数据，能减少数据扫描的范围，如下所示：

trino> select distinct user_id from hive.test_db.test_partition_tb where birthday = date '2022-02-09' and country = 'china';

4. Hive连接器创建外部分区表，不能查询到已有分区数据

HDFS目录结构如下：

/user/hive/warehouse/test_db.db/external_partition_tb/birthday=2018-08-16/country=china/test2018.txt
/user/hive/warehouse/test_db.db/external_partition_tb/birthday=2019-08-16/country=japan/test2019.txt

birthday=2018-08-16/country=china分区的数据内容如下

1,zhang_san,2018-08-16,china

birthday=2019-08-16/country=japan/test2019.txt分区的数据内容如下

2,li_si,2019-08-16,japan

在Trino中创建外部分区表external_partition_tb

trino> 
trino> create table hive.test_db.external_partition_tb(
    -> user_id varchar,
    -> user_name varchar,
    -> birthday varchar,
    -> country varchar
    -> ) with (
    -> format = 'CSV',
    -> csv_escape = '\',
    -> csv_quote = '"',  
    -> csv_separator = ',',
    -> external_location = 'hdfs://nnha/user/hive/warehouse/test_db.db/external_partition_tb',
    -> partitioned_by = array['birthday', 'country']
    -> );
CREATE TABLE
trino> 
trino> select * from hive.test_db.external_partition_tb;
 user_id | user_name | birthday | country 
---------+-----------+----------+---------
(0 rows)

Query 20220209_113936_00038_jx84g, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0.30 [0 rows, 0B] [0 rows/s, 0B/s]

trino>

使用Hive连接器创建CSV格式的表，列的数据类型只能是varchar类型

查询external_partition_tb表数据为空。这是因为HMS无法识别分区，需要我们手动添加分区，或者执行自动发现分区命令

手动添加分区

trino> 
trino> use hive.test_db;
USE
trino:test_db> 
trino:test_db> call system.create_empty_partition(
            -> schema_name => 'test_db',
            -> table_name => 'external_partition_tb',
            -> partition_columns => array['birthday', 'country'],
            -> partition_values => array['2018-08-16', 'china']
            -> );
CALL
trino:test_db> 
trino:test_db> select * from hive.test_db.external_partition_tb;
 user_id | user_name |  birthday  | country 
---------+-----------+------------+---------
 1       | zhang_san | 2018-08-16 | china   
(1 row)

Query 20220209_114018_00044_jx84g, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0.54 [1 rows, 32B] [1 rows/s, 60B/s]

trino:test_db>

自动发现所有分区

trino:test_db> 
trino:test_db> call system.sync_partition_metadata(
            -> schema_name => 'test_db', 
            -> table_name => 'external_partition_tb', 
            -> mode => 'FULL', 
            -> case_sensitive => true
            -> );
CALL
trino:test_db> 
trino:test_db> select * from hive.test_db.external_partition_tb;
 user_id | user_name |  birthday  | country 
---------+-----------+------------+---------
 1       | zhang_san | 2018-08-16 | china   
 2       | li_si     | 2019-08-16 | japan   
(2 rows)

Query 20220209_114728_00048_jx84g, FINISHED, 1 node
Splits: 3 total, 3 done (100.00%)
0.50 [2 rows, 57B] [3 rows/s, 114B/s]

trino:test_db>

5. 通过Hive连接器插入数据

trino> 
trino> insert into hive.test_db.test_external_tb(user_id, user_name, birthday, country) values(1, 'zhang_san', date '2018-08-16', 'china'), (2, 'li_si', date '2019-08-16', 'japan');
INSERT: 2 rows

Query 20220209_095722_00002_jx84g, FINISHED, 2 nodes
Splits: 5 total, 5 done (100.00%)
17.84 [0 rows, 0B] [0 rows/s, 0B/s]

trino>
trino> insert into hive.test_db.test_partition_tb select * from hive.test_db.test_external_tb;
INSERT: 2 rows

Query 20220209_095933_00003_jx84g, FINISHED, 2 nodes
Splits: 5 total, 5 done (100.00%)
6.99 [2 rows, 589B] [0 rows/s, 84B/s]

trino> 
trino> 
trino> create table hive.test_db.create_table_insert ( 
    -> user_id,                           
    -> user_name,                        
    -> birthday,                            
    -> country                           
    -> ) as select * from hive.test_db.test_partition_tb;
CREATE TABLE: 2 rows

Query 20220209_103421_00011_jx84g, FINISHED, 2 nodes
Splits: 6 total, 6 done (100.00%)
2.42 [2 rows, 732B] [0 rows/s, 302B/s]

trino>

使用Hive连接器向分区表插入数据，如果分区目录不存在，则会自动创建分区目录
通过create table table_name(......) as select ......创建的表，会复制表结构，不会复制表属性。但创建的新表可以自己添加表属性

6. 删除分区表的数据

trino> 
trino> select * from hive.test_db.test_partition_tb;
 user_id | user_name |  birthday  | country 
---------+-----------+------------+---------
       1 | zhang_san | 2018-08-16 | china   
       2 | li_si     | 2019-08-16 | japan   
(2 rows)

Query 20220209_115202_00053_jx84g, FINISHED, 1 node
Splits: 3 total, 3 done (100.00%)
0.44 [2 rows, 732B] [4 rows/s, 1.62KB/s]

trino> 
trino> delete from hive.test_db.test_partition_tb where birthday = date '2019-08-16';
DELETE

Query 20220209_115234_00054_jx84g, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.91 [0 rows, 0B] [0 rows/s, 0B/s]

trino> 
trino> select * from hive.test_db.test_partition_tb;
 user_id | user_name |  birthday  | country 
---------+-----------+------------+---------
       1 | zhang_san | 2018-08-16 | china   
(1 row)

Query 20220209_115259_00055_jx84g, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0.49 [1 rows, 374B] [2 rows/s, 759B/s]

trino> 
trino> select * from hive.test_db.external_partition_tb;
 user_id | user_name |  birthday  | country 
---------+-----------+------------+---------
 2       | li_si     | 2019-08-16 | japan   
 1       | zhang_san | 2018-08-16 | china   
(2 rows)

Query 20220209_115403_00056_jx84g, FINISHED, 1 node
Splits: 3 total, 3 done (100.00%)
0.39 [2 rows, 57B] [5 rows/s, 147B/s]

trino> 
trino> delete from hive.test_db.external_partition_tb where birthday = '2019-08-16';
DELETE

Query 20220209_115453_00059_jx84g, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.42 [0 rows, 0B] [0 rows/s, 0B/s]

trino> 
trino> 
trino> select * from hive.test_db.external_partition_tb;
 user_id | user_name |  birthday  | country 
---------+-----------+------------+---------
 1       | zhang_san | 2018-08-16 | china   
(1 row)

Query 20220209_115649_00060_jx84g, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0.58 [1 rows, 32B] [1 rows/s, 55B/s]

trino>