Hive08_分区表

最新推荐文章于 2024-05-08 11:44:05 发布

程序喵猴

最新推荐文章于 2024-05-08 11:44:05 发布

阅读量1.4k

点赞数 25

分类专栏： hive 文章标签： hive hadoop 数据仓库

本文链接：https://blog.csdn.net/tonyshi1989/article/details/135354180

版权

hive 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

一分区表

1 概念：

分区表实际上就是对应一个 HDFS 文件系统上的独立的文件夹，该文件夹下是该分区所

有的数据文件。Hive 中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据

集。在查询时通过 WHERE 子句中的表达式选择查询所需要的指定的分区，这样的查询效率

会提高很多。

2 案例演示

1 创建分区表语法

hive (default)> create table dept_par(
deptno int, dname string, loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';

注意：分区字段不能是表中已经存在的数据，可以将分区字段看作表的伪列。

2 加载数据到分区表中

（1）数据准备
dept_20220401.log

10	ACCOUNTING	1700
20	RESEARCH	1800

dept_20220402.log

30	SALES	1900
40	OPERATIONS	1700

dept_20220403.log

50	TEST	2000
60	DEV	1900

（2）加载数据

hive (default)> load data local inpath 
'/usr/soft/datas/dept_20220401.log' into table dept_par 
partition(day='20220401');

hive (default)> load data local inpath 
'/usr/soft/datas/dept_20220402.log' into table dept_par 
partition(day='20220402');

hive (default)> load data local inpath 
'/usr/soft/datas/dept_20220403.log' into table dept_par 
partition(day='20220403');

注意：分区表加载数据时，必须指定分区

在这里插入图片描述

3 查询分区表中数据

单分区查询

hive (default)> select * from dept_partition where day='20220401';
或者
hive (default)> select * from dept_partition where deptno=10 or deptno=20;

上述第二种方式，是对全表进行搜索查询！

而第一种方式，仅仅对某一张分区表进行搜索查询

多分区联合查询

hive (default)> select * from dept_partition where day='20220401'
 union
 select * from dept_partition where day='20220402'
 union
 select * from dept_partition where day='20220403';

hive (default)> select * from dept_partition where day='20220401' or day='20220402' or day='20220403';

4 增加分区

创建单个分区

hive (default)> alter table dept_partition add partition(day='20220404');

同时创建多个分区

hive (default)> alter table dept_partition add partition(day='20220405') 
partition(day='20220406');

5 删除分区

删除单个分区

hive (default)> alter table dept_partition drop partition (day='20220406');

同时删除多个分区

hive (default)> alter table dept_partition drop partition (day='20220404'), partition(day='20220405');

6 查看分区表有多少分区

hive> show partitions dept_partition;

7 查看分区表结构

hive> desc formatted dept_partition;

二级分区

思考: 如何一天的日志数据量也很大，如何再将数据拆分?

1 创建二级分区表

hive (default)> create table dept_partition2(
 deptno int, dname string, loc string
 )
 partitioned by (day string, hour string)
 row format delimited fields terminated by '\t';

2 正常的加载数据

（1）加载数据到二级分区表中

hive (default)> load data local inpath 
'/usr/soft/hive/datas/dept_20220401.log' into table
dept_partition2 partition(day='20220401', hour='12');

（2）查询分区数据

hive (default)> select * from dept_partition2 where day='20220401' and hour='12';

3）把数据直接上传到分区目录上，让分区表和数据产生关联的三种方式

（1）方式一：上传数据后修复
上传数据

hive (default)> dfs -mkdir -p
/user/hive/warehouse/dept_partition2/day=20220401/hour=14;

hive (default)> dfs -put /usr/soft/datas/dept_20220401.log 
/user/hive/warehouse/dept_partition2/day=20220401/hour=14;

查询数据（查询不到刚上传的数据）

hive (default)> select * from dept_partition2 where day='20220401' and hour='13';

执行修复命令

hive> msck repair table dept_partition3;

Partitions not in metastore:

再次查询数据

hive (default)> select * from dept_partition2 where day='20220401' and hour='13';

（2）方式二：上传数据后添加分区
上传数据

hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20220401/hour=14;

hive (default)> dfs -put /usr/soft/datas/dept_20220401.log  /user/hive/warehouse/mydb.db/dept_partition2/day=20220401/hour=14;


hive (default)> load data inpath '/user/hive/warehouse/mydb.db/dept_partition2/day=20220401/hour=14' into table dept_partition2 partition(day='20220401',hour='14');

执行添加分区

hive (default)> alter table dept_partition2 add partition(day='20220401',hour='14');

查询数据

hive (default)> select * from dept_partition2 where day='20220401' and hour='14';

（3）方式三：创建文件夹后 load 数据到分区
创建目录

hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20220401/hour=15;

上传数据

hive (default)> load data local inpath 
'/usr/soft/hive/datas/dept_20220401.log' into table
dept_partition2 partition(day='20220401',hour='15');

查询数据

hive (default)> select * from dept_partition2 where day='20220401' and 
hour='15';

三动态分区

1 创建表结构

create table dept_no_par1( dname string, loc string ) partitioned by (deptno int) row format delimited fields terminated by '\t';

2 查询dept表，并将数据添加到 dept_no_par

insert into table dept_no_par partition(deptno) select dname,loc ,deptno from dept;

使用动态分区，查询的最后一个字段，赋值给分区字段

在这里插入图片描述

直接报错：
设置为非严格模式（动态分区的模式，默认 strict，表示必须指定至少一个分区为静态分区，nonstrict 模式表示允许所有的分区字段都可以使用动态分区。）

hive (default)> set hive.exec.dynamic.partition.mode=nonstrict;

在这里插入图片描述

3 其他配置

开启动态分区参数设置

（1）开启动态分区功能（默认 true，开启）

hive.exec.dynamic.partition=true

（2）在所有执行 MR 的节点上，最大一共可以创建多少个动态分区。默认 1000

hive.exec.max.dynamic.partitions=1000

（3）在每个执行 MR 的节点上，最大可以创建多少个动态分区。该参数需要根据实际
的数据来设定。比如：源数据中包含了一年的数据，即 day 字段有 365 个值，那么该参数就
需要设置成大于 365，如果使用默认值 100，则会报错。

hive.exec.max.dynamic.partitions.pernode=100

（4）整个 MR Job 中，最大可以创建多少个 HDFS 文件。默认 100000

hive.exec.max.created.files=100000

（5）当有空分区生成时，是否抛出异常。一般不需要设置。默认 false

hive.error.on.empty.partition=false

案例实操

需求：将 dept 表中的数据按照地区（loc 字段），插入到目标表 dept_partition 的相应分区中。

（1）创建目标分区表

hive (default)> create table dept_partition_dy(id int, name string) 
partitioned by (loc int) row format delimited fields terminated by '\t';

（2）设置动态分区

set hive.exec.dynamic.partition.mode = nonstrict;
hive (default)> insert into table dept_partition_dy partition(loc) select 
deptno, dname, loc from dept;

（3）查看目标分区表的分区情况

hive (default)> show partitions dept_partition;

lt)> create table dept_partition_dy(id int, name string)
partitioned by (loc int) row format delimited fields terminated by ‘\t’;




**（2）设置动态分区**

```sql
set hive.exec.dynamic.partition.mode = nonstrict;
hive (default)> insert into table dept_partition_dy partition(loc) select 
deptno, dname, loc from dept;

（3）查看目标分区表的分区情况

hive (default)> show partitions dept_partition;

程序喵猴

关注

25
点赞
踩
17

收藏

觉得还不错? 一键收藏
0
评论
Hive08_分区表

分区表实际上就是对应一个 HDFS 文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive 中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过 WHERE 子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。
复制链接

扫一扫