09-hive中的分区表

最新推荐文章于 2023-07-04 19:31:22 发布

xixihaha_coder

最新推荐文章于 2023-07-04 19:31:22 发布

阅读量2.8k

点赞数 2

分类专栏： Hadoop 文章标签： hive

本文链接：https://blog.csdn.net/xixihaha_coder/article/details/121229539

版权

Hadoop 专栏收录该内容

20 篇文章 1 订阅

订阅专栏

hive中的分区表

为什么分区

Hive的Select查询时，一般会扫描整个表内容。随着系统运行的时间越来越长，
表的数据量越来越大，而hive查询做全表扫描，会消耗很多时间，降低效率。
而有时候，我们需求的数据只需要扫描表中的一部分数据即可。
这样，hive在建表时引入了partition概念。即在建表时，
将整个表存储在不同的子目录中，每一个子目录对应一个分区。
在查询时，我们就可以指定分区查询，避免了hive做全表扫描,从而提高查询效率。

如何分区

根据业务需求而定,不过通常以年、月、日、小时、地区等进行分区。

分区的语法

create table tableName(
......
......
)
partitioned by (colName colType[comment '...'],...)

分区的注意事项

- hive的分区名不区分大小写。不支持中文
- hive的分区字段是一个伪字段,但是可以用来进行操料
- 一张表可以有一个或者多个分区，并且分区下面也可以有一个或者多个分区。
- 分区是以字段的形式在表结构中存在，通过describe table命令可以查看到字段存在，但是该字段不存放实际的数据内容，仅仅是分区的表示。

分区的意义

让用户在做数据统计的时候缩小数据扫描的范围。在进行select操作时可以指定要统计哪个分区

分区的本质

在表的目录或者是分区的目录下在创建口剥分区的目录名为指定字段=值

分区的使用

一级分区的使用

1 ) 建表语句

create table if not exists part1(
id int,
name string,
age int
)
partitioned by (dt string)
row format delimited
fields terminated by '\t'
lines terminated by '\n';

2 ) 加载数据

user1.txt (user2.txt)

1 user1  1
2 user2  2
3 user1  1
4 user2  2
5 user2  2
6 user1  1
7 user2  1
8 user1  2
9 user2  2

load data local inpath './root/user1.txt' into table part1 partition(dt='2020-05-05');

select * from part1;
1 user1  1  2020-05-05
2 user2  2  2020-05-05
3 user1  1  2020-05-05
4 user2  2  2020-05-05
5 user2  2  2020-05-05
6 user1  1  2020-05-05
7 user2  1  2020-05-05
8 user1  2  2020-05-05
9 user2  2  2020-05-05

load data local inpath './root/user2.txt' into table part1 partition(dt='2020-05-06');

# 这样就会从2020-05-05里查
select * from part1 where dt='2020-05-05';

二级分区的使用

1 ) 建表语句

create table if not exists part2(
id int,
name string,
age int
)
partitioned by (year string,month string)
row format delimited
fields terminated by '\t';

2 ) 加载数据

load data local inpath './root/user1.txt' into table part1 partition(year='2020',month='03');
load data local inpath './root/user1.txt' into table part1 partition(year='2020',month='04');
load data local inpath './root/user1.txt' into table part1 partition(year='2020',month='05');

select * from part2 where year='2020' and month='04';

三级分区的使用

1 ) 建表语句

create table if not exists part2(
id int,
name string,
age int
)
partitioned by (year string,month string,day string)
row format delimited
fields terminated by '\t';

2 ) 加载数据

load data local inpath './root/user1.txt' into table part1 partition(year='2020',month='03',day='01');
load data local inpath './root/user1.txt' into table part1 partition(year='2020',month='04',day='02');
load data local inpath './root/user1.txt' into table part1 partition(year='2020',month='05',day='03');

select * from part2 where year='2020'and month='04'and day='02';

在hive中，分区字段名是不区分大小写的，不过字段值是区分大小写的。

查看分区

show partitions tableName;
# 举例
show partitions part1

删除分区

alter table part2 drop partition(year='2020',month='05',day='03');

-- 删除多个分区 逗号隔开
alter table part2 drop 
partition(year='2020',month='05',day='03'),
partition year='2020',month='04',day='02';

结论︰在删除操作时，对应的目录（最里层）会被删除，上级目录如果没有文件存在，也会被删除，如果有文件存在，则不会被删除。

在这里插入图片描述

hive分区类型详解

在这里插入图片描述

创建动态分区的案例

1 )创建动态分区表

create table if not exists dy_part1(
id int,
name string,
gender string,
age int,
academy string
)
partitioned by (dt string)
row format delimited fields terminated by '\t'
;

2 )动态分区加载数据

下面方式不要用，因为不是动态加载数据

load data local inpath './root/user1.txt' into table dy_part1 partition(dt='2020-05-06');

正确方式，要从别的表中加载数据

第一步：先创建临时表

create table if not exists temp_part1(
id int,
name string,
gender string,
age int,
academy string
)
partitioned by (dt string)
row format delimited fields terminated by '\t'
;

注意

创建临时表时，必须要有动态分区表中的分区字段

第二步：导入数据到临时表

在这里插入图片描述

第三步：动态加载到表

insert into dy_part1 partition(dt) select sid name,gender,age,academy,dt from temp_part1;

注意:严格模式下，给动态分区表导入数据时，分区字段至少要有一个分区字段是静态值
	非严格模式下,导入数据时,可以不指定静态值。

混合分区示例

在这里插入图片描述

分区表注意事项

1. hive的分区使用的是表外字段，分区字段是一个伪列，但是分区字段是可以做查询过滤。
2.分区字段不建议使用中文
3.一般不建议使用动态分区，因为动态分区会使用mapreduce来进行查询数据，如果分区数据过多，导致namenode和resourcemancger的性能瓶颈。所以建议在使用动态分区前尽可能预知分区数量。
4.分区属性的修改都可以修改元数据和hdfs数据内容。

Hive分区和Miysql分区的区别

mysql分区字段用的是表内字段;而hive分区字段采用表外字段。

xixihaha_coder

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
09-hive中的分区表

hive中的分区表为什么分区Hive的Select查询时，一般会扫描整个表内容。随着系统运行的时间越来越长，表的数据量越来越大，而hive查询做全表扫描，会消耗很多时间，降低效率。而有时候，我们需求的数据只需要扫描表中的一部分数据即可。这样，hive在建表时引入了partition概念。即在建表时，将整个表存储在不同的子目录中，每一个子目录对应一个分区。在查询时，我们就可以指定分区查询，避免了hive做全表扫描,从而提高查询效率。如何分区根据业务需求而定,不过通常以年、月、日、小
复制链接

扫一扫