hive之分区分桶

xiaoxiao______

已于 2022-11-10 16:10:02 修改

阅读量499

点赞数

分类专栏： hive 文章标签： hive 大数据 hadoop

于 2020-10-14 21:38:32 首次发布

本文链接：https://blog.csdn.net/xiaoxiao______/article/details/109079002

版权

hive 专栏收录该内容

6 篇文章 1 订阅

订阅专栏

分区

分区简介

为什么分区

Hive的Select查询时，一般会扫描整个表内容。随着系统运行的时间越来越长，表的数据量越来越大，而hive查询做全表扫描，会消耗很多时间，降低效率。有时候，我们需求的数据只需要扫描表中的一部分数据即可。这样，hive在建表时引入了partition概念。即在建表时，将整个表存储在不同的子目录中，每一个子目录对应一个分区。在查询时，我们就可以指定分区查询，避免了hive做全表扫描，从而提高查询效率。

如何分区

根据业务需求而定，不过通常以年、月、日、小时、地区等进行分区。

分区的语法

create table tableName(
.......
.......
)
partitioned by (colName colType [comment '...'],...)

分区的注意事项

hive的分区名不区分大小写，不支持中文
hive的分区字段是一个伪字段，但是可以用来进行操作（也就是在写hql时可以当成正常字段使用）
一张表可以有多级分区
分区是以字段的形式在表结构中存在，通过describe table命令可以查看到字段存在，但是该字段不存放实际的数据内容，仅仅是分区的表示。

分区的意义

让用户在做数据统计的时候缩小数据扫描的范围，在进行select操作时可以指定要统计哪个分区，并且加快查询速度，节省资源

分区的本质

在表的目录或者是分区的目录下在创建目录，分区的目录名为指定字段=值（如： year=2020）

分区案例

一级分区的使用

1）建表语句

create table if not exists part1(
id int,
name string,
age int
)
partitioned by (dt string)
row format delimited 
fields terminated by '\t'
lines terminated by '\n';

2）加载数据

load data local inpath './data/user1.txt' into table part1 partition(dt='2020-05-05');
load data local inpath './data/user2.txt' into table part1 partition(dt='2020-05-06');

二级分区的使用

1）建表语句

create table if not exists part2(
id int,
name string,
age int
)
partitioned by (year string,month string)
row format delimited fields terminated by '\t';

2）加载数据

load data local inpath './data/user1.txt' into table part2 partition(year='2020',month='03'); 
load data local inpath './data/user2.txt' into table part2 partition(year='2020',month=04);
load data local inpath './data/user2.txt' into table part2 partition(year='2020',month="05");

三级分区的使用

1）建表语句

create table if not exists part3(
id int,
name string,
age int
)
partitioned by (year string,month string,day string)
row format delimited 
fields terminated by '\t';

2）加载数据

load data local inpath './data/user1.txt' into table part3 partition(year='2020',month='05',day='01');

load data local inpath './data/user2.txt' into table part3 partition(year='2019',month='12',day='31');

测试是否区分大小写

在hive中，分区字段名是不区分大小写的，不过字段值是区分大小写的。我们可以来测试一下

1）建表语句

create table if not exists part4(
id int,
name string
)
partitioned by (year string,month string,DAY string)
row format delimited fields terminated by ','
;

--测试字段名的大小写，结果不区分。

2）加载数据

load data local inpath './data/user1.txt' into table part4 partition(year='2018',month='03',DAy='21');

load data local inpath './data/user2.txt' into table part4 partition(year='2018',month='03',day='AA');
load data local inpath './data/user2.txt' into table part4 partition(year='2018',month='03',day='aa');

--测试字段值的大小写，结果是区分的。

查看分区:

语法：
	show partitions tableName
eg:
 	show partitions part4;

修改分区：

修改分区（注意：location后接的hdfs路径需要写成完全路径）

alter table part3 partition(year='2019',month='10',day='23') set location 
'/user/hive/warehouse/mydb1.db/part1/dt=2018-03-21';    --错误使用(要写完全路径)

#:修改分区，指的是修改分区字段值对应的映射位置。

alter table part3 partition(year='2020',month='05',day='01') set location 
'hdfs://bd001:8020/user/hive/warehouse/mydb.db/part1/dt=2020-05-05';

增加分区

1）新增分区（空）

-- 加一个分区
alter table part3 add partition(year='2020',month='05',day='02');
-- 加多个分区
alter table part3 add partition(year='2020',month='05',day='03') partition(year='2020',month='05',day='04') .....;

2）新增分区 (带数据)

alter table part3 add partition(year='2020',month='05',day='05') location '/user/hive/warehouse/mydb.db/part1/dt=2020-05-06';

3）新增多分区

alter table part3 add 
partition(year='2020',month='05',day='06') location '/user/hive/warehouse/mydb.db/part1/dt=2020-05-05'
partition(year='2020',month='05',day='07') location '/user/hive/warehouse/mydb.db/part1/dt=2020-05-06';

删除分区

1）删除单个分区

alter table part3 drop partition(year='2020',month='05',day='07');

2）删除多个分区

alter table part3 drop partition(year='2020',month='05',day='06'),partition(year='2020',month='05',day='06');

测试分区表的分区都被删除的特点

create table if not exists part10(
id int,
name string,
age int
)
partitioned by (year string,month string,day string)
row format delimited 
fields terminated by '\t';

load data local inpath './data/user1.txt' overwrite into table part10
partition(year='2020',month='05',day='06');
load data local inpath './data/user2.txt' overwrite into table part10
partition(year='2020',month='05',day='07');

删除分区:
alter table part10 drop
partition(year='2020',month='05',day='06'),
partition(year='2020',month='05',day='07');

注意:  默认创建分区表时，删除所有分区时，表目录不会被删除。


测试2： 使用location关键字去指定分区对应的位置
alter table part10 add partition(year='2020',month='05',day='08') location '/test/a/b';
alter table part10 add partition(year='2020',month='05',day='09') location '/test/a/c';

alter table part10 drop
partition(year='2020',month='05',day='08'),
partition(year='2020',month='05',day='09');
结论：在删除操作时，对应的目录（最里层）会被删除，上级目录如果没有文件存在，也会被删除，如果有文件存在，则不会被删除。

test

-- 一级分区
create table part1(
id int,
name string,
age int
)
partitioned by (datee string)
row format delimited 
fields terminated by '\t'
;

select * from part1; --此时表中没有数据

-- 加载数据
load data local inpath '/root/hivedata/user1' into table part1 partition (datee="2020-10-12");

select * from part1; --此时表中有分区数据

-- 此时再添加一个分区
load data local inpath '/root/hivedata/user2' into table part1 partition (datee="2020-10-13");

select * from part1; --此时part1中有两个分区

--查询
select * from part1 where datee="2020-10-13";--查询某一个分区的数据

--查看分区
show partitions part1;

--二级分区
--建表
create table if not exists part2(
id int,
name string,
age int
)
partitioned by (year string,month string)
row format delimited 
fields terminated by '\t'
;

--导入数据
load data local inpath '/root/hivedata/user2' into table part2 partition (year="2020",month="05");

select * from part2; --有数据

select * from part2 where year="2020";

--三级分区
create table part3(
id int ,
name string,
age int
)
partitioned by (year string,month string,day string)
row format delimited 
fields terminated by '\t'
;

--加载数据
load data local inpath '/root/hivedata/user2' into table part3 partition (year="2020",month="09",day="22");

--查询
select * from part3 where month="09";

show partitions part3;

--修改分区（指的是修改分区字段映射的文件的位置，也就是修改的源文件）

alter table part3 partition (year="2020",month="09",day="22") set location 'hdfs://qianfeng01:8020/user/hive/warehouse/myhive.db/part1/datee=2020-10-12';

--增加分区
alter table part3 add partition(year="2019",month="10",day="08");

--增加多个分区(无数据，)
alter table part3 add partition(year="2019",month="04",day="08") partition (year="2018",month="10",day="08");


--增加分区（有数据）
alter table myhive.part3 add partition(year="2017",month="05",day="25") location 'hdfs://qianfeng01:8020/user/hive/warehouse/myhive.db/part1/dtaee=2020-10-12';

show partitions part3;
--删除分区
alter table myhive.part3 drop partition (year="2017",month="05",day="25");

分区类型详解

分区的种类

静态分区：直接加载数据文件到指定的分区，即静态分区表。
动态分区：数据未知，根据分区的值来确定需要创建的分区(分区目录不是指定的，而是根据数据的值自动分配的)
混合分区：静态和动态都有。

9.3.2 分区属性设置

hive.exec.dynamic.partition=true，是否支持动态分区操作
hive.exec.dynamic.partition.mode=strict/nonstrict: 严格模式/非严格模式
hive.exec.max.dynamic.partitions=1000: 总共允许创建的动态分区的最大数量
hive.exec.max.dynamic.partitions.pernode=100:在每个MR执行的节点上，最大可以创建多少个分区

创建动态分区的案例

1）创建动态分区表

create table dy_part1(
sid int,
name string,
gender string,
age int,
academy string
)
partitioned by (dt string)
row format delimited fields terminated by ','
;

动态分区和静态分区建表语句一样。

2）动态分区加载数据

下面方式不要用，因为不是动态加载数据

load data local inpath '/hivedata/user.txt' into table dy_part1 partition(dt=
'2020-05-06');

正确方式：要从别的表中加载数据

第一步： 先创建临时表：

create table temp_part1(
sid int,
name string,
gender string,
age int,
academy string,
dt string
)
row format delimited 
fields terminated by ','
;

注意：创建临时表时，必须要有动态分区表中的分区字段。

第二步： 导入数据到临时表：

95001,李勇,男,20,CS,2017-8-31
95002,刘晨,女,19,IS,2017-8-31
95003,王敏,女,22,MA,2017-8-31
95004,张立,男,19,IS,2017-8-31
95005,刘刚,男,18,MA,2018-8-31
95006,孙庆,男,23,CS,2018-8-31
95007,易思玲,女,19,MA,2018-8-31
95008,李娜,女,18,CS,2018-8-31
95009,梦圆圆,女,18,MA,2018-8-31
95010,孔小涛,男,19,CS,2017-8-31
95011,包小柏,男,18,MA,2019-8-31
95012,孙花,女,20,CS,2017-8-31
95013,冯伟,男,21,CS,2019-8-31
95014,王小丽,女,19,CS,2017-8-31
95015,王君,男,18,MA,2019-8-31
95016,钱国,男,21,MA,2019-8-31
95017,王风娟,女,18,IS,2019-8-31
95018,王一,女,19,IS,2019-8-31
95019,邢小丽,女,19,IS,2018-8-31
95020,赵钱,男,21,IS,2019-8-31
95021,周二,男,17,MA,2018-8-31
95022,郑明,男,20,MA,2018-8-31

load data local inpath './data/student2.txt' into table temp_part1;

第三步： 动态加载到表

insert into dy_part1 partition(dt) select sid,name,gender,age,academy,dt from 
temp_part1;

注意：严格模式下，给动态分区表导入数据时，分区字段至少要有一个分区字段是静态值非严格模式下,导入数据时，可以不指定静态值。临时表中的最后一个字段就是分区字段，如果是多级分区，则按顺序依次匹配

混合分区示例

1）创建一个分区表：

create table dy_part2(
id int,
name string
)
partitioned by (year string,month string,day string)
row format delimited fields terminated by ','
;

2）创建临时表

create table temp_part2(
id int,
name string,
year string,
month string,
day string
)
row format delimited fields terminated by ','
;

数据如下：
1,廉德枫,2019,06,25
2,刘浩(小),2019,06,25
3,王鑫,2019,06,25
5,张三,2019,06,26
6,张小三,2019,06,26
7,王小四,2019,06,27
8,夏流,2019,06,27

load data local inpath './data/temp_part2.txt' into table temp_part2;

3）导入数据到分区表

- 错误用法：
	insert into dy_part2 partition (year='2019',month,day) 
	select * from temp_part2;

- 正确用法：
insert into dy_part2 partition (year='2020',month,day) 
select id,name,month,day from temp_part2;

4）分区表注意事项

hive的分区使用的是表外字段，分区字段是一个伪列，但是分区字段是可以做查询过滤。
分区字段不建议使用中文
一般不建议使用动态分区，因为动态分区会使用mapreduce来进行查询数据，如果分区数据过多，导致namenode和resourcemanager的性能瓶颈。所以建议在使用动态分区前尽可能预知分区数量。

4.分区属性的修改都可以修改元数据和hdfs数据内容。

5） Hive分区和Mysql分区的区别

mysql分区字段用的是表内字段；而hive分区字段采用表外字段。

分桶

分桶的概述

在这里插入图片描述

为什么要分桶

数据分区可能导致有些分区数据过多，有些分区数据极少。分桶是将数据集分解为若干部分(数据文件)的另一种技术。
分区和分桶其实都是对数据更细粒度的管理。当单个分区或者表中的数据越来越大，分区不能细粒度的划分数据时，我们就采用分桶技术将数据更细粒度的划分和管理
[CLUSTERED BY (col_name, col_name, …)

分桶的原理

与MapReduce中的HashPartitioner的原理一模一样

MapReduce：使用key的hash值对reduce的数量进行取模(取余)
hive：使用分桶字段的hash值对分桶的数量进行取模(取余)。针对某一列进行分桶存储。每一条记录都是通过分桶字段的值的hash对分桶个数取余，然后确定放入哪个桶。比如有5个桶，某一个记录的分桶字段值对5取模等于1，那么这一条记录就存储1对应的文件中。

分桶的意义

为了保存分桶查询的分桶结构（数据已经按照分桶字段进行了hash散列）
分桶表适合进行数据抽样
抽样更高效。处理大数据时，如果能在数据集的一部分上运行查询进行测试会带来很多方便
join操作时可以提高MR的查询效率
连接查询两个在相同列上划分了桶的表，可以在map端进行高效的连接操作。比如join操作。对于两个表都有一个相同的列，如果对两个表都进行桶操作，那么hive底层会对相同列值的桶进行join操作。效率很高

分桶的本质

本质就是对应的hdfs上的一个文件，而分区对应的是hdfs上的一个目录，分桶是对数据更细粒度的划分

分桶表创建

案例

第一步：建表

drop table student;
create table student(
sno int,
name string,
sex string,
age int,
academy string
)
clustered by (sno) sorted by (age desc) into 4 buckets
row format delimited 
fields terminated by ','
;

 #分桶字段和排序字段可以不一致

第二步：准备数据(创建临时表)

create table temp_student(
sno int,
name string,
sex string,
age int,
academy string
)
row format delimited 
fields terminated by ','
;

load data local inpath './data/students.txt' into table temp_student;

第三步：从临时表中查询并导入数据

insert into [table] student
select * from temp_student
distribute by(sno) 
sort by (age desc)
;
或者
insert overwrite table student
select * from temp_student
distribute by(sno) 
sort by (age desc)
;

注意加载数据时，绝对不能使用load或者是上传方式，没有分桶效果。

注意事项

2.1.1版本设置了强制分桶操作，因此人为的修改reduce的个数不会影响最终文件的个数(文件个数由桶数决定)
–1. 在2.1.1版本里，底层实现了强制分桶，强制排序策略
– 即：正规写法要带上distribute by(分桶字段)[sort by 排序字段]，如果没有带上，也会分桶和排序。
–2. 使用insert into时可以不加关键字table. 使用insert overwrite时必须带关键字table.
–3. 因为底层实行了强制分桶策略，所以修改mapreduce.job.reduces的个数，不会影响桶文件数据。但是会影响真正执行时reduceTask的数量。是真正的reduceTask的数量是最接近mapreduce.job.reduces的数量的因子。如果是素数，就使用本身。

如果是低版本，比如1.2.1版本可以修改下面的属性

需要设置reduce数量和分桶数量相同：
set mapreduce.job.reduces=4;
2.如果数据量比较大，我们可以使用MR的本地模式：
set hive.exec.mode.local.auto=true;
3.强行分桶设置：set hive.enforce.bucketing=true; 默认是false
4.强行排序设置：set hive.enforce.sorting=true;

分桶表查询

语法：

语法:tablesample(bucket x out of y on sno)
x:代表从第几桶开始查询，x不能大于y
x：表示查询第几桶的数据
y: 表示桶的总数，在2.1.1.版本以后，y可以自定义，也就是不一定非要是建表时指定的桶的数量

2.1.1版本的y:代表查询的总的桶数 y值可以自定义。
低版本，比如1.2.1的y必须是表的桶数的因子或者是倍数。

查询全部

select * from student;
select * from student tablesample(bucket 1 out of 1);

指定桶查询

--案例1： 查询整张表的数据
select * from student1;
select * from student1 tablesample(bucket 1 out of 1 on sno);
--案例2：查询八桶中的第三桶
select * from student1 tablesample(bucket 3 out of 8 on sno);
--案例3：查询八桶中的第三桶和第五桶
select * from student1 tablesample(bucket 3 out of 8 on sno)
union
select * from student1 tablesample(bucket 5 out of 8 on sno);
--案例4：查询八桶中的第二桶和第六桶
select * from student1 tablesample(bucket 2 out of 4 on sno);

例：table总bucket数为32，tablesample(bucket 3 out of 16)，表示总共抽取（32/16=）2个bucket的数据，分别为第3个bucket和第（3+16=）19个bucket的数据。

--案例5： 查询7桶的第2桶
select * from student1 tablesample(bucket 2 out of 7 on sno);

其他查询

查询三行数据
    select * from student limit 3;
    select * from student tablesample(3 rows);
查询百分比的数据
    select * from student tablesample(13 percent);大小的百分比所占的那一行。
    
查询固定大小的数据
    select * from student tablesample(68b); 单位（K,KB,MB,GB...）
    固定大小所占的那一行。
随机抽三行数据
    select * from student order by rand() limit 3;

小总结：

定义

clustered by (id); —指定表内的字段进行分桶。
sorted by (id asc|desc) —指定数据的排序规则，表示咱们预期的数据是以这种规则进行的排序

导入数据

cluster by (id)   
--指定getPartition以哪个字段来进行hash，并且排序字段也是指定的字段，排序是以asc排列
--相当于distribute by (id) sort by (id)

distribute by (id)    -- 指定getPartition以哪个字段来进行hash
sort by (name asc | desc) --指定排序字段

-- 区别：distribute by 这种方式可以分别指定getPartition和sort的字段

导数据时：
insert overwrite table buc3
select id,name,age from temp_buc1
distribute by (id) sort by (id asc)
;
和下面的语句效果一样
insert overwrite table buc4
select id,name,age from temp_buc1
cluster by (id)
;

–1. 分区和分桶在建表时，关键字上都带ed
(1)分区的是partitioned by
(2)分桶的是clustered by … sorted by …
–2. 动态导入数据时，
(1)分区的是partition by(colname=“static value”,colname)
(2)分桶的是distribute by(colname) sort by(colname)
(3)当分桶的字段和排序字段一致，并且是升序时，可以使用cluster by(colname)
–3. 本质区别：
(1)分区的本质是分多个子目录来管理表数据
(2)分桶的本质是将目录中的大文件划分为多个小文件（桶文件）来管理数据
(3)两者都是优化手段，但是分桶比分区更细粒度。
(4)分区是表外字段，分桶是表内字段。

注意事项

分区使用的是表外字段，分桶使用的是表内字段
分桶更加细粒度的管理数据，更多的是使用来做抽样、join

xiaoxiao______

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
hive之分区分桶

第九章分区表的相关内容9.1 分区简介9.1.1 为什么分区Hive的Select查询时，一般会扫描整个表内容。随着系统运行的时间越来越长，表的数据量越来越大，而hive查询做全表扫描，会消耗很多时间，降低效率。而时候，我们需求的数据只需要扫描表中的一部分数据即可。这样，hive在建表时引入了partition概念。即在建表时，将整个表存储在不同的子目录中，每一个子目录对应一个分区。在查询时，我们就可以指定分区查询，避免了hive做全表扫描，从而提高查询效率。9.1.2 如何分区根据业务需
复制链接

扫一扫