【Hive-分区表和分桶表】

Tonystark_sunshine

已于 2022-09-13 20:47:10 修改

阅读量335

点赞数

分类专栏： Hive 文章标签： hive hadoop 大数据

于 2022-09-06 20:31:06 首次发布

本文链接：https://blog.csdn.net/Tonystark_lz/article/details/126732714

版权

Hive 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

分区表

基本操作

创建分区表语法

hive (default)> 
create table dept_partition(
    deptno int,    --部门编号
    dname string, --部门名称
    loc string     --部门位置
)
partitioned by (day string)
row format delimited fields terminated by '\t';

加载数据到分区表中

load data local inpath '/opt/module/hive/datas/dept_20220401.log' 
into table dept_partition 
partition(day='20220401');

查看分区表有多少分区

show partitions dept_partition;

增加分区

1）创建单个分区

hive (default)> 
alter table dept_partition 
add partition(day='20220404');

（2）同时创建多个分区（分区之间不能有逗号）

hive (default)> 
alter table dept_partition 
add partition(day='20220405') partition(day='20220406');

删除分区

（1）删除单个分区

hive (default)> 
alter table dept_partition 
drop partition (day='20220406');

（2）同时删除多个分区（分区之间必须有逗号）

hive (default)> 
alter table dept_partition 
drop partition (day='20220404'), partition(day='20220405');

分区表二级分区

如果一天内的日志数据量也很大，如何再将数据拆分?

创建二级分区表

create table dept_partition2(
    deptno int,    -- 部门编号
    dname string, -- 部门名称
    loc string     -- 部门位置
)
partitioned by (day string, hour string)
row format delimited fields terminated by '\t';

加载数据到二级分区表中

load data local inpath '/opt/module/hive/datas/dept_20220401.log' 
into table dept_partition2 
partition(day='20220401', hour='12');

修复分区表的元数据

hdfs对应的目录有文件（使用dfs -put上传)但没有元数据时，是查询不到当天的数据的，需要修复元数据或者添加对应的分区

-- 修复元数据
msck repair table dept_partition2;

-- 添加分区
alter table dept_partition2 add partition(day='20220401',hour='14');

动态分区

-- 设置非严格模式
set hive.exec.dynamic.partition.mode=nonstrict;
-- 动态分区 通过partition()指定分区字段
insert into dept_partition partition(day)
select 10,'abc','efs','20220608'
union
select 10,'abc','efs','20220609';

-- 与之对应的静态分区
insert into dept_partition partition(day='20220607')
select 10,'abc','efs';

相关参数

-- 开启动态分区功能（默认true，开启）
set hive.exec.dynamic.partition;
-- 设置为非严格模式（动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区）
set hive.exec.dynamic.partition.mode=nonstrict;
-- 在所有执行MapReduce的节点上，最大一共可以创建多少个动态分区。默认1000。
set hive.exec.max.dynamic.partitions;
-- 在每个执行MapReduce的节点上，最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。
-- 比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错
set hive.exec.max.dynamic.partitions.pernode;
-- 整个MapReduce Job中，最大可以创建多少个HDFS文件。默认100000
set hive.exec.max.created.files;
-- 当有空分区生成时，是否抛出异常。一般不需要设置。默认false。
set hive.error.on.empty.partition=false;

分桶表

创建普通分桶表（桶内数据不排序）

create table stu_buck(
    id int, 
    name string
)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

导入数据

load data local inpath '/opt/module/hive/datas/student.txt' 
into table stu_buck;

分桶规则：
根据结果可知：Hive的分桶采用对分桶字段的值进行哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中

创建排序分桶表（桶内数据排序）

create table stu_buck_sort(
    id int, 
    name string
)
clustered by(id) sorted by(id)
into 4 buckets
row format delimited fields terminated by '\t';