hive分区

最新推荐文章于 2023-08-24 11:23:44 发布

甲家家

最新推荐文章于 2023-08-24 11:23:44 发布

阅读量322

点赞数

文章标签： hive分区概述

本文链接：https://blog.csdn.net/qq_29329981/article/details/89365060

版权

hive分区

**
1、为什么要分区
随着系统的运行，数据量越来越大，而hive的查询时全表扫描，这样将会导致大量的不必要的数据扫描，从而查询效率低下。
引进分区技术，避免全表扫描，提高查询效率。分区：partition
可以将用户的整个表的数据划分成多个子目录（子目录以分区变量的值来命名）
2、怎么分区
通常使用年、月、日、地区等进行分区，具体和业务相关
hive的分区和mysql的分区是有区别的

hive的分区使用的表外字段
mysql的分区使用的表内的字段

[PARTITIONED BY (col_name data_type [COMMENT col_comment], …)]

1、hive的分区名不区分大小写
2、hive分区的本质是在表对应的目录下面创建目录，分区字段是一个伪列，不存在数据中
3、一张表中可以有一个或者多个分区，分区下面也可以有一个或者多个分区（多级分区的情况）
create table stu1(
name string,
age int
)
partitioned by (province string)
;

load data local inpath ‘/etc/profile’ into table stu1; ##加载报错，这是一个分区表，加载数据时要指定分区

load data local inpath ‘/etc/profile’ into table stu1 partition (province=‘beijing’);为分区表加载数据

3、分区的细节
1、hive的分区名不区分大小写
2、hive分区的本质是在表对应的目录下面创建目录，分区字段是一个伪列，不存在数据中
3、一张表中可以有一个或者多个分区，分区下面也可以有一个或者多个分区（多级分区的情况）
4、可以查询分区信息，但是我们的分区字段只存在于元数据中

4、分区的操作
创建一个一级分区
create table if not exists day_part(
uid int,
uname string
)
partitioned by (year int)
row format delimited
fields terminated by ‘\t’
;

load data local inpath ‘/root/day_part.txt’ overwrite into table day_part partition (year=2017);
load data local inpath ‘/root/day_part.txt’ into table day_part partition (year=2018);

查看分区
show partitions day_part;

select * from day_part where year=2017; 查询时指定分区，不必全表扫描

alter table day_part change column uname uname string after uid;

创建二级分区

create table if not exists day_part1(
uid int,
uname string
)
partitioned by (year int,month int)
row format delimited
fields terminated by ‘\t’
;

load data local inpath ‘/root/day_part.txt’ overwrite into table day_part1 partition (year=2019,month=04);
load data local inpath ‘/root/day_part.txt’ overwrite into table day_part1 partition (year=2019,month=03);

select * from day_part1 where year=2019 and month=04;

创建三级分区
create table if not exists day_part2(
uid int,
uname string
)
partitioned by (year int,month int,day int)
row format delimited
fields terminated by ‘\t’
;

对分区进行操作：
显示分区
show partitions day_part1;

新增分区
alter table day_part1 add partition(year=2017,month=1);
alter table day_part1 add partition(year=2017,month=2) partition(year=2017,month=3);
新增分区并加载数据
alter table day_part1 add partition(year=2017,month=10) location
“/user/hive/warehouse/buc1”
修改分区所对应的路径
alter table day_part1 partition(year=2017,month=10) set location
“hdfs://mini1:9000/user/hive/warehouse/log_1” ##路径必须是绝对路径。从hdfs://mini1:9000开始

删除分区
alter table day_part1 drop partition (year=2017,month=1);
show partitions day_part1;

分区的类型：
静态分区、动态分区、混合分区
静态分区：新增分区或者是加载分区数据时指定分区名
动态分区：新增分区或者是加载分区数据时,分区名未知。
混合分区：静态分区和动态分区同时存在

动态分区的举例：
A表数据
uid uname year month day
1 zhangsan 2019 4 19
2 lissi 2019 4 18

B表是分区表，按照year month day进行分区

从A表中查询数据插入到B表

动态分区的相关属性：
set hive.exec.dynamic.partition=true; ##允许动态分区
set hive.exec.dynamic.partition.mode=strict ##分区模式的设定nostrict：strict(非严格模式：严格模式)
严格模式：至少需要一个静态分区
非严格模式：可以全是动态分区
set hive.exec.max.dynamic.partitions ##允许动态分区的最大数量
set hive.exec.max.dynamic.partitions.pernode ##每个节点上允许的最大的动态分区的数量（也就是reducetask的数量）

创建临时表
create table if not exists tmp(
uid int,
commentid bigint,
recommentid bigint,
year int,
month int,
day int
)
row format delimited
fields terminated by ‘\t’
;

load data local inpath ‘/tmp.txt’ overwrite into table tmp;

创建动态分区表
create table if not exists dyp1(
uid int,
commentid int,
recommentid int
)
partitioned by(year int,month int,day int)
row format delimited
fields terminated by ‘\t’
;

为混合分区加载数据
insert into table dyp1 partition(year=2011,month,day)
select uid,commentid,recommentid,month,day
from
tmp
;

创建表
create table if not exists dyp2(
uid int,
commentid int,
recommentid int
)
partitioned by(year int,month int,day int)
row format delimited
fields terminated by ‘\t’
;

为动态分区加载数据
insert into table dyp2 partition(year,month,day)
select uid,commentid,recommentid,year,month,day
from
tmp
;

报错
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

解决
set hive.exec.dynamic.partition.mode=nostrict

hive提供一个严格模式：严格模式下会阻止三种查询
1、对分区表查询时，where中过滤字段不是分区字段(避免全表扫描)
select uid,commentid from dyp2 where recommentid >1000 这种查询不允许
2、笛卡尔积的join查询，join语句不带on条件或者where条件
select
u1.uid,
u1.uname,
l.logintime
from
user u1
join
login l
;

下面的查询可以
select
u1.uid,
u1.uname,
l.logintime
from
user u1
join
login l
where u1.uid = l.uid
;

3、对order by查询，有order by不带limit语句
select
dyp2.uid,
dyp2.commentid
from dyp2
order by
dype2.uid
limit 5
;