Hive分区分桶详解

一、分区概念:分区最主要的目的:实现快速查询
为什么要创建分区:
单个表数据量越来越大的时候,在Hive select查询中一般会扫描整个表内容(暴力扫描),会消耗很多时间做没必要的工作。有时候只需要扫描表中关心的一部分数据,因此建表时引入了partition概念。
(1)、Hive的分区和mysql的分区差异:
mysql分区是将表中的字段拿来直接作为分区字段,而hive的分区则是分区字段不在表中。
(2)、怎么分区:
根据业务分区,(完全看业务场景)选取id、年、月、日、男女性别、年龄段或者是能平均将数据分到不同文件中最好,分区不好将直接导致查询结果延迟。
(3)、分区细节:
1、一个表可以拥有一个或者多个分区,每个分区以文件夹的形式单独存在表文件夹的目录下。
2、表和列名不区分大小写。
3、分区是以字段的形式在表结构中存在,通过describe table命令可以查看到字段存在(算是一个伪列),但是该字段不存放实际的数据内容,仅仅是分区的表示。
4、分区有一级、二级、三级和多级分区:
5、创建动态分区、静态分区、混合分区:
动态分区:可以动态加载数据或分区个数不固定
静态分区:可以静态加入数据或分区个数确定
混合分区:动态和静态结合加入数据,主分区必须为静态分区,副分区可以为动态分区

example:
一、通过load的方式加载静态分区数据
##看一下我们的数据
[hadoop@hadoop001 partitionData]$ cat pt_stu_nv.txt 
095041211011,zhangyan,21
095041211034,xiaoli,18
095041211056,xiaoxin,13
[hadoop@hadoop001 partitionData]$ cat pt_stu_nan.txt 
095041211001,xiaozhang,21
095041211029,xiaopeng,18
095041211089,xiaohong,11
##创建分区表,需要注意分区字段不在表字段中
hive (hwzhdb)> create external table static_partition_stu(
             > id int,
             > name string
             > )
             > partitioned by (age int)
             > row format delimited fields terminated by ','
             > stored as textfile;
OK
Time taken: 0.308 seconds
##加载分区数据,必须在表名后面加上分区字段和值。此时我们是加载的本地数据,生产来说的话是需要加载hdfs上的数据,因为数据做清洗操作后是存放在hdfs上面的。
hive (hwzhdb)> load data local inpath '/home/hadoop/data/partitionData/pt_stu_nan.txt' into table static_partition_stu partition(gender='nan');
Loading data to table hwzhdb.static_partition_stu partition (gender=nan)
Partition hwzhdb.static_partition_stu{gender=nan} stats: [numFiles=1, numRows=0, totalSize=76, rawDataSize=0]
OK
Time taken: 1.525 seconds
hive (hwzhdb)> load data local inpath '/home/hadoop/data/partitionData/pt_stu_nv.txt' into table static_partition_stu partition(gender='nv');
Loading data to table hwzhdb.static_partition_stu partition (gender=nv)
Partition hwzhdb.static_partition_stu{gender=nv} stats: [numFiles=1, numRows=0, totalSize=72, rawDataSize=0]
OK
Time taken: 0.757 seconds

hive (hwzhdb)> select * from static_partition_stu;
OK
static_partition_stu.id static_partition_stu.name       static_partition_stu.age       static_partition_stu.gender
NULL    xiaozhang       21      nan
NULL    xiaopeng        18      nan
NULL    xiaohong        11      nan
NULL    zhangyan        21      nv
NULL    xiaoli  18      nv
NULL    xiaoxin 13      nv
Time taken: 0.554 seconds, Fetched: 6 row(s)
hive (hwzhdb)> select * from static_partition_stu where gender='nan';
OK
static_partition_stu.id static_partition_stu.name       static_partition_stu.age       static_partition_stu.gender
NULL    xiaozhang       21      nan
NULL    xiaopeng        18      nan
NULL    xiaohong        11      nan
Time taken: 0.541 seconds, Fetched: 3 row(s)
hive (hwzhdb)> select * from static_partition_stu where gender='nv';
OK
static_partition_stu.id static_partition_stu.name       static_partition_stu.age       static_partition_stu.gender
NULL    zhangyan        21      nv
NULL    xiaoli  18      nv
NULL    xiaoxin 13      nv
Time taken: 0.143 seconds, Fetched: 3 row(s)
二、通过insert的方式加载静态分区数据
##查看数据
hive (hwzhdb)> select * from emp;
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm       emp.deptno
7369    SMITH   CLERK   7902    1980-12-17    00:00:00  800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-02-20    00:00:00  1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-02-22    00:00:00  1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-04-02    00:00:00  2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-09-28    00:00:00  1250.0  1400.0  3
7698    BLAKE   MANAGER 7839    1981-05-01    00:00:00  2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-06-09    00:00:00  2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1982-12-09    00:00:00  3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17    00:00:00  5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-09-08    00:00:00  1500.0  0.0     30
7876    ADAMS   CLERK   7788    1983-01-12    00:00:00  1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-03    00:00:00  950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-03    00:00:00  3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-01-23    00:00:00  1300.0  NULL    10
##创建分区表,将部门编号作为分区字段
hive (hwzhdb)> create external table static_partition_insert_emp(
             > empno int,
             > ename string,
             > job string,
             > mgr int,
             > hiredate string,
             > sal double,
             > comm double
             > )
             > partitioned by (deptno int)
             > row format delimited fields terminated by '\t'
             > stored as textfile;
OK
Time taken: 0.154 seconds
##通过insert的方式从emp中向static_partition_insert_emp中加载数据,需要注意,因为此时emp表中的字段数量跟static_partition_insert_emp的字段数量是不对应的,将emp中的deptno字段作为static_partition_insert_emp表中的分区字段了,所以从emp中查出来的字段要一个个对应static_partition_insert_emp中的字段

insert into table static_partition_insert_emp partition(deptno=10)
select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=10;

insert into table static_partition_insert_emp partition(deptno=20)
select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=20;

insert into table static_partition_insert_emp partition(deptno=30)
select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=30;
hive (hwzhdb)> select * from static_partition_insert_emp;
OK
static_partition_insert_emp.empno       static_partition_insert_emp.ename       static_partition_insert_emp.job static_partition_insert_emp.mgr static_partition_insert_emp.hiredate    static_partition_insert_emp.sal static_partition_insert_emp.comm        static_partition_insert_emp.deptno
7782    CLARK   MANAGER 7839    1981-06-09    00:00:00  2450.0  NULL    10
7839    KING    PRESIDENT       NULL    1981-11-17    00:00:00  5000.0  NULL    10
7934    MILLER  CLERK   7782    1982-01-23    00:00:00  1300.0  NULL    10
7369    SMITH   CLERK   7902    1980-12-17    00:00:00  800.0   NULL    20
7566    JONES   MANAGER 7839    1981-04-02    00:00:00  2975.0  NULL    20
7788    SCOTT   ANALYST 7566    1982-12-09    00:00:00  3000.0  NULL    20
7876    ADAMS   CLERK   7788    1983-01-12    00:00:00  1100.0  NULL    20
7902    FORD    ANALYST 7566    1981-12-03    00:00:00  3000.0  NULL    20
7499    ALLEN   SALESMAN        7698    1981-02-20    00:00:00  1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-02-22    00:00:00  1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981-09-28    00:00:00  1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-05-01    00:00:00  2850.0  NULL    30
7844    TURNER  SALESMAN        7698    1981-09-08    00:00:00  1500.0  0.0     30
7900    JAMES   CLERK   7698    1981-12-03    00:00:00  950.0   NULL    30
Time taken: 0.124 seconds, Fetched: 14 row(s)
hive (hwzhdb)> select * from static_partition_insert_emp where deptno=10;
OK
static_partition_insert_emp.empno       static_partition_insert_emp.ename       static_partition_insert_emp.job static_partition_insert_emp.mgr static_partition_insert_emp.hiredate    static_partition_insert_emp.sal static_partition_insert_emp.comm        static_partition_insert_emp.deptno
7782    CLARK   MANAGER 7839    1981-06-09    00:00:00  2450.0  NULL    10
7839    KING    PRESIDENT       NULL    1981-11-17    00:00:00  5000.0  NULL    10
7934    MILLER  CLERK   7782    1982-01-23    00:00:00  1300.0  NULL    10
Time taken: 0.221 seconds, Fetched: 3 row(s)
hive (hwzhdb)> select * from static_partition_insert_emp where deptno=20;
OK
static_partition_insert_emp.empno       static_partition_insert_emp.ename       static_partition_insert_emp.job static_partition_insert_emp.mgr static_partition_insert_emp.hiredate    static_partition_insert_emp.sal static_partition_insert_emp.comm        static_partition_insert_emp.deptno
7369    SMITH   CLERK   7902    1980-12-17    00:00:00  800.0   NULL    20
7566    JONES   MANAGER 7839    1981-04-02    00:00:00  2975.0  NULL    20
7788    SCOTT   ANALYST 7566    1982-12-09    00:00:00  3000.0  NULL    20
7876    ADAMS   CLERK   7788    1983-01-12    00:00:00  1100.0  NULL    20
7902    FORD    ANALYST 7566    1981-12-03    00:00:00  3000.0  NULL    20
Time taken: 0.113 seconds, Fetched: 5 row(s)
hive (hwzhdb)> select * from static_partition_insert_emp where deptno=30;
OK
static_partition_insert_emp.empno       static_partition_insert_emp.ename       static_partition_insert_emp.job static_partition_insert_emp.mgr static_partition_insert_emp.hiredate    static_partition_insert_emp.sal static_partition_insert_emp.comm        static_partition_insert_emp.deptno
7499    ALLEN   SALESMAN        7698    1981-02-20    00:00:00  1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-02-22    00:00:00  1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981-09-28    00:00:00  1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-05-01    00:00:00  2850.0  NULL    30
7844    TURNER  SALESMAN        7698    1981-09-08    00:00:00  1500.0  0.0     30
7900    JAMES   CLERK   7698    1981-12-03    00:00:00  950.0   NULL    30
Time taken: 0.119 seconds, Fetched: 6 row(s)
三、多级静态分区,因为一个分区对应的是hdfs上面的一个目录,多级分区则相当于hdfs上的多级目录
比如我们还是从emp表中拿数据:通过 job和deptno进行多级分区
hive (hwzhdb)> select * from emp;
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm       emp.deptno
7369    SMITH   CLERK   7902    1980-12-17    00:00:00  800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-02-20    00:00:00  1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-02-22    00:00:00  1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-04-02    00:00:00  2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-09-28    00:00:00  1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-05-01    00:00:00  2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-06-09    00:00:00  2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1982-12-09    00:00:00  3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17    00:00:00  5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-09-08    00:00:00  1500.0  0.0     30
7876    ADAMS   CLERK   7788    1983-01-12    00:00:00  1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-03    00:00:00  950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-03    00:00:00  3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-01-23    00:00:00  1300.0  NULL    10
Time taken: 0.083 seconds, Fetched: 14 row(s)
##插入分区数据
insert into table static_mut_partition_insert_emp partition(job='MANAGER',deptno=10)
select empno,ename,mgr,hiredate,sal,comm from emp where job='MANAGER' and deptno=10;
##通过指定分区查出数据
hive (hwzhdb)> select * from static_mut_partition_insert_emp where job='MANAGER' and deptno=10;
OK
static_mut_partition_insert_emp.empno   static_mut_partition_insert_emp.ename   static_mut_partition_insert_emp.mgr     static_mut_partition_insert_emp.hiredate        static_mut_partition_insert_emp.sal     static_mut_partition_insert_emp.comm    static_mut_partition_insert_emp.job     static_mut_partition_insert_emp.deptno
7782    CLARK   7839    1981-06-09    00:00:00  2450.0  NULL    MANAGER 10
Time taken: 0.149 seconds, Fetched: 1 row(s)
四、动态分区,当分区个数不确定或者分区个数过多的时候可以使用动态分区,动态分区只能使用insert的方式插入数据,并且使用动态分区需要先把动态分区的设置改为非严格模式,默认为严格模式(set hive.exec.dynamic.partition.mode=nonstrict; 如果动态分区为关闭状态的话还需要打开动态分区的设置set hive.exec.dynamic.partition=true;)
##看一下我们的数据,我们以年龄作为分区字段
hive (hwzhdb)> select * from stus;
OK
stus.id stus.name       stus.sge
NULL    zhangyan        21
NULL    xiaoli  18
NULL    xiaoxin 13
NULL    xiaozhang       21
NULL    xiaopeng        18
NULL    xiaohong        1
Time taken: 0.087 seconds, Fetched: 6 row(s)
##创建动态分区表
hive (hwzhdb)> create table dynamic_insert_stu(
             > id int,
             > name string
             > )
             > partitioned by (age int)
             > row format delimited fields terminated by '\t'
             > stored as textfile;
OK
Time taken: 0.069 seconds
##从stus表中向dynamic_insert_stu中插入数据
hive (hwzhdb)> set hive.exec.dynamic.partition.mode=nonstrict;

insert into table dynamic_insert_stu partition(age)
select id,name,age from stus;
五、混合分区(待续...)

总结: 静态分区表和动态分区表的区别?
1.建表无区别。
2.静态分区能通过load和insert加载数据,不仅要指定分区字段,也要指定对应的值
load data local inpath ‘/home/hadoop/data/partitionData/pt_stu_nan.txt’ into table static_partition_stu partition(gender=‘nan’);
insert into table static_partition_insert_emp partition(deptno=10)
select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=10;
动态分区只能通过insert的方式加载数据,只需要指定分区字段就可以,但是必须要通过设置打开动态分区,并且动态分区的字段只能放在select的最后一个字段,如果是多个动态分区字段,那对应放在select的最后几个就行。
insert into table dynamic_insert_stu partition(age) select id,name,age
from stus;

Attention:
1、动态分区与静态分区还有一个细微的差别是,静态分区一定会创建分区,不管SELECT语句的结果有没有数据。而动态分区,只有在SELECT结果的记录数>0的时候,才会创建分区。因此在不同的业务场景下,可能会选择不同的方案。
2、另外使用动态分区时需要注意的比较重要的一点是,动态分区会为每一个分区分配reduce数。比如说你在脚本上面写了:set
mapred.reduce.tasks=100;
这在分区值很多的情况下,会成为一个灾难,容易直接把namenode给搞挂掉(因为namenode大量操作文件),是非常危险的。因此使用动态分区时,一定要清楚地知道产生的动态分区值,并且合理地设置reduce数量。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值