20190321-分区和分桶

最新推荐文章于 2024-07-14 09:03:07 发布

shayuwei

最新推荐文章于 2024-07-14 09:03:07 发布

阅读量300

点赞数

分类专栏： Hive 笔记文章标签： Hive

本文链接：https://blog.csdn.net/shayuwei/article/details/88747729

版权

笔记同时被 2 个专栏收录

164 篇文章 5 订阅

订阅专栏

Hive

12 篇文章 0 订阅

订阅专栏

文章目录

动态分区表

1 包含动态分区的字段的表

准备数据

5	男	16	北京	13754554587	2015-03-24
6	女	17	北京	13872374170	2017-02-20
2	女	16	北京	17938844884	2015-05-26
23	男	15	北京	13700000033	2016-07-25
4	男	17	北京	15257575773	2017-08-11
3	女	15	北京	15885888889	2018-05-03
10	男	16	北京	14098593989	2018-04-06
15	女	14	北京	14938983000	2019-06-24

[yao@master ~]$ cd data/
[yao@master data]$ vim stu_message.txt

创建表

hive> create table stu_message(
    > stu_id int,
    > sex string,
    > age int,
    > address string,
    > tel_num string,
    > ts date)
    > row format delimited fields terminated by '\t'
    > stored as textfile
    > location '/yr/stu_message';
OK
Time taken: 0.12 seconds

加载数据

hive> load data local inpath '/home/yao/data/stu_message.txt' into table stu_message;
Loading data to table yr.stu_message
Table yr.stu_message stats: [numFiles=0, totalSize=0]
OK
Time taken: 0.932 seconds
hive> select * from stu_message;
OK
5	男	16	北京	13754554587	2015-03-24
6	女	17	北京	13872374170	2017-02-20
2	女	16	北京	17938844884	2015-05-26
23	男	15	北京	13700000033	2016-07-25
4	男	17	北京	15257575773	2017-08-11
3	女	15	北京	15885888889	2018-05-03
10	男	16	北京	14098593989	2018-04-06
15	女	14	北京	14938983000	2019-06-24
Time taken: 0.078 seconds, Fetched: 8 row(s)

创建全部字段的表

hive> create table stu_mess(
    > stu_id int,
    > sex string,
    > age int,
    > address string,
    > tel_num string,
    > ts date,
    > y int,
    > m int)
    > row format delimited fields terminated by '\t'
    > stored as textfile;
OK
Time taken: 0.111 seconds

加载数据

hive> insert into table stu_mess select stu_id,sex,age,address,tel_num,ts,year(ts) as y,month(ts) as m from stu_message;					//年份和月份是通过year()和month()函数截取的
Query ID = yao_20190321150133_0ab47c53-1f4e-4edb-bc5e-6073cf526ed4
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1553134178907_0006, Tracking URL = http://master:18088/proxy/application_1553134178907_0006/
Kill Command = /home/yao/hadoop-2.7.7/bin/hadoop job  -kill job_1553134178907_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-03-21 15:01:50,482 Stage-1 map = 0%,  reduce = 0%
2019-03-21 15:02:10,344 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.53 sec
MapReduce Total cumulative CPU time: 3 seconds 530 msec
Ended Job = job_1553134178907_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://master:9000/user/hive/warehouse/yr.db/stu_mess/.hive-staging_hive_2019-03-21_15-01-33_891_108911504095429350-1/-ext-10000
Loading data to table yr.stu_mess
Table yr.stu_mess stats: [numFiles=1, numRows=8, totalSize=371, rawDataSize=363]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 4.3 sec   HDFS Read: 4519 HDFS Write: 439 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 300 msec
OK
Time taken: 40.172 seconds
hive> select * from stu_mess;
OK
5	男	16	北京	13754554587	2015-03-24	2015	3
6	女	17	北京	13872374170	2017-02-20	2017	2
2	女	16	北京	17938844884	2015-05-26	2015	5
23	男	15	北京	13700000033	2016-07-25	2016	7
4	男	17	北京	15257575773	2017-08-11	2017	8
3	女	15	北京	15885888889	2018-05-03	2018	5
10	男	16	北京	14098593989	2018-04-06	2018	4
15	女	14	北京	14938983000	2019-06-24	2019	6
Time taken: 0.084 seconds, Fetched: 8 row(s)

2 关闭严格分区模式

分区模式，默认nonstrict

hive> set hive.exec.dynamic.partition.mode=nonstrict;

开启动态分区，默认true

hive> set hive.exec.dynamic.partition=true;

最大动态分区数，默认1000

set hive.exec.dynamic.partitions=1000;

数据量很大时需要设置，这次的实验数据很小，所以这里没有进行设置。

3 创建带有分区的表

创建表

hive> create table stu_mess_part(
    > stu_id int,
    > sex string,
    > age int,
    > address string,
    > tel_num string,
    > ts date)
    > partitioned by(y int,m int)
    > row format delimited fields terminated by '\t'
    > stored as textfile;
OK
Time taken: 0.131 seconds

加载数据

hive> insert into table stu_mess_part partition(y,m) select * from stu_mess;
Query ID = yao_20190321151105_548604f0-746b-425b-8c07-6d78426de226
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1553134178907_0007, Tracking URL = http://master:18088/proxy/application_1553134178907_0007/
Kill Command = /home/yao/hadoop-2.7.7/bin/hadoop job  -kill job_1553134178907_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-03-21 15:11:21,325 Stage-1 map = 0%,  reduce = 0%
2019-03-21 15:11:33,308 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.71 sec
MapReduce Total cumulative CPU time: 3 seconds 710 msec
Ended Job = job_1553134178907_0007
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://master:9000/user/hive/warehouse/yr.db/stu_mess_part/.hive-staging_hive_2019-03-21_15-11-05_342_1048170690574644764-1/-ext-10000
Loading data to table yr.stu_mess_part partition (y=null, m=null)
	 Time taken for load dynamic partitions : 1957
	Loading partition {y=2015, m=5}
	Loading partition {y=2019, m=6}
	Loading partition {y=2015, m=3}
	Loading partition {y=2016, m=7}
	Loading partition {y=2017, m=8}
	Loading partition {y=2018, m=5}
	Loading partition {y=2017, m=2}
	Loading partition {y=2018, m=4}
	 Time taken for adding to write entity : 3
Partition yr.stu_mess_part{y=2015, m=3} stats: [numFiles=1, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part{y=2015, m=5} stats: [numFiles=1, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part{y=2016, m=7} stats: [numFiles=1, numRows=1, totalSize=40, rawDataSize=39]
Partition yr.stu_mess_part{y=2017, m=2} stats: [numFiles=1, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part{y=2017, m=8} stats: [numFiles=1, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part{y=2018, m=4} stats: [numFiles=1, numRows=1, totalSize=40, rawDataSize=39]
Partition yr.stu_mess_part{y=2018, m=5} stats: [numFiles=1, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part{y=2019, m=6} stats: [numFiles=1, numRows=1, totalSize=40, rawDataSize=39]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.71 sec   HDFS Read: 4815 HDFS Write: 720 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 710 msec
OK
Time taken: 33.911 seconds
hive> dfs -ls /user/hive/warehouse/yr.db/stu_mess_part;
Found 5 items
drwxr-xr-x   - yao supergroup          0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2015
drwxr-xr-x   - yao supergroup          0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2016
drwxr-xr-x   - yao supergroup          0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2017
drwxr-xr-x   - yao supergroup          0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2018
drwxr-xr-x   - yao supergroup          0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2019
hive> dfs -ls /user/hive/warehouse/yr.db/stu_mess_part;
Found 5 items
drwxr-xr-x   - yao supergroup          0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2015
drwxr-xr-x   - yao supergroup          0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2016
drwxr-xr-x   - yao supergroup          0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2017
drwxr-xr-x   - yao supergroup          0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2018
drwxr-xr-x   - yao supergroup          0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2019
hive> dfs -ls /user/hive/warehouse/yr.db/stu_mess_part/y=2016;
Found 1 items
drwxr-xr-x   - yao supergroup          0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2016/m=7
hive> dfs -ls /user/hive/warehouse/yr.db/stu_mess_part/y=2017;
Found 2 items
drwxr-xr-x   - yao supergroup          0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2017/m=2
drwxr-xr-x   - yao supergroup          0 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2017/m=8
hive> dfs -ls /user/hive/warehouse/yr.db/stu_mess_part/y=2017/m=2;
Found 1 items
-rwxr-xr-x   1 yao supergroup         39 2019-03-21 15:11 /user/hive/warehouse/yr.db/stu_mess_part/y=2017/m=2/000000_0
hive> dfs -cat /user/hive/warehouse/yr.db/stu_mess_part/y=2017/m=2/000000_0;
6	女	17	北京	13872374170	2017-02-20
hive> show partitions stu_mess_part;
OK
y=2015/m=3
y=2015/m=5
y=2016/m=7
y=2017/m=2
y=2017/m=8
y=2018/m=4
y=2018/m=5
y=2019/m=6
Time taken: 0.159 seconds, Fetched: 8 row(s)

一般基于时间的分区比较多（电商类）

分桶

为什么存在桶表？

分区提供了一个隔离数据和优化查询的便利方式，不过并非所有的数据都可形成
合理的分区，尤其是需要确定合适大小的分区划分方式，（不合理的数据分区划分方式可能导致有的分区数据过多，而某些分区没有什么数据的尴尬情况）。分桶是将数据集分解为更容易管理的若干部分的另一种技术。
什么是桶表？

对于每一个表（table）或者分区，Hive可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分。
Hive也是针对某一列进行桶的组织。
怎么分桶？

Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。
有什么优点？

1.获得更高的查询处理效率。桶为表加上了额外的结构，Hive在处理有些查询
时能利用这个结构。具体而言，连接两个在（包含连接列的）相同列上划分了桶的表，可以使用连接。比如 JOIN 操作。对于JOIN操作两个表有一个相同的列，如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行 JOIN 操作就可以，可以大大较少 JOIN 的数据量。
2.使取样（sampling）更高效。在处理大规模数据集时，在开发和修改查询的阶段，如果能在数据集的一小部分数据上试运行查询，会带来很多方便。

1.实例

data

[yao@master data]$ vim student.txt
[yao@master data]$ cat student.txt 
95001,李勇,男,20,CS
95002,刘晨,女,19,IS
95003,王敏,女,22,MA
95004,张立,男,19,IS
95005,刘刚,男,18,MA
95006,孙庆,男,23,CS
95007,易思玲,女,19,MA
95008,李娜,女,18,CS
95009,梦圆圆,女,18,MA
95010,孔小涛,男,19,CS
95011,包小柏,男,18,MA
95012,孙花,女,20,CS
95013,冯伟,男,21,CS
95014,王小丽,女,19,CS
95015,王君,男,18,MA
95016,钱国,男,21,MA
95017,王风娟,女,18,IS
95018,王一,女,19,IS
95019,邢小丽,女,19,IS
95020,赵钱,男,21,IS
95021,周二,男,17,MA
95022,郑明,男,20,MA

2 第一个失败的分桶

创建表

hive> create table students(
    > sno int,
    > sname string,
    > sex string,
    > sage int,
    > sdept string)
    > clustered by(sno) into 3 buckets
    > row format delimited fields terminated by ','
    > stored as textfile;
OK
Time taken: 0.162 seconds

设置分桶（强制分桶）

hive> set hive.enforce.bucketing=true;

加载数据

hive> load data local inpath '/home/yao/data/student.txt' overwrite into table students;
Loading data to table yr.students
Table yr.students stats: [numFiles=1, numRows=0, totalSize=527, rawDataSize=0]
OK
Time taken: 0.752 seconds

查看

hive> dfs -ls /user/hive/warehouse/yr.db/students;
Found 1 items
-rwxr-xr-x   1 yao supergroup        527 2019-03-21 15:55 /user/hive/warehouse/yr.db/students/student.txt

虽然强制分桶，但并没成功
我们看到虽然设置了强制分桶，但实际STUDENT表下面只有一个STUDENTS一个文件。
分桶也就是分区，分区数量等于文件数，所以上面方法并没有分桶。

3 第二个分桶

创建表

hive> create table stud_buck(
    > sno int,
    > sname string,
    > sex string,
    > sage int,
    > sdept string)
    > clustered by(sno) 
    > sorted by(sno desc)
    > into 3 buckets
    > row format delimited fields terminated by ',';
OK
Time taken: 0.12 seconds

设置分桶和分区

hive> set hive.enforce.bucketing=true;
hive> set mapreduce.job.reduce=3;

设置变量,设置分桶为true, 设置reduce数量是分桶的数量个数

加载数据

开始往创建的分桶表插入数据(插入数据需要是已分桶, 且排序的)

hive> insert into table stud_buck select * from students distribute by(sno) sort by (sno);
Query ID = yao_20190321160721_d26f0b8f-319f-4378-aec3-f6146b59ba08
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1553134178907_0008, Tracking URL = http://master:18088/proxy/application_1553134178907_008/
Kill Command = /home/yao/hadoop-2.7.7/bin/hadoop job  -kill job_1553134178907_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-03-21 16:07:40,026 Stage-1 map = 0%,  reduce = 0%
2019-03-21 16:07:53,630 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.22 sec
2019-03-21 16:08:09,486 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.31 sec
MapReduce Total cumulative CPU time: 6 seconds 310 msec
Ended Job = job_1553134178907_0008
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1553134178907_0009, Tracking URL = http://master:18088/proxy/application_1553134178907_009/
Kill Command = /home/yao/hadoop-2.7.7/bin/hadoop job  -kill job_1553134178907_0009
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 3
2019-03-21 16:08:29,636 Stage-2 map = 0%,  reduce = 0%
2019-03-21 16:08:36,029 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 1.67 sec
2019-03-21 16:09:11,172 Stage-2 map = 100%,  reduce = 44%, Cumulative CPU 4.37 sec
2019-03-21 16:09:26,828 Stage-2 map = 100%,  reduce = 67%, Cumulative CPU 13.03 sec
2019-03-21 16:09:27,866 Stage-2 map = 100%,  reduce = 89%, Cumulative CPU 13.17 sec
2019-03-21 16:09:29,991 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 15.41 sec
MapReduce Total cumulative CPU time: 15 seconds 410 msec
Ended Job = job_1553134178907_0009
Loading data to table yr.stud_buck
Table yr.stud_buck stats: [numFiles=3, numRows=22, totalSize=527, rawDataSize=505]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.31 sec   HDFS Read: 6819 HDFS Write: 909 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 3   Cumulative CPU: 15.41 sec   HDFS Read: 12694 HDFS Write: 734 SUCCESS
Total MapReduce CPU Time Spent: 21 seconds 720 msec
OK
Time taken: 130.884 seconds

查看

hive> dfs -ls /user/hive/warehouse/yr.db/stud_buck;
Found 3 items
-rwxr-xr-x   1 yao supergroup        193 2019-03-21 16:09 /user/hive/warehouse/yr.db/stud_buck/000000_0
-rwxr-xr-x   1 yao supergroup        170 2019-03-21 16:09 /user/hive/warehouse/yr.db/stud_buck/000001_0
-rwxr-xr-x   1 yao supergroup        164 2019-03-21 16:09 /user/hive/warehouse/yr.db/stud_buck/000002_0
hive> dfs -cat /user/hive/warehouse/yr.db/stud_buck/000000_0;
95001,李勇,男,20,CS
95004,张立,男,19,IS
95007,易思玲,女,19,MA
95010,孔小涛,男,19,CS
95013,冯伟,男,21,CS
95016,钱国,男,21,MA
95019,邢小丽,女,19,IS
95022,郑明,男,20,MA
hive> dfs -cat /user/hive/warehouse/yr.db/stud_buck/000000_1;
cat: `/user/hive/warehouse/yr.db/stud_buck/000000_1': No such file or directory
Command failed with exit code = 1
Query returned non-zero code: 1, cause: null
hive> dfs -cat /user/hive/warehouse/yr.db/stud_buck/000001_0;
95002,刘晨,女,19,IS
95005,刘刚,男,18,MA
95008,李娜,女,18,CS
95011,包小柏,男,18,MA
95014,王小丽,女,19,CS
95017,王风娟,女,18,IS
95020,赵钱,男,21,IS
hive> dfs -cat /user/hive/warehouse/yr.db/stud_buck/000002_0;
95003,王敏,女,22,MA
95006,孙庆,男,23,CS
95009,梦圆圆,女,18,MA
95012,孙花,女,20,CS
95015,王君,男,18,MA
95018,王一,女,19,IS
95021,周二,男,17,MA

#可以使用distribute by(sno) sort by(sno asc) 或是排序和分桶的字段相同的时候使用Cluster by(字段)
#注意使用cluster by 就等同于分桶+排序(sort)

4 重点

SORTED BY对桶中的一个或多个列另外排序
distribute by + “表中字段”关键字控制map输出结果的分发,相同字段的map输出会发到一个reduce节点去处理。
CLUSTERED BY相当于 distribute by 和sort by 的结合，默认只能是升序。用来指定划分桶所用列和划分桶的个数。Hive对key的hash值除bucket个数取余数，保证数据均匀随机分布在所有bucket里。

5 总结

其实桶的概念就是MapReduce的分区的概念，两者完全相同。物理上每个桶就是目录里的一个文件，一个作业产生的桶（输出文件）数量和reduce任务个数相同。
而分区表的概念，则是新的概念。分区代表了数据的仓库，也就是文件夹目录。每个文件夹下面可以放不同的数据文件。
通过文件夹可以查询里面存放的文件。但文件夹本身和数据的内容毫无关系。
桶则是按照数据内容的某个值进行分桶，把一个大文件散列称为一个个小文件。这些小文件可以单独排序。如果另外一个表也按照同样的规则分成了一个个小文件。两个表join的时候，就不必要扫描整个表，
只需要匹配相同分桶的数据即可。效率当然大大提升。
同样，对数据抽样的时候，也不需要扫描整个文件。只需要对每个分区按照相同规则抽取一部分数据即可。

6 分桶的查询

基于桶抽样

hive> select * from stu_mess_part_buck tablesample(bucket 1 out of 2 on sex);				//取2个bucket，分别是1 和3
OK
5	男	16	北京	13754554587	2015-03-24	2015	3
23	男	15	北京	13700000033	2016-07-25	2016	7
4	男	17	北京	15257575773	2017-08-11	2017	8
10	男	16	北京	14098593989	2018-04-06	2018	4
Time taken: 1.621 seconds, Fetched: 4 row(s)

tablesample是抽样语句，语法：TABLESAMPLE(BUCKET x OUTOF y)
y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例。
例如，table总共分了64份，当y=32时，抽取(64/32=)2个bucket的数据，
当y=128时，抽取(64/128=)1/2个bucket的数据。
x表示从哪个bucket开始抽取。
例如，table总bucket数为32，tablesample(bucket 3 out of 16)，
表示总共抽取（32/16=）2个bucket的数据，
分别为第3个bucket和第（3+16=）19个bucket的数据。

基于百分比抽样

hive另外一种按照抽样百分比进行抽样的方式，该种方式基于行数，按照输入路径下的数据块的百分比进行抽样。这种抽样的最小单元是一个hdfs数据块，如果表的数据大小小于普通块大小128M,将返回所有行。
基于百分比的抽样方式提供了一个变量，用于控制基于数据块的调优种子信息：

<property>

   <name>hive.sample.seednumber</name>

   <value>0</value>

</property>

7 数据分桶存在的一些缺陷

如果通过数据文件LOAD 到分桶表中，会存在额外的MR负担。
实际生产中分桶策略使用频率较低，更常见的还是使用数据分区

既分区又分桶

创建表

hive> create table stu_mess_part_buck(
    > stu_id int,
    > sex string,
    > age int,
    > address string,
    > tel_num string,
    > ts date)
    > partitioned by (y int,m int)
    > clustered by(stu_id) into 4 buckets
    > row format delimited fields terminated by "\t"
    > stored as textfile
    > location '/yr/stu_mess_part_buck';
OK
Time taken: 0.095 seconds

加载数据

hive> insert into table stu_mess_part_buck partition(y,m) select * from stu_mess;
Query ID = yao_20190321162521_c78329dd-e687-4839-81fc-26d706f8631e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1553134178907_0010, Tracking URL = http://master:18088/proxy/application_1553134178907_010/
Kill Command = /home/yao/hadoop-2.7.7/bin/hadoop job  -kill job_1553134178907_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2019-03-21 16:25:38,087 Stage-1 map = 0%,  reduce = 0%
2019-03-21 16:25:45,844 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.53 sec
2019-03-21 16:26:03,068 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU 4.31 sec
2019-03-21 16:26:10,634 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU 8.12 sec
2019-03-21 16:26:13,881 Stage-1 map = 100%,  reduce = 42%, Cumulative CPU 9.05 sec
2019-03-21 16:26:16,029 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 12.12 sec
2019-03-21 16:26:17,080 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 18.26 sec
MapReduce Total cumulative CPU time: 18 seconds 260 msec
Ended Job = job_1553134178907_0010
Loading data to table yr.stu_mess_part_buck partition (y=null, m=null)
	 Time taken for load dynamic partitions : 1278
	Loading partition {y=2018, m=5}
	Loading partition {y=2016, m=7}
	Loading partition {y=2017, m=2}
	Loading partition {y=2019, m=6}
	Loading partition {y=2015, m=3}
	Loading partition {y=2018, m=4}
	Loading partition {y=2015, m=5}
	Loading partition {y=2017, m=8}
	 Time taken for adding to write entity : 4
Partition yr.stu_mess_part_buck{y=2015, m=3} stats: [numFiles=4, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part_buck{y=2015, m=5} stats: [numFiles=4, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part_buck{y=2016, m=7} stats: [numFiles=4, numRows=1, totalSize=40, rawDataSize=39]
Partition yr.stu_mess_part_buck{y=2017, m=2} stats: [numFiles=4, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part_buck{y=2017, m=8} stats: [numFiles=4, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part_buck{y=2018, m=4} stats: [numFiles=4, numRows=1, totalSize=40, rawDataSize=39]
Partition yr.stu_mess_part_buck{y=2018, m=5} stats: [numFiles=4, numRows=1, totalSize=39, rawDataSize=38]
Partition yr.stu_mess_part_buck{y=2019, m=6} stats: [numFiles=4, numRows=1, totalSize=40, rawDataSize=39]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 4   Cumulative CPU: 18.26 sec   HDFS Read: 21761 HDFS Write: 871 SUCCESS
Total MapReduce CPU Time Spent: 18 seconds 260 msec
OK
Time taken: 60.378 seconds

结果查询

hive> show partitions stu_mess_part_buck;
OK
y=2015/m=3
y=2015/m=5
y=2016/m=7
y=2017/m=2
y=2017/m=8
y=2018/m=4
y=2018/m=5
y=2019/m=6
Time taken: 0.109 seconds, Fetched: 8 row(s)
hive> dfs -ls /yr/stu_mess_part_buck;
Found 5 items
drwxr-xr-x   - yao supergroup          0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2015
drwxr-xr-x   - yao supergroup          0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2016
drwxr-xr-x   - yao supergroup          0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2017
drwxr-xr-x   - yao supergroup          0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2018
drwxr-xr-x   - yao supergroup          0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2019
hive> dfs -ls /yr/stu_mess_part_buck/y=2015;
Found 2 items
drwxr-xr-x   - yao supergroup          0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2015/m=3
drwxr-xr-x   - yao supergroup          0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2015/m=5
hive> dfs -ls /yr/stu_mess_part_buck/y=2015/m=3;
Found 4 items
-rwxr-xr-x   1 yao supergroup          0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2015/m=3/000000_0
-rwxr-xr-x   1 yao supergroup         39 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2015/m=3/000001_0
-rwxr-xr-x   1 yao supergroup          0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2015/m=3/000002_0
-rwxr-xr-x   1 yao supergroup          0 2019-03-21 16:26 /yr/stu_mess_part_buck/y=2015/m=3/000003_0
hive> dfs -cat /yr/stu_mess_part_buck/y=2015/m=3/000000_0;
hive> dfs -cat /yr/stu_mess_part_buck/y=2015/m=3/000001_0;
5	男	16	北京	13754554587	2015-03-24
hive> dfs -cat /yr/stu_mess_part_buck/y=2015/m=3/000002_0;
hive> dfs -cat /yr/stu_mess_part_buck/y=2015/m=3/000003_0;

shayuwei

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
20190321-分区和分桶

文章目录动态分区表1 包含动态分区的字段的表2 关闭严格分区模式3 创建带有分区的表分桶1.实例2 第一个失败的分桶3 第二个分桶4 重点5 总结6 分桶的查询7 数据分桶存在的一些缺陷既分区又分桶动态分区表1 包含动态分区的字段的表准备数据5 男 16 北京 13754554587 2015-03-246 女 17 北京 13872374170 2017-02-202 女 16 ...
复制链接

扫一扫

专栏目录