对于每一个表(table)或者分区, Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划分。Hive也是 针对某一列进行桶的组织。Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。
把表(或者分区)组织成桶(Bucket)有两个理由:
- 获得更高的查询处理效率。桶为表加上了额外的结构,Hive 在处理有些查询时能利用这个结构。具体而言,连接两个在(包含连接列的)相同列上划分了桶的表,可以使用 Map 端连接 (Map-side join)高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以,可以大大较少JOIN的数据量。
- 使取样(sampling)更高效。在处理大规模数据集时,在开发和修改查询的阶段,如果能在数据集的一小部分数据上试运行查询,会带来很多方便。
1. 创建分桶表
create table t_buk(id string,name string)
clustered by (id) -- 按照id分桶
sorted by (id) -- 按照id排序
into 4 buckets -- 分4个桶
row format delimited fields terminated by ',';
2. 查看表结构
0: jdbc:hive2://localhost:10000> desc formatted t_buk;
内容如下:
3. 导入数据
0: jdbc:hive2://localhost:10000> load data local inpath '/home/hadoop/data.txt' into table t_buk;
结果如下:
INFO : Loading data to table shizhan03.t_buk from file:/home/hadoop/data.txt
INFO : Table shizhan03.t_buk stats: [numFiles=1, totalSize=46]
No rows affected (0.737 seconds)
注意:这样导入的数据并不会自动分桶,把数据截断。
4. 截断表
truncate table t_buk;
5. 插入数据
-- 这里会执行mapreduce程序
insert into table t_buk select id,name from t_sz03;
结果如下:
INFO : Number of reduce tasks is set to 0 since there’s no reduce operator’
INFO : number of splits:1
INFO : Submitting tokens for job: job_local1688102775_0001
INFO : The url to track the job: http://localhost:8080/
INFO : Job running in-process (local Hadoop)
INFO : 2018-09-06 10:04:05,158 Stage-1 map = 0%, reduce = 0%
INFO : 2018-09-06 10:04:06,164 Stage-1 map = 100%, reduce = 0%
INFO : Ended Job = job_local1688102775_0001
INFO : Stage-4 is selected by condition resolver.
INFO : Stage-3 is filtered out by condition resolver.
INFO : Stage-5 is filtered out by condition resolver.
INFO : Moving data to: hdfs://langzi01:9000/user/hive/warehouse/shizhan03.db/t_buk/.hive-staging_hive_2018-09-06_10-04-02_708_5179772915570508915-1/-ext-10000 from hdfs://langzi01:9000/user/hive/warehouse/shizhan03.db/t_buk/.hive-staging_hive_2018-09-06_10-04-02_708_5179772915570508915-1/-ext-10002
INFO : Loading data to table shizhan03.t_buk from hdfs://langzi01:9000/user/hive/warehouse/shizhan03.db/t_buk/.hive-staging_hive_2018-09-06_10-04-02_708_5179772915570508915-1/-ext-10000
INFO : Table shizhan03.t_buk stats: [numFiles=1, numRows=5, totalSize=46, rawDataSize=41]
No rows affected (3.81 seconds)
查看结果发现还是一个文件内容:
即: /user/hive/warehouse/shizhan03.db/t_buk 这个路径下面只有一个文件
如何解决这个问题?
6. 设置分桶表参数
-- 设置参数:
set hive.enforce.bucketing = true;
查看设置的参数:
0: jdbc:hive2://localhost:10000> set hive.enforce.bucketing;
+------------------------------+--+
| set |
+------------------------------+--+
| hive.enforce.bucketing=true |
+------------------------------+--+
1 row selected (0.006 seconds)
set mapreduce.job.reduces=4;-- 设置mapreduce的数量
0: jdbc:hive2://localhost:10000> set mapreduce.job.reduces;
+--------------------------+--+
| set |
+--------------------------+--+
| mapreduce.job.reduces=4 |
+--------------------------+--+
1 row selected (0.007 seconds)
查看排序结果:
0: jdbc:hive2://localhost:10000> select * from t_sz01 sort by (id);
INFO : Number of reduce tasks not specified. Defaulting to jobconf value of: 4
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : number of splits:1
INFO : Submitting tokens for job: job_local1059625551_0002
INFO : The url to track the job: http://localhost:8080/
INFO : Job running in-process (local Hadoop)
INFO : 2018-09-06 10:26:11,257 Stage-1 map = 100%, reduce = 0%
INFO : 2018-09-06 10:26:12,271 Stage-1 map = 100%, reduce = 100%
INFO : Ended Job = job_local1059625551_0002
+------------+--------------+--+
| t_sz01.id | t_sz01.name |
+------------+--------------+--+
| 2 | lisi |
| 4 | zhaoliu |
| 1 | zhangsan |
| 3 | wangwu |
| 5 | zhouqi |
+------------+--------------+--+
5 rows selected (2.81 seconds)
mapreduce的数量为4.
7. 导入数据
insert into table t_buk select id,name from t_sz01 distribute by(id) sort by(id);
执行过程如下:
0: jdbc:hive2://localhost:10000> insert into table t_buk
0: jdbc:hive2://localhost:10000> select id,name from t_sz01 distribute by(id) sort by(id);
INFO : Number of reduce tasks not specified. Defaulting to jobconf value of: 4
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=
INFO : number of splits:1
INFO : Submitting tokens for job: job_local1435934157_0003
INFO : The url to track the job: http://localhost:8080/
INFO : Job running in-process (local Hadoop)
INFO : 2018-09-06 10:42:08,780 Stage-1 map = 100%, reduce = 100%
INFO : Ended Job = job_local1435934157_0003
INFO : Number of reduce tasks determined at compile time: 4
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=
INFO : number of splits:1
INFO : Submitting tokens for job: job_local1436192478_0004
INFO : The url to track the job: http://localhost:8080/
INFO : Job running in-process (local Hadoop)
INFO : 2018-09-06 10:42:10,446 Stage-2 map = 100%, reduce = 0%
INFO : 2018-09-06 10:42:11,458 Stage-2 map = 100%, reduce = 50%
INFO : 2018-09-06 10:42:12,474 Stage-2 map = 100%, reduce = 75%
INFO : 2018-09-06 10:42:13,479 Stage-2 map = 100%, reduce = 100%
INFO : Ended Job = job_local1436192478_0004
INFO : Loading data to table shizhan03.t_buk from hdfs://langzi01:9000/user/hive/warehouse/shizhan03.db/t_buk/.hive-staging_hive_2018-09-06_10-42-07_217_7738831819154647152-1/-ext-10000
INFO : Table shizhan03.t_buk stats: [numFiles=4, numRows=16, totalSize=189, rawDataSize=173]
No rows affected (6.763 seconds)
通过浏览器查看文件系统,发现/user/hive/warehouse/shizhan03.db/t_buk
路径下面有4个文件,文件名分别为:
000000_0,000001_0,000002_0,000003_0
查看分桶之后的文件内容:
0: jdbc:hive2://localhost:10000> dfs -cat /user/hive/warehouse/shizhan03.db/t_buk/000000_0;
内容如下:
+————-+–+
| DFS Output |
+————-+–+
| 15,spring |
| 4,一元go |
| 8,丢件 |
| 11,圣诞浏览 |
+————-+–+
4 rows selected (0.027 seconds)
0: jdbc:hive2://localhost:10000> dfs -cat /user/hive/warehouse/shizhan03.db/t_buk/000001_0;
+————-+–+
| DFS Output |
+————-+–+
| 1,11.11 |
| 16,browse |
| 12,圣诞注册 |
| 9,延时 |
| 5,排行榜 |
+————-+–+
5 rows selected (0.017 seconds)
0: jdbc:hive2://localhost:10000> dfs -cat /user/hive/warehouse/shizhan03.db/t_buk/000002_0;
+————-+–+
| DFS Output |
+————-+–+
| 6,邀请 |
| 2,更高补贴 |
| 13,圣诞领取按钮 |
+————-+–+
3 rows selected (0.019 seconds)
0: jdbc:hive2://localhost:10000> dfs -cat /user/hive/warehouse/shizhan03.db/t_buk/000003_0;
+——————–+–+
| DFS Output |
+——————–+–+
| 14,springFestival |
| 10,剁手 |
| 7,货损 |
| 3,+5 |
+——————–+–+
4 rows selected (0.022 seconds)
8. 导入数据:
-- 先清空表
truncate table t_buk;
-- 导入数据
insert into table t_buk select id,name from t_sz01 cluster by(id);
查看分桶数据:
0: jdbc:hive2://localhost:10000> dfs -cat /user/hive/warehouse/shizhan03.db/t_buk/000000_0;
!问题:将mapreduce数量设置为4,查询不是分桶表的话,能查询到全部数据吗?能!
0: jdbc:hive2://localhost:10000> select * from t_sz01 order by name;
INFO : Number of reduce tasks determined at compile time: 1
-- 会将mapreduce数量设置为1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : number of splits:1
INFO : Submitting tokens for job: job_local157551508_0007
INFO : The url to track the job: http://localhost:8080/
INFO : Job running in-process (local Hadoop)
INFO : 2018-09-06 10:59:40,673 Stage-1 map = 100%, reduce = 100%
INFO : Ended Job = job_local157551508_0007
+------------+-----------------+--+
| t_sz01.id | t_sz01.name |
+------------+-----------------+--+
| 3 | +5 |
| 1 | 11.11 |
| 16 | browse |
| 15 | spring |
| 14 | springFestival |
| 4 | 一元go |
| 8 | 丢件 |
| 10 | 剁手 |
| 12 | 圣诞注册 |
| 11 | 圣诞浏览 |
| 13 | 圣诞领取按钮 |
| 9 | 延时 |
| 5 | 排行榜 |
| 2 | 更高补贴 |
| 7 | 货损 |
| 6 | 邀请 |
+------------+-----------------+--+
16 rows selected (1.404 seconds)
9. 保存select查询结果的几种方式:
-- 从一个表的数据导入到新建的一个表里面
a、create table t_tmp as select * from t_sz01;
-- 从一个表的数据导入到已经存在一个表里面
b、insert into table t_tmp select * from t_sz01;
-- 从一个表的数据保存到指定的文件目录(可以是本地,也可以是hdfs)
c、insert overwrite local directory '/home/hadoop/test' select * from t_sz01;
完!!!