hive中distribute by、sort by、cluster by的区别
1.distribute by + sort by 组合使用
实验表
hive> select * from test1;
OK
19 1 FLUSH
18 1 CACHE
17 1 PROCEDURE
16 1 ANALYSE
15 1 HELP
14 1 VALUE
13 1 SERIAL
12 1 DEFAULT
11 1 HELP_VERSION
10 1 HELP_DATE
9 0 FLUSH
8 0 CACHE
7 0 PROCEDURE
6 0 ANALYSE
5 0 HELP
4 0 VALUE
3 0 SERIAL
2 0 DEFAULT
1 0 HELP_VERSION
0 0 HELP_DATE
Time taken: 0.123 seconds, Fetched: 20 row(s)
hive> desc test1;
OK
id int
type int
name varchar(100)
Time taken: 0.064 seconds, Fetched: 3 row(s)
执行:
> select * from test1 distribute by type sort by id;
Query ID = root_20200110153747_48771756-ed5a-4e36-ae3c-2611cb64f63c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1578636756028_0021, Tracking URL = http://linux-one:8088/proxy/application_1578636756028_0021/
Kill Command = /usr/java/hadoop/bin/mapred job -kill job_1578636756028_0021
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-01-10 15:38:03,353 Stage-1 map = 0%, reduce = 0%
2020-01-10 15:38:11,512 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.62 sec
2020-01-10 15:38:18,649 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.18 sec
MapReduce Total cumulative CPU time: 5 seconds 180 msec
Ended Job = job_1578636756028_0021
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.18 sec HDFS Read: 11083 HDFS Write: 575 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 180 msec
OK
0 0 HELP_DATE
1 0 HELP_VERSION
2 0 DEFAULT
3 0 SERIAL
4 0 VALUE
5 0 HELP
6 0 ANALYSE
7 0 PROCEDURE
8 0 CACHE
9 0 FLUSH
10 1 HELP_DATE
11 1 HELP_VERSION
12 1 DEFAULT
13 1 SERIAL
14 1 VALUE
15 1 HELP
16 1 ANALYSE
17 1 PROCEDURE
18 1 CACHE
19 1 FLUSH
Time taken: 33.305 seconds, Fetched: 20 row(s)
hive>
解析:
distribute by type 将map端输出的文件按type值分组,将同一个type值的输出文件输入到同一个reduce上处理,sort by id 为每个reduce的输出文件按ID排序。
2.cluster by
当cluster by后的字段和distribute by 和sort by字段一样时,相当于 distribute by 和sort by 的结合,默认只能是升序
以下两种写法是一样的
select * from test1 cluster by id;
=
select * from test1 distribute by type sort by id asc;