背 景
如果hive面试,基本都会问一下ORDER BY, SORT BY, DISTRIBUTE BY, CLUSTER BY
的区别,那这里就简单给大家做个介绍;
ORDER BY
order by最后会唤醒一个reduce对全局排序,如果是个大数据集则会消耗太过漫长的时间,但是整个结果集是有序的,对于大数据集慎用吧;
这里我们来用Hive SQL演示下,查询学生表的数据按照学号全局顺序,为了说明全局有序,我们刻意指定两个Reduce数目,那Hive默认采用Hash Partition的话,肯定会根据学号Hash再对Reduce个数取模去到不同的Reduce,那因为采用order by,最终会导致所有reduce的数据再次全局排序展示,具体Hive SQL如下;
set mapred.reduce.tasks = 2; #设置两个reduce
select
select
t1.sno,t1.sname
from dw.ods_rs_basic_tbd_student t1
where t1.event_day='20200618'
order by t1.sno
;
执行结果如图1,确实全局有序;
SORT BY
sort by只会在每个reducer 中对数据进行排序,也就是执行局部排序过程,只能保证每个reducer的输出数据都是有序的,但并非全局有序。
为了说明问题,我们也自定义指定两个Reduce数目,那按照理解来说,应该是局部有序,执行结果会出现两段有序的结果,具体Hive SQL如下;
set mapred.reduce.tasks = 2; #设置两个reduce
select
t1.sno,t1.sname
from dw.ods_rs_basic_tbd_student t1
where t1.event_day='20200618'
sort by t1.sno
;
执行结果如图2,出现了两段单独有序的数据;
如果你觉得是巧合的话,我们可以再试试把Reduce个数变成3个,执行Hive SQL如下;
set mapred.reduce.tasks = 3;
select
t1.sno,t1.sname
from dw.ods_rs_basic_tbd_student t1
where t1.event_day='20200618'
sort by t1.sno
;
执行结果如图3,出现了三段段单独有序的数据;
DISTRIBUTE BY
distribute by控制map的输出在reducer中是如何划分的,假设我们希望性别相同的数据在一起处理。那么我们可以使用distribute by来保证具有相同记录会分发到同一个reducer中进行处理,当然distribute by后面也可以再接sort by来对各个单独的Reduce处理的数据进行排序。
以下是单独的distribute by的Hive SQL语句演练;
#创建一个dwd层的学生信息表
CREATE TABLE `dwd_rs_tbb_studentinfo`(
`sno` int COMMENT '学号',
`sname` string COMMENT '姓名',
`ssex` string COMMENT '性别',
`sage` string COMMENT '年龄',
`classid` string COMMENT '班号',
`classname` string COMMENT '班级名称')
COMMENT 'dwd学生信息表'
PARTITIONED BY (
`event_week` int,
`event_day` string,
`event_hour` string)
row format delimited fields terminated by ','
lines terminated by '\n'
LOCATION
'/hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo'
TBLPROPERTIES ('parquet.compression'='snappy')
;
#设置两个Reduce数目,将同一性别的学生交给
set mapred.reduce.tasks = 2;
insert overwrite table dwd_rs_tbb_studentinfo partition(event_week='25',event_day='20200618',event_hour='00')
select
t1.sno
,t1.sname
,t1.ssex
,t1.sage
,t2.classid
,t2.classname
from dw.ods_rs_basic_tbd_student t1
inner join dw.ods_rs_basic_tbd_class t2
on t1.classid=t2.classid
and t2.event_day='20200618'
where t1.event_day='20200618'
distribute by t1.ssex
;
最终可以看到表dwd_rs_tbb_studentinfo对应的hive文件有两个,两个文件分别是两个Reduce的结果,然后cat下两个文件的结果,分别是两个不同性别的数据,也就验证了我们的想法,distribute by会将具有相同记录会分发到同一个reducer中进行处理,对应shell操作及结果如下;
注意:这里有点巧合的成分在,那就是Hive默认的Hash Partition刚好吧男,女性别的学生到了不同的Reduce,即男学生在一个reduce,女学生在一个reduce,如果选择的distribute by的键的值特殊,也可能导致所有的结果在同一个Reduce上,另一个Reduce处理结果为0条记录,这就是常说的数据倾斜,但是还是会保证相同distribute by的键的值在同一个Reduce上,这点要明白。
(base) [hadoop@dw-test-cluster-007 src]$ hadoop fs -ls /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/
Found 2 items
-rwxr-xr-x 3 hadoop supergroup 94 2020-07-03 15:13 /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/000000_0
-rwxr-xr-x 3 hadoop supergroup 62 2020-07-03 15:13 /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/000001_0
(base) [hadoop@dw-test-cluster-007 src]$ hadoop fs -cat /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/000001_0
3,小丽,女,14,7,初一(6)班
8,大瑶,女,14,7,初一(6)班
(base) [hadoop@dw-test-cluster-007 src]$ hadoop fs -cat /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/000000_0
1,小明,男,15,6,初一(5)班
10,大佬,男,15,4,初一(4)班
6,大林,男,14,3,初一(1)班
前面说了,distribute by后面也可以再接sort by来对各个单独的Reduce处理的数据进行排序,假设实现同一性别在同一个Reduce上处理的结果还要按年龄顺序排序,实现的Hive SQL改进下,在后面再加上sort by sage即可,具体实现如下;
set mapred.reduce.tasks = 2;
insert overwrite table dwd_rs_tbb_studentinfo partition(event_week='25',event_day='20200618',event_hour='00')
select
t1.sno
,t1.sname
,t1.ssex
,t1.sage
,t2.classid
,t2.classname
from dw.ods_rs_basic_tbd_student t1
inner join dw.ods_rs_basic_tbd_class t2
on t1.classid=t2.classid
and t2.event_day='20200618'
where t1.event_day='20200618'
distribute by t1.ssex
sort by t1.sage
;
再次查询表dwd_rs_tbb_studentinfo对应的hive文件有两个,两个文件分别是两个Reduce的结果,然后cat下两个文件的结果,分别是两个不同性别的数据,而且按照年龄顺序排序了,对应shell操作及结果如下
(base) [hadoop@dw-test-cluster-007 src]$ hadoop fs -ls /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/
Found 2 items
-rwxr-xr-x 3 hadoop supergroup 94 2020-07-03 16:16 /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/000000_0
-rwxr-xr-x 3 hadoop supergroup 62 2020-07-03 16:16 /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/000001_0
(base) [hadoop@dw-test-cluster-007 src]$ hadoop fs -cat /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/000000_0
6,大林,男,14,3,初一(1)班
10,大佬,男,15,4,初一(4)班
1,小明,男,15,6,初一(5)班
(base) [hadoop@dw-test-cluster-007 src]$ hadoop fs -cat /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/000001_0
8,大瑶,女,14,7,初一(6)班
3,小丽,女,14,7,初一(6)班
distribute by的巧用
:可在SQL中添加distribute by rand()来防止数据倾斜,实现Hive SQL如下;
select
t1.sno
,t1.sname
,t1.ssex
,t1.sage
,t2.classid
,t2.classname
from dw.ods_rs_basic_tbd_student t1
inner join dw.ods_rs_basic_tbd_class t2
on t1.classid=t2.classid
and t2.event_day='20200618'
where t1.event_day='20200618'
distribute by rand()
sort by t1.sage
;
同样适应Spark-SQL;
val dataRDD = sqlContext.sql(
"select A ,B from table your_table distribute by rand() "
)
CLUSTER BY
如果distribute by和sort by涉及到的列完全相同,且采用的是升序排序方式,则相当于cluster by,如下Hive-SQL,把相同班级id的数据交给同一个Reduce处理,且单个Reduce处理的结果按照班级id顺序排序;
set mapred.reduce.tasks = 2;
insert overwrite table dwd_rs_tbb_studentinfo partition(event_week='25',event_day='20200618',event_hour='00')
select
t1.sno
,t1.sname
,t1.ssex
,t1.sage
,t2.classid
,t2.classname
from dw.ods_rs_basic_tbd_student t1
inner join dw.ods_rs_basic_tbd_class t2
on t1.classid=t2.classid
and t2.event_day='20200618'
where t1.event_day='20200618'
cluster by t2.classid
;
查询表dwd_rs_tbb_studentinfo对应的hive文件有两个,两个文件分别是两个Reduce的结果,然后cat下两个文件的结果,分别是两个不同性别的数据,而且按照年龄顺序排序了,对应shell操作及结果如下;
(base) [hadoop@dw-test-cluster-007 src]$ hadoop fs -ls /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/
Found 2 items
-rwxr-xr-x 3 hadoop supergroup 63 2020-07-03 17:43 /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/000000_0
-rwxr-xr-x 3 hadoop supergroup 93 2020-07-03 17:43 /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/000001_0
(base) [hadoop@dw-test-cluster-007 src]$ hadoop fs -cat /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/000001_0
6,大林,男,14,3,初一(1)班
8,大瑶,女,14,7,初一(6)班
3,小丽,女,14,7,初一(6)班
(base) [hadoop@dw-test-cluster-007 src]$ hadoop fs -cat /hive/warehouse/dwd/rs/dwd_rs_tbb_studentinfo/event_week=25/event_day=20200618/event_hour=00/000000_0
10,大佬,男,15,4,初一(4)班
1,小明,男,15,6,初一(5)班
以上就是ORDER BY, SORT BY, DISTRIBUTE BY, CLUSTER BY
的区别,请细品;