Hive的order by、sort by、distribute by和cluster by
文章目录
作用
order by
只产生一个reduce,在reduce中全局排序。
数据量大的时候集中在一个reduce就会运行的很慢。
如果指定了严格模式hive.mapred.mode=strict;
(默认是nonstrict),这时就必须指定limit来限制输出条数。
sort by
sort by是分组排序,在有多个reduce输出的情况下,它保证每个reduce的输出结果有序。
只有一个reduce时,和order by作用一样。
设置reduce数量为3:set mapreduce.job.reduces=3;
数据量大的时候使用sort by将数据分散到多个reduce中排序后,再进行order by全局排序会提高计算速度。
distribute by
distribute by会根据指定列的字段的hash值,在map端将数据分散到不同的reduce中,相同hash值的在一个组中,常与sort by 一起使用。
需要设置reduce数量。因为sort by保证每个reduce输出结果有序,默认一个reduce。
distribute by + sort by的结果是分组内有序而全局无序的。
distribute by和group by区别?
有人可能觉得distribute by和group by很像,distribute by
是根据hash值分组,把数据分散给不同的reduce,与sort by一起使用;group by
是根据实际的值,要与sum等聚合函数一起使用。
Distribute by和sort by的使用场景
1.Map输出的文件大小不均。
2.Reduce输出文件大小不均。
3.小文件过多。
4.文件超大。
distribute by rand() 可以防止数据倾斜
当 distribute by 后边跟的列是:rand() 随机数时,即保证每个分区的数据量基本一致。rand()生成随机数的函数。
按照随机数分组可以将数据随机分配到不同的reduce中,同时保证每个分区的数据量基本一致。需要设置reduce数量。
cluster by
把有相同值的数据分散到一组,并排序,效果等同于distribute by + sort by 。
即 cluster by col 等同于distribute by col sort by col
但是排序只能是默认升序排序,不能指定排序规则为asc 或者desc。
示例
准备测试数据
创建一张order表,字段分别为name,orderdate,orderprice
create table ordtable
(
name string,
orderdate date,
orderprice double
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE ;
插入测试数据到表中
insert overwrite table ordtable
select 'Jack','2021-01-01',10
union all
select 'ROSE','2021-02-01',15
union all
select 'ROSE','2021-03-01',20
union all
select 'Bob','2021-04-01',25
union all
select 'Bob','2021-03-11',30
union all
select 'Bob','2021-03-21',20
union all
select 'Adam','2021-04-11',15
union all
select 'Adam','2021-03-11',25
union all
select 'Adam','2021-04-21',10
union all
select 'Adam','2021-04-25',5;
测试数据如下
+----------------+---------------------+----------------------+--+
| ordtable.name | ordtable.orderdate | ordtable.orderprice |
+----------------+---------------------+----------------------+--+
| Jack | 2021-01-01 | 10.0 |
| ROSE | 2021-02-01 | 15.0 |
| ROSE | 2021-03-01 | 20.0 |
| Bob | 2021-04-01 | 25.0 |
| Bob | 2021-03-11 | 30.0 |
| Bob | 2021-03-21 | 20.0 |
| Adam | 2021-04-11 | 15.0 |
| Adam | 2021-03-11 | 25.0 |
| Adam | 2021-04-21 | 10.0 |
| Adam | 2021-04-25 | 5.0 |
+----------------+---------------------+----------------------+--+
order by
一个reduce,全局排序
select * from ordtable order by orderprice;
+----------------+---------------------+----------------------+--+
| ordtable.name | ordtable.orderdate | ordtable.orderprice |
+----------------+---------------------+----------------------+--+
| Adam | 2021-04-25 | 5.0 |
| Adam | 2021-04-21 | 10.0 |
| Jack | 2021-01-01 | 10.0 |
| Adam | 2021-04-11 | 15.0 |
| ROSE | 2021-02-01 | 15.0 |
| Bob | 2021-03-21 | 20.0 |
| ROSE | 2021-03-01 | 20.0 |
| Adam | 2021-03-11 | 25.0 |
| Bob | 2021-04-01 | 25.0 |
| Bob | 2021-03-11 | 30.0 |
+----------------+---------------------+----------------------+--+
sort by
先设置reduce数为2个,默认1个看不出效果。2个reduce,每个reduce组内排序。
set mapreduce.job.reduces=2;
select * from ordtable sort by orderprice;
+----------------+---------------------+----------------------+--+
| ordtable.name | ordtable.orderdate | ordtable.orderprice |
+----------------+---------------------+----------------------+--+
| Adam | 2021-04-25 | 5.0 |
| Jack | 2021-01-01 | 10.0 |
| ROSE | 2021-03-01 | 20.0 |
| Adam | 2021-03-11 | 25.0 |
| Bob | 2021-04-01 | 25.0 |
| Bob | 2021-03-11 | 30.0 |
| Adam | 2021-04-21 | 10.0 |
| Adam | 2021-04-11 | 15.0 |
| ROSE | 2021-02-01 | 15.0 |
| Bob | 2021-03-21 | 20.0 |
+----------------+---------------------+----------------------+--+
distribute by
求按照月份分组,求出了每个月的订单价格升序排序?
按照月份在map端分组,共有4个月份,设置了4个reduce任务,每个reduce组内按照价格升序序排序。默认一个reduce。
set mapreduce.job.reduces=4;
select * from ordtable distribute by month(orderdate) sort by orderprice asc;
0: jdbc:hive2://hadoop91:10000> select * from ordtable distribute by month(orderdate) sort by orderprice asc;
+----------------+---------------------+----------------------+--+
| ordtable.name | ordtable.orderdate | ordtable.orderprice |
+----------------+---------------------+----------------------+--+
| Adam | 2021-04-25 | 5.0 |
| Adam | 2021-04-21 | 10.0 |
| Adam | 2021-04-11 | 15.0 |
| Bob | 2021-04-01 | 25.0 |
| Jack | 2021-01-01 | 10.0 |
| ROSE | 2021-02-01 | 15.0 |
| Bob | 2021-03-21 | 20.0 |
| ROSE | 2021-03-01 | 20.0 |
| Adam | 2021-03-11 | 25.0 |
| Bob | 2021-03-11 | 30.0 |
+----------------+---------------------+----------------------+--+
reduce为默认1个时,就相当于全局排序了,就只会按照价格排序,结果如下
set mapreduce.job.reduces=1;
select * from ordtable distribute by month(orderdate) sort by orderprice asc;
0: jdbc:hive2://hadoop91:10000> select * from ordtable distribute by month(orderdate) sort by orderprice asc;
+----------------+---------------------+----------------------+--+
| ordtable.name | ordtable.orderdate | ordtable.orderprice |
+----------------+---------------------+----------------------+--+
| Adam | 2021-04-25 | 5.0 |
| Adam | 2021-04-21 | 10.0 |
| Jack | 2021-01-01 | 10.0 |
| Adam | 2021-04-11 | 15.0 |
| ROSE | 2021-02-01 | 15.0 |
| Bob | 2021-03-21 | 20.0 |
| ROSE | 2021-03-01 | 20.0 |
| Adam | 2021-03-11 | 25.0 |
| Bob | 2021-04-01 | 25.0 |
| Bob | 2021-03-11 | 30.0 |
+----------------+---------------------+----------------------+--+
cluster by
select * from ordtable distribute by orderprice sort by orderprice asc;
等同于
select * from ordtable cluster by orderprice ;