在平时的Hive数仓开发工作中经常会用到排序,而Hive中支持的排序方式有四种,这里结合具体的案例详细介绍一下他们的使用与区别:
- order by
- sort by
- distribute by
- cluster by
准备工作:
新建一个测试用表employInfo:
create table employInfo(deptID int,employID int,employName string,employSalary double)
row format delimited fields terminated by ',';
向测试用表中导入测试数据:
load data local inpath '/home/hadoop/datas/employInfo.txt' into table employInfo;
以下为测试用的数据:
[hadoop@weekend110 datas]$ cat employInfo.txt
deptID,employID,employName,employSalary
1,1001,Jack01,5000
1,1002,Jack02,5001
1,1003,Jack03,5002
1,1004,Jack04,5003
1,1005,Jack05,5004
1,1006,Jack06,5005
1,1007,Jack07,5006
1,1008,Jack08,5007
1,1009,Jack09,5008
1,1010,Jack10,5009
1,1011,Jack11,5010
1,1012,Jack12,5011
2,1013,Maria01,7500
2,1014,Maria02,7501
2,1015,Maria03,7502
2,1016,Maria04,7503
2,1017,Maria05,7504
2,1018,Maria06,7505
2,1019,Maria07,7506
2,1020,Maria08,7507
2,1021,Maria09,7508
3,1022,Lucy01,8540
3,1023,Lucy02,8541
3,1024,Lucy03,8542
3,1025,Lucy04,8543
3,1026,Lucy05,8544
3,1027,Lucy06,8545
3,1028,Lucy07,8546
3,1029,Lucy08,8547
3,1030,Lucy09,8548
3,1031,Lucy10,8549
3,1032,Lucy11,8550
3,1033,Lucy12,8551
4,1034,Jimmy01,10000
4,1035,Jimmy02,10001
4,1036,Jimmy03,10002
4,1037,Jimmy04,10003
4,1038,Jimmy05,10004
4,1039,Jimmy06,10005
4

本文详细介绍了Hive中的四种排序方式:order by(全局排序)、sort by(局部排序)、distribute by(分区排序)和cluster by(分区+局部排序)。order by确保全局排序但只有一个reducer,适合配合limit使用;sort by在每个reducer内部排序,不保证全局有序;distribute by用于控制行流向哪个reducer,常与sort by结合使用;cluster by等同于distribute by + sort by,但排序只能是升序。
最低0.47元/天 解锁文章
823

被折叠的 条评论
为什么被折叠?



