一、导言
order by和sort by都是hive中的排序方式,区别在于order by只有一个reduce对全表数据进行排序,sort by会按照key值分区在不同的reduce内部排序。那么对排序结果取limit的话,Hive会怎么处理呢?
二、执行计划比较
数据表myhive.score
select * from score;
score.s_id score.c_id score.s_score score.month
06 04 80 201806
06 05 80 201806
06 06 80 201806
04 05 80 201806
06 03 34 201806
07 02 89 201806
07 03 98 201806
1、order by limit
在日志中,该任务只有一个job,且Hive在编译时即已确认reduce数量为1,所以limit对reduce中所有数据排序结果取topN。
hive (myhive)> select * from score order by s_id limit 4;
2021-11-26 13:52:27,404 INFO [ac29ef6d-3004-4b12-8f18-e7f01a936167 main] ql.Driver (SessionState.java:printInfo(1227)) - Total jobs = 1
2021-11-26 09:39:49,015 INFO [ac29ef6d-3004-4b12-8f18-e7f01a936167 main] ql.Driver (Driver.java:launchTask(2662)) - Starting task [Stage-1:MAPRED] in serial mode
Number of reduce tasks determined at compile time: 1
2021-11-26 09:39:49,029 INFO [ac29ef6d-3004-4b12-8f18-e7f01a936167 main] exec.Task (SessionState.java:printInfo(1227)) - Number of reduce tasks determined at compile time: 1
#编译时确定的reduce任务数:1
执行计划中,只对reduce结果进行limit操作
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: score
Statistics: Num rows: 4 Data size: 2340 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: s_id (type: string), c_id (type: string), s_score (type: int), month (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 4 Data size: 2340 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (<

最低0.47元/天 解锁文章
823

被折叠的 条评论
为什么被折叠?



