[hive]sort by limit和reduce by limit

胖胖学编程

已于 2023-03-13 10:59:14 修改

阅读量263

点赞数

分类专栏： hive 文章标签： hive 大数据 hadoop Powered by 金山文档

于 2023-03-09 10:29:16 首次发布

本文链接：https://blog.csdn.net/qq_35896718/article/details/129417515

版权

hive 专栏收录该内容

55 篇文章 2 订阅

订阅专栏

文章详细解释了Hadoop环境下的SQL查询中，sortbylimit和orderbylimit的区别。sortbylimit通过两个MapReduce阶段实现全局TopN排序，而orderbylimit在单个MapReduce阶段完成reduce级别的全局排序。同时，文中提到了查看执行计划的方法以及sortby操作。

摘要由CSDN通过智能技术生成

参考：https://blog.csdn.net/sinat_30371347/article/details/121558221

一、问题

在没看这块之前，我一直以为sort by limit只是单纯的在每个reduce中有序，并不能实现全局排序。

但实际上是sort by确实是单个reduce有序。

但是sort by limit是全局有序。它走两个MR：第一个MR内每个reduce取topN，第二个MR对所有已经取过topN的reduce进行汇总排序，再整个取topN。

order by limit是一个reduce全局topN。

二、分析

注：如何查看执行计划：

https://blog.csdn.net/qq_35896718/article/details/129418741?spm=1001.2014.3001.5501

1、sort by limit

explain
create table test.k2 as
select
id,
tm_time,
tm_diqu,
tm_wan_zheng_indicator,
tm_biao_zhun_value,
tm_biao_zhun_unit,
di_yu_dai_ma,
str
from
(
    select
    id,
    tm_time,
    tm_diqu,
    tm_wan_zheng_indicator,
    tm_biao_zhun_value,
    tm_biao_zhun_unit,
    di_yu_dai_ma,
    substr(tm_time,1,4) str
    from
    ods.ods_tablemeta_year
    where nvl(cast(substr(tm_time,1,4) as int),null) is not null--如果tm_time的前4位是数字
)t
sort by tm_time asc
limit 1000
;

有两个mr的stage：

最好别设置reduce的个数，设置为20则reduce总数为20

set mapred.reduce.tasks = 20;

默认为-1，系统自动设置为700多个reduce

2、order by limit

explain
create table test.k2 as
select
id,
tm_time,
tm_diqu,
tm_wan_zheng_indicator,
tm_biao_zhun_value,
tm_biao_zhun_unit,
di_yu_dai_ma,
str
from
(
    select
    id,
    tm_time,
    tm_diqu,
    tm_wan_zheng_indicator,
    tm_biao_zhun_value,
    tm_biao_zhun_unit,
    di_yu_dai_ma,
    substr(tm_time,1,4) str
    from
    ods.ods_tablemeta_year
    where nvl(cast(substr(tm_time,1,4) as int),null) is not null--如果tm_time的前4位是数字
)t
order by tm_time asc
limit 1000
;

只有1个MR的Stage

3、sort by

explain
create table test.k2 as
select
id,
tm_time,
tm_diqu,
tm_wan_zheng_indicator,
tm_biao_zhun_value,
tm_biao_zhun_unit,
di_yu_dai_ma,
str
from
(
    select
    id,
    tm_time,
    tm_diqu,
    tm_wan_zheng_indicator,
    tm_biao_zhun_value,
    tm_biao_zhun_unit,
    di_yu_dai_ma,
    substr(tm_time,1,4) str
    from
    ods.ods_tablemeta_year
    where nvl(cast(substr(tm_time,1,4) as int),null) is not null--如果tm_time的前4位是数字
)t
sort by tm_time asc
;

只有1个MR的Stage