row_number结合case when选择性排序

不会Hive的啊扬

已于 2023-11-16 00:31:24 修改

阅读量429

点赞数 1

文章标签： sql hive spark

于 2023-11-16 00:29:55 首次发布

本文链接：https://blog.csdn.net/m0_70949976/article/details/134430050

版权

1、应用场景

在BI中我们经常遇到一个问题，那就是有时候业务端会让我们展示前十名，有时候又要前二十名，甚至有时候又要有条件的筛选后的排名，而且多个报表，要不同维度的前多少名，这时如果在BI端一个个进行where过滤再开窗排序，查询响应会有一定延迟，要是能在sql中处理就好了

2、建表语句

比如有这么一个需求，需要将上月整月收礼排名，总收入环比增长排名，总收入环比下降排名，每日收礼排名分别做四个图表，且环比那两个图表又要求只要上月总收入在100万以上的工会的环比前十名，首先不管那么多，我用一个表出，分别给四个排序字段，这样BI端只需要过滤出想要的排名就可以了

 CREATE TABLE `al_north_star_de_id_unions_diamond_summary`(
     `union_id` int COMMENT '工会id',
     `union_name` string COMMENT '工会名称',
     `total_this_day` bigint COMMENT '当日总收入',
     `total_last_month_whole` bigint COMMENT '上月整月总收入',
     `total_this_month` bigint COMMENT '当月总收入',
     `total_last_month` bigint COMMENT '上月总收入',
     `total_inc` float COMMENT '总收入环比',
     `total_dif` bigint COMMENT '总收入差值',
     `gift_this_month` bigint COMMENT '当月语音房礼物收入',
     `gift_last_month` bigint COMMENT '上月语音房礼物收入',
     `gift_inc` float COMMENT '语音房礼物收入环比',
     `gift_dif` bigint COMMENT '语音房礼物收入差值',
     `exchange_this_month` bigint COMMENT '当月兑换消耗',
     `exchange_last_month` bigint COMMENT '上月兑换消耗',
     `exchange_inc` float COMMENT '兑换消耗环比',
     `exchange_dif` bigint COMMENT '兑换消耗差值',
     `balance_this_month` bigint COMMENT '当月钻石余额',
     `balance_last_month` bigint COMMENT '上月钻石余额',
     `balance_inc` float COMMENT '钻石余额环比',
     `balance_dif` bigint COMMENT '钻石余额差值',
     `total_last_month_whole_rn` int COMMENT '上月整月收礼排名',
     `total_inc_add_rn` int COMMENT '环比增长排名',
     `total_inc_sub_rn` int COMMENT '环比下降排名',
     `total_this_day_rn` int COMMENT '每日收礼排名')
 COMMENT '各工会主播报表汇总数据'
 PARTITIONED BY (`dt` string)
 row format delimited fields terminated by ','
 STORED AS PARQUET
 TBLPROPERTIES('parquet.compression'='SNAPPY', 'author' = 'ayang');

3、查询语句

首先有关收入的，而且需要频繁用到本月收入，上月收入的，我们应当在下层做临时表提前聚合一下，每天都统计一份本月1号截止当前日期，上月1号截止到上月本天，以及上月整月，这样我们就不必再每个报表层逐一重复计算了，如下所示tl表就是做的这个事情

insert overwrite table al_north_star_de_id_unions_diamond_summary partition (dt = '${hiveconf:part}')
select
    union_id,
    union_name,
    total_this_day,
    total_last_month_whole,
    total_this_month,
    total_last_month,
    total_inc,
    total_dif,
    gift_this_month,
    gift_last_month,
    gift_inc,
    gift_dif,
    exchange_this_month,
    exchange_last_month,
    exchange_inc,
    exchange_dif,
    balance_this_month,
    balance_last_month,
    balance_inc,
    balance_dif,
    case
        when total_last_month_whole_rn is not null
            then row_number() over(order by total_last_month_whole_rn desc)
        else null
    end as total_last_month_whole_rn,
    case
        when total_inc_add_rn is not null
            then row_number() over(order by total_inc_add_rn desc)
        else null
    end as total_inc_add_rn,
    case
        when total_inc_sub_rn is not null
            then row_number() over(order by total_inc_sub_rn desc)
        else null
    end as total_inc_sub_rn,
    case
        when total_this_day_rn is not null
            then row_number() over(order by total_this_day_rn desc)
        else null
    end as total_this_day_rn
from
    (
    select
        union_id,
        union_name,
        total_this_day,
        total_last_month_whole,
        total_this_month,
        total_last_month,
        round(total_this_month / total_last_month - 1, 2) as total_inc,
        total_this_month - total_last_month as total_dif,
        gift_this_month,
        gift_last_month,
        round(gift_this_month / gift_last_month - 1, 2) as gift_inc,
        gift_this_month - gift_last_month as gift_dif,
        exchange_this_month,
        exchange_last_month,
        round(exchange_this_month / exchange_last_month - 1, 2) as exchange_inc,
        exchange_this_month - exchange_last_month as exchange_dif,
        balance_this_month,
        balance_last_month,
        round(balance_this_month / balance_last_month - 1, 2) as balance_inc,
        balance_this_month - balance_last_month as balance_dif,
        case
            when union_id != 'All'
                then row_number() over(order by total_last_month_whole)
            else null
        end as total_last_month_whole_rn,
        case
            when union_id != 'All' and total_last_month_whole >= 1000000 and total_this_month / total_last_month - 1 >= 0
                then row_number() over(order by total_this_month / total_last_month - 1)
            else null
        end as total_inc_add_rn,
        case
            when union_id != 'All' and total_last_month_whole >= 1000000 and total_this_month / total_last_month - 1 < 0
                then row_number() over(order by total_this_month / total_last_month - 1 desc)
            else null
        end as total_inc_sub_rn,
        case
            when union_id != 'All'
                then row_number() over(order by total_this_day)
            else null
        end as total_this_day_rn
    from
        (
            select
                nvl(union_id, 'All') as union_id,
                nvl(union_name, 'All') as union_name,
                sum(if(dt = '${hiveconf:part}', salary_wage_day, 0)) as total_this_day,
                sum(if(dt = getlastmonthpart('${hiveconf:part}', 3), salary_wage, 0)) as total_last_month_whole,
                sum(if(dt = '${hiveconf:part}', salary_wage, 0)) as total_this_month,
                sum(if(dt = getlastmonthpart('${hiveconf:part}', 2), salary_wage, 0)) as total_last_month,
                sum(if(dt = '${hiveconf:part}', audio_gift, 0)) as gift_this_month,
                sum(if(dt = getlastmonthpart('${hiveconf:part}', 2), audio_gift, 0)) as gift_last_month,
                sum(if(dt = '${hiveconf:part}', diamond_exchange, 0)) as exchange_this_month,
                sum(if(dt = getlastmonthpart('${hiveconf:part}', 2), diamond_exchange, 0)) as exchange_last_month,
                sum(if(dt = '${hiveconf:part}', diamond_balance, 0)) as balance_this_month,
                sum(if(dt = getlastmonthpart('${hiveconf:part}', 2), diamond_balance, 0)) as balance_last_month
            from
                tl_north_star_de_vj_live_diamond_summary_month
            where dt in ('${hiveconf:part}', getlastmonthpart('${hiveconf:part}', 2), getlastmonthpart('${hiveconf:part}', 3)) and country = 'ID'
            group by union_id, union_name
                grouping sets((), (union_id, union_name))
        )t
    )tt

1、最里层t表：

1、getlastmonthpart()函数为自定义udf函数，接收两个参数，第一个参数为日期，第二个参数为2时为上月同期，为3时为上月整月

2、where完成对三个分区的过滤，因为要进行环比差值计算，之后sumif对数据进行行转列

3、grouping sets对不同维度做聚合，此时我们无需单独的union_id或union_name维度，因为这两个其实是同一个维度同一回事，因此不使用with cube

4、无论是grouping sets还是with cube后必定会产生Null值，我们在select中使用nvl用ALL替换Null值

2、次里层tt表：

1、这层我们要完成row_number排序，但由于有特殊情况，某些排序只排上月总收入在100万以上的工会，这种定制化排序，并不是所有row_number都要进行的，因此我们不能对所有数据进行过滤，而应该进行有条件的排序，也就是case when结合row_number

2、由于数据都带有一个ALL值，但这个ALL值我们是肯定不希望让其参与排序的，故对每个排序都进行case when处理，然后那两个特殊的排序再额外加其他条件

3、但要知道就算我用case when过滤了干扰行，但干扰行仍然会进入排序，该排第几排第几，只是最后rn被强行置为null而已,此时排序队列为散列，例如(null,2,null,null,5,null,7)，这里的257并非聚合的数据257，只是排序放入的对应位置即rn，目标是外层再将散列rn重新排列即(2->1, 5->2, 7->3, null->null)，也就是说把原本第2个位置的放在第1位，但由于row_number函数会认为null最小，结果为(null,null,null,null,2,5,7),无法取出真正前三257，此时的257实际rn为567，因此order by中与实际排序逻辑相反，外层再取反（负负得正，结果仍为实际的排序逻辑）即可解决，即(7,null,5,null,null,2,null)，这里的7为原本应该rn为7的，其实际在这里rn为1，外层再从大到小倒序排(2,5,7,null,null,null,null),这样null就会被扔在最后，给真正的前三打上排序标记123

3、最外层：

还是老规矩，不想让内层rn标为null的参与排序，直接对其使用case when强行赋null，由于order by都是从大到小排，这里的Null就不会对实际排名产生干扰了，BI端也不选null，不必做理会

4、优化思路

其实这里面是case when后在order by排序，其实的排序仍然该排第几排第几，只是最后rn被强行置为null而已，而且最后为了处理这个null值可谓是，非常麻烦且难以理解，那么有没有什么更好的思路去处理呢？

我们不妨在partion by中也做case when过滤，此时虽然也产生了null值，但null值并不会对数据的rn造成干扰，因为放入了不同的组，即原本的(null,2,null,null,5,null,7)现在为(null,1,null,null,2,null,3)，BI端直接过滤我需要的排名即可，而且我也不需要先倒排再从大到小再取rn，我直接少去了最外一层的rn排序，此时再回过头来看看刚刚的屎山代码，简直，遥遥领先！但是没有前面的第一种思路做铺垫，第二种思路也是很难想出来的

case
    when union_id != 'All' then row_number() over(partition by
        case
            when union_id != 'All' then 1 
            else 2
        end
        order by total_last_month_whole desc)
    else null
end as total_last_month_whole_rn,