HiveSQL一天一个小技巧:清洗数据如何将同一组内稀疏字段用有值的数据填充完整?

77 篇文章 222 订阅
72 篇文章 119 订阅

0 需求

1 需求分析

需求分析:需求中需要求出分组中按成绩排名取倒数第二的值作为新字段,且分组内没有倒数第二条的时候取当前值。

如果本题只是求分组内排序后倒数第二,则很简单,使用row_number()函数即可求出,但是本题问题点在于没有倒数第二时候需要保留当前值,如何优雅求出呢?

使用row_number()函数得到如下结果


with data as
         (select 111 as stu_id, 'class1' as class_name, 69 as score
          union all
          select 113 as stu_id, 'class1' as class_name, 74 as score
          union all
          select 112 as stu_id, 'class1' as class_name, 80 as score
          union all
          select 115 as stu_id, 'class1' as class_name, 93 as score
          union all
          select 114 as stu_id, 'class1' as class_name, 94 as score
          union all
          select 124 as stu_id, 'class2' as class_name, 70 as score
          union all
          select 121 as stu_id, 'class2' as class_name, 74 as score
          union all
          select 123 as stu_id, 'class2' as class_name, 78 as score
          union all
          select 122 as stu_id, 'class2' as class_name, 86 as score
          union all
          select 9999 as stu_id, 'class3' as class_name, 99 as score
         )
select stu_id
           , class_name
           , score
           , row_number() over (partition by class_name order by score desc ) rn1
          
      from data

根据上述结果,如何取出倒数第二值?上层使用case when rn = 2 then score end ,看看效果


with data as
         (select 111 as stu_id, 'class1' as class_name, 69 as score
          union all
          select 113 as stu_id, 'class1' as class_name, 74 as score
          union all
          select 112 as stu_id, 'class1' as class_name, 80 as score
          union all
          select 115 as stu_id, 'class1' as class_name, 93 as score
          union all
          select 114 as stu_id, 'class1' as class_name, 94 as score
          union all
          select 124 as stu_id, 'class2' as class_name, 70 as score
          union all
          select 121 as stu_id, 'class2' as class_name, 74 as score
          union all
          select 123 as stu_id, 'class2' as class_name, 78 as score
          union all
          select 122 as stu_id, 'class2' as class_name, 86 as score
          union all
          select 9999 as stu_id, 'class3' as class_name, 99 as score
         )
select stu_id
     , class_name
     , score
     , case when  rn1 = 2 then score end as  res
from (
         select stu_id
              , class_name
              , score
              , row_number() over (partition by class_name order by score desc ) rn1
              --, row_number() over (partition by class_name order by score  ) rn2
         from data
     ) t

倒数第二值是取出来了,但是还不符合要求,需求中要求该分组内生成的字段每一行全部为该值,如何做呢?这里有个小技巧,也是数据清洗的手段,如何将分组内空值用该分组内有值的值填充完整?我们采用max()函数开窗的技巧:max() over(partition by 分组字段),这样同一个组内的所有空值都会被赋值为同一个字段。SQL如下:



with data as
         (select 111 as stu_id, 'class1' as class_name, 69 as score
          union all
          select 113 as stu_id, 'class1' as class_name, 74 as score
          union all
          select 112 as stu_id, 'class1' as class_name, 80 as score
          union all
          select 115 as stu_id, 'class1' as class_name, 93 as score
          union all
          select 114 as stu_id, 'class1' as class_name, 94 as score
          union all
          select 124 as stu_id, 'class2' as class_name, 70 as score
          union all
          select 121 as stu_id, 'class2' as class_name, 74 as score
          union all
          select 123 as stu_id, 'class2' as class_name, 78 as score
          union all
          select 122 as stu_id, 'class2' as class_name, 86 as score
          union all
          select 9999 as stu_id, 'class3' as class_name, 99 as score
         )
select stu_id
     , class_name
     , score
     , max(case when  rn1 = 2 then score end ) over(partition by class_name)   as  res
from (
         select stu_id
              , class_name
              , score
              , row_number() over (partition by class_name order by score desc ) rn1
              --, row_number() over (partition by class_name order by score  ) rn2
         from data
     ) t

我们看到其结果值越来越符合预期,但是对于分组内只有一个值的如何处理呢?这里我们需要辅助判断,我们可以采用采用min() =max()判断,也可以采用percent_rank()=0判断等等,这里我们采用min() =max()判断,只要最大值等于最小值说明就分组内值唯一,最终SQL如下:


with data as
         (select 111 as stu_id, 'class1' as class_name, 69 as score
          union all
          select 113 as stu_id, 'class1' as class_name, 74 as score
          union all
          select 112 as stu_id, 'class1' as class_name, 80 as score
          union all
          select 115 as stu_id, 'class1' as class_name, 93 as score
          union all
          select 114 as stu_id, 'class1' as class_name, 94 as score
          union all
          select 124 as stu_id, 'class2' as class_name, 70 as score
          union all
          select 121 as stu_id, 'class2' as class_name, 74 as score
          union all
          select 123 as stu_id, 'class2' as class_name, 78 as score
          union all
          select 122 as stu_id, 'class2' as class_name, 86 as score
          union all
          select 9999 as stu_id, 'class3' as class_name, 99 as score
         )
select stu_id
     , class_name
     , score
     , max(case
               when rn1 != rn2 and rn1 = 2  --正序和倒序值不等 则取倒数第二的值 (rn1=2的值)
                   then score
               when rn1 = rn2 then score   --正序和倒序值相等 则取当前值
           end) over (partition by class_name) res
from (
         select stu_id
              , class_name
              , score
              , dense_rank()  over (partition by class_name order by score desc ) rn1
              , dense_rank() over (partition by class_name order by score) rn2 --用来辅助判断
             -- , percent_rank() over (partition by class_name order by score) pr --也可以采用该函数辅助判断(pr=0时候)
         from data
     ) t

2 小结

本文通过实际需求中的案例,讲解了如何将分组内空值补充完整的技巧,通过开窗,min()/max() over(partition by 分组字段)来补充,注意点max()函数中根据实际情况写case when语句,或构造符合实际需求的条件,往往数据清晰中会用到这一技巧

  • 4
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值