目录
0 需求
表名:t
表字段及内容:
date_id a b c
2014 AB 12 bc
2015 23
2016 d
2017 BC
问题:如何使用最新数据补全表格
输出结果如下所示:
date_id a b c
2014 AB 12 bc
2015 AB 23 bc
2016 AB 23 d
2017 BC 23 d
应用场景:补全稀疏表格,获取用户点击事件时所对应的当前页面【URL】。
1 数据准备
create table t as
select '2014' as date_id,'AB' as a,'12' as b,'bc' as c
UNION ALL
select '2015' as date_id,null as a,'23' as b,null as c
UNION ALL
select '2016' as date_id,null as a,null as b,'d' as c
UNION ALL
select '2017' as date_id,'BC' as a,null as b,null as c
2 问题分析
该问题的本质还是利用随时间变化区分可变和不可变数据的技巧。
sum() over(order by date_id)
SQL如下:
select date_id, max(a) over(partition by grp1) as a, max(b) over(partition by grp2) as b, max(c) over(partition by grp3) as c from ( select date_id, a, b, c, count(a) over(order by date_id) as grp1, count(b) over(order by date_id) as grp2, count(c) over(order by date_id) as grp3 from t )t;
结果如下:
+----------+-----+-----+-----+
| date_id | a | b | c |
+----------+-----+-----+-----+
| 2014 | AB | 12 | bc |
| 2015 | AB | 23 | bc |
| 2016 | AB | 23 | d |
| 2017 | BC | 23 | d |
+----------+-----+-----+-----+
3 小结
本文给出了一种利用当前最新数据补全稀疏表格的方法,该方法常常用于数据清洗当中,比如用户一个session中发生浏览点击事件时候,url往往在浏览事件时候给出而点击事件中往往没有给出,这种表格往往是稀疏的,如果此时想知道用户点击了某个按钮后当前页面是哪个,那么我们往往就需要利用这种方法补全数据来获取当前点击事件所对应的页面。