Hive经典案例之蚂蚁森林(二)

需求

2、蚂蚁森林低碳用户排名分析
问题:查询user_low_carbon表中每日流水记录,条件为:
用户在2017年,连续三天(或以上)的天数里,
每天减少碳排放(low_carbon)都超过100g的用户低碳流水。
需要查询返回满足以上条件的user_low_carbon表中的记录流水。
例如用户u_002符合条件的记录如下,因为2017/1/2~2017/1/5连续四天的碳排放量之和都大于等于100g:

分析:首先根据用户和天数进行分组,过滤掉总能量小于100的用户。
接下来就是比较难理解的一点,就是连续三天,我们怎么做到这个连续三天呢?
我也是在看到第一步的结果之后突然发现的,
假如是连续的天数,那么他们的天数成等差数列,差值为1,
假如我们再按照userid进行开创,排名。用时间减去他们的排名就会得到一个固定的值。
假如他们的日期不是连续的,那么开窗之后相减的差值就不是一个固定的值。
基于以上思路,我们做一个简单的测试。
代码是没有进行优化的,我会把自己的思路一步一步理出来。
1.:和第一问类似,先进行时间格式的优化
 SELECT 
     user_id,
     DATE_FORMAT(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
     low_carbon
 FROM user_low_carbon   t1

+----------+-------------+-------------+
| user_id  |   data_dt   | low_carbon  |
+----------+-------------+-------------+
| u_001    | 2017-01-01  | 10          |
| u_001    | 2017-01-02  | 150         |
| u_001    | 2017-01-02  | 110         |
| u_001    | 2017-01-02  | 10          |
| u_001    | 2017-01-04  | 50          |
| u_001    | 2017-01-04  | 10          |
| u_001    | 2017-01-06  | 45          |
| u_001    | 2017-01-06  | 90          |
| u_002    | 2017-01-01  | 10          |
| u_002    | 2017-01-02  | 150         |
| u_002    | 2017-01-02  | 70          |
| u_002    | 2017-01-03  | 30          |
| u_002    | 2017-01-03  | 80          |
| u_002    | 2017-01-04  | 150         |

2. 按照用户和时间进行分组,过滤掉当天总能量小于100的用户
 SELECT
    t1.user_id,
    t1.data_dt,
    SUM(t1.low_carbon) day_sum_low_carbon
 FROM 
 t1
 GROUP BY 
   t1.user_id,
   t1.data_dt
HAVING day_sum_low_carbon>100    t2


组合:
 SELECT
    t1.user_id,
    t1.data_dt,
    SUM(t1.low_carbon) day_sum_low_carbon
 FROM
 (
  SELECT 
     user_id,
     DATE_FORMAT(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
     low_carbon
 FROM user_low_carbon
 ) 
 t1
 GROUP BY 
   t1.user_id,
   t1.data_dt
HAVING day_sum_low_carbon>100    ====t2
 SELECT
    t1.user_id,
    t1.data_dt,
    SUM(t1.low_carbon) day_sum_low_carbon
 FROM
 (
  SELECT 
     user_id,
     DATE_FORMAT(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
     low_carbon
 FROM user_low_carbon
 ) 
 t1
 GROUP BY 
   t1.user_id,
   t1.data_dt
HAVING day_sum_low_carbon>100
3.根据用户id进行开窗,然后进行排序
SELECT 
   t2.user_id,
   t2.data_dt,
   t2.day_sum_low_carbon,
   row_number()over(PARTITION BY t2.user_id ORDER BY t2.data_dt)
FROM t2

组合:
SELECT 
   t2.user_id,
   t2.data_dt,
   t2.day_sum_low_carbon,
   rank() over(PARTITION BY t2.user_id ORDER BY t2.data_dt) rn
FROM 
(
     SELECT
    t1.user_id,
    t1.data_dt,
    SUM(t1.low_carbon) day_sum_low_carbon
 FROM
 (
  SELECT 
     user_id,
     DATE_FORMAT(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
     low_carbon
 FROM user_low_carbon
 ) 
 t1
 GROUP BY 
   t1.user_id,
   t1.data_dt
HAVING day_sum_low_carbon>100
)
t2
t2.user_id	t2.data_dt	t2.day_sum_low_carbon	rn
u_001	2017-01-02	270	 			1
u_001	2017-01-06	135				2
u_002	2017-01-02	220				1
u_002	2017-01-03	110				2
u_002	2017-01-04	150				3
u_002	2017-01-05	101				4
u_003	2017-01-02	160				1
u_003	2017-01-03	160				2
u_003	2017-01-05	120				3
u_003	2017-01-07	120				4
u_004	2017-01-01	110				1

4.根据上表我们可以发现,如果是连续三天,那么我们用日期减去排名,所得的数值是相同的。
那么 我们就做data——dt和rn的差值

SELECT 
    t3.user_id,
    t3.data_dt,
    t3.day_sum_low_carbon,
    DATE_SUB(t3.data_dt,t3.rn)  data_dt_rn_diff
FROM
  t3

组合:
SELECT 
    t3.user_id,
    t3.data_dt,
    t3.day_sum_low_carbon,
    DATE_SUB(t3.data_dt,t3.rn)  data_dt_rn_diff
FROM
(
SELECT 
   t2.user_id,
   t2.data_dt,
   t2.day_sum_low_carbon,
   rank() over(PARTITION BY t2.user_id ORDER BY t2.data_dt) rn
FROM 
(
     SELECT
    t1.user_id,
    t1.data_dt,
    SUM(t1.low_carbon) day_sum_low_carbon
 FROM
 (
  SELECT 
     user_id,
     DATE_FORMAT(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
     low_carbon
 FROM user_low_carbon
 ) 
 t1
 GROUP BY 
   t1.user_id,
   t1.data_dt
HAVING day_sum_low_carbon>100
)
t2
)
  t3
 
 t3.user_id	t3.data_dt	t3.day_sum_low_carbon	data_dt_rn_diff
u_001	2017-01-02	270	2017-01-01
u_001	2017-01-06	135	2017-01-04
u_002	2017-01-02	220	2017-01-01
u_002	2017-01-03	110	2017-01-01
u_002	2017-01-04	150	2017-01-01
u_002	2017-01-05	101	2017-01-01
u_003	2017-01-02	160	2017-01-01
u_003	2017-01-03	160	2017-01-01
u_003	2017-01-05	120	2017-01-02

5. 按照user_id和data_dt_rn_diff 进行分区 求出个数

SELECT
   t4.user_id,
   t4.data_dt,
   t4.day_sum_low_carbon,
   COUNT(*) over(PARTITION BY t4.user_id,t4.data_dt_rn_diff)
FROM t4
t4.user_id	t4.data_dt	t4.day_sum_low_carbon	days
u_001	2017-01-02	270	1
u_001	2017-01-06	135	1
u_002	2017-01-02	220	4
u_002	2017-01-03	110	4
u_002	2017-01-04	150	4
u_002	2017-01-05	101	4
u_003	2017-01-02	160	2
u_003	2017-01-03	160	2
u_003	2017-01-05	120	1
u_003	2017-01-07	120	1
6.查询个数大于等于3的,最后通过join,得出。
所有语句如下

SELECT 
  t8.user_id,
  t8.data_dt,
  t8.low_carbon
FROM 
(
SELECT t5.user_id,t5.data_dt,t5.day_sum_low_carbon,t5.days
FROM(
SELECT
   t4.user_id,
   t4.data_dt,
   t4.day_sum_low_carbon,
   COUNT(*) over(PARTITION BY t4.user_id,t4.data_dt_rn_diff)  days
FROM 
(
SELECT 
    t3.user_id,
    t3.data_dt,
    t3.day_sum_low_carbon,
    DATE_SUB(t3.data_dt,t3.rn)  data_dt_rn_diff
FROM
(
SELECT 
   t2.user_id,
   t2.data_dt,
   t2.day_sum_low_carbon,
   rank() over(PARTITION BY t2.user_id ORDER BY t2.data_dt) rn
FROM 
(
     SELECT
    t1.user_id,
    t1.data_dt,
    SUM(t1.low_carbon) day_sum_low_carbon
 FROM
 (
  SELECT 
     user_id,
     DATE_FORMAT(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
     low_carbon
 FROM user_low_carbon
 ) 
 t1
 GROUP BY 
   t1.user_id,
   t1.data_dt
HAVING day_sum_low_carbon>100
)
t2
)
  t3
)
t4
) t5
WHERE t5.days>=3 
)
t7 JOIN 
(
           SELECT
           user_id,
            DATE_FORMAT (regexp_replace(data_dt,'/','-'),'yyyy-MM-dd')  data_dt,
            low_carbon
           FROM user_low_carbon
)t8
ON t7.user_id=t8.user_id AND t7.data_dt=t8.data_dt

t8.user_id	t8.data_dt	t8.low_carbon
u_002	2017-01-02	150
u_002	2017-01-02	70
u_002	2017-01-03	30
u_002	2017-01-03	80
u_002	2017-01-04	150
u_002	2017-01-05	101
u_005	2017-01-03	180
u_005	2017-01-02	50
u_005	2017-01-02	80
u_005	2017-01-04	180
u_005	2017-01-04	10
u_008	2017-01-04	260
u_008	2017-01-06	160
u_008	2017-01-05	360
u_008	2017-01-07	60
u_008	2017-01-07	60
u_009	2017-01-04	270

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值