需求
2、蚂蚁森林低碳用户排名分析
问题:查询user_low_carbon表中每日流水记录,条件为:
用户在2017年,连续三天(或以上)的天数里,
每天减少碳排放(low_carbon)都超过100g的用户低碳流水。
需要查询返回满足以上条件的user_low_carbon表中的记录流水。
例如用户u_002符合条件的记录如下,因为2017/1/2~2017/1/5连续四天的碳排放量之和都大于等于100g:
分析:首先根据用户和天数进行分组,过滤掉总能量小于100的用户。
接下来就是比较难理解的一点,就是连续三天,我们怎么做到这个连续三天呢?
我也是在看到第一步的结果之后突然发现的,
假如是连续的天数,那么他们的天数成等差数列,差值为1,
假如我们再按照userid进行开创,排名。用时间减去他们的排名就会得到一个固定的值。
假如他们的日期不是连续的,那么开窗之后相减的差值就不是一个固定的值。
基于以上思路,我们做一个简单的测试。
代码是没有进行优化的,我会把自己的思路一步一步理出来。
1.:和第一问类似,先进行时间格式的优化
SELECT
user_id,
DATE_FORMAT(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
low_carbon
FROM user_low_carbon t1
+----------+-------------+-------------+
| user_id | data_dt | low_carbon |
+----------+-------------+-------------+
| u_001 | 2017-01-01 | 10 |
| u_001 | 2017-01-02 | 150 |
| u_001 | 2017-01-02 | 110 |
| u_001 | 2017-01-02 | 10 |
| u_001 | 2017-01-04 | 50 |
| u_001 | 2017-01-04 | 10 |
| u_001 | 2017-01-06 | 45 |
| u_001 | 2017-01-06 | 90 |
| u_002 | 2017-01-01 | 10 |
| u_002 | 2017-01-02 | 150 |
| u_002 | 2017-01-02 | 70 |
| u_002 | 2017-01-03 | 30 |
| u_002 | 2017-01-03 | 80 |
| u_002 | 2017-01-04 | 150 |
2. 按照用户和时间进行分组,过滤掉当天总能量小于100的用户
SELECT
t1.user_id,
t1.data_dt,
SUM(t1.low_carbon) day_sum_low_carbon
FROM
t1
GROUP BY
t1.user_id,
t1.data_dt
HAVING day_sum_low_carbon>100 t2
组合:
SELECT
t1.user_id,
t1.data_dt,
SUM(t1.low_carbon) day_sum_low_carbon
FROM
(
SELECT
user_id,
DATE_FORMAT(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
low_carbon
FROM user_low_carbon
)
t1
GROUP BY
t1.user_id,
t1.data_dt
HAVING day_sum_low_carbon>100 ====t2
SELECT
t1.user_id,
t1.data_dt,
SUM(t1.low_carbon) day_sum_low_carbon
FROM
(
SELECT
user_id,
DATE_FORMAT(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
low_carbon
FROM user_low_carbon
)
t1
GROUP BY
t1.user_id,
t1.data_dt
HAVING day_sum_low_carbon>100
3.根据用户id进行开窗,然后进行排序
SELECT
t2.user_id,
t2.data_dt,
t2.day_sum_low_carbon,
row_number()over(PARTITION BY t2.user_id ORDER BY t2.data_dt)
FROM t2
组合:
SELECT
t2.user_id,
t2.data_dt,
t2.day_sum_low_carbon,
rank() over(PARTITION BY t2.user_id ORDER BY t2.data_dt) rn
FROM
(
SELECT
t1.user_id,
t1.data_dt,
SUM(t1.low_carbon) day_sum_low_carbon
FROM
(
SELECT
user_id,
DATE_FORMAT(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
low_carbon
FROM user_low_carbon
)
t1
GROUP BY
t1.user_id,
t1.data_dt
HAVING day_sum_low_carbon>100
)
t2
t2.user_id t2.data_dt t2.day_sum_low_carbon rn
u_001 2017-01-02 270 1
u_001 2017-01-06 135 2
u_002 2017-01-02 220 1
u_002 2017-01-03 110 2
u_002 2017-01-04 150 3
u_002 2017-01-05 101 4
u_003 2017-01-02 160 1
u_003 2017-01-03 160 2
u_003 2017-01-05 120 3
u_003 2017-01-07 120 4
u_004 2017-01-01 110 1
4.根据上表我们可以发现,如果是连续三天,那么我们用日期减去排名,所得的数值是相同的。
那么 我们就做data——dt和rn的差值
SELECT
t3.user_id,
t3.data_dt,
t3.day_sum_low_carbon,
DATE_SUB(t3.data_dt,t3.rn) data_dt_rn_diff
FROM
t3
组合:
SELECT
t3.user_id,
t3.data_dt,
t3.day_sum_low_carbon,
DATE_SUB(t3.data_dt,t3.rn) data_dt_rn_diff
FROM
(
SELECT
t2.user_id,
t2.data_dt,
t2.day_sum_low_carbon,
rank() over(PARTITION BY t2.user_id ORDER BY t2.data_dt) rn
FROM
(
SELECT
t1.user_id,
t1.data_dt,
SUM(t1.low_carbon) day_sum_low_carbon
FROM
(
SELECT
user_id,
DATE_FORMAT(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
low_carbon
FROM user_low_carbon
)
t1
GROUP BY
t1.user_id,
t1.data_dt
HAVING day_sum_low_carbon>100
)
t2
)
t3
t3.user_id t3.data_dt t3.day_sum_low_carbon data_dt_rn_diff
u_001 2017-01-02 270 2017-01-01
u_001 2017-01-06 135 2017-01-04
u_002 2017-01-02 220 2017-01-01
u_002 2017-01-03 110 2017-01-01
u_002 2017-01-04 150 2017-01-01
u_002 2017-01-05 101 2017-01-01
u_003 2017-01-02 160 2017-01-01
u_003 2017-01-03 160 2017-01-01
u_003 2017-01-05 120 2017-01-02
5. 按照user_id和data_dt_rn_diff 进行分区 求出个数
SELECT
t4.user_id,
t4.data_dt,
t4.day_sum_low_carbon,
COUNT(*) over(PARTITION BY t4.user_id,t4.data_dt_rn_diff)
FROM t4
t4.user_id t4.data_dt t4.day_sum_low_carbon days
u_001 2017-01-02 270 1
u_001 2017-01-06 135 1
u_002 2017-01-02 220 4
u_002 2017-01-03 110 4
u_002 2017-01-04 150 4
u_002 2017-01-05 101 4
u_003 2017-01-02 160 2
u_003 2017-01-03 160 2
u_003 2017-01-05 120 1
u_003 2017-01-07 120 1
6.查询个数大于等于3的,最后通过join,得出。
所有语句如下
SELECT
t8.user_id,
t8.data_dt,
t8.low_carbon
FROM
(
SELECT t5.user_id,t5.data_dt,t5.day_sum_low_carbon,t5.days
FROM(
SELECT
t4.user_id,
t4.data_dt,
t4.day_sum_low_carbon,
COUNT(*) over(PARTITION BY t4.user_id,t4.data_dt_rn_diff) days
FROM
(
SELECT
t3.user_id,
t3.data_dt,
t3.day_sum_low_carbon,
DATE_SUB(t3.data_dt,t3.rn) data_dt_rn_diff
FROM
(
SELECT
t2.user_id,
t2.data_dt,
t2.day_sum_low_carbon,
rank() over(PARTITION BY t2.user_id ORDER BY t2.data_dt) rn
FROM
(
SELECT
t1.user_id,
t1.data_dt,
SUM(t1.low_carbon) day_sum_low_carbon
FROM
(
SELECT
user_id,
DATE_FORMAT(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
low_carbon
FROM user_low_carbon
)
t1
GROUP BY
t1.user_id,
t1.data_dt
HAVING day_sum_low_carbon>100
)
t2
)
t3
)
t4
) t5
WHERE t5.days>=3
)
t7 JOIN
(
SELECT
user_id,
DATE_FORMAT (regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
low_carbon
FROM user_low_carbon
)t8
ON t7.user_id=t8.user_id AND t7.data_dt=t8.data_dt
t8.user_id t8.data_dt t8.low_carbon
u_002 2017-01-02 150
u_002 2017-01-02 70
u_002 2017-01-03 30
u_002 2017-01-03 80
u_002 2017-01-04 150
u_002 2017-01-05 101
u_005 2017-01-03 180
u_005 2017-01-02 50
u_005 2017-01-02 80
u_005 2017-01-04 180
u_005 2017-01-04 10
u_008 2017-01-04 260
u_008 2017-01-06 160
u_008 2017-01-05 360
u_008 2017-01-07 60
u_008 2017-01-07 60
u_009 2017-01-04 270