一、数据
user_low_carbon(user_id String,data_dt String,low_carbon int)
u_001 2017/1/1 10
plant_carbon(plant_id string,plant_name String,low_carbon int)
二、需求一
1.蚂蚁森林植物申领统计
问题:假设2017年1月1日开始记录低碳数据(user_low_carbon),
假设2017年10月1日之前满足申领条件的用户都申领了一颗p004-胡杨,
剩余的能量全部用来领取“p002-沙柳”?。
统计在10月1日累计申领“p002-沙柳”?排名前10的用户信息;以及他比后一名多领了几颗沙柳。
得到的统计结果如下表样式:
user_id??plant_count?less_count(比后一名多领了几颗沙柳)
①求用户从2017年1月1日开始,到2017年10月1日一共收集了多少了碳
select user_id,sum(low_carbon) sumCarbon
from user_low_carbon
where year(regexp_replace(data_dt,'/','-'))=2017 and month(regexp_replace(data_dt,'/','-')) between 1 and 9
group by user_id //t1
②求胡杨的单价
select low_carbon hycarbon from plant_carbon where plant_id='p004' //t2
③求沙柳的单价
select low_carbon slcarbon from plant_carbon where plant_id='p002' //t3
④求 (用户的碳总量-1颗胡杨的单价)/沙柳单价,之后向下取整,这是每个用户供领取多少课沙柳
按照沙柳的总量降序排序,取前11名的信息
select user_id,sumCarbon,floor((sumCarbon-hycarbon)/slcarbon) slcount
from
t1 join t2 join t3
where sumCarbon>=hycarbon
order by slcount desc
limit 11 //t4
⑤再从top11中取前10名的信息,再计算每一名比后一名多领了多少颗
select t4.*,slcount-lead(slcount,1,0) over(order by slcount desc) diffcount
from t4
limit 10
组合:
select t4.*,slcount-lead(slcount,1,0) over(order by slcount desc) diffcount
from
(select user_id,sumCarbon,floor((sumCarbon-hycarbon)/slcarbon) slcount
from
(select user_id,sum(low_carbon) sumCarbon
from user_low_carbon
where year(regexp_replace(data_dt,'/','-'))='2017' and month(regexp_replace(data_dt,'/','-')) between 1 and 9
group by user_id )t1
join
(select low_carbon hycarbon from plant_carbon where plant_id='p004')t2
join
(select low_carbon slcarbon from plant_carbon where plant_id='p002') t3
order by slcount desc
limit 11 )t4
limit 10
三、需求二
1、数据
user_low_carbon(user_id String,data_dt String,low_carbon int)
u_001 2017/1/1 10
plant_carbon(plant_id string,plant_name String,low_carbon int)
2.问题
问题:查询user_low_carbon表中每日流水记录,条件为:
用户在2017年,连续三天(或以上)的天数里,
每天减少碳排放(low_carbon)都超过100g的用户低碳流水。
需要查询返回满足以上条件的user_low_carbon表中的记录流水。
例如用户u_002符合条件的记录如下,因为2017/1/2~2017/1/5连续四天的碳排放量之和都大于等于100g:
条件1: 每天都得收集超过100g
且
条件2: 连续3天或以上达标
①求每个用户每天一共收集了多少碳
②对用户收集的总的碳量进行过滤,只剩用户收集的超过100g的日期
select user_id,regexp_replace(data_dt,'/','-') mydate,sum(low_carbon) sumCarbon
from user_low_carbon
where year(regexp_replace(data_dt,'/','-'))=2017
group by user_id,data_dt
having sumCarbon>100 //t1
③再判断这些日期是否满足连续3天这个条件
何谓连续3天?
如何判断当前记录属于连续3天?
如果当前这条记录的日期符合连续3天,那么这条日期应该属于以下情况中的任意一种:
当前此条日期位于连续3天中的第一天,用当前日期减去后一条的日期=-1,用当前日期减去后两条的日期=-2;
当前此条日期位于连续3天中的第二天,用当前日期减去后一条的日期=-1,用当前日期减去前一条的日期=1;
当前此条日期位于连续3天中的第三天,用当前日期减去前一条的日期=1,用当前日期减去前两条的日期=22;
select user_id,mydate,
lag(mydate,1,'1970-1-1') over(partition by user_id order by mydate) pre1day,
lag(mydate,2,'1970-1-1') over(partition by user_id order by mydate) pre2day,
lead(mydate,1,'1970-1-1') over(partition by user_id order by mydate) after1day,
lead(mydate,2,'1970-1-1') over(partition by user_id order by mydate) after2day
from t1 //t2
做差判断
select user_id,regexp_replace(mydate,'-','/') data_dt
from t2
where
(datediff(mydate,after1day)=-1 and datediff(mydate,after2day)=-2)
or
(datediff(mydate,pre1day)=1 and datediff(mydate,after1day)=-1)
or
(datediff(mydate,pre1day)=1 and datediff(mydate,pre2day)=2) //t3
④将复合条件的日期,再和原表进行关联,求出每日的具体流水
select ul.*
from t3 join user_low_carbon ul
on t3.user_id=ul.user_id and t3.data_dt=ul.data_dt
总的
select ul.*
from
(select user_id,regexp_replace(mydate,'-','/') data_dt
from
(select user_id,mydate,
lag(mydate,1,'1970-1-1') over(partition by user_id order by mydate) pre1day,
lag(mydate,2,'1970-1-1') over(partition by user_id order by mydate) pre2day,
lead(mydate,1,'1970-1-1') over(partition by user_id order by mydate) after1day,
lead(mydate,2,'1970-1-1') over(partition by user_id order by mydate) after2day
from
(select user_id,regexp_replace(data_dt,'/','-') mydate,sum(low_carbon) sumCarbon
from user_low_carbon
where year(regexp_replace(data_dt,'/','-'))=2017
group by user_id,data_dt
having sumCarbon>100 ) t1)t2
where
(datediff(mydate,after1day)=-1 and datediff(mydate,after2day)=-2)
or
(datediff(mydate,pre1day)=1 and datediff(mydate,after1day)=-1)
or
(datediff(mydate,pre1day)=1 and datediff(mydate,pre2day)=2))t3 join user_low_carbon ul
on t3.user_id=ul.user_id and t3.data_dt=ul.data_dt
3.解法2
思路二: 在3天连续上进行改进求法
3天连续=连续+3天
总结规律:
假设A列,起始值为a,每次递增x
假设B列,起始值为b,每次递增y
如果A列和B列都连续递增,那么A列和B列,每两行之间的差值,都以 y-x 递增
A B 做差
a b b-a
a+x b+y (b-a)+(y-x)
a+2x b+2y (b-a)+2(y-x)
①如何判断日期是连续的
日期如果是连续的,那么排序后,紧邻的两行日期做差差值一定为1
借助一个参考列,参考列一定是连续的,且以1为单位递增
举例: col1 差值
u_002 2017-1-2 1 2017-1-1
u_002 2017-1-3 2 2017-1-1
u_002 2017-1-4 3 2017-1-1
u_002 2017-1-6 4 2017-1-2
u_002 2017-1-7 5 2017-1-2
u_002 2017-1-9 6 2017-1-3
u_002 2017-1-10 7 2017-1-3
u_002 2017-1-11 8 2017-1-3
u_005 2017-1-2 1
u_005 2017-1-3 2
②如何判断此日期已经3天连续
做差后的数据,以差值和用户为单位分组,如果组内count(*)>=3,3天连续
N天连续:
做差后的数据,以差值和用户为单位分组,如果组内count(*)>=N,N天连续
select user_id,regexp_replace(data_dt,'/','-') mydate,sum(low_carbon) sumCarbon
from user_low_carbon
where year(regexp_replace(data_dt,'/','-'))=2017
group by user_id,data_dt
having sumCarbon>100 //t1
以用户为单位分区,求出参考列和差值
select user_id,mydate,row_number() over(partition by user_id) rn,
date_sub(mydate,row_number() over(partition by user_id)) diff
from t1 //t2
判断3天连续,以差值和用户为单位分组,如果组内count(*)>=3,3天连续
group by 无法取到日期数据,使用窗口函数取
select user_id,mydate,count(*) over(partition by user_id,diff) mycount
from t2 //t3
过滤出3天连续的数据
select user_id,regexp_replace(mydate,'-','/') data_dt
from t3
where mycount>=3 //t4
将复合条件的日期,再和原表进行关联,求出每日的具体流水
select ul.*
from t4 join user_low_carbon ul
on t4.user_id=ul.user_id and t4.data_dt=ul.data_dt
总的
select ul.*
from
(select user_id,regexp_replace(mydate,'-','/') data_dt
from
(select user_id,mydate,count(*) over(partition by user_id,diff) mycount
from (select user_id,mydate,row_number() over(partition by user_id) rn,
date_sub(mydate,row_number() over(partition by user_id)) diff
from (select user_id,regexp_replace(data_dt,'/','-') mydate,sum(low_carbon) sumCarbon
from user_low_carbon
where year(regexp_replace(data_dt,'/','-'))=2017
group by user_id,data_dt
having sumCarbon>100)t1)t2)t3
where mycount>=3)t4 join user_low_carbon ul
on t4.user_id=ul.user_id and t4.data_dt=ul.data_dt