背景说明:
以下表记录了用户每天的蚂蚁森林低碳生活领取的记录流水。
table_name:user_low_carbon
seq(key) user_id data_dt low_carbon
流水号 用户 日期 减少碳排放(g)
xxxxx01 u_001 2017/1/1 10
xxxxx02 u_001 2017/1/2 150
xxxxx03 u_001 2017/1/2 110
xxxxx04 u_001 2017/1/2 10
xxxxx05 u_001 2017/1/4 50
xxxxx06 u_001 2017/1/4 10
xxxxx07 u_001 2017/1/6 45
xxxxx08 u_001 2017/1/6 90
xxxxx09 u_002 2017/1/1 10
xxxxx10 u_002 2017/1/2 150
xxxxx11 u_002 2017/1/2 70
xxxxx12 u_002 2017/1/3 30
xxxxx13 u_002 2017/1/3 80
xxxxx14 u_002 2017/1/4 150
xxxxx14 u_002 2017/1/5 101
xxxxx15 u_002 2017/1/6 68
xxxxx16 u_002 2017/1/6 120
蚂蚁森林植物换购表,用于记录申领环保植物所需要减少的碳排放量
table_name: plant_carbon
plant_id plant_name low_carbon
植物编号 植物名 换购植物所需要的碳
p001 梭梭树 17900
p002 沙柳 19680
p003 樟子树 146210
p004 胡杨 215680
----题目
1、蚂蚁森林植物申领统计
问题:假设2017年1月1日开始记录低碳数据(user_low_carbon),假设2017年10月1日之前满足申领条件的用户都申领了一颗p004-胡杨,
剩余的能量全部用来领取“p002-沙柳” 。
统计在10月1日累计申领“p002-沙柳” 排名前10的用户信息;以及他比后一名多领了几颗沙柳。
得到的统计结果如下表样式:
user_id plant_count less_count(比后一名多领了几颗沙柳)
u_101 1000 100
u_088 900 400
u_103 500 …
2、蚂蚁森林低碳用户排名分析
问题:查询user_low_carbon表中每日流水记录,条件为:
用户在2017年,连续三天(或以上)的天数里,
每天减少碳排放(low_carbon)都超过100g的用户低碳流水。
需要查询返回满足以上条件的user_low_carbon表中的记录流水。
例如用户u_002符合条件的记录如下,因为2017/1/2~2017/1/5连续四天的碳排放量之和都大于等于100g:
seq(key) user_id data_dt low_carbon
xxxxx10 u_002 2017/1/2 150
xxxxx11 u_002 2017/1/2 70
xxxxx12 u_002 2017/1/3 30
xxxxx13 u_002 2017/1/3 80
xxxxx14 u_002 2017/1/4 150
xxxxx14 u_002 2017/1/5 101
备注:统计方法不限于sql、procedure、python,java等
数据准备
plant_carbon
p001 梭梭树 17
p002 沙柳 19
p003 樟子树 146
p004 胡杨 215
user_low_carbon 简单处理 把流水号去掉了
u_001 2017/1/1 10
u_001 2017/1/2 150
u_001 2017/1/2 110
u_001 2017/1/2 10
u_001 2017/1/4 50
u_001 2017/1/4 10
u_001 2017/1/6 45
u_001 2017/1/6 90
u_002 2017/1/1 10
u_002 2017/1/2 150
u_002 2017/1/2 70
u_002 2017/1/3 30
u_002 2017/1/3 80
u_002 2017/1/4 150
u_002 2017/1/5 101
u_002 2017/1/6 68
u_003 2017/1/1 20
u_003 2017/1/2 10
u_003 2017/1/2 150
u_003 2017/1/3 160
u_003 2017/1/4 20
u_003 2017/1/5 120
u_003 2017/1/6 20
u_003 2017/1/7 10
u_003 2017/1/7 110
u_004 2017/1/1 110
u_004 2017/1/2 20
u_004 2017/1/2 50
u_004 2017/1/3 120
u_004 2017/1/4 30
u_004 2017/1/5 60
u_004 2017/1/6 120
u_004 2017/1/7 10
u_004 2017/1/7 120
u_005 2017/1/1 80
u_005 2017/1/2 50
u_005 2017/1/2 80
u_005 2017/1/3 180
u_005 2017/1/4 180
u_005 2017/1/4 10
u_005 2017/1/5 80
u_005 2017/1/6 280
u_005 2017/1/7 80
u_005 2017/1/7 80
u_006 2017/1/1 40
u_006 2017/1/2 40
u_006 2017/1/2 140
u_006 2017/1/3 210
u_006 2017/1/3 10
u_006 2017/1/4 40
u_006 2017/1/5 40
u_006 2017/1/6 20
u_006 2017/1/7 50
u_006 2017/1/7 240
u_007 2017/1/1 130
u_007 2017/1/2 30
u_007 2017/1/2 330
u_007 2017/1/3 30
u_007 2017/1/4 530
u_007 2017/1/5 30
u_007 2017/1/6 230
u_007 2017/1/7 130
u_007 2017/1/7 30
u_008 2017/1/1 160
u_008 2017/1/2 60
u_008 2017/1/2 60
u_008 2017/1/3 60
u_008 2017/1/4 260
u_008 2017/1/5 360
u_008 2017/1/6 160
u_008 2017/1/7 60
u_008 2017/1/7 60
u_009 2017/1/1 70
u_009 2017/1/2 70
u_009 2017/1/2 70
u_009 2017/1/3 170
u_009 2017/1/4 270
u_009 2017/1/5 70
u_009 2017/1/6 70
u_009 2017/1/7 70
u_009 2017/1/7 70
u_010 2017/1/1 90
u_010 2017/1/2 90
u_010 2017/1/2 90
u_010 2017/1/3 90
u_010 2017/1/4 90
u_010 2017/1/4 80
u_010 2017/1/5 90
u_010 2017/1/5 90
u_010 2017/1/6 190
u_010 2017/1/7 90
u_010 2017/1/7 90
u_011 2017/1/1 110
u_011 2017/1/2 100
u_011 2017/1/2 100
u_011 2017/1/3 120
u_011 2017/1/4 100
u_011 2017/1/5 100
u_011 2017/1/6 100
u_011 2017/1/7 130
u_011 2017/1/7 100
u_012 2017/1/1 10
u_012 2017/1/2 120
u_012 2017/1/2 10
u_012 2017/1/3 10
u_012 2017/1/4 50
u_012 2017/1/5 10
u_012 2017/1/6 20
u_012 2017/1/7 10
u_012 2017/1/7 10
u_013 2017/1/1 50
u_013 2017/1/2 150
u_013 2017/1/2 50
u_013 2017/1/3 150
u_013 2017/1/4 550
u_013 2017/1/5 350
u_013 2017/1/6 50
u_013 2017/1/7 20
u_013 2017/1/7 60
u_014 2017/1/1 220
u_014 2017/1/2 120
u_014 2017/1/2 20
u_014 2017/1/3 20
u_014 2017/1/4 20
u_014 2017/1/5 250
u_014 2017/1/6 120
u_014 2017/1/7 270
u_014 2017/1/7 20
u_015 2017/1/1 10
u_015 2017/1/2 20
u_015 2017/1/2 10
u_015 2017/1/3 10
u_015 2017/1/4 20
u_015 2017/1/5 70
u_015 2017/1/6 10
u_015 2017/1/7 80
u_015 2017/1/7 60
第一题
/*
regexp_replace 日期的格式 2017/1/4 是不能用日期函数的 要是 2017-1-4 regexp_replace(data_dt,'/','-')
datediff 日期取差值
date_sub 日期与数字相减
date_add 日期与数字相加
floor数学函数 floor() 向下取整
*/
1、先查出所有用户在10月1日前的能量总和
select
user_id,
sum(low_carbon) sum_carbon
from user_low_carbon
where datediff(regexp_replace(data_dt,'/','-'),"2017-10-1")<0
group by user_id ; t1
2、查出胡杨 ,沙柳各需要的
select
low_carbon
from plant_carbon
where plant_id = 'p004' ; t2
select
low_carbon
from plant_carbon
where plant_id = 'p002' ;t3
3、查出 沙柳数的前11名
select
user_id,
floor((sum_carbon-t2.low_carbon)/(t3.low_carbon)) treecount
from
t1,t2,t3
order by treecount desc
limit 11 ;t4
4、查出 前一名比后一名多多少棵,取前10名
select
user_id,
treecount,
(treecount - lead(treecount)over(order by treecount desc))
from
t4
limit 10
//结合
select
user_id,
treecount,
(treecount - lead(treecount)over(order by treecount desc))
from
(
select
user_id,
floor((sum_carbon-t2.low_carbon)/(t3.low_carbon)) treecount
from
(
select
user_id,
sum(low_carbon) sum_carbon
from user_low_carbon
where datediff(regexp_replace(data_dt,'/','-'),"2017-10-1")<0
group by user_id
)t1,
(
select
low_carbon
from plant_carbon
where plant_id = 'p004'
)t2,
(
select
low_carbon
from plant_carbon
where plant_id = 'p002'
)t3
order by treecount desc
limit 11
)t4
limit 10
第二题
1、首先查出每个用户满足 减少碳排放大于 100g的 user_id data_dt
2、找出满足连续三天的 user_id 和 data_dt(难点)
3、有了满足条件的user_id 和data_dt就能够查到流水了
第一步
select
user_id,
data_dt,
sum(low_carbon) day_sum_carbon
from
user_low_carbon
group by user_id,data_dt
having day_sum_carbon >=100
order by user_id ,data_dt ;t1
第二步
方法一,
思路:每个用户,每条记录,和先2条的差要是 2,先一条的是1,后一条的是-1,后两条的是-2
这个方法比较麻烦
select
user_id,
data_dt,
day_sum_carbon,
datediff(regexp_replace(data_dt,'/','-'),regexp_replace(lag(data_dt,2,"1970-1-1")over(partition by user_id order by data_dt),'/','-')) lag2,
datediff(regexp_replace(data_dt,'/','-'),regexp_replace(lag(data_dt,1,"1970-1-1")over(partition by user_id order by data_dt),'/','-')) lag1,
datediff(regexp_replace(data_dt,'/','-'),regexp_replace(lead(data_dt,1,"1970-1-1")over(partition by user_id order by data_dt),'/','-')) lead1,
datediff(regexp_replace(data_dt,'/','-'),regexp_replace(lead(data_dt,2,"1970-1-1")over(partition by user_id order by data_dt),'/','-')) lead2
from
t1 ;t2
select
user_id,
data_dt,
day_sum_carbon
from
t2
where
lag2 =2 and lag1= 1
or
lag1 =1 and lead1 = -1
or
lead1=-1 and lead2=-2 ;t3
结果
user_id data_dt day_sum_carbon
u_002 2017/1/2 220
u_002 2017/1/3 110
u_002 2017/1/4 150
u_002 2017/1/5 101
u_005 2017/1/2 130
u_005 2017/1/3 180
u_005 2017/1/4 190
u_008 2017/1/4 260
u_008 2017/1/5 360
u_008 2017/1/6 160
u_008 2017/1/7 120
u_009 2017/1/2 140
u_009 2017/1/3 170
u_009 2017/1/4 270
u_010 2017/1/4 170
u_010 2017/1/5 180
u_010 2017/1/6 190
u_010 2017/1/7 180
u_011 2017/1/1 110
u_011 2017/1/2 200
u_011 2017/1/3 120
u_011 2017/1/4 100
u_011 2017/1/5 100
u_011 2017/1/6 100
u_011 2017/1/7 230
u_013 2017/1/2 200
u_013 2017/1/3 150
u_013 2017/1/4 550
u_013 2017/1/5 350
u_014 2017/1/5 250
u_014 2017/1/6 120
u_014 2017/1/7 290
方法二:
上一种的方法,如果要求连续10天的,那判断要10个,越多越麻烦
举例
1 1 0
3 2 1
4 3 1
5 4 1
8 5 3
9 6 3
10 7 3
假如第一列是 日期 ,第二列是一个递增的 数列,后面是差
那么差相等的证明日期是连续的
我们只需要在每一条记录 开窗一样的值有几个 判断大于等于3 就好了
select
user_id,
data_dt,
row_number()over(partition by user_id order by data_dt) no,
sum(low_carbon) day_sum_carbon
from
user_low_carbon
group by user_id,data_dt
having day_sum_carbon >=100 ;t1
select
user_id,
data_dt,
day_sum_carbon,
date_sub(regexp_replace(data_dt,'/','-'),no) sub
from
t1 ;t2
select
user_id,
data_dt,
day_sum_carbon,
count(*) over(partition by user_id ,sub) continue_Days
from t2 ;t3
select
user_id,
data_dt,
day_sum_carbon
from t3
where continue_Days >=3
order by user_id,data_dt
这种如果要求连续的日期多,并不会造成判断条件多,简洁点,但是执行的效率并没有前一个高,他3个job,而前面一个2个job
方法三:
写mr来实现,只写第二步,怎么从 已经满足100克以上提取出连续3天的userId和data_dt
public class MayiInfo implements WritableComparable<MayiInfo> {
String id;
String date;
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getDate() {
return date;
}
public void setDate(String date) {
this.date = date;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(id);
out.writeUTF(date);
}
@Override
public void readFields(DataInput in) throws IOException {
id = in.readUTF();
date = in.readUTF();
}
@Override
public String toString() {
return id + "\t" + date;
}
//map 排序
@Override
public int compareTo(MayiInfo mayiInfo) {
int result = 0;
result = this.id.compareTo(mayiInfo.id);
if(result == 0) {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy/MM/dd");
try {
Date d1 = sdf.parse(this.date);
Date d2 = sdf.parse(mayiInfo.date);
result = d1.compareTo(d2);
} catch (ParseException e) {
e.printStackTrace();
}
}
return result;
}
}
public class MyGroup extends WritableComparator {
public MyGroup() {
super(MayiInfo.class,true);
}
//自定义分组,按user_id
@Override
public int compare(WritableComparable a, WritableComparable b) {
MayiInfo m1 = (MayiInfo) a;
MayiInfo m2 = (MayiInfo) b;
int result = m1.getId().compareTo(m2.getId());
if(result > 0){
result = 1;
}else if(result<0) {
result = -1;
}else {
result = 0 ;
}
return result;
}
}
public class Stage1Reducer extends Reducer<MayiInfo, MayiInfo, MayiInfo, NullWritable> {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy/MM/dd");
@Override
protected void reduce(MayiInfo key, Iterable<MayiInfo> values, Context context) throws IOException, InterruptedException {
ArrayList<MayiInfo> list = new ArrayList<MayiInfo>();
Date date = null;
try {
for (MayiInfo value : values) {
MayiInfo mayiInfo = new MayiInfo();
BeanUtils.copyProperties(mayiInfo,value);
Date d1 = sdf.parse(mayiInfo.getDate());
if (date == null) {
date = d1;
list.add(mayiInfo);
continue;
}
if(myDateDiff(d1,date) == 1){
date = d1;
list.add(mayiInfo);
}else{
date = d1;
if(list.size() >=3){
for (MayiInfo m : list) {
context.write(m,NullWritable.get());
}
}
list.clear();
}
}
if(list.size() >=3){
for (MayiInfo m1 : list) {
context.write(m1,NullWritable.get());
}
}
list.clear();
}catch (Exception e){
e.printStackTrace();
}
}
public int myDateDiff(Date d1,Date d2){
int result = 0 ;
result = d1.getYear() - d2.getYear();
result = result == 0 ? d1.getMonth() - d2.getMonth() : result;
result = result ==0?d1.getDate() -d2.getDate() : result;
return result;
}
}