1.面试题
我们有如下的用户访问数据:
要求使用SQL统计出每个用户的累积访问次数,如下表所示:
数据:
u01 2017/1/21 5
u02 2017/1/23 6
u03 2017/1/22 8
u04 2017/1/20 3
u01 2017/1/23 6
u01 2017/2/21 8
u02 2017/1/23 6
u01 2017/2/22 4
建表:
create table action
(userId string,
visitDate string,
visitCount int)
row format delimited fields terminated by "\t";
答案:
SELECT t.userId `用户id`,
t.visitdt `月份`,
t.sum_cnt `小计`,
sum(t.sum_cnt) over(partition by t.userId ORDER BY t.visitdt
rows BETWEEN unbounded preceding AND current row) `累积`
FROM (
SELECT userId,
regexp_replace(substring(visitDate, 1, 6), '/', '-') visitdt,
sum(visitCount) sum_cnt
FROM action GROUP BY userId, substring(visitDate, 1, 6)
) t
2.面试题
有50W个店铺,每个顾客访问任何一个店铺的任何一个商品时都会产生一条访问日志,访问日志存储的表名为visit,访客的用户id为user_id,被访问的店铺名称为shop,请统计:
1)每个店铺的UV(访客数)
2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
数据:
u1 a
u2 b
u1 b
u1 a
u3 c
u4 b
u1 a
u2 c
u5 b
u4 b
u6 c
u2 c
u1 b
u2 a
u2 a
u3 a
u5 a
u5 a
u5 a
建表:
create table visit(user_id string,shop string)
row format delimited fields terminated by '\t';
答案:
#1)每个店铺的UV(访客数)
## 数据量小时:
select shop, count(distinct user_id) uv from visit group by shop;
## 数据量大时:
select t.shop, count(t.user_id) uv
from (
select shop, user_id from visit group by shop, user_id
) t group by t.shop;
#2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
SELECT * FROM (
SELECT t.*,
row_number() over(partition by t.shop ORDER BY t.cnt desc) rn
FROM (
SELECT shop, user_id, count(user_id) cnt
FROM visit GROUP BY shop, user_id
) t
)final
WHERE final.rn < 4;
3.面试题
背景说明:
以下表记录了用户每天的蚂蚁森林低碳生活领取的记录流水。
table_name:user_low_carbon
user_id | data_dt | low_carbon |
---|---|---|
用户 | 日期 | 减少碳排放(g) |
蚂蚁森林植物换购表,用于记录申领环保植物所需要减少的碳排放量
table_name: plant_carbon
plant_id | plant_name | low_carbon |
---|---|---|
植物编号 | 植物名 | 换购植物所需要的碳 |
数据:
plant_carbon.txt
p001 梭梭树 17
p002 沙柳 19
p003 樟子树 146
p004 胡杨 215
user_low_carbon.txt
u_001 2017/1/1 10
u_001 2017/1/2 150
u_001 2017/1/2 110
u_001 2017/1/2 10
u_001 2017/1/4 50
u_001 2017/1/4 10
u_001 2017/1/6 45
u_001 2017/1/6 90
u_002 2017/1/1 10
u_002 2017/1/2 150
u_002 2017/1/2 70
u_002 2017/1/3 30
u_002 2017/1/3 80
u_002 2017/1/4 150
u_002 2017/1/5 101
u_002 2017/1/6 68
u_003 2017/1/1 20
u_003 2017/1/2 10
u_003 2017/1/2 150
u_003 2017/1/3 160
u_003 2017/1/4 20
u_003 2017/1/5 120
u_003 2017/1/6 20
u_003 2017/1/7 10
u_003 2017/1/7 110
u_004 2017/1/1 110
u_004 2017/1/2 20
u_004 2017/1/2 50
u_004 2017/1/3 120
u_004 2017/1/4 30
u_004 2017/1/5 60
u_004 2017/1/6 120
u_004 2017/1/7 10
u_004 2017/1/7 120
u_005 2017/1/1 80
u_005 2017/1/2 50
u_005 2017/1/2 80
u_005 2017/1/3 180
u_005 2017/1/4 180
u_005 2017/1/4 10
u_005 2017/1/5 80
u_005 2017/1/6 280
u_005 2017/1/7 80
u_005 2017/1/7 80
u_006 2017/1/1 40
u_006 2017/1/2 40
u_006 2017/1/2 140
u_006 2017/1/3 210
u_006 2017/1/3 10
u_006 2017/1/4 40
u_006 2017/1/5 40
u_006 2017/1/6 20
u_006 2017/1/7 50
u_006 2017/1/7 240
u_007 2017/1/1 130
u_007 2017/1/2 30
u_007 2017/1/2 330
u_007 2017/1/3 30
u_007 2017/1/4 530
u_007 2017/1/5 30
u_007 2017/1/6 230
u_007 2017/1/7 130
u_007 2017/1/7 30
u_008 2017/1/1 160
u_008 2017/1/2 60
u_008 2017/1/2 60
u_008 2017/1/3 60
u_008 2017/1/4 260
u_008 2017/1/5 360
u_008 2017/1/6 160
u_008 2017/1/7 60
u_008 2017/1/7 60
u_009 2017/1/1 70
u_009 2017/1/2 70
u_009 2017/1/2 70
u_009 2017/1/3 170
u_009 2017/1/4 270
u_009 2017/1/5 70
u_009 2017/1/6 70
u_009 2017/1/7 70
u_009 2017/1/7 70
u_010 2017/1/1 90
u_010 2017/1/2 90
u_010 2017/1/2 90
u_010 2017/1/3 90
u_010 2017/1/4 90
u_010 2017/1/4 80
u_010 2017/1/5 90
u_010 2017/1/5 90
u_010 2017/1/6 190
u_010 2017/1/7 90
u_010 2017/1/7 90
u_011 2017/1/1 110
u_011 2017/1/2 100
u_011 2017/1/2 100
u_011 2017/1/3 120
u_011 2017/1/4 100
u_011 2017/1/5 100
u_011 2017/1/6 100
u_011 2017/1/7 130
u_011 2017/1/7 100
u_012 2017/1/1 10
u_012 2017/1/2 120
u_012 2017/1/2 10
u_012 2017/1/3 10
u_012 2017/1/4 50
u_012 2017/1/5 10
u_012 2017/1/6 20
u_012 2017/1/7 10
u_012 2017/1/7 10
u_013 2017/1/1 50
u_013 2017/1/2 150
u_013 2017/1/2 50
u_013 2017/1/3 150
u_013 2017/1/4 550
u_013 2017/1/5 350
u_013 2017/1/6 50
u_013 2017/1/7 20
u_013 2017/1/7 60
u_014 2017/1/1 220
u_014 2017/1/2 120
u_014 2017/1/2 20
u_014 2017/1/3 20
u_014 2017/1/4 20
u_014 2017/1/5 250
u_014 2017/1/6 120
u_014 2017/1/7 270
u_014 2017/1/7 20
u_015 2017/1/1 10
u_015 2017/1/2 20
u_015 2017/1/2 10
u_015 2017/1/3 10
u_015 2017/1/4 20
u_015 2017/1/5 70
u_015 2017/1/6 10
u_015 2017/1/7 80
u_015 2017/1/7 60
1. 蚂蚁森林植物申领统计
问题:假设2017年1月1日开始记录低碳数据(user_low_carbon),假设2017年10月1日之前满足申领条件的用户都申领了一颗p004-胡杨,
剩余的能量全部用来领取“p002-沙柳” 。
统计在10月1日累计申领“p002-沙柳” 排名前10的用户信息;以及他比后一名多领了几颗沙柳。
得到的统计结果如下表样式:
user_id | plant_count | less_count(比后一名多领了几颗沙柳) |
---|---|---|
u_101 | 1000 | 100 |
u_088 | 900 | 400 |
u_103 | 500 | … |
答案:
- 创建表
create table user_low_carbon(user_id String,data_dt String,low_carbon int) row format delimited fields terminated by '\t';
create table plant_carbon(plant_id string,plant_name String,low_carbon int) row format delimited fields terminated by '\t';
- 加载数据
load data local inpath "/opt/module/data/low_carbon.txt" into table user_low_carbon;
load data local inpath "/opt/module/data/plant_carbon.txt" into table plant_carbon;
- 计算
(1) 求10.1前每个人合计,和前10,得到t1表
# t1表
select user_id, sum(low_carbon) sum_carbon from user_low_carbon
where datediff(regexp_replace(data_dt, '/', '-'), '2017-10-01') < 0
group by user_id order by sum_carbon desc limit 11;
(2) 取p004和p002数据,得到t2,t3表
# t2表
select low_carbon from plant_carbon where plant_id = 'p004';
# t3表
select low_carbon from plant_carbon where plant_id = 'p002';
(3) 求兑换沙柳数,得到表4
# t4表
select t1.user_id, floor((t1.sum_carbon - t2.low_carbon)/t3.low_carbon) plant_count
from t1, t2, t3;
(4) 求比后一名多几颗
select user_id, plant_count,
plant_count - lead(plant_count, 1, null) over(order by plant_count desc) less_count
from t4 limit 10;
(5) 合并起来就是这样的:
select user_id, plant_count,
plant_count - lead(plant_count, 1, null) over(order by plant_count desc) less_count
from (
select t1.user_id, floor((t1.sum_carbon - t2.low_carbon)/t3.low_carbon) plant_count
from (
select user_id, sum(low_carbon) sum_carbon from user_low_carbon
where datediff(regexp_replace(data_dt, '/', '-'), '2017-10-01') < 0
group by user_id order by sum_carbon desc limit 11
) t1,
(select low_carbon from plant_carbon where plant_id = 'p004') t2,
(select low_carbon from plant_carbon where plant_id = 'p002')t3
) t4 limit 10;
2. 蚂蚁森林低碳用户排名分析
问题:查询user_low_carbon表中每日流水记录,条件为:
用户在2017年,连续三天(或以上)的天数里,
每天减少碳排放(low_carbon)都超过100g的用户低碳流水。
需要查询返回满足以上条件的user_low_carbon表中的记录流水。
例如用户u_002符合条件的记录如下,因为2017/1/2~2017/1/5连续四天的碳排放量之和都大于等于100g:
user_id | data_dt | low_carbon |
---|---|---|
u_002 | 2017/1/2 | 150 |
u_002 | 2017/1/2 | 70 |
u_002 | 2017/1/3 | 30 |
u_002 | 2017/1/3 | 80 |
u_002 | 2017/1/4 | 150 |
u_002 | 2017/1/5 | 101 |
解法1:
(1) 根据user_id,data_dt聚合,得到t1表
# t1表
select user_id, regexp_replace(data_dt, '/', '-') dt, sum(low_carbon) sum_carbon
from user_low_carbon
where substring(data_dt, 1, 4) = '2017'
group by user_id, regexp_replace(data_dt, '/', '-') having sum_carbon > 100;
(2) 计算与前两天,后两天的时间差,得到t2表
select user_id, dt,
datediff(dt, lag(dt, 2, null) over(partition by user_id order by dt)) lag2,
datediff(dt, lag(dt, 1, null) over(partition by user_id order by dt)) lag1,
datediff(lead(dt, 1, null) over(partition by user_id order by dt), dt) lead1,
datediff(lead(dt, 2, null) over(partition by user_id order by dt), dt) lead2
from t1;
(3) 筛选连续3天的数据,得到t3表
select user_id from t2 where (lead1 = 1 and lead2 = 2)
or (lag1 = 1 and lag2 = 2)
or (lag1 = 1 and lead1 = 1)
group by user_id having cnt > 2
(4) 和原表关联,取流水明细
select t4.* from user_low_carbon t4 join t3 on t3.user_id = t4.user_id;
(5) 合并起来就是这样的:
select t4.* from user_low_carbon t4 join (
select user_id from (
select user_id, dt,
datediff(dt, lag(dt, 2, null) over(partition by user_id order by dt)) lag2,
datediff(dt, lag(dt, 1, null) over(partition by user_id order by dt)) lag1,
datediff(lead(dt, 1, null) over(partition by user_id order by dt), dt) lead1,
datediff(lead(dt, 2, null) over(partition by user_id order by dt), dt) lead2
from (
select user_id, regexp_replace(data_dt, '/', '-') dt, sum(low_carbon) sum_carbon
from user_low_carbon
where substring(data_dt, 1, 4) = '2017'
group by user_id, regexp_replace(data_dt, '/', '-') having sum_carbon > 100
) t1
) t2 where (lead1 = 1 and lead2 = 2)
or (lag1 = 1 and lag2 = 2)
or (lag1 = 1 and lead1 = 1)
group by user_id having count(1) > 2
) t3 on t3.user_id = t4.user_id;
解法2:
(1) 根据user_id,data_dt聚合,得到t1表,同解法1的 (1)
(2) 根据时间排序,给出一个递增列标识
select user_id, dt, rank() over(partition by user_id order by dt) rank from t1
(3) 根据等差数列,连续天数和递增列,相减结果一致说明连续,然后选出大于等于3的user_id
select user_id from t2 group by user_id, date_sub(dt, rank) having count(1) >= 3
(4) 和原表关联,取流水明细,同解法1的 (4)
(5) 合并起来就是这样的:
select user_id from (
select user_id, dt, rank() over(partition by user_id order by dt) rank
from (
select user_id, regexp_replace(data_dt, '/', '-') dt, sum(low_carbon) sum_carbon
from user_low_carbon
where substring(data_dt, 1, 4) = '2017'
group by user_id, regexp_replace(data_dt, '/', '-') having sum_carbon > 100
) t1
) t2
group by user_id, date_sub(dt, rank) having count(1) >= 3