以下SQL为本人手写,有些题目描述简单可能存在歧义,欢迎探讨交流
一.case专题-商品订单数据
数据表:
订单表orders,大概字段有(order_id’订单号,'user_id‘用户编号’, order_pay‘订单金额’ ,
order_time‘下单时间’,‘商品一级类目commodity_level_1’,‘商品二级类目commodity_level_2’)
1. 求最近7天内每一个一级类目下成交总额排名前3的二级类目:
select commodity_level_1,commodity_level_2,total,r
from(
select *,row_number() over(partition by commodity_level_1 order by total desc) r
from(
select commodity_level_1,commodity_level_2,sum(order_by) total
from orders
where timestampdiff(day,order_time,now()) <= 7
group by commodity_level_1,commodity_level_2
) a
)b
where r<=3;
2.提取8.1-8.10每一天消费金额排名在101-195的user_id
select order_date,user_id,total
from(
select *,row_number() over(partition by order_date,user_id order by total) r
from(
select convert(order_time,date) order_date,user_id,sum(order_by) total
from orders
where convert(order_time,date) between '20190801' and '20190810'
group by convert(order_time,date),user_id
) a
) b
where r between 101 and 195
二.case专题-活动运营数据分析
数据表
表1——订单表orders,大概字段有(user_id‘用户编号’, order_pay‘订单金额’ , order_time‘下单时间’)
表2——活动报名表act_apply,大概字段有(act_id‘活动编号’,user_id‘报名用户’,act_time‘报名时间’)
1.活动运营数据分析-统计每个活动对应所有用户在报名后产生的总订单金额,总订单数
select a.user_id,sum(order_pay) total,count(*) num
from orders o
join act_apply a
on o.user_id = a.user_id
where o.order_time >= a.act_time
group by a.act_id
2.统计每个活动从开始后到当天(考试日)平均每天产生的订单数,活动开始时间定义为最早有用户报名的时间。(涉及到时间的数据类型均为:datetime)
select a.act_id,count(*)/timestampdiff(day,min(a.act_time),now())
from orders o
join act_apply a
on o.user_id = a.user_id
where o.order_time >= a.act_time
group by a.act_id
三.case专题-用户行为路径分析
表1——用户行为表tracking_log,大概字段有(user_id‘用户编号’,opr_id‘操作编号’,log_time‘操作时间’)
1.统计每天符合以下条件的用户数:A操作之后是B操作,AB操作必须相邻
解题思路:运用窗口函数在每行数据生成一列数据:下次操作编号
注意要把datetime格式转换成date格式来表示天
select date,count(*)
from(
select user_id
from(
select user_id,convert(log_time,date) date,opr_id f,lag(opr_id,1) over(partition by user_id,convert(log_time,date) order by log_time) l
from tracking_log
) a
where f='A' and l='B'
) b
group by date;
2.统计用户行为序列为A-B-D的用户数
其中:A-B之间可以有任何其他浏览记录(如C,E等),B-D之间除了C记录可以有任何其他浏览记录(如A,E等)
select count(*)
from(
select user_id,group_concat(opr_id) ubp
from tracking_log
group by user_id
) a
where ubp like '%A%B%D%' and ubp not like '%A%B%C%D%'
四.case专题-用户留存分析
表1——用户登陆表user_log,大概字段有(user_id‘用户编号’,log_date‘登陆时间’)
1.求每天新增用户数,以及他们第2天、30天的留存率
select count(distinct user_id),
round(count(distinct case when datediff(log_date,fd)=1 then a.user_id else null end)/
count(distinct user_id),2),
round(count(distinct case when datediff(log_date,fd)=29 then a.user_id else null end)/
count(distinct user_id),2)
from user_log u
left join (
select user_id,min(log_date) fd
from user_log
group by user_id
) a
on u.user_id = a.user_id
group by fd
2.找近90天,30天,7天的登录人数
select count(distinct case when datediff(curdate(),log_date)<=90 then user_id else null end),
count(distinct case when datediff(curdate(),log_date)<=30 then user_id else null end),
count(distinct case when datediff(curdate(),log_date)<=7 then user_id else null end)
from user_log
3.求用户近一个月平均登录时间间隔(按天)
select user_id,sum(datediff(log_date,l))/30
from(
select user_id,log_date,lag(log_date,1) over(partition by user_id order by log_date) l
from user_log
where datediff(curdate(),log_date) <= 30
group by user_id
) a
group by user_id
五.case专题-统计特征(中位数,众数,四分位数)
字段:店铺id(shop_id),销量(sale),商品id(commodity_id)
1.求每个店铺商品销量排名的中位数
select shop_id,avg(sale)
from(
select shop_id,commodity_id,sale,
count(1) over(partition by shop_id) total,
row_number() over(partition by shop_id order by sale desc) r
from orders
) a
where abs(r-(total+1)/2)<1
group by shop_id
abs(rn - (cnt+1)/2) < 1
解释下上面的公式:
rn是给定长度为cnt的数列的序号排序,
eg:对于1,2,3,4,5,它的中位数所在序号是3,3-(5+1)/2 = 0
对于1,2,3,4,它的中位数所在序号是2,3
2 - (4+1)/2 = -0.5
3-(4+1)/2 = 0.5
可见(cnt+1)/2是一个数列的中间位置,如果是奇数数列,这个位置刚好是中位数所在
2.求每个店铺订购商品的众数
select shop_id,commodity_id,mt
from(
select shop_id,commodity_id,total,max(total) over(partition by shop_id) mt
from(
select shop_id,commodity_id,count(*) as total
from orders
group by shop_id,commodity_id
) a
) b
where total = mt
3.求四分位数
三个表
T1:good_id,cate_id(分类)
T2:mall_id(店铺), good_id
T3:mall_id, credit_score(信用分)
问,在不同分类中,店铺的信用分前100和求top25%
select cate_id,mall_id,credit_score
from(
select cate_id,mall_id,credit_score,
row_number() over(partition by cate_id order by credit_score desc) r,
count(1) over(partition by cate_id) total
from(
select t2.mall_id,t2.good_id,t3.credit_score,t1.cate_id
from t1
join t2
on t1.good_id=t2.good_id
join t3
join t2.mall_id = t3.mall_id
) a
)b
where r<=total*0.25
六.case专题-GMV周同比统计
字段:时间(sale_date),店铺类别(cate_id),店铺数量(mall_num),gmv
1.拼多多618前后一周内各店铺类别gmv的日均提升幅度和比例
注:以下解法只适用于数据连续情况,如果数据不连续,例如对于上一周没有星期六,星期日的数据,lead(gmv,7)这样规定移动窗口必然会出现错误。
select cate_id,avg(gmv-l),avg((gmv-l)/gmv)
from(
select sale_date,cate_id,mall_num,gmv,
lag(gmv,7) over(partition by cate_id order by sale_date) l
from T
) a
where sale_date between '20190611' and '20190625'
group by cate_id
2.在618前一周gmv top20%,20-40%等这5类商铺在618后一周内gmv日均提升幅度和比例
注:写的过程,分步一点点写
select t3.cate_id,avg(l-gmv),avg((l-gmv)/gmv)
from(
select sale_date,cate_id,mall_num,gmv,
lead(gmv,7) over(partition by cate_id order by sale_date) l
from T
where sale_date between '20190611' and '20190625'
) t3
join(
select cate_id,
case
when r <= num*0.2 then 'top20%'
when r <= num*0.4 and r > num*0.2 then '20%-40%'
when r <= num*0.6 and r > num*0.4 then '40%-60%'
when r <= num*0.8 and r > num*0.6 then '60%-80%'
when r > num*0.8 then '80%以上'
else NULL
end as gmv_quantile
from(
select cate_id,total,row_number() over(order by total desc) r,
count(1) over() num
from(
select cate_id,sum(gmv) total
from T
where sale_date between '20190611' and '20190617'
group by cate_id
) t1
) t2
) t4
on t3.cate_id = t4.cate_id
where t3.sale_date between '20190611' and '20190617'
group by t3.cate_id
七.case专题-连续区间问题
表:Logs
±--------------±--------+
| Column Name | Type |
±--------------±--------+
| log_id | int |
±--------------±--------+
id 是上表的主键。
上表的每一行包含日志表中的一个 ID。
后来一些 ID 从 Logs 表中删除。
编写一个 SQL 查询得到 Logs 表中的连续区间的开始数字和结束数字。
将查询表按照 start_id 排序。
查询结果格式如下面的例子:
Logs 表:
±-----------+
| log_id |
±-----------+
| 1 |
| 2 |
| 3 |
| 7 |
| 8 |
| 10 |
±-----------+
结果表:
±-----------±-------------+
| start_id | end_id |
±-----------±-------------+
| 1 | 3 |
| 7 | 8 |
| 10 | 10 |
±-----------±-------------+
结果表应包含 Logs 表中的所有区间。
(1)使用两个变量,一个@id用来记录logid,可以比较当下log_id与之前log_id的差值,判断是否连续。一个@num用来储存连续状态。
set @id=1,@num=0;
select log_id,
case when @id = log_id -1 then @num := @num
else @num := @num + 1
end as n,
@id := log_id
from logs;
(2)得到上述结果后,用num字段分组,最小log_id为start_id,最大log_id为end_id。
最终答案
set @id=1,@num=0;
select min(log_id) strat_id,max(log_id) end_id
from(
select log_id,
case when @id = log_id -1 then @num := @num
else @num := @num + 1
end as n,
@id := log_id
from logs
) t
group by n;
八.case专题-学生成绩分析
表:Enrollments
±--------------±--------+
| Column Name | Type |
±--------------±--------+
| student_id | int |
| course_id | int |
| grade | int |
±--------------±--------+
(student_id, course_id) 是该表的主键。
1.查询每位学生获得的最高成绩和它所对应的科目,若科目成绩并列,取 course_id 最小的一门。查询结果需按 student_id 增序进行排序。
解法一:窗口函数
select *
from(
select student_id,course_id,grade,row_number() over(partition by student_id order by grade desc,course_id) r
from Enrollments
) a
where r = 1
order by student_id;
解法二: in解法(更简单,快速)
select student_id,min(course_id)
from Enrollments
where (student_id,grade) in (
select student_id,max(grade)
from Enrollments
group by student_id)
group by student_id
order by student_id;
2.查询每一科目成绩最高和最低分数的学生,输出courseid,studentid,score
解法一:Union
预备知识:union和union all
union对两个结果集进行并集操作,两个联合的字段必须一样
两者区别是union要进行重复值扫描,不包括重复行,同时进行默认规则的排序,效率低
union all包括重复行,不进行排序
select c_id,s_id
from(
select *,row_number() over(partition by c_id order by s_score desc) r
from score
) a
where r=1
union
select c_id,s_id
from(
select *,row_number() over(partition by c_id order by s_score) r
from score
) a
where r=1;
解法二:case-when
select c_id,
max(case when r1=1 then s_id else null end) '最高分学生',
max(case when r2=1 then s_id else null end) '最低分学生'
from(
select *,row_number() over(partition by c_id order by s_score desc) r1,
row_number() over(partition by c_id order by s_score) r2
from score
) a
group by c_id;
九.case专题-学生做题情况分析
表t:做题日期(time),学生id(stu_id),题目id(exer_id)
1.统计10.1-10.10每天做新题的人的数量,重点在每天
select md,count(distinct t.stu_id)
from t
join(
select stu_id,exer_id,min(covert(time,date)) md
from t
where covert(time,date) between '20191001' and '20191010'
group by stu_id,exer_id
) a
on t.stu_id = a.stu_id and t.exer_id = a.exer_id
group by md
参考: 数分面试-SQL篇.