文章目录
1.找出某特征连续出现n次的所有记录
写在前面:对于求连续记录的题,套路都是:求哪个变量连续就用这个变量-row_number()
1.1 例题1
题目要求:
id和people都是int类型,visit_date是日期类型,id是主键
- 方法一:使用rank()
思路分析:
将people>=100的所有id进行排序(由于本例中id是主键,所以肯定不会有重复,因此直接使用rank就可以);然后用id值减去rank值,结果记为c,如果id是连续的,那么连续id对应c应该是相同的;c出现了几次,那么id就连续了c条记录。
with t1 as(
select id,visit_date,people,
id - rank() over(order by id) as c
from stadium
where people>=100
)
select id,visit_date,people from t1 where c in (
select c
from t1
group by c
having count(id)>=3
);
- 方法二:不使用rank()
如果mysql版本不支持窗口函数,则可使用该方法。
思路分析:
从所有people>=100的id(假设叫id_list)中,找出id-1、id+1以及自己本身都在id_list中的id(假设叫center_id);最后从最原始记录中找出center_id-1,center_id,center_id+1的记录;
with t1 as(
select id from stadium where
id in (select id from stadium where people>=100) and
id-1 in (select id from stadium where people>=100) and
id+1 in (select id from stadium where people>=100)
)
select distinct s.* from stadium s join t1
on s.id=t1.id-1 or s.id=t1.id or s.id=t1.id+1;
1.2 例题2
题目要求:
- 方法一:使用rank()
with t1 as(select seat_id,seat_id-rank() over(order by seat_id asc) as gap
from cinema where free=1)
select seat_id from t1 where gap in
(select gap from t1 group by gap having count(seat_id)>=2);
- 方法二:不使用rank()
select distinct a.seat_id from cinema a
join cinema b
on abs(a.seat_id-b.seat_id) = 1
where a.free = 1 and b.free = 1 order by a.seat_id;
在本例中,在free为1的记录中,满足abs(a.seat_id-b.seat_id) = 1 条件的有
a.seat_id | b.seat_id |
---|---|
3 | 4 |
4 | 3 |
4 | 5 |
5 | 4 |
所以上述a.seat_id去重后结果就为3、4、5.
方法二的思想很简单易懂,但是只适用于连续2次及以上,如果多次的话,代码修改起来就会很麻烦,而方法一只需要修改一个参数就行。
1.3 拓展题
- step1:先把指定日期内的失败记录和成功记录连接起来:
select 0 as 'state',fail_date as date from failed where fail_date>='2019-01-01' and fail_date<='2019-12-31'
union all
select 1 as 'state',success_date as date from succeeded where success_date>='2019-01-01' and success_date<='2019-12-31'
order by date
设该表为t1.
- step2:把连续的记录分割成组
从下图可以看出,只要从每组中取出最小的日期和最大的日期,就能完成题目的要求。
那么这些组该如何形成呢?
state为0和1的记录可以通过group by state区分开,那同一个state里的记录又如何划分组呢?
这个是不是就和上面那个电影座位问题一样?date就相当于seat_id,state就相当于free;只不过上例中只需要将free为1的记录分组,本例中,state为0和为1的记录都需要分组。
方法一:state为1和为0的分别在内部再次分组,然后拼接起来:
对于state为1的:
select state,date,subdate(date,row_number() over(order by date asc)) as gap
from t1
where state=1;
state | date | gap |
---|---|---|
1 | 2019-01-01 | 2018-12-31 |
1 | 2019-01-02 | 2018-12-31 |
1 | 2019-01-03 | 2018-12-31 |
1 | 2019-01-06 | 2019-01-02 |
可以看到,通过gap就可以把state都为1的记录分开;
对于state为0的也是相同的操作:
select state,date,subdate(date,row_number() over(order by date asc)) as gap
from t1
where state=2;
state | date | gap |
---|---|---|
0 | 2019-01-04 | 2019-01-03 |
0 | 2019-01-04 | 2019-01-03 |
- step3:选出每组中的最小日期作为start_date,最大日期作为end_date.
对于state为1的:
select state,min(date) as start_date,max(date) as end_date
from (
select state,date,subdate(date,row_number() over(order by date asc))
from t1
where state=1
) t2
group by gap
state | start_date | end_date |
---|---|---|
1 | 2019-01-01 | 2019-01-03 |
1 | 2019-01-06 | 2019-01-06 |
对于state为0的:
select state,min(date) as start_date,max(date) as end_date
from (
select state,date,subdate(date,row_number() over(order by date asc))
from t1
where state=0
) t3
group by gap
state | start_date | end_date |
---|---|---|
0 | 2019-01-04 | 2019-01-05 |
将上面两表通过union结合起来,然后按照start_date升序排列,就是正确答案了。
当然,我们也可以不用把state取值为0、1的情况分开,我们使用partition by 和 group by 也可以完成step2、3
select state,date,subdate(date,row_number() over(partition by state order by date asc)) as gap from t1
state | date | gap |
---|---|---|
1 | 2019-01-01 | 2018-12-31 |
1 | 2019-01-02 | 2018-12-31 |
1 | 2019-01-03 | 2018-12-31 |
0 | 2019-01-04 | 2019-01-03 |
0 | 2019-01-04 | 2019-01-03 |
1 | 2019-01-06 | 2019-01-02 |
ps:此时的row_number 为:
state | date | row_number |
---|---|---|
1 | 2019-01-01 | 1 |
1 | 2019-01-02 | 2 |
1 | 2019-01-03 | 3 |
0 | 2019-01-04 | 1 |
0 | 2019-01-042 | |
1 | 2019-01-06 | 4 |
最后通过state和gap进行分组,选出每组最小日期和最大日期即可。
完整代码:
select state as period_state,min(date) as start_date,max(date) as end_date
from
(
select state,date,subdate(date,row_number() over(partition by state order by date asc)) as gap from
(
select 'failed' as 'state',fail_date as date from failed
union all
select 'succeeded' as 'state',success_date as date from succeeded
order by date
) t1
where date>='2019-01-01' and date<='2019-12-31'
) t2
group by state,gap
order by start_date;
2.交换座位
题目要求:
2.1 交换id
2.1.1 使用case
-解题思路:当id为奇数且不是最后一个时,id变为id+1;当id为奇数,但是是最后一个时,id不变;当id为偶数时,id变为id-1.
with n as(select max(id) as max_id from seat)
select
(case
when id%2=1 and id!=n.max_id then id+1
when id%2=1 and id=n.max_id then id
else id-1
end) as id,student
from seat,n
order by id asc;
另一种写法:
select
if(id%2=0,
id-1,
if(id=(select count(distinct id) from seat),
id,
id+1))
as id,student
from seat
order by id;
2.1.2 使用位异或和coalesce
不了解位异或的,请戳sql位运算符
coalesce(exp1,exp2,exp3,…)返回第一个不为null的值,如果全为null就返回null。
- 首先使用(id+1)^1-1来达到交换id的目的:
SELECT id, (id+1)^1-1, student FROM seat;
2. 连接原始表和更新id后的表,连接条件是ori.id=update.id
但是更新后的表中,没有id为5的记录,所以直接连接的话,结果是null,因此使用coalesce来解决
select id,coalesce(t1.student,s.student) as student
from seat s
left join (select (id+1)^1-1 as update_id,student from seat) t1
on s.id = t1.update_id;
2.2 上移记录、下移记录(使用开窗函数)
select id, if(t.id%2=1,t.back,t.ahead) as 'student' from
(select id,student,
LAG(student, 1,student) over() ahead,
LEAD(student, 1,student) over() back
from seat) t
如果不能使用开窗函数,就连接两张表(一张是上移之后的表,一张是下移在之后的表):
with t1 as(select id-1 as sub_id,student from seat),
t2 as(select id+1 as add_id,student from seat)
select id,ifnull(if(id%2=1,t1.student,t2.student),s.student) as student from seat s
left join t1 on s.id=t1.sub_id
left join t2 on s.id=t2.add_id;
3.avg与is not null连用
如果某个变量(设为var)的值只分为null和非null(设为a),那么可以用avg(var is not null)求a的个数占总体个数的比例
3.1 例题1
- step1:先找出每个人登录日期
select player_id,min(event_date) as login
from activity
group by player_id;
- step2:将上表与原表连接起来
select a.player_id,a.event_date,t1.login from
(select player_id,min(event_date) as login
from activity
group by player_id) t1
left join activity a
on a.player_id=t1.player_id and date_sub(a.event_date,interval 1 day)=login;
因为t1作为主表,所以连接查询一共会返回三行,每行代表一个玩家,若前面两个为null,说明该玩家在login的一天之后并没有再次登录
- step3:使用avg算fraction
select round(avg(a.player_id is not null),2) as fraction
from (select player_id,min(event_date) as login
from activity
group by player_id) t1
left join activity a
on a.player_id=t1.player_id and date_sub(a.event_date,interval 1 day)=login;
可以看到,这段代码和上一段基本一致,只是最后的查询结果变了。
a.player_id is not null
:如果是null 就返回0,如果不是null就返回1,所以从step2的结果可以看出,每行返回的值分别是1、0、0;那么再avg一下,结果就是0.33
3.2 例题2
select question_id as survey_log
from survey_log
group by question_id
order by avg(answer_id is not null) desc
limit 1;
4.求中位数
4.1 例题1
首先说一下,我感觉它这里的中位数不太合适,按理来说如果总数是偶数个的话,中位数应该是中间两个数取平均,但是这里还是先按照它的方法来。
思路一:
先把所有数从小到大排列起来并标号从1到n (因此这里不能使用rank(),而应该使用 row_number()),中位数与(n/2)的差在[0,1]之间,以此来选择出中位数。
- step1:通过窗口函数求出每个人在该公司的排名(工资升序排列)以及该公司的总人数。
SELECT id, company, salary,
ROW_NUMBER() OVER (PARTITION BY company
ORDER BY Salary ASC, id ASC) AS row_num,
COUNT(Id) OVER (PARTITION BY company) AS count_id
FROM Employee;
- step2:根据思路中的那个条件选出中位数
select Id,Company,Salary from(
SELECT id, company, salary,
ROW_NUMBER() OVER (PARTITION BY company
ORDER BY Salary ASC, id ASC) AS row_num,
COUNT(Id) OVER (PARTITION BY company) AS count_id
FROM Employee) t1
where row_num-count_id/2 between 0 and 1;
思路二:
也就是说,只有中位数才能保证 升序排名>=(总数/2) and 降序排名>=(总数/2)
select id,company,salary
from(
select id,company,salary,
row_number() over(partition by company order by salary asc,id asc) as asc_rank,
row_number() over(partition by company order by salary desc,id asc) as desc_rank,
count(id) over(partition by company) as total
from employee) t1
where asc_rank>=total/2 and desc_rank>=total/2;
4.2 例题2
思路一:
就和上例的第二个思路差不多
select avg(number) as median from(
select number,
sum(frequency) over(order by number) asc_accumu,
sum(frequency) over(order by number desc) desc_accumu,
sum(frequency) over() total
from numbers) t1
where asc_accumu>=total/2 and desc_accumu>=total/2;
思路二:
- 设总的个数是n。
- 若n为奇数,则n/2必落在某个区间内部(即不会在边界)假设是[f1,f1+f2],那么该区间对应的数字,即b,就是中位数。
- 若n为偶数,则n/2可能落在某个区间内部,假设是[0,f1],那么该区间对应的数字a,就是中位数;
- -n/2也可能落在某两个区间的边界处,比如n/2=f1+f2+f3,那么[f1+f2,fi+f2+f3]和[f1+f2+f3,f1+f2+f3+f4]对应的数:c、d,这两数的平均数就是中位数。
select avg(Number) as median from (
select *,sum(Frequency) over(order by Number rows between
unbounded preceding and unbounded following) as total,
sum(Frequency) over(order by Number rows between
unbounded preceding and current row) as accumulation
from Numbers) t1
where total/2 between accumulation -Frequency and accumulation;
(还是第一种思路简单…)
5. n preceding
select Id,Month,Salary from(
select Id,Month,
sum(salary) over(partition by id
order by month asc rows 2 PRECEDING) as Salary,
rank() over(partition by id
order by month desc) as month_rank
from employee) t1
where month_rank>1
order by 1,2 desc;
- month_rank:把月份从大到小排序,然后取出排名大于1的记录就可以将最后的一个月去除;
- rows 2 preceding:其实写全应该是 rows between 2 preceding and current row ;
- order by 1,2 :1代表id,2代表month
6.其实是一个简单的问题,我自己把它做复杂了…
正常方法:
select sum(tiv_2016) as TIV_2016 from insurance
where tiv_2015 in
(select tiv_2015 from insurance
group by tiv_2015 having count(1) > 1)
and (LAT,LON) in
(select LAT,LON from insurance
group by LAT,LON having count(1) = 1);
我的方法:
- step1:求出经纬度不完全相同的PID
select pid from insurance
group by concat(lat,lon)
having count(pid)=1;
- step2:求出有保额相同的pid
select pid from insurance where tiv_2015 in (
select tiv_2015
from insurance
group by tiv_2015
having count(pid)>1)
- step3:求两者交集
select t1.pid from
(select pid from insurance
group by concat(lat,lon)
having count(pid)=1) t1 join
(select pid from insurance where tiv_2015 in (
(select tiv_2015
from insurance
group by tiv_2015
having count(pid)>1)))t2
on t1.pid = t2.pid;
- step4:得到最后答案
select sum(tiv_2016) as tiv_2016 from insurance where pid in(
select t1.pid from
(select pid from insurance
group by concat(lat,lon)
having count(pid)=1) t1 join
(select pid from insurance where tiv_2015 in (
(select tiv_2015
from insurance
group by tiv_2015
having count(pid)>1)))t2
on t1.pid = t2.pid);
其实思路是一样的,就是我把它写复杂了,,,嗐
7.巧用union创造行
7.1 例题1
如果没有记录为0的那一行,那么这道题就并不是很难:
SELECT spend_date,platform,SUM(amount) AS total_amount,COUNT(user_id) AS total_users FROM(
SELECT spend_date,user_id, IF(COUNT(DISTINCT platform) = 2, 'both', platform) 'platform',
SUM(amount) 'amount' FROM spending GROUP BY spend_date,user_id) t1
GROUP BY spend_date,platform;
因此,本题的关键就是怎么搞出记录为0的那一行。
解题步骤:
- step1:找出每天、每个用户的支付平台类型:
SELECT spend_date,user_id, IF(COUNT(DISTINCT platform) = 2, 'both', platform) 'platform',
SUM(amount) 'amount' FROM spending
GROUP BY spend_date,user_id;
- step2:创建一个表,该表只有一列(‘platform’)、三行(‘mobile’、‘desktop’、‘both’)
这一步是本题关键点所在,也就是如何凭空给没有数据的记录(即total_amount、total_users)搞出一行来
SELECT 'desktop' AS 'platform' UNION
SELECT 'mobile' AS 'platform' UNION
SELECT 'both' AS 'platform';
- step3:把上述两表连接起来(笛卡尔乘积)
SELECT * FROM
(SELECT spend_date,user_id, IF(COUNT(DISTINCT platform) = 2, 'both', platform) 'platform',
SUM(amount) 'amount' FROM spending
GROUP BY spend_date,user_id) a,
(SELECT 'desktop' AS 'platform' UNION
SELECT 'mobile' AS 'platform' UNION
SELECT 'both' AS 'platform') b;
原本a表有5条记录,b表有3条记录,笛卡尔乘积之后,变成15条记录
- step4:
SELECT spend_date,b.platform,
SUM(IF(a.platform = b.platform,amount,0)) AS total_amount,
COUNT(IF(a.platform = b.platform,user_id,NULL)) AS total_users
FROM
(SELECT spend_date,user_id, IF(COUNT(DISTINCT platform) = 2, 'both', platform) 'platform',
SUM(amount) 'amount' FROM spending
GROUP BY spend_date,user_id) a,
(SELECT 'desktop' AS 'platform' UNION
SELECT 'mobile' AS 'platform' UNION
SELECT 'both' AS 'platform') b
GROUP BY spend_date,b.platform;
ps:
1 b.platform才能保证每一天都有‘both’、‘desktop’、‘mobile’(其实这也是这道题的难点所在)
2 count if 那儿,不满足if条件的,一定要写成null,因为null不会被count计算。
7.2 例题2
matches表数据如下:
players表数据如下:
要求:查询每组中总score最高的player_id,若成绩相同,则取id较小的那个。查询结果应如下图所示:
- 思路:先用union all把每个player作为first_player和second_player的得分拼接起来,之后就可以计算每个player的总得分;然后按照总得分(降序)和player_id(升序)排列,最后按照group_id分组。(这里用到了group by的一个小技巧:group by之后如果没有使用聚合函数,那么返回的就是该组中的第一条记录!!!)
- step1:union all
SELECT first_player AS player_id,first_score AS score FROM matches
UNION ALL
SELECT second_player AS player_id,second_score AS score FROM matches
player_id | score |
---|---|
15 | 3 |
30 | 1 |
30 | 2 |
40 | 5 |
35 | 1 |
45 | 0 |
25 | 2 |
15 | 0 |
20 | 2 |
50 | 1 |
- step2:计算每个player的总得分,并按照总得分和player_id排好序
SELECT t1.player_id,SUM(score) AS score,group_id FROM(
SELECT first_player AS player_id,first_score AS score FROM matches
UNION ALL
SELECT second_player AS player_id,second_score AS score FROM matches) t1
LEFT JOIN players p ON t1.player_id = p.`player_id`
GROUP BY t1.player_id
ORDER BY score DESC,player_id ASC
player_id | score | group_id |
---|---|---|
40 | 5 | 3 |
15 | 3 | 1 |
30 | 3 | 1 |
20 | 2 | 3 |
25 | 2 | 1 |
35 | 1 | 2 |
50 | 1 | 2 |
45 | 0 | 1 |
- step3: 使用group by选出每组中的第一条记录
SELECT group_id,player_id FROM(
SELECT t1.player_id,SUM(score) AS score,group_id FROM(
SELECT first_player AS player_id,first_score AS score FROM matches
UNION ALL
SELECT second_player AS player_id,second_score AS score FROM matches) t1
LEFT JOIN players p ON t1.player_id = p.`player_id`
GROUP BY t1.player_id
ORDER BY score DESC,player_id ASC) t2
GROUP BY group_id;
group_id | player_id |
---|---|
1 | 15 |
2 | 35 |
3 | 40 |
8.with recursive
8.1 创造连续数列
- step1:使用with recursive创建一个从1到max(customer_id)的数列
with recursive full_id(n) as (
select 1 as id
union all
select n+1 from full_id where n<(select max(customer_id) from customers)
)
select * from full_id
- step2:找出遗失的ID
with recursive full_id(n) as (
select 1 as id
union all
select n+1 from full_id where n<(select max(customer_id) from customers)
)
select n as ids from full_id where n not in (select customer_id from customers)
8.2
要求:
查询2020年每月截止到月底时已有的司机人数、每月被接受的订单量
# 创建一个月份表 month_table
with recursive month_table(m) as(
select 1 as m
union all
select m+1 from month_table where m<=11
)
select month,active_drivers,ifnull(accepted_rides,0) as accepted_rides from
# t1计算每个月的已有注册司机人数
(select mt.m as month ,count(driver_id) active_drivers
from month_table mt left join drivers on 202000+mt.m>=date_format(join_date,'%Y%m')
group by mt.m) t1
left join
# t2计算每月被接受的订单量
(select month(requested_at) as order_month,count(a.ride_id) as accepted_rides
from rides r join acceptedrides a on r.ride_id=a.ride_id
where substr(requested_at,1,4)='2020' group by order_month
) t2
on t1.month=t2.order_month
8.3
原始数据就是8.2的数据,查询2020年每3个月的平均ride_distance和平均ride_duration
with recursive month_table(m) as(
select 1 as m
union all
select m+1 from month_table where m<=11
)
select month,
round(avg(month_distance) over(rows between current row and 2 following),2) as average_ride_distance,
round(avg(month_duration) over(rows between current row and 2 following),2) as average_ride_duration
from (
select mt.m as month
,ifnull(sum(ride_distance),0) as month_distance,ifnull(sum(ride_duration),0) as month_duration
from month_table as mt left join rides r on mt.m = month(requested_at) and substr(requested_at,1,4)='2020'
# 注意,这里2020年这个限制条件不能放到where里面,因为where的执行是在全连接之后、on之前
# 如果使用where,就会首先在全连接的结果中把所有不是2020的全部剔除,只剩下年份是2020的记录
# 然后on才作用于这些剩下的结果,因此就不能保证1~12月份都有记录
left join acceptedrides a on r.ride_id=a.ride_id
group by mt.m
) t1
limit 10 #只显示前10条记录
9. greatest、least选择多列中的最大值、最小值
9.1
select least(from_id,to_id) as person1,greatest(from_id,to_id) as person2,count(*) as call_count,sum(duration) as total_duration
from calls
group by person1,person2
10.窗口函数和group by连用
窗口函数作用的范围是group by之后的结果。什么叫group by之后的结果?就比如本例中:group by machine_id,process_id。那么group by之后的结果中,每一条记录就代表一组不同的(machine_id,process_id) (当然后面还可以有其他聚合字段)。那么窗口函数就是在这基础上进行的操作。
# 方法一:常规方法
select machine_id,round(avg(processing_time),3) as processing_time from(
select machine_id,process_id,sum(if(activity_type='start',-1,1)*timestamp) as processing_time
from activity group by machine_id,process_id) t1 group by machine_id
# 方法二:使用窗口函数(窗口函数作用的范围是group by之后的结果)
select * from(
select machine_id
,round(avg(sum(if(activity_type='start',-1,1)*timestamp)) over(partition by machine_id),3) as processing_time
from activity group by machine_id,process_id) t1 group by machine_id