1. 第二快/慢用时之差试卷时长一半的试卷
现有试卷信息表examination_info(exam_id试卷ID, tag试卷类别, difficulty试卷难度, duration考试时长, release_time发布时间):
id | exam_id | tag | difficulty | duration | release_time |
---|---|---|---|---|---|
1 | 9001 | SQL | hard | 60 | 2021-09-01 06:00:00 |
2 | 9002 | C++ | hard | 60 | 2021-09-01 06:00:00 |
3 | 9003 | 算法 | medium | 80 | 2021-09-01 10:00:00 |
试卷作答记录表exam_record(uid用户ID, exam_id试卷ID, start_time开始作答时间, submit_time交卷时间, score得分):
id | uid | exam_id | start_time | submit_time | score |
---|---|---|---|---|---|
1 | 1001 | 9001 | 2021-09-01 09:01:01 | 2021-09-01 09:51:01 | 78 |
2 | 1001 | 9002 | 2021-09-01 09:01:01 | 2021-09-01 09:31:00 | 81 |
3 | 1002 | 9002 | 2021-09-01 12:01:01 | 2021-09-01 12:31:01 | 81 |
4 | 1003 | 9001 | 2021-09-01 19:01:01 | 2021-09-01 19:59:01 | 86 |
5 | 1003 | 9002 | 2021-09-01 12:01:01 | 2021-09-01 12:31:51 | 89 |
6 | 1004 | 9002 | 2021-09-01 19:01:01 | 2021-09-01 19:30:01 | 85 |
7 | 1005 | 9001 | 2021-09-01 12:01:01 | 2021-09-01 12:31:02 | 85 |
8 | 1006 | 9001 | 2021-09-07 10:01:01 | 2021-09-07 10:21:01 | 84 |
9 | 1003 | 9001 | 2021-09-08 12:01:01 | 2021-09-08 12:11:01 | 40 |
10 | 1003 | 9002 | 2021-09-01 14:01:01 | (NULL) | (NULL) |
11 | 1005 | 9001 | 2021-09-01 14:01:01 | (NULL) | (NULL) |
12 | 1003 | 9003 | 2021-09-08 15:01:01 | (NULL) | (NULL) |
找到第二快和第二慢用时之差大于试卷时长的一半的试卷信息,按试卷ID降序排序。由示例数据结果输出如下:
exam_id | duration | release_time |
---|---|---|
9001 | 60 | 2021-09-01 06:00:00 |
解释:试卷9001被作答用时有50分钟、50分钟、30分1秒、11分钟、10分钟,第二快和第二慢用时之差为50分钟-11分钟=39分钟,试卷时长为60分钟,因此满足大于试卷时长一半的条件,输出试卷ID、时长、发布时间。
示例1
drop table if exists examination_info,exam_record;
CREATE TABLE examination_info (
id int PRIMARY KEY AUTO_INCREMENT COMMENT '自增ID',
exam_id int UNIQUE NOT NULL COMMENT '试卷ID',
tag varchar(32) COMMENT '类别标签',
difficulty varchar(8) COMMENT '难度',
duration int NOT NULL COMMENT '时长',
release_time datetime COMMENT '发布时间'
)CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE exam_record (
id int PRIMARY KEY AUTO_INCREMENT COMMENT '自增ID',
uid int NOT NULL COMMENT '用户ID',
exam_id int NOT NULL COMMENT '试卷ID',
start_time datetime NOT NULL COMMENT '开始时间',
submit_time datetime COMMENT '提交时间',
score tinyint COMMENT '得分'
)CHARACTER SET utf8 COLLATE utf8_general_ci;
INSERT INTO examination_info(exam_id,tag,difficulty,duration,release_time) VALUES
(9001, 'SQL', 'hard', 60, '2021-09-01 06:00:00'),
(9002, 'C++', 'hard', 60, '2021-09-01 06:00:00'),
(9003, '算法', 'medium', 80, '2021-09-01 10:00:00');
INSERT INTO exam_record(uid,exam_id,start_time,submit_time,score) VALUES
(1001, 9001, '2021-09-01 09:01:01', '2021-09-01 09:51:01', 78),
(1001, 9002, '2021-09-01 09:01:01', '2021-09-01 09:31:00', 81),
(1002, 9002, '2021-09-01 12:01:01', '2021-09-01 12:31:01', 81),
(1003, 9001, '2021-09-01 19:01:01', '2021-09-01 19:59:01', 86),
(1003, 9002, '2021-09-01 12:01:01', '2021-09-01 12:31:51', 89),
(1004, 9002, '2021-09-01 19:01:01', '2021-09-01 19:30:01', 85),
(1005, 9001, '2021-09-01 12:01:01', '2021-09-01 12:31:02', 85),
(1006, 9001, '2021-09-07 10:01:01', '2021-09-07 10:12:01', 84),
(1003, 9001, '2021-09-08 12:01:01', '2021-09-08 12:11:01', 40),
(1003, 9002, '2021-09-01 14:01:01', null, null),
(1005, 9001, '2021-09-01 14:01:01', null, null),
(1003, 9003, '2021-09-08 15:01:01', null, null);
输出
9001|60|2021-09-01 06:00:00
思路
第一步:
1.找到每份试卷的ID、限制时间、发布时间以及每份试卷被完成的耗时、耗时降序排名和增序排名。这里的不用去重,也不用分组,因为每一份都要计算耗时:
2.试卷信息与做题信息分布在两个表中,因此要将两个表以exam_id连接。知识点:join...on...
3.只查询有提交时间的时间,筛掉没做完的试卷。知识点:where
4.试卷ID、限制时间、发布时间直接获取,
完成的耗时使用timestampdiff函数根据开始时间和提交时间计算分钟数差值。
知识点:timestampdiff()
5.利用分组聚合排名对每一种试卷的完成耗时分别进行增序排名和降序排名。
知识点:row_number() over partition by
第二步:
1.根据上面筛选出来的信息查询每份试卷的限制时间、发布时间及第二快与第二慢的差值:
2.每张不同的试卷都会有一个数据,因此以试卷ID作为分组。知识点:group by
3.每组试卷的ID、限制时间、发布时间都可以由table1直接查询到。
4.将每组试卷的完成时间累加,只有当最慢排名为2时才加正值,最快排名为2时加负值,其余情况加0.这样刚好是用时第二多减去用时第二少。
5.sum(case when rank1=2 then time
when rank2=2 then -time
else 0
end
) as sub
知识点:sum()、case when...then...when...then...else...end
题解
select
exam_id,duration,release_time
from (
select exam_id,duration,release_time,
sum(case when rank1=2 then time
when rank2=2 then -time
else 0
end
) as sub
from
(
select
info.exam_id,info.duration,info.release_time,
row_number()over(partition by record.exam_id order by timestampdiff(minute,start_time,submit_time)desc) as rank1,
row_number()over(partition by record.exam_id order by timestampdiff(minute,start_time,submit_time)) as rank2,
timestampdiff(minute,record.start_time,record.submit_time) as time
from examination_info info
inner join exam_record record
on info.exam_id=record.exam_id
where record.submit_time is not null
)t1
group by exam_id
)t2
where sub>=duration/2
order by exam_id desc
2.连续两次作答试卷的最大时间窗
现有试卷作答记录表exam_record(uid用户ID, exam_id试卷ID, start_time开始作答时间, submit_time交卷时间, score得分):
id | uid | exam_id | start_time | submit_time | score |
---|---|---|---|---|---|
1 | 1006 | 9003 | 2021-09-07 10:01:01 | 2021-09-07 10:21:02 | 84 |
2 | 1006 | 9001 | 2021-09-01 12:11:01 | 2021-09-01 12:31:01 | 89 |
3 | 1006 | 9002 | 2021-09-06 10:01:01 | 2021-09-06 10:21:01 | 81 |
4 | 1005 | 9002 | 2021-09-05 10:01:01 | 2021-09-05 10:21:01 | 81 |
5 | 1005 | 9001 | 2021-09-05 10:31:01 | 2021-09-05 10:51:01 | 81 |
请计算在2021年至少有两天作答过试卷的人中,计算该年连续两次作答试卷的最大时间窗days_window,那么根据该年的历史规律他在days_window天里平均会做多少套试卷,按最大时间窗和平均做答试卷套数倒序排序。由示例数据结果输出如下:
uid | days_window | avg_exam_cnt |
---|---|---|
1006 | 6 | 2.57 |
解释:用户1006分别在20210901、20210906、20210907作答过3次试卷,连续两次作答最大时间窗为6天(1号到6号),他1号到7号这7天里共做了3张试卷,平均每天3/7=0.428571张,那么6天里平均会做0.428571*6=2.57张试卷(保留两位小数);
用户1005在20210905做了两张试卷,但是只有一天的作答记录,过滤掉。
示例1
drop table if exists exam_record;
CREATE TABLE exam_record (
id int PRIMARY KEY AUTO_INCREMENT COMMENT '自增ID',
uid int NOT NULL COMMENT '用户ID',
exam_id int NOT NULL COMMENT '试卷ID',
start_time datetime NOT NULL COMMENT '开始时间',
submit_time datetime COMMENT '提交时间',
score tinyint COMMENT '得分'
)CHARACTER SET utf8 COLLATE utf8_general_ci;
INSERT INTO exam_record(uid,exam_id,start_time,submit_time,score) VALUES
(1006, 9003, '2021-09-07 10:01:01', '2021-09-07 10:21:02', 84),
(1006, 9001, '2021-09-01 12:11:01', '2021-09-01 12:31:01', 89),
(1006, 9002, '2021-09-06 10:01:01', '2021-09-06 10:21:01', 81),
(1005, 9002, '2021-09-05 10:01:01', '2021-09-05 10:21:01', 81),
(1005, 9001, '2021-09-05 10:31:01', '2021-09-05 10:51:01', 81),
(1001, 9003, '2021-09-01 09:01:01', '2021-09-01 09:51:11', 78),
(1001, 9002, '2021-09-01 09:01:01', '2021-09-01 09:31:00', 81),
(1002, 9002, '2021-09-01 12:01:01', '2021-09-01 12:31:01', 81),
(1003, 9003, '2021-09-01 19:01:01', '2021-09-01 19:59:01', 86),
(1004, 9002, '2021-09-01 19:01:01', '2021-09-01 19:10:01', 85),
(1005, 9003, '2021-09-01 12:01:01', '2021-09-01 12:31:02', 85),
(1003, 9003, '2021-09-07 12:01:01', '2021-09-07 12:11:01', 40),
(1003, 9003, '2021-09-08 15:01:01', null, null),
(1003, 9002, '2020-09-01 13:01:01', '2020-09-01 13:58:01', 81);
输出
1003 | 7 | 2.63 |
---|---|---|
1006 | 6 | 2.57 |
1005 | 5 | 3.00 |
思路
1.统计2021年每人总作答数、最早最晚相隔天数、最大连续作答间隔,生成子表
2.生成2021年每次作答试卷的下次作答时间,
筛选2021年的作答记录:WHERE YEAR(start_time)=2021
生成下次作答时间,按用户分区按作答时间升序:
lead(start_time) over(PARTITION BY uid ORDER BY start_time) as next_start_time
按用户分组:GROUP BY uid
统计此人作答的总试卷数:count(start_time) as exam_cnt
统计最早一次作答和最晚一次作答的相差天数:DATEDIFF(max(start_time), min(start_time))+1 as diff_days
统计两次作答的最大时间窗:max(DATEDIFF(next_start_time, start_time))+1 as days_window
筛选最早最晚相差天数大于1,即至少活跃两天的记录:WHERE diff_days > 1
计算平均能做多少套试卷:ROUND(days_window * exam_cnt / diff_days, 2) as avg_exam_cnt
了解lead()函数,以及实战(计算用户的平均次日留存率,拓展)
https://blog.csdn.net/weixin_44464850/article/details/124334316?spm=1001.2014.3001.5502
题解
select uid,days_window,round((days_count*days_window/avg_exam_cnt),2) avg_exam_cnt
from(
select
uid,
max(datediff(days2,days1)+1) as days_window ,
count(days1)as days_count,
datediff(max(days1),min(days1))+1 as avg_exam_cnt
from
(
select
uid,
exam_id,
date_format(start_time,"%Y-%m-%d") as days1,
lead(start_time,1)over (partition by uid order by start_time) as days2
from exam_record
where year(start_time)='2021'
)t
group by uid
)t1
ORDER BY days_window DESC, avg_exam_cnt DESC;
3.近三个月未完成试卷数为0的用户完成情况
现有试卷作答记录表exam_record(uid:用户ID, exam_id:试卷ID, start_time:开始作答时间, submit_time:交卷时间,为空的话则代表未完成, score:得分):
id | uid | exam_id | start_time | submit_time | score |
---|---|---|---|---|---|
1 | 1006 | 9003 | 2021-09-06 10:01:01 | 2021-09-06 10:21:02 | 84 |
2 | 1006 | 9001 | 2021-08-02 12:11:01 | 2021-08-02 12:31:01 | 89 |
3 | 1006 | 9002 | 2021-06-06 10:01:01 | 2021-06-06 10:21:01 | 81 |
4 | 1006 | 9002 | 2021-05-06 10:01:01 | 2021-05-06 10:21:01 | 81 |
5 | 1006 | 9001 | 2021-05-01 12:01:01 | (NULL) | (NULL) |
6 | 1001 | 9001 | 2021-09-05 10:31:01 | 2021-09-05 10:51:01 | 81 |
7 | 1001 | 9003 | 2021-08-01 09:01:01 | 2021-08-01 09:51:11 | 78 |
8 | 1001 | 9002 | 2021-07-01 09:01:01 | 2021-07-01 09:31:00 | 81 |
9 | 1001 | 9002 | 2021-07-01 12:01:01 | 2021-07-01 12:31:01 | 81 |
10 | 1001 | 9002 | 2021-07-01 12:01:01 | (NULL) | (NULL) |
找到每个人近三个有试卷作答记录的月份中没有试卷是未完成状态的用户的试卷作答完成数,按试卷完成数和用户ID降序排名。由示例数据结果输出如下:
uid | exam_complete_cnt |
---|---|
1006 | 3 |
解释:用户1006近三个有作答试卷的月份为202109、202108、202106,作答试卷数为3,全部完成;用户1001近三个有作答试卷的月份为202109、202108、202107,作答试卷数为5,完成试卷数为4,因为有未完成试卷,故过滤掉。
示例1
drop table if exists exam_record;
CREATE TABLE exam_record (
id int PRIMARY KEY AUTO_INCREMENT COMMENT '自增ID',
uid int NOT NULL COMMENT '用户ID',
exam_id int NOT NULL COMMENT '试卷ID',
start_time datetime NOT NULL COMMENT '开始时间',
submit_time datetime COMMENT '提交时间',
score tinyint COMMENT '得分'
)CHARACTER SET utf8 COLLATE utf8_general_ci;
INSERT INTO exam_record(uid,exam_id,start_time,submit_time,score) VALUES
(1006, 9003, '2021-09-06 10:01:01', '2021-09-06 10:21:02', 84),
(1006, 9001, '2021-08-02 12:11:01', '2021-08-02 12:31:01', 89),
(1006, 9002, '2021-06-06 10:01:01', '2021-06-06 10:21:01', 81),
(1006, 9002, '2021-05-06 10:01:01', '2021-05-06 10:21:01', 81),
(1006, 9001, '2021-05-01 12:01:01', null, null),
(1001, 9001, '2021-09-05 10:31:01', '2021-09-05 10:51:01', 81),
(1001, 9003, '2021-08-01 09:01:01', '2021-08-01 09:51:11', 78),
(1001, 9002, '2021-07-01 09:01:01', '2021-07-01 09:31:00', 81),
(1001, 9002, '2021-07-01 12:01:01', '2021-07-01 12:31:01', 81),
(1001, 9002, '2021-07-01 12:01:01', null, null);
输出
1006|3
备注
按试卷完成数和用户ID降序排名
思路
1.统计每人每次试卷作答记录的月份排名
2.生成当次作答的月份排名,按用户ID分区按年月排序进行窗口连续排名:
DENSE_RANK() OVER
(PARTITION BY uid ORDER BY DATE_FORMAT(start_time, "%Y%m") DESC) as month_rank
3.筛选月份排名小于等于3的记录(每个人近三个月的):WHERE start_month_rank <= 3
4.按用户ID分组:GROUP BY uid
5.筛选都完成了的分组(无未完成试卷):(count(1)不会除去为空的成绩,而是统计共有多少数据)
HAVING COUNT(1) = COUNT(score)
题解
select uid,count(score)as exam_complete_cnt
from
(
select
uid,
score,
start_time,
submit_time,
dense_rank()
over (partition by uid
order by date_format(start_time,"%Y%m")desc
)recent_month
from exam_record
)t1
where recent_month<=3
group by uid
having count(1)=count(score)
order by exam_complete_cnt desc,uid desc
4. 未完成率较高的50%用户近三个月答卷情况 (重点)
现有用户信息表user_info(uid用户ID,nick_name昵称, achievement成就值, level等级, job职业方向, register_time注册时间):
id | uid | nick_name | achievement | level | job | register_time |
---|---|---|---|---|---|---|
1 | 1001 | 牛客1号 | 3200 | 7 | 算法 | 2020-01-01 10:00:00 |
2 | 1002 | 牛客2号 | 2500 | 6 | 算法 | 2020-01-01 10:00:00 |
3 | 1003 | 牛客3号♂ | 2200 | 5 | 算法 | 2020-01-01 10:00:00 |
试卷信息表:
id | exam_id | tag | difficulty | duration | release_time |
---|---|---|---|---|---|
1 | 9001 | SQL | hard | 60 | 2020-01-01 10:00:00 |
2 | 9002 | SQL | hard | 80 | 2020-01-01 10:00:00 |
3 | 9003 | 算法 | hard | 80 | 2020-01-01 10:00:00 |
4 | 9004 | PYTHON | medium | 70 | 2020-01-01 10:00:00 |
试卷作答记录表:
id | uid | exam_id | start_time | submit_time | score |
---|---|---|---|---|---|
1 | 1001 | 9001 | 2020-01-01 09:01:01 | 2020-01-01 09:21:59 | 90 |
15 | 1002 | 9001 | 2020-01-01 18:01:01 | 2020-01-01 18:59:02 | 90 |
13 | 1001 | 9001 | 2020-01-02 10:01:01 | 2020-01-02 10:31:01 | 89 |
2 | 1002 | 9001 | 2020-01-20 10:01:01 | ||
3 | 1002 | 9001 | 2020-02-01 12:11:01 | ||
5 | 1001 | 9001 | 2020-03-01 12:01:01 | ||
6 | 1002 | 9001 | 2020-03-01 12:01:01 | 2020-03-01 12:41:01 | 90 |
4 | 1003 | 9001 | 2020-03-01 19:01:01 | ||
7 | 1002 | 9001 | 2020-05-02 19:01:01 | 2020-05-02 19:32:00 | 90 |
14 | 1001 | 9002 | 2020-01-01 12:11:01 | ||
8 | 1001 | 9002 | 2020-01-02 19:01:01 | 2020-01-02 19:59:01 | 69 |
9 | 1001 | 9002 | 2020-02-02 12:01:01 | 2020-02-02 12:20:01 | 99 |
10 | 1002 | 9002 | 2020-02-02 12:01:01 | ||
11 | 1002 | 9002 | 2020-02-02 12:01:01 | 2020-02-02 12:43:01 | 81 |
12 | 1002 | 9002 | 2020-03-02 12:11:01 | ||
17 | 1001 | 9002 | 2020-05-05 18:01:01 | ||
16 | 1002 | 9003 | 2020-05-06 12:01:01 |
请统计SQL试卷上未完成率较高的50%用户中,6级和7级用户在有试卷作答记录的近三个月中,
由示例数据结果输出如下:
uid | start_month | total_cnt | complete_cnt |
---|---|---|---|
1002 | 202002 | 3 | 1 |
1002 | 202003 | 2 | 1 |
1002 | 202005 | 2 | 1 |
解释:各个用户对SQL试卷的未完成数、作答总数、未完成率如下:
uid | incomplete_cnt | total_cnt | incomplete_rate |
---|---|---|---|
1001 | 3 | 7 | 0.4286 |
1002 | 4 | 8 | 0.5000 |
1003 | 1 | 1 | 1.0000 |
1001、1002、1003分别排在1.0、0.5、0.0的位置,因此较高的50%用户(排位<=0.5)为1002、1003;
1003不是6级或7级;
有试卷作答记录的近三个月为202005、202003、202002;
这三个月里1002的作答题数分别为3、2、2,完成数目分别为1、1、1。
示例1
drop table if exists examination_info,user_info,exam_record;
CREATE TABLE examination_info (
id int PRIMARY KEY AUTO_INCREMENT COMMENT '自增ID',
exam_id int UNIQUE NOT NULL COMMENT '试卷ID',
tag varchar(32) COMMENT '类别标签',
difficulty varchar(8) COMMENT '难度',
duration int NOT NULL COMMENT '时长',
release_time datetime COMMENT '发布时间'
)CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE user_info (
id int PRIMARY KEY AUTO_INCREMENT COMMENT '自增ID',
uid int UNIQUE NOT NULL COMMENT '用户ID',
`nick_name` varchar(64) COMMENT '昵称',
achievement int COMMENT '成就值',
level int COMMENT '用户等级',
job varchar(32) COMMENT '职业方向',
register_time datetime COMMENT '注册时间'
)CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE exam_record (
id int PRIMARY KEY AUTO_INCREMENT COMMENT '自增ID',
uid int NOT NULL COMMENT '用户ID',
exam_id int NOT NULL COMMENT '试卷ID',
start_time datetime NOT NULL COMMENT '开始时间',
submit_time datetime COMMENT '提交时间',
score tinyint COMMENT '得分'
)CHARACTER SET utf8 COLLATE utf8_general_ci;
INSERT INTO user_info(uid,`nick_name`,achievement,`level`,job,register_time) VALUES
(1001, '牛客1', 3200, 7, '算法', '2020-01-01 10:00:00'),
(1002, '牛客2号', 2500, 6, '算法', '2020-01-01 10:00:00'),
(1003, '牛客3号♂', 2200, 5, '算法', '2020-01-01 10:00:00');
INSERT INTO examination_info(exam_id,tag,difficulty,duration,release_time) VALUES
(9001, 'SQL', 'hard', 60, '2020-01-01 10:00:00'),
(9002, 'SQL', 'hard', 80, '2020-01-01 10:00:00'),
(9003, '算法', 'hard', 80, '2020-01-01 10:00:00'),
(9004, 'PYTHON', 'medium', 70, '2020-01-01 10:00:00');
INSERT INTO exam_record(uid,exam_id,start_time,submit_time,score) VALUES
(1001, 9001, '2020-01-01 09:01:01', '2020-01-01 09:21:59', 90),
(1002, 9001, '2020-01-20 10:01:01', null, null),
(1002, 9001, '2020-02-01 12:11:01', null, null),
(1003, 9001, '2020-03-01 19:01:01', null, null),
(1001, 9001, '2020-03-01 12:01:01', null, null),
(1002, 9001, '2020-03-01 12:01:01', '2020-03-01 12:41:01', 90),
(1002, 9001, '2020-05-02 19:01:01', '2020-05-02 19:32:00', 90),
(1001, 9002, '2020-01-02 19:01:01', '2020-01-02 19:59:01', 69),
(1001, 9002, '2020-02-02 12:01:01', '2020-02-02 12:20:01', 99),
(1002, 9002, '2020-02-02 12:01:01', null, null),
(1002, 9002, '2020-02-02 12:01:01', '2020-02-02 12:43:01', 81),
(1002, 9002, '2020-03-02 12:11:01', null, null),
(1001, 9001, '2020-01-02 10:01:01', '2020-01-02 10:31:01', 89),
(1001, 9002, '2020-01-01 12:11:01', null, null),
(1002, 9001, '2020-01-01 18:01:01', '2020-01-01 18:59:02', 90),
(1002, 9003, '2020-05-06 12:01:01', null, null),
(1001, 9002, '2020-05-05 18:01:01', null, null);
输出1
1002|202002|3|1
1002|202003|2|1
1002|202005|2|1
示例2
drop table if exists examination_info,user_info,exam_record;
CREATE TABLE examination_info (
id int PRIMARY KEY AUTO_INCREMENT COMMENT '自增ID',
exam_id int UNIQUE NOT NULL COMMENT '试卷ID',
tag varchar(32) COMMENT '类别标签',
difficulty varchar(8) COMMENT '难度',
duration int NOT NULL COMMENT '时长',
release_time datetime COMMENT '发布时间'
)CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE user_info (
id int PRIMARY KEY AUTO_INCREMENT COMMENT '自增ID',
uid int UNIQUE NOT NULL COMMENT '用户ID',
`nick_name` varchar(64) COMMENT '昵称',
achievement int COMMENT '成就值',
level int COMMENT '用户等级',
job varchar(32) COMMENT '职业方向',
register_time datetime COMMENT '注册时间'
)CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE exam_record (
id int PRIMARY KEY AUTO_INCREMENT COMMENT '自增ID',
uid int NOT NULL COMMENT '用户ID',
exam_id int NOT NULL COMMENT '试卷ID',
start_time datetime NOT NULL COMMENT '开始时间',
submit_time datetime COMMENT '提交时间',
score tinyint COMMENT '得分'
)CHARACTER SET utf8 COLLATE utf8_general_ci;
INSERT INTO user_info(uid,`nick_name`,achievement,`level`,job,register_time) VALUES
(1001, '牛客1', 3200, 7, '算法', '2020-01-01 10:00:00'),
(1002, '牛客2号', 2500, 6, '算法', '2020-01-01 10:00:00'),
(1003, '牛客3号', 2500, 6, '算法', '2020-01-01 10:00:00'),
(1004, '牛客4号', 2300, 5, '算法', '2020-01-01 10:00:00');
INSERT INTO examination_info(exam_id,tag,difficulty,duration,release_time) VALUES
(9001, 'SQL', 'hard', 60, '2020-01-01 10:00:00'),
(9002, 'SQL', 'hard', 80, '2020-01-01 10:00:00'),
(9003, '算法', 'hard', 80, '2020-01-01 10:00:00'),
(9004, 'PYTHON', 'medium', 70, '2020-01-01 10:00:00');
INSERT INTO exam_record(uid,exam_id,start_time,submit_time,score) VALUES
(1001, 9001, '2020-01-01 09:01:01', '2020-01-01 09:21:59', 90),
(1002, 9001, '2020-01-20 10:01:01', null, null),
(1002, 9001, '2020-02-01 12:11:01', null, null),
(1003, 9001, '2020-03-01 19:01:01', null, null),
(1004, 9001, '2020-03-01 19:01:01', null, null),
(1001, 9001, '2020-03-01 12:01:01', null, null),
(1002, 9001, '2020-03-01 12:01:01', '2020-03-01 12:41:01', 90),
(1002, 9001, '2020-05-02 19:01:01', '2020-05-02 19:32:00', 90),
(1001, 9002, '2020-01-02 19:01:01', '2020-01-02 19:59:01', 69),
(1001, 9002, '2020-02-02 12:01:01', '2020-02-02 12:20:01', 99),
(1002, 9002, '2020-02-02 12:01:01', null, null),
(1002, 9002, '2020-02-02 12:01:01', '2020-02-02 12:43:01', 81),
(1002, 9002, '2020-03-02 12:11:01', null, null),
(1001, 9001, '2020-01-02 10:01:01', '2020-01-02 10:31:01', 89),
(1001, 9002, '2020-01-01 12:11:01', null, null),
(1002, 9001, '2020-01-01 18:01:01', '2020-01-01 18:59:02', 90),
(1002, 9003, '2020-05-06 12:01:01', null, null),
(1001, 9002, '2020-05-05 18:01:01', null, null);
预期输出
1003| 202003 |1 | 0
思路
1.统计每次试卷作答记录的月份编号,按年月降序进行窗口连续排名:
dense_rank() over ( PARTITION BY uid ORDER BY DATE_FORMAT( start_time, '%Y%m' ) DESC ) months_rank
2.筛选月份排名小于等于3的记录(近三个月的):WHERE start_month_rank <= 3
3.筛选SQL试卷上未完成率较高的50%用户:
(1)计算每个用户在SQL试卷上的未完成率百分比排名:
(2)计算每个用户在SQL试卷上的未完成率:
(3)筛选SQL试卷
4.统计百分比排名,按未完成率降序uid升序进行百分比排名:
PERCENT_RANK() over(ORDER BY incomplete_rate DESC, uid) as incomp_rate_rank
5.筛选未完成率较高的50%用户(因为按未完成率倒序了,所以<=):WHERE incomp_rate_rank <= 0.5
6.筛选6、7级用户
7.按用户ID、月份分组:GROUP BY uid, start_month
解题流程
1.各个用户对SQL试卷的未完成数、作答总数、未完成率如下
SELECT
uid,
sum(IF(score IS NULL,1,0)) AS incomplete_cnt,
count( uid ) AS total_cnt,
sum(IF(score IS NULL,1,0))/ count( uid ) AS incomplete_rate
FROM examination_info
INNER JOIN exam_record USING ( exam_id )
WHERE tag = 'SQL'
GROUP BY uid
2.未完成率 和 筛选6、7级用户
SELECT
uid
FROM
user_info
INNER JOIN (
SELECT
uid,
PERCENT_RANK () OVER (
ORDER BY
incomplete_rate DESC,
uid
) AS rank1
FROM
(
SELECT
uid,
sum(IF(score IS NULL, 1, 0)) AS incomplete_cnt,
count(uid) AS total_cnt,
sum(IF(score IS NULL, 1, 0)) / count(uid) AS incomplete_rate
FROM
examination_info
INNER JOIN exam_record USING (exam_id)
WHERE
tag = 'SQL'
GROUP BY
uid
) t1
) t2 USING (uid)
WHERE
`level` >= 6
AND rank1 <= 0.5
3.试卷作答记录的近三个月
SELECT
*
FROM
(
SELECT
uid,
DATE_FORMAT( start_time, '%Y%m' ) start_month,
dense_rank()
over(PARTITION BY uid
ORDER BY DATE_FORMAT( start_time, '%Y%m') DESC
)months_rank,
start_time,
score
FROM
exam_record
) t_record
WHERE
months_rank <= 3
题解
方法一:
SELECT
uid,
start_month,
count(start_time) total_cnt,
count(score) complete_cnt
FROM
(
SELECT *
FROM
(
SELECT
uid,
DATE_FORMAT(start_time, '%Y%m') start_month,
dense_rank()over (PARTITION BY uid ORDER BY
DATE_FORMAT(start_time, '%Y%m') DESC) months_rank,
start_time,
score
FROM
exam_record
) t_record
WHERE
months_rank <= 3
AND uid IN (
SELECT
uid
FROM
user_info
INNER JOIN (
SELECT
uid,
PERCENT_RANK()OVER(ORDER BY incomplete_rate DESC,uid) AS rank1
FROM
(
SELECT
uid,
sum(IF(score IS NULL, 1, 0)) AS incomplete_cnt,
count(uid) AS total_cnt,
sum(IF(score IS NULL, 1, 0)) / count(uid) AS incomplete_rate
FROM
examination_info
INNER JOIN exam_record USING (exam_id)
WHERE tag = 'SQL'
GROUP BY uid
) t1
) t2 USING (uid)
WHERE`level` >= 6 AND rank1 <= 0.5
)
) t3
GROUP BY
uid,start_month
ORDER BY
uid,start_month
方式二:
SELECT
uid,
start_month,
COUNT(1) as exam_cnt,
COUNT(submit_time) as complete_cnt
FROM
(
SELECT
uid,
submit_time,
DATE_FORMAT(start_time, "%Y%m") as start_month,
DENSE_RANK()
over(ORDER BY DATE_FORMAT(start_time, "%Y%m")DESC) as start_month_rank -- 按作答月份降序编号
FROM exam_record
) as t_exam_record_month_rank
WHERE start_month_rank <= 3
AND uid IN (
SELECT uid
FROM (
SELECT uid,
PERCENT_RANK() over(order BY incomp_rate DESC, uid) as incomp_rate_rank
FROM
(
SELECT
uid,
1 - COUNT(submit_time)/COUNT(1) as incomp_rate -- 此人未完成率
FROM exam_record
WHERE exam_id in
(SELECT exam_id FROM examination_info WHERE tag='SQL')
GROUP BY uid
) as t_exam_incom_rate -- 在SQL试卷上的未完成率
)as t_exam_incom_rate_rank -- 在SQL试卷上的未完成率百分比排名
WHERE incomp_rate_rank <= 0.5
) -- 在SQL试卷上未完成率最高的50%用户
AND uid IN (SELECT uid FROM user_info WHERE `level`>=6)
GROUP BY uid, start_month
ORDER BY uid, start_month;
拓展
PERCENT_RANK()函数计算分区或结果集中行的百分位数排名
这PERCENT_RANK()是一个窗口函数,
用于计算分区或结果集中行的百分位数。
以下显示了PERCENT_RANK()函数的语法:
PERCENT_RANK()
OVER (
PARTITION BY expr,...
ORDER BY expr [ASC|DESC],...
)
PERCENT_RANK()函数返回一个从0到1的数字。
对于指定的行,PERCENT_RANK()计算行的等级减1,
除以评估的分区或查询结果集中的行数减1:
(rank - 1) / (total_rows - 1)
在此公式中,rank是指定行的等级,total_rows是要计算的行数。
PERCENT_RANK()对于分区或结果集中的第一行,
函数始终返回零。重复的列值将接收相同的PERCENT_RANK()值。
与其他窗口函数类似,PARTITION BY子句将行分配到分区中,
ORDER BY子句指定每个分区中行的逻辑顺序。
PERCENT_RANK()为每个有序分区独立计算函数。
两个PARTITION BY和ORDER BY子句都是可选项。
但是,它PERCENT_RANK()是一个顺序敏感函数,
因此,您应始终使用ORDER BY子句。
MySQL PERCENT_RANK() 函数示例
让我们创建一个名为新表productLineSales
基础上的orders
,orderDetails
以及products
从表中示例数据库:
CREATE TABLE productLineSales
SELECT
productLine,
YEAR(orderDate) orderYear,
quantityOrdered * priceEach orderValue
FROM
orderDetails
INNER JOIN
orders USING (orderNumber)
INNER JOIN
products USING (productCode)
GROUP BY
productLine ,
YEAR(orderDate);
productLineSales
表存储销售数据的摘要,包括产品系列,订单年份和订单值。
+------------------+-----------+------------+
| productLine | orderYear | orderValue |
+------------------+-----------+------------+
| Vintage Cars | 2013 | 4080.00 |
| Classic Cars | 2013 | 5571.80 |
| Trucks and Buses | 2013 | 3284.28 |
| Trains | 2013 | 2770.95 |
| Ships | 2013 | 5072.71 |
| Planes | 2013 | 4825.44 |
| Motorcycles | 2013 | 2440.50 |
| Classic Cars | 2014 | 8124.98 |
| Vintage Cars | 2014 | 2819.28 |
| Trains | 2014 | 4646.88 |
| Ships | 2014 | 4301.15 |
| Planes | 2014 | 2857.35 |
| Motorcycles | 2014 | 2598.77 |
| Trucks and Buses | 2014 | 4615.64 |
| Motorcycles | 2015 | 4004.88 |
| Classic Cars | 2015 | 5971.35 |
| Vintage Cars | 2015 | 5346.50 |
| Trucks and Buses | 2015 | 6295.03 |
| Trains | 2015 | 1603.20 |
| Ships | 2015 | 3774.00 |
| Planes | 2015 | 4018.00 |
+------------------+-----------+------------+
21 rows in set (0.02 sec)
PERCENT_RANK()
在查询结果集上使用MySQL
以下查询按订单值查找每个产品系列的百分位数排名:
WITH t AS (
SELECT
productLine,
SUM(orderValue) orderValue
FROM
productLineSales
GROUP BY
productLine
)
SELECT
productLine,
orderValue,
ROUND(
PERCENT_RANK() OVER (
ORDER BY orderValue
)
,2) percentile_rank
FROM
t;
在这个例子中:
- 首先,我们使用公用表表达式按产品线汇总订单值。
- 其次,我们用它
PERCENT_RANK()
来计算每种产品的订单价值的百分等级。此外,我们使用ROUND()
函数将值舍入为2十进制,以获得更好的表示。
这是输出:
+------------------+------------+-----------------+
| productLine | orderValue | percentile_rank |
+------------------+------------+-----------------+
| Trains | 9021.03 | 0.00 |
| Motorcycles | 9044.15 | 0.17 |
| Planes | 11700.79 | 0.33 |
| Vintage Cars | 12245.78 | 0.50 |
| Ships | 13147.86 | 0.67 |
| Trucks and Buses | 14194.95 | 0.83 |
| Classic Cars | 19668.13 | 1.00 |
+------------------+------------+-----------------+
7 rows in set (0.01 sec)
以下是输出中的一些分析:
- 订单价值
Trains
并不比任何其他产品线更好,后者用零表示。 Vintage Cars
表现优于50%的其他产品。Classic Cars
表现优于任何其他产品系列,因此其百分比等级为1或100%
percent_rank()
在分区上使用MySQL
以下语句按年度中的订单值返回产品系列的百分位数排名:
SELECT
productLine,
orderYear,
orderValue,
ROUND(
PERCENT_RANK()
OVER (
PARTITION BY orderYear
ORDER BY orderValue
),2) percentile_rank
FROM
productLineSales;
这是输出:
+------------------+-----------+------------+-----------------+
| productLine | orderYear | orderValue | percentile_rank |
+------------------+-----------+------------+-----------------+
| Motorcycles | 2013 | 2440.50 | 0.00 |
| Trains | 2013 | 2770.95 | 0.17 |
| Trucks and Buses | 2013 | 3284.28 | 0.33 |
| Vintage Cars | 2013 | 4080.00 | 0.50 |
| Planes | 2013 | 4825.44 | 0.67 |
| Ships | 2013 | 5072.71 | 0.83 |
| Classic Cars | 2013 | 5571.80 | 1.00 |
| Motorcycles | 2014 | 2598.77 | 0.00 |
| Vintage Cars | 2014 | 2819.28 | 0.17 |
| Planes | 2014 | 2857.35 | 0.33 |
| Ships | 2014 | 4301.15 | 0.50 |
| Trucks and Buses | 2014 | 4615.64 | 0.67 |
| Trains | 2014 | 4646.88 | 0.83 |
| Classic Cars | 2014 | 8124.98 | 1.00 |
| Trains | 2015 | 1603.20 | 0.00 |
| Ships | 2015 | 3774.00 | 0.17 |
| Motorcycles | 2015 | 4004.88 | 0.33 |
| Planes | 2015 | 4018.00 | 0.50 |
| Vintage Cars | 2015 | 5346.50 | 0.67 |
| Classic Cars | 2015 | 5971.35 | 0.83 |
| Trucks and Buses | 2015 | 6295.03 | 1.00 |
+------------------+-----------+------------+-----------------+
21 rows in set (0.01 sec)
在此示例中,我们按订单年划分产品线的订单值。将PERCENT_RANK()
然后被施加到每个分区。
例如,2013年Vintage Cars
表现优于其他产品线的50%,而在2014年,船舶表现优于其他产品的50%。
在本教程中,您学习了如何使用MySQL PERCENT_RANK()
函数计算分区或结果集中行的百分位数。