SQL 最大连续回答问题的天数大于等于n天的用户及其等级
题目描述
现有某乎问答创作者信息表author_tb如下(其中author_id表示创作者编号、author_level表示创作者级别,共1-6六个级别、sex表示创作者性别):
author_id | author_level | sex |
---|---|---|
101 | 6 | m |
102 | 1 | f |
103 | 1 | m |
104 | 3 | m |
105 | 4 | f |
106 | 2 | f |
107 | 2 | m |
108 | 5 | f |
109 | 6 | f |
110 | 5 | m |
创作者回答情况表answer_tb如下(其中answer_date表示创作日期、author_id指创作者编号、issue_id指回答问题编号、char_len表示回答字数):
answer_date | author_id | issue_id | char_len |
---|---|---|---|
2021-11-01 | 101 | E001 | 150 |
2021-11-01 | 101 | E002 | 200 |
2021-11-01 | 102 | C003 | 50 |
2021-11-01 | 103 | P001 | 35 |
2021-11-01 | 104 | C003 | 120 |
2021-11-01 | 105 | P001 | 125 |
2021-11-01 | 102 | P002 | 105 |
2021-11-02 | 101 | P001 | 201 |
2021-11-02 | 110 | C002 | 200 |
2021-11-02 | 110 | C001 | 225 |
2021-11-02 | 110 | C002 | 220 |
2021-11-03 | 101 | C002 | 180 |
2021-11-04 | 109 | E003 | 130 |
2021-11-04 | 109 | E001 | 123 |
2021-11-05 | 108 | C001 | 160 |
2021-11-05 | 108 | C002 | 120 |
2021-11-05 | 110 | P001 | 180 |
2021-11-05 | 106 | P002 | 45 |
2021-11-05 | 107 | E003 | 56 |
请你统计最大连续回答问题的天数大于等于3天的用户及其等级(若有多条符合条件的数据,按author_id升序排序),以上例子的输出结果如下:
author_id | author_level | days_cnt |
---|---|---|
101 | 6 | 3 |
解决
- 对每个用户的回答日期去重、连续排序
# drop table if exists tmp_days;
# create temporary table tmp_days
with tmp_days as(
select distinct a.author_id, a.answer_date,
dense_rank() over(partition by author_id order by answer_date) day_rank
from answer_tb a
)
author_id | answer_date | day_rank |
---|---|---|
101 | 2021-11-01 | 1 |
101 | 2021-11-02 | 2 |
101 | 2021-11-03 | 3 |
102 | 2021-11-01 | 1 |
103 | 2021-11-01 | 1 |
104 | 2021-11-01 | 1 |
105 | 2021-11-01 | 1 |
106 | 2021-11-05 | 1 |
107 | 2021-11-05 | 1 |
108 | 2021-11-05 | 1 |
109 | 2021-11-04 | 1 |
110 | 2021-11-02 | 1 |
110 | 2021-11-05 | 2 |
注意:
- 由于表变量实际上使用了临时表,增加了额外的I/O开销,造成性能的损失。因此不太适合数据量大且频繁查询的情况。为此,在SQL Server 2005中提供了WITH公用表表达式(CTE)。不需要显式定义删除,效率更高。
CTEs are similar to tables created with CREATE [TEMPORARY] TABLE but need not be defined or dropped explicitly. For a CTE, you need no privileges to create tables.- 上面用的排序函数为 dense_rank(),而不是row_number()。
这是因为使用 row_number() over (partition by * order by *) 时,order by 的字段最好是唯一的、不为空的,如果重复或为空,排序可能会失败,按照 sql 的默认排序去取值。
author_id | answer_date | day_rank |
---|---|---|
101 | 2021-11-01 | 1 |
101 | 2021-11-01 | 2 |
101 | 2021-11-02 | 3 |
101 | 2021-11-03 | 4 |
102 | 2021-11-01 | 1 |
102 | 2021-11-01 | 2 |
103 | 2021-11-01 | 1 |
104 | 2021-11-01 | 1 |
105 | 2021-11-01 | 1 |
106 | 2021-11-05 | 1 |
107 | 2021-11-05 | 1 |
108 | 2021-11-05 | 1 |
108 | 2021-11-05 | 2 |
109 | 2021-11-04 | 1 |
109 | 2021-11-04 | 2 |
110 | 2021-11-02 | 1 |
110 | 2021-11-02 | 2 |
110 | 2021-11-02 | 3 |
110 | 2021-11-05 | 4 |
如果要使用row_number(),应该对answer_date列去重后再使用,SQL如下。
select a.author_id, a.answer_date, row_number() over(partition by author_id order by answer_date asc) day_rank from ( select distinct a.author_id, a.answer_date from answer_tb a order by author_id asc, answer_date asc ) a
- 对回答日期与日期排序相减,如果日期连续差值应该相同(eg: [1,2,3]-[7,8,9]=[-6,-6,-6])
# drop table if exists tmp
# create temporary table tmp
with tmp as(
select author_id,answer_date, day_rank, date_sub(answer_date, interval day_rank day) p_day
from tmp_days
)
author_id | answer_date | day_rank | p_day |
---|---|---|---|
101 | 2021-11-01 | 1 | 2021-10-31 |
101 | 2021-11-02 | 2 | 2021-10-31 |
101 | 2021-11-03 | 3 | 2021-10-31 |
102 | 2021-11-01 | 1 | 2021-10-31 |
103 | 2021-11-01 | 1 | 2021-10-31 |
104 | 2021-11-01 | 1 | 2021-10-31 |
105 | 2021-11-01 | 1 | 2021-10-31 |
106 | 2021-11-05 | 1 | 2021-11-04 |
107 | 2021-11-05 | 1 | 2021-11-04 |
108 | 2021-11-05 | 1 | 2021-11-04 |
109 | 2021-11-04 | 1 | 2021-11-03 |
110 | 2021-11-02 | 1 | 2021-11-01 |
110 | 2021-11-05 | 2 | 2021-11-03 |
- 统计连续的天数
# drop table if exists t
# create temporary table t
with t as(
select author_id,p_day, count(p_day) days_cnt
from tmp
group by author_id,p_day
# order by days_cnt desc;
)
author_id | p_day | days_cnt |
---|---|---|
101 | 2021-10-31 | 3 |
eg*101 | 2021-12-01 | 5 |
102 | 2021-10-31 | 1 |
103 | 2021-10-31 | 1 |
104 | 2021-10-31 | 1 |
105 | 2021-10-31 | 1 |
106 | 2021-11-04 | 1 |
107 | 2021-11-04 | 1 |
eg*107 | 2021-12-01 | 3 |
108 | 2021-11-04 | 1 |
109 | 2021-11-03 | 1 |
110 | 2021-11-01 | 1 |
110 | 2021-11-03 | 1 |
- 统计最大连续天数大于3的用户及其等级
select distinct t.author_id, l.author_level,
max(t.days_cnt) over(partition by author_id) days_cnt
from t
left join author_tb l
on l.author_id = t.author_id
where days_cnt >= 3
多个连续天数举例
author_id | author_level | days_cnt |
---|---|---|
eg*101 | 6 | 5 |
eg*107 | 2 | 3 |
原答案
author_id | author_level | days_cnt |
---|---|---|
101 | 6 | 3 |
完整代码
select distinct t.author_id, l.author_level,
max(t.days_cnt) over(partition by author_id) as days_cnt
from (
select author_id,p_day, count(p_day) days_cnt
from (
select author_id,answer_date, day_rank,
date_sub(answer_date, interval day_rank day) p_day from (
select distinct a.author_id, a.answer_date,
dense_rank() over(partition by author_id order by answer_date) day_rank
from answer_tb a
) tmp_days
) tmp
group by author_id,p_day
) t
left join author_tb l
on l.author_id = t.author_id
where days_cnt >= 3
使用WITH,可读性更强
with tmp_days as (
select distinct a.author_id, a.answer_date,
dense_rank() over(partition by author_id order by answer_date) day_rank
from answer_tb a
),tmp as(
select author_id,answer_date, day_rank,
date_sub(answer_date, interval day_rank day) p_day
from tmp_days
),t as (
select author_id,p_day, count(p_day) days_cnt
from tmp
group by author_id,p_day
)
select distinct t.author_id, l.author_level, max(t.days_cnt) over(partition by author_id) days_cnt
from t
left join author_tb l
on l.author_id = t.author_id
where days_cnt >= 3
相关链接