SQL进阶day9————聚合与分组-CSDN博客

本文链接：https://blog.csdn.net/qq_47966193/article/details/139418763

1聚合函数

1.1SQL类别高难度试卷得分的截断平均值

1聚合函数

1.1SQL类别高难度试卷得分的截断平均值

我的错误代码：截断平均值是有专门的函数吗？

select tag,difficulty,avg(score) clip_avg_score
from examination_info ei join exam_record
using(id)
group by tag
where tag = 'SQL' and difficulty='hard'
and score not in (max(score),min(score))

我的思路改正：用（全部值 - 最大值 - 最小值） / (总数-2) ，但是缺点就是，如果最大值和最小值有多个，这个方法就很难筛选出来

SELECT ei.tag,ei.difficulty,
       ROUND((SUM(er.score)-MIN(er.score)-MAX(er.score)) / (COUNT(er.score)-2),1) AS clip_avg_score
FROM examination_info ei join exam_record er
on ei.exam_id = er.exam_id
where ei.tag = "SQL"
AND ei.difficulty = "hard";

标准正确代码：

使用in子句将最大值和最小值排除掉，再求平均值

懒人写法，可以用with...as句式将要多次使用的表只写1次即可（WITH AS 语法是MySQL中的一种临时结果集，它可以在SELECT、INSERT、UPDATE或DELETE语句中使用。通过使用WITH AS语句，可以将一个查询的结果存储在一个临时表中，然后在后续的查询中引用这个临时表。这样可以简化复杂的查询，提高代码的可读性和可维护性。但是不知道哪个MySQL版本开始支持with...as句式的，我的本地电脑里面是Navicat 15 for MySQL，不支持）
用union把max和min的结果集中在一行当中，这样形成一列多行的效果，不用多写一次代码

# t1筛选出SQL高难度的数据
WITH t1 as(
    SELECT er.*,ei.tag,ei.difficulty
    FROM exam_record er INNER JOIN
    examination_info ei
    ON er.exam_id = ei.exam_id
    WHERE tag = "SQL" and difficulty = "hard"
)

# 在t1的基础上计算均值
SELECT tag,difficulty,round(avg(score),1)
FROM t1

# 用in子句将最大值和最小值排除掉，再求平均值 not in
WHERE score not in (
    SELECT max(score)
    FROM t1
    UNION
    SELECT min(score)
    FROM t1
)

Q：为什么这里where换成and也不报错，因为前面有on？那么where和on有啥区别呢，可以只有一个吗？

（1）where和having是在临时表生产之后，对临时表中的数据进行过滤用的。

如SQL语句：select * form tab1 left join tab2 on (tab1.size = tab2.size) where tab2.name=’AAA’

（2） on是在生成中临时表之前就去作用的，它会在数据源那里就把不符合要求的数据给过滤掉，即是先把不符合条件的记录过滤后才进行统计，它就可以减少中间运算要处理的数据所以on运行的速度最快。

如SQL语句：select * form tab1 left join tab2 on (tab1.size = tab2.size and tab2.name=’AAA’)

（3）在两个表联接时才用on的，所以在一个表的时候，就剩下where跟having比较了。在这单表查询统计的情况下，如果要过滤的条件没有涉及到要计算字段，那它们的结果是一样的，但是where可以使用rushmore技术，而having就不能，在速度上后者要慢。

（4）如果要涉及到计算的字段，where的作用时间是在计算之前就完成的，而having就是在计算后才起作用的，所以在这种情况下，两者的结果会不同。

（5）在多表联接查询时，on比where更早起作用。系统首先由on根据各个表之间的联接条件，把多个表合成一个临时表后，再由where进行过滤，然后再计算，计算完后再由having进行过滤。

1.2统计作答次数

我的报错代码：求已完成的试卷数时应该要分组一下，exam_id

select count(er.id) total_pv,
count(er.submit_time) complete_pv,
count(t2.exam_id) complete_exam_cnt
from exam_record er,(select count(exam_id) from exam_record 
group by exam_id) t2

正确代码1：

主要在于已完成的试卷数的统计，因为这个带有条件，且需要统计聚合结果，可以使用 聚合函数与case when 结合。count中是可以加条件的

select 
count(*)  total_pv,
count(score)  complete_pv,
count(distinct case when score is null then null else exam_id end)  complete_exam_cnt
from exam_record

复习case when：

（1）case expr when v1 then r1 [when v2 then r2] ...[else rn] end

例如：case 2 when 1 then 'one' when 2 then 'two' else 'more' end 返回two

case后面的值为2，与第二条分支语句when后面的值相等相等，因此返回two

（2）case when v1 then r1 [when v2 then r2]...[else rn] end

例如：case when 1<0 then 'T' else 'F' end返回F

1<0的结果为false，因此函数返回值为else后面的F

正确代码2：

select count(*) as total_pv,
count(score) as complete_pv,
count(distinct exam_id,score IS NOT NULL or null) as complete_exam_cnt
# 是逗号，连接不是and连接
from exam_record

在select和count后面都可以加条件的，但是要明白内核：

count(distinct exam_id,score IS NOT NULL or null) as complete_exam_cnt （正确）

不能是

count(distinct exam_id and score IS NOT NULL or null) as complete_exam_cnt （错误，结果永远为2，这个题只是碰巧结果为2，改一个数据就不对了）

1 用and：

（1）一般在where后筛选过滤，还是得到的满足条件的score

（2）如果在select后直接加条件判断：这里的score is not null 是判断

符合条件的返回 true ,即为1
不符合的返回 false ,即为0

（3）加上exam_id，进行and逻辑运算

exam_id 本身为值，可以理解为真在 and 逻辑下，所以上一步的1，0并不会变化，后面加上or NULL，否则会把0也计算上。

（4）在上一步的基础上去重，则只会剩下1和0

经过and运算之后，只剩下一列数据，多行1和0
distinct 去重后，就只剩下两行数据 1 和 0

（5）所以这时候再进行count计算，结果恒为2 （两行数据）

2 正确答案的执行逻辑：

（1）用，连接（从之前的逻辑判断，变为多列组合）

（2）这时候distinct 去重后，就不是只剩下两行数据 1 和 0，而是会把score为null也会考虑进去。

（3）结果应该是3，如果没有or NULL，就是5行了（null为0会被计数）

1.3 得分不小于平均分的最低分

我的代码：where后面的条件错了，但是思路大概这样。

select score min_score_over_avg 
from exam_record er join examination_info ei
on er.id = ei.id
group by exam_id
where score>=avg(score) and ei.tag = 'SQL'
order by score asc
limit 1

修改我的代码：

select er.score min_score_over_avg 
from exam_record er 
left join examination_info ei
on er.exam_id = ei.exam_id # 不是按照id连接
where  ei.tag = 'SQL' 
and score>= (SELECT avg(er.score) from exam_record er
left join examination_info ei
on er.exam_id=ei.exam_id
where ei.tag='SQL')
order by score asc
limit 1

（1）表连接是按照exam_id ，不是按照id连接

（2）score>=某个值，这里不能直接score>=avg(score)，而是应该通过表查询返回得到avg(score)，然后在进行比较。

改进我的代码：

这里有表查询的部分重复了两次，可以用with...as句式将要多次使用的表命名，这样可以只写一次，多次调用。

此外，order by score asc limit 1 可以换为min函数。

with t as
(SELECT score from exam_record er
left join examination_info ei
on er.exam_id=ei.exam_id
where ei.tag='SQL')

select min(score) min_score_over_avg 
from t 
where score>= 
(SELECT avg(t.score) from t)

2 分组查询

2.1平均活跃天数和月活人数

我的代码：此处活跃指有交卷行为，用户平均月活跃天数avg_active_days啥意思？

with t as
(select * 
from exam_record
where year(start_time)=2021)

select month(start_time) 'month',
count(submit_time) mau
from t
group by month(submit_time)

正确代码：

select DATE_FORMAT(start_time,"%Y%m") as month,
round(count(distinct uid,date_format(start_time,"%Y%m%d"))/count(distinct uid),2) as avg_active_days,
count(distinct uid) as mau
from exam_record
where submit_time is not NULL
and YEAR(submit_time) =2021
group by month;

（1）202107是用date_format函数：DATE_FORMAT(start_time,"%Y%m") as month

（2）主要难的一点是天数的计算。

到底是count(distinct uid,date_format(start_time,"%Y%m%d"))

还是count(start_time)作为分子呢

关键是理解题目的意思是：天数。

假设一个uid 比如1001在2021-07-06这一天有二个记录，如果是count(start_time)那么就是天数是2，但是如果是count(distinct uid,date_format(start_time,"%Y%m%d"))天数就是1了

复盘探索：

（1）先找出2021年，活跃的用户ID和时间（具体到哪天）

（2）如果不考虑uid，直接按照活跃时间去重，那么不同用户在同一天活跃记录会被去重到只剩下1条，

（3）同理，如果只安装用户ID去重，那么同一用户在不同天的记录也会被去重到只有1条。这里查询的实际是月活跃的用户有哪些。

（4）所以要去重的目的是，同一个用户，在同一天，重复提交活跃多次的记录。（因为这里是按天算，同一天同一个用户只算一次。）

（5）用户平均月活跃天数=月活跃天数 /月活跃用户

：count(distinct uid,date_format(start_time,"%Y%m%d"))/count(distinct uid)

月活跃天数：

月活跃用户：

2.2 月总刷题数和日均刷题数

我的代码：分组好像报错，后面那个求总数我也不知道咋整

分组报错问题：MySQL提供了any_value()函数来抑制ONLY_FULL_GROUP_BY值被拒绝

select 
date_format(submit_time,'%y%m') submit_month,
count(score) month_q_cnt,
count(score)/day(month(submit_time)) avg_day_q_cnt
from practice_record
group by date_format(submit_time,'%y%m%d') 
having year(submit_time)=2021

（1）当月天数求错了，我是想先求出当前月，再求出当月天：这样操作结果是不对的。

应该用last_day函数求出最后一天，然后用day函数求出这个日期的天数。

复习【日期时间函数】

● year(date)——获取年的函数

● month(date)——获取月的函数

● day(date)——获取日的函数

● date_add(date,interval expr type)——对指定起始时间进行加操作

● date_sub(date,interval expr type)——对指定起始时间进行减操作

● datediff(date1,date2)——计算两个日期之间间隔的天数

● date_format(date,format)——将日期和时间格式化

代码改正：

select 
date_format(submit_time,'%y%m') submit_month,
any_value(count(score)) month_q_cnt,
any_value(round(count(score)/day(last_day(submit_time)),3)) avg_day_q_cnt
from practice_record
where year(submit_time)='2021'
# where date_format(submit_time,'%y')='2021'
group by submit_month

该年的总体情况，可以用union all来连接，完整代码：

select date_format(submit_time,'%Y%m') submit_month,
any_value(count(question_id)) month_q_cnt,
any_value(round(count(question_id)/day(LAST_DAY(submit_time)),3)) avg_day_q_cnt 
from practice_record
where date_format(submit_time,'%Y')='2021'
group by submit_month
union all
select '2021汇总' as submit_month,
count(question_id) month_q_cnt,
round(count(id)/31,3) avg_day_q_cnt
from practice_record
where date_format(submit_time,'%Y')='2021'
order by submit_month;

复习：

1、区别1：取结果的交集

1）union: 对两个结果集进行并集操作, 不包括重复行,相当于distinct, 同时进行默认规则的排序;

2）union all: 对两个结果集进行并集操作, 包括重复行, 即所有的结果全部显示, 不管是不是重复;

2、区别2：获取结果后的操作

1）union: 会对获取的结果进行排序操作

2）union all: 不会对获取的结果进行排序操作

3、总结

union all只是合并查询结果，并不会进行去重和排序操作，在没有去重的前提下，使用union all的执行效率要比union高。

2.3未完成试卷数大于1的有效用户

我的代码：罗里吧嗦答案还不对。。

with t as
(select uid,er.exam_id,start_time,submit_time,tag
from exam_record er , examination_info ei
where er.exam_id=ei.exam_id
and date_format(start_time,'%Y')='2021')

select uid,
(select count(submit_time) from t
where submit_time is NULL) incomplete_cnt,
(select count(submit_time) from t
where submit_time is not NULL) complete_cnt
from t
where (select count(submit_time) from t
where submit_time is NULL)<5 

and (select count(submit_time) from t
where submit_time is not NULL)>1
order by incomplete_cnt

我的代码改正：

select uid,
sum(case when submit_time is NULL then 1 else 0 end ) incomplete_cnt,
sum(case when submit_time is NULL then 0 else 1 end ) complete_cnt
from exam_record er join examination_info ei
on er.exam_id=ei.exam_id
where date_format(start_time,'%Y')='2021'
group by uid
having incomplete_cnt >1 and incomplete_cnt<5 and complete_cnt>1
order by incomplete_cnt

接下来是detail，作答过的试卷tag集合，是提交日期：类型；一直重复显示

我的完整代码：

select uid,
sum(case when submit_time is NULL then 1 else 0 end ) incomplete_cnt,
sum(case when submit_time is NULL then 0 else 1 end ) complete_cnt,
group_concat(DISTINCT concat_ws(':',date_format(start_time,"%Y-%m-%d"),tag) order by start_time Separator ';') detail

from exam_record er join examination_info ei
on er.exam_id=ei.exam_id
where date_format(start_time,'%Y')='2021'
group by uid
having incomplete_cnt >1 and incomplete_cnt<5 and complete_cnt>1
order by incomplete_cnt desc

注意：select后面的属性，不管计算了多长，每个之间都要有逗号！！！

大佬代码：

SELECT uid,
SUM(CASE WHEN submit_time IS NULL THEN 1 ELSE 0 END) "incomplete_cnt",
SUM(CASE WHEN submit_time IS NULL THEN 0 ELSE 1 END) "complete_cnt",
group_concat(distinct concat_ws(':',date(start_time),tag) 
order by start_time separator ';') as detail
FROM exam_record er INNER JOIN
examination_info ei 
ON er.exam_id = ei.exam_id
WHERE year(start_time) = 2021
GROUP BY uid
HAVING complete_cnt >= 1 AND incomplete_cnt > 1 AND
incomplete_cnt < 5
ORDER BY incomplete_cnt desc

其中

（1）用sum和case when函数来求完成和未完成的试卷数

（2）detail的实现是用concat_ws或者concat函数将submit_time和tag连接并且同时distinct：

函数group_concat([DISTINCT] 要连接的字段 [Order BY ASC/DESC 排序字段] [Separator'分隔符'])

concat()函数