目录
group by:
数据分析经常会使用到分组汇总的方法。通过汇总的方法可以更快速地看到数据的特征。
例如查看男女生的比例
spark/presto/hive/mysql:
select
student_sex,
count(distinct student_id) as stu_num
from student_info
group by student_sex
或者
spark/presto/hive/mysql:
select
student_sex,
count(distinct student_id) as stu_num
from student_info
group by 1
又如想查看男女生不同年龄阶段的学生数
spark/presto/hive/mysql:
select
student_sex,
student_age,
count(distinct student_id) as stu_num
from student_info
group by student_sex,student_age
此时计算逻辑是想对学生性别进行分类,再对学生年龄分类
想查看男女生不同年龄阶段的年龄最大和最小分别是多少
spark/presto/hive/mysql:
select
student_sex,
max(student_age) as max_age,
min(student_age) as min_age
from student_info
group by student_sex
grouping sets():
grouping sets()分层聚合
如果你不仅想看各维度的分组求和,还想看总的求和,那么久可以用grouping sets()
spark:
select
student_sex,
student_age,
count(distinct student_id) as stu_num
from student_info
group by
grouping sets((student_sex,student_age))
--相当于
/*
select
student_sex,
student_age,
count(distinct student_id) as stu_num
from student_info
group by student_sex,student_age
union all
select
null as student_sex,
null as student_age,
count(distinct student_id) as stu_num
from student_info
*/
group by ...with rollup:
group by ...with rollup 不仅可以对字段汇总,并且还可以聚合
例如我们想看到学生的不同性别的人数以及不同性别不同年龄的
spark:
select
student_sex,
student_age,
count(distinct student_id) as stu_num
from student_info
group by student_sex,student_age with rollup
/*相当于使用grouping by
student_sex,
student_age,
count(distinct student_id) as stu_num
from student_info
grouping by ((student_sex,student_age),(student_sex))
*/
/* 也相当于我们这么写
student_sex,
student_age,
count(distinct student_id) as stu_num
from student_info
group by student_sex,student_age
union all
select
student_sex,
null as student_age,
count(distinct student_id) as stu_num
from student_info
group by student_sex
union all
select
null student_sex,
null as student_age,
count(distinct student_id) as stu_num
from student_info
*/
with cube:
spark:
select
student_sex,
student_age,
count(distinct student_id) as stu_num
from student_info
group by student_sex,student_age with cube
/*相当于
select
student_sex,
student_age,
count(distinct student_id) as stu_num
from student_info
group by student_sex,student_age with rollup
union all
select
null as student_sex,
student_age,
count(distinct student_id) as stu_num
from student_info
group by student_age
with cube 和 with rollup 的区别在于 with cube 比with rollup 在多个字段汇总时 数据条数更多;
因为with rollup 是跟接近的字段往后结和汇总的,但是with cube 相当于是排列组合的方式,将所有结果一一汇总出来。
rollup()
presto:
select
day,
count(distinct stu_id) as stu_num
group by
rollup (day)