场景举例
北京市学生成绩分析
成绩的数据格式
exercise5_1.txt 文件中的每一行就是一个学生的成绩信息。字段之间的分隔符是","
时间,学校,年纪,姓名,科目,成绩
样例数据
2013,北大,1,黄渤,语文,97 2013,北大,1,徐峥,语文,52 2013,北大,1,刘德华,语文,85 2012,清华,0,马云,英语,61 2015,北理工,3,李彦宏,物理,81 2016,北科,4,马化腾,化学,92 2014,北航,2,刘强东,数学,70 2012,清华,0,刘诗诗,英语,59 2014,北航,2,刘亦菲,数学,49 2014,北航,2,刘嘉玲,数学,77
建表导数据
-
create
database
if
not
exists exercise;
-
use exercise;
-
drop
table
if
exists exercise5_1;
-
create
table exercise5_1(
year
int, school
string, grade
int,
name
string, course
string, score
int)
row
format
delimited
fields
terminated
by
',';
-
load
data
local inpath
"/home/hadoop/exercise5_1.txt"
into
table exercise5_1;
-
select *
from exercise5_1;
-
desc exercise5_1;
需求
1、分组TopN,选出今年每个学校、每个年级、分数前三的科目
-
select t.*
-
from
-
(
select school, grade, course, score,
-
row_number()
over (
partition
by school, grade, course
order
by score
desc) rank_code
-
from exercise5_1
-
where
year =
"2017"
-
) t
-
where t.rank_code <=
3;
详解如下:
row_number函数:row_number() 按指定的列进行分组生成行序列,从 1 开始,如果两行记录的分组列相同,则行序列 +1。
over 函数:是一个窗口函数。
over (order by score) 按照 score 排序进行累计,order by 是个默认的开窗函数。
over (partition by grade) 按照班级分区。
over (partition by grade order by score) 按照班级分区,并按着分数排序。
over (order by score range between 2 preceding and 2 following) 窗口范围为当前行的数据幅度减2加2后的范围内的数据求和。
2、今年,北航,每个班级,每科的分数,及分数上下浮动2分的总和
-
select school, grade, course, score,
-
sum(score)
over (
order
by score
range
between
2
preceding
and
2
following) sscore
-
from exercise5_1
-
where
year =
"2017"
and school=
"北航";
3、where与having:今年,清华1年级,总成绩大于200分的学生以及学生数
-
select school, grade,
name,
sum(score)
as total_score,
-
count(
1)
over (
partition
by school, grade) nct
-
from exercise5_1
-
where
year =
"2017"
and school=
"清华"
and grade =
1
-
group
by school, grade,
name
-
having total_score >
200;
having 是分组(group by)后的筛选条件,分组后的数据组内再筛选,也就是说 HAVING 子句可以让我们筛选成组后的各组数据。
where 则是在分组,聚合前先筛选记录。也就是说作用在 GROUP BY 子句和 HAVING 子句前。