题目:
有一张hive表,表名student_score,包含两列,分别是学生姓名name(类型string),学生成绩score(类型map<string,int>),成绩列中key是课程名称,例如语文、数学等,value是对应课程分数(0-100),请用一个sql求一下每个学生成绩最好的课程及分数、最差的课程及分数、平均分数。
一、自己模拟创建数据,文件命名为text.txt
数据格式为:
name1 math:98,chinese:80,english:78
name2 math:71,chinese:60,english:100
....
注:名字和科目分数之间我用的制表符分隔
二、Hive中创建相应的表
create table student_score(
name string,
score map<string,int>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
push数据至表中
load data local inpath '/root/project/text.txt' into table student_score;
select * from student_score;
三、执行SQL语句:
select name,
concat_ws('',collect_set(if(score_order=1,concat(subject,'_',achievement),''))) max_subject_and_score,
concat_ws('',collect_set(if(score_order=total_subject,concat(subject,'_',achievement),''))) min_subject_and_score,
round(avg(achievement),2) as subject_avg
from
(
select name,achievement,subject,
row_number() over(distribute by name sort by achievement desc) as score_order,
(count(1) over(partition by name)) as total_subject
from student_score lateral view explode(score) sc as subject,achievement
)t1
group by name;
注:如果知道每个人的科目数可以减少job,加快效率
亲测可行,自己做之前在网上搜了一些,有的对于有些情况没考虑到,有的逻辑太复杂(join操作比较多);最后终于憋出来了
上面设计到的hive语句的方法,网上都可以查到,就不一一列举了。