有一张hive表,表名stdent_score,包含两列,分别是学生姓名name(类型string),
学生成绩score(类型map<string,int>或者 array<string>) ,
成绩列中key是课程名称,例如语文、数学等,value是对应课程分数(0-100),
请用一个hql求一下每个学生成绩最好的课程及分数、最差的课程及分数、平均分数
表名称:student_score
字段:
name,score
数据格式:
huangbo yuwen:80,shuxue:89,yingyu:95
xuzheng yuwen:70,shuxue:65,yingyu:81
wangbaoqiang yuwen:75,shuxue:100,yingyu:76
map类型做法:
1)建表
create table student_score(name string,score map<string,int>)
row format delimited fields terminated by "\t"
collection items terminated by ","
map keys terminated by ":";
存储结果:
+---------------------+----------------------------------------+
| student_score.name | student_score.score |
+---------------------+----------------------------------------+
| huangbo | {"yuwen":80,"shuxue":89,"yingyu":95} |
| xuzheng | {"yuwen":70,"shuxue":65,"yingyu":81} |
| wangbaoqiang | {"yuwen":75,"shuxue":100,"yingyu":76} |
+---------------------+----------------------------------------+
2)使用explode炸裂为多行:
set hive.mapred.mode=nonstrict; 笛卡尔积
使用explode炸裂为多行
select name,sc_name,sc_value
from student_score lateral view explode(score) sc_map as sc_name,sc_value;
+---------------+----------+-----------+
| name | sc_name | sc_value |
+---------------+----------+-----------+
| huangbo | yuwen | 80 |
| huangbo | shuxue | 89 |
| huangbo | yingyu | 95 |
| xuzheng | yuwen | 70 |
| xuzheng | shuxue | 65 |
| xuzheng | yingyu | 81 |
| wangbaoqiang | yuwen | 75 |
| wangbaoqiang | shuxue | 100 |
| wangbaoqiang | yingyu | 76 |
+---------------+----------+-----------+
3)使用开窗函数求最大、最小、平均
select distinct(a.name),
first_value(a.sc_name) over(partition by name order by sc_value desc) as max_name,
max(a.sc_value) over(partition by name ) as max_v,
first_value(a.sc_name) over(partition by name order by sc_value) as min_name,
min(a.sc_value) over(partition by name ) as min_v,
avg(sc_value) over(partition by name) as avg_v
from(
select name,sc_name,sc_value
from student_score lateral view explode(score) sc_map as sc_name,sc_value
) a;
+---------------+-----------+--------+-----------+--------+--------------------+
| a.name | max_name | max_v | min_name | min_v | avg_v |
+---------------+-----------+--------+-----------+--------+--------------------+
| huangbo | yingyu | 95 | yuwen | 80 | 88.0 |
| wangbaoqiang | shuxue | 100 | yuwen | 75 | 83.66666666666667 |
| xuzheng | yingyu | 81 | shuxue | 65 | 72.0 |
+---------------+-----------+--------+-----------+--------+--------------------+
另外见到别人非常取巧的一种做法:
成绩与科目拼接,成绩在前,自然排序;但是成绩中存在100,就不灵了!
select
name,
split(max(score), '-')[1] max_name,
split(max(score), '-')[0] max_score,
split(min(score), '-')[1] min_name,
split(min(score), '-')[0] min_score,
avg(split(score, '-')[0]) avg
from (
select name,concat(sc_value, '-', sc_name) score
from student_score lateral view explode(score) sc_map as sc_name,sc_value
) a
group by name;
100被算为最小值,这是这种写法需要注意的地方。
+---------------+-----------+------------+-----------+------------+--------------------+
| name | max_name | max_score | min_name | min_score | avg |
+---------------+-----------+------------+-----------+------------+--------------------+
| huangbo | yingyu | 95 | yuwen | 80 | 88.0 |
| wangbaoqiang | yingyu | 76 | shuxue | 100 | 83.66666666666667 |
| xuzheng | yingyu | 81 | shuxue | 65 | 72.0 |
+---------------+-----------+------------+-----------+------------+--------------------+
array类型做法:
1)建表 (仓库文件路径location与上个表student_score保持一致,不用再导入数据)
create table student_score2(name string,score array<string>)
row format delimited fields terminated by "\t"
collection items terminated by ","
location "/user/hive/warehouse/mianshi.db/student_score";
数据格式:
+----------------------+----------------------------------------+
| student_score2.name | student_score2.score |
+----------------------+----------------------------------------+
| huangbo | ["yuwen:80","shuxue:89","yingyu:95"] |
| xuzheng | ["yuwen:70","shuxue:65","yingyu:81"] |
| wangbaoqiang | ["yuwen:75","shuxue:100","yingyu:76"] |
+----------------------+----------------------------------------+
2)使用explode炸裂为多行:增加sc_value列,方便排序与统计。
set hive.mapred.mode=nonstrict; 笛卡尔积
select name, scores ,cast(split(scores,":")[1] as int) as sc_value
from student_score2 lateral view explode(score) sc_map as scores
效果:
+---------------+-------------+-----------+
| name | scores | sc_value |
+---------------+-------------+-----------+
| huangbo | yuwen:80 | 80 |
| huangbo | shuxue:89 | 89 |
| huangbo | yingyu:95 | 95 |
| xuzheng | yuwen:70 | 70 |
| xuzheng | shuxue:65 | 65 |
| xuzheng | yingyu:81 | 81 |
| wangbaoqiang | yuwen:75 | 75 |
| wangbaoqiang | shuxue:100 | 100 |
| wangbaoqiang | yingyu:76 | 76 |
+---------------+-------------+-----------+
3)使用开窗函数求最大、最小、平均
select distinct(a.name),
first_value(a.scores) over(partition by name order by sc_value desc) as max_v,
first_value(a.scores) over(partition by name order by sc_value) as min_v,
avg(sc_value) over(distribute by name) as avg_v
from(
select name, scores ,cast(split(scores,":")[1] as int) as sc_value
from student_score2 lateral view explode(score) sc_map as scores
) a;
结果:
+---------------+-------------+------------+--------------------+
| a.name | max_v | min_v | avg_v |
+---------------+-------------+------------+--------------------+
| huangbo | yingyu:95 | yuwen:80 | 88.0 |
| wangbaoqiang | shuxue:100 | yuwen:75 | 83.66666666666667 |
| xuzheng | yingyu:81 | shuxue:65 | 72.0 |
+---------------+-------------+------------+--------------------+