hive求成绩分析

最新推荐文章于 2024-01-05 17:51:52 发布

尬聊码农

最新推荐文章于 2024-01-05 17:51:52 发布

阅读量765

点赞数

分类专栏： hadoop离线数仓

本文链接：https://blog.csdn.net/weixin_43562705/article/details/97417699

版权

hadoop离线数仓专栏收录该内容

26 篇文章 0 订阅 ¥29.90 ¥99.00

订阅专栏

场景举例

北京市学生成绩分析

成绩的数据格式

exercise5_1.txt 文件中的每一行就是一个学生的成绩信息。字段之间的分隔符是","

时间,学校,年纪,姓名,科目,成绩

样例数据


 
 
   
   
    
    
   
   
   
   
    
    
     
     2013,北大,1,黄渤,语文,97
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     2013,北大,1,徐峥,语文,52
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     2013,北大,1,刘德华,语文,85
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     2012,清华,0,马云,英语,61
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     2015,北理工,3,李彦宏,物理,81
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     2016,北科,4,马化腾,化学,92
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     2014,北航,2,刘强东,数学,70
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     2012,清华,0,刘诗诗,英语,59
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     2014,北航,2,刘亦菲,数学,49
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     2014,北航,2,刘嘉玲,数学,77

建表导数据


 
 
   
   
    
    
   
   
   
   
    
    
     
     create 
     
     database 
     
     if 
     
     not 
     
     exists exercise;
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     use exercise;
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     drop 
     
     table 
     
     if 
     
     exists exercise5_1;
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     create 
     
     table exercise5_1(
     
     year 
     
     int, school 
     
     string, grade 
     
     int, 
     
     name 
     
     string, course 
     
     string, score 
     
     int) 
     
     row 
     
     format 
     
     delimited 
     
     fields 
     
     terminated 
     
     by 
     
     ',';
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     load 
     
     data 
     
     local inpath 
     
     "/home/hadoop/exercise5_1.txt" 
     
     into 
     
     table exercise5_1;
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     select * 
     
     from exercise5_1;
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     desc exercise5_1;

需求

1、分组TopN，选出今年每个学校、每个年级、分数前三的科目


 
 
   
   
    
    
   
   
   
   
    
    
     
     select t.* 
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from 
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     (
     
     select school, grade, course, score,
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     row_number() 
     
     over (
     
     partition 
     
     by school, grade, course 
     
     order 
     
     by score 
     
     desc) rank_code 
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from exercise5_1 
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     where 
     
     year = 
     
     "2017"
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ) t
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     where t.rank_code <= 
     
     3;

详解如下：
row_number函数：row_number() 按指定的列进行分组生成行序列，从 1 开始，如果两行记录的分组列相同，则行序列 +1。
over 函数：是一个窗口函数。
over (order by score) 按照 score 排序进行累计，order by 是个默认的开窗函数。
over (partition by grade) 按照班级分区。
over (partition by grade order by score) 按照班级分区，并按着分数排序。
over (order by score range between 2 preceding and 2 following) 窗口范围为当前行的数据幅度减2加2后的范围内的数据求和。

2、今年，北航，每个班级，每科的分数，及分数上下浮动2分的总和


 
 
   
   
    
    
   
   
   
   
    
    
     
     select school, grade, course, score,
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     sum(score) 
     
     over (
     
     order 
     
     by score 
     
     range 
     
     between 
     
     2 
     
     preceding 
     
     and 
     
     2 
     
     following) sscore
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from exercise5_1 
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     where 
     
     year = 
     
     "2017" 
     
     and school=
     
     "北航";

3、where与having：今年，清华1年级，总成绩大于200分的学生以及学生数


 
 
   
   
    
    
   
   
   
   
    
    
     
     select school, grade, 
     
     name, 
     
     sum(score) 
     
     as total_score,
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     count(
     
     1) 
     
     over (
     
     partition 
     
     by school, grade) nct
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from exercise5_1
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     where 
     
     year = 
     
     "2017" 
     
     and school=
     
     "清华" 
     
     and grade = 
     
     1
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     group 
     
     by school, grade, 
     
     name
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     having total_score > 
     
     200;