《Hive高级查询》作业答案
一、每个店铺top3(分组求topN)
有50w个京东店铺,每个顾客访问任何一个店铺的任何一个商品时,都会产生一条访问日志,访问日志存储的表名为visit,访客用户id为user_id,被访问的店铺名称为shop。请统计:
数据:
u1 a
u2 b
u1 b
u1 a
u3 c
u4 b
u1 a
u2 c
u5 b
u4 b
u6 c
u2 c
u1 a
u2 a
u2 a
u3 a
u5 a
u5 a
u5 a
需求:
1.每个店铺UV(访客数)
2.每个店铺访问次数top3的访客信息。输出店铺名、访客id、访问次数。
建表:
create table visit(user_id string,shop string) row format delimited fields terminated by ' ';
加载数据:
load data local inpath '/opt/soft/files/jd.txt' into table visit;
需求1:
方法1:
select shop,count(distinct user_id) UV from visit group by shop;
方法2:
select shop,count(*) UV from (select shop,user_id from visit group by shop,user_id)t1 group by shop;
需求2:
select shop, user_id, ct from (select shop, user_id, ct, row_number() over(partition by shop order by ct desc) rk from (select shop, user_id, count(*) ct from visit group by shop,user_id)t1) t2 where rk<=3;
二、求月销售额和总销售额
1、数据说明
(1)数据格式
a,01,150
a,01,200
b,01,1000
b,01,800
c,01,250
c,01,220
b,01,6000
a,02,2000
a,02,3000
b,02,1000
b,02,1500
c,02,350
c,02,280
a,03,350
a,03,250
(2)字段含义
店铺,月份,金额
3、需求
编写Hive的HQL语句求出每个店铺的当月销售额和累计到当月的总销售额
2、数据准备
(1)创建数据库表t_store
create table t_store(
name string,
months int,money int
)
row format delimited fields terminated by ",";
(2)导入数据
load data local inpath "/home/hadoop/store.txt" into table t_store;
with
r1 as
(
select
name,
months,
sum(money) money
from t_store
group by name,months
)
select
name,
months,
sum(money) over(partition by name,months order by months),
sum(money) over(partition by name order by months rows between unbounded preceding and current row)
from r1;
三、自关联练习
name,month,pv
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,11
数据的字段意义是:
用户,月份,访问量
需求:
每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数。
参考答案:可以参考上一题的解题思考,使用窗口函数。也可以使用自连接。
具体的最终实现的SQL:
select a.name as aname, a.month as amonth, a.pv as apv,
max(b.pv) as maxpv, sum(b.pv) as sumpv
from
(select a.name, a.month, sum(a.pv) as pv from exercise01 a group by a.name, a.month) a
join
(select a.name, a.month, sum(a.pv) as pv from exercise01 a group by a.name, a.month) b
on a.name = b.name
where a.month >= b.month
group by a.name, a.month, a.pv;
实现思路:
第一步:由于每个用户在每个月份有多条数据访问记录,所以根据题意,首先得汇总每个用户在每个月份的总访问次数
SQL实现:
select a.name, a.month, sum(a.pv) as pv from exercise01 a group by a.name, a.month;
结果数据:
A 2015-01 33
A 2015-02 10
A 2015-03 38
B 2015-01 30
B 2015-02 15
B 2015-03 44
第二步:由于要求得这种格式的数据:
用户 月份 当月访问次数 最大访问次数 总访问次数
A 2015-01 33 33 33
A 2015-02 10 33 43
A 2015-03 38 38 81
B 2015-01 30 30 30
B 2015-02 15 30 45
B 2015-03 44 44 89
需要如下这种格式的数据才能求出:
A 2015-01 33 A 2015-01 33
A 2015-02 10 A 2015-01 33
A 2015-02 10 A 2015-02 10
A 2015-03 38 A 2015-01 33
A 2015-03 38 A 2015-02 10
A 2015-03 38 A 2015-03 38
B 2015-01 30 B 2015-01 30
B 2015-02 15 B 2015-01 30
B 2015-02 15 B 2015-02 15
B 2015-03 44 B 2015-01 30
B 2015-03 44 B 2015-02 15
B 2015-03 44 B 2015-03 44
那如何得到这样的数据呢?执行如下的SQL:
select a.name as aname, a.month as amonth, a.pv as apv,
b.name as bname, b.month as bmonth, b.pv as bpv
from
(select a.name, a.month, sum(a.pv) as pv from exercise01 a group by a.name, a.month) a
join
(select a.name, a.month, sum(a.pv) as pv from exercise01 a group by a.name, a.month) b
on a.name = b.name
where a.month >= b.month;
第三步:在得到上述数据的基础之上,然后直接进行聚合即可
SQL实现:
select a.aname, a.amonth, a.apv, max(a.bpv) as maxpv, sum(a.bpv) as sumpv
from
(
select a.name as aname, a.month as amonth, a.pv as apv,
b.name as bname, b.month as bmonth, b.pv as bpv
from
(select a.name, a.month, sum(a.pv) as pv from exercise01 a group by a.name, a.month) a
join
(select a.name, a.month, sum(a.pv) as pv from exercise01 a group by a.name, a.month) b
on a.name = b.name
where a.month >= b.month
) a
group by a.aname, a.amonth, a.apv;
第四步:获得执行结果:
A 2015-01 33 33 33
A 2015-02 10 33 43
A 2015-03 38 38 81
B 2015-01 30 30 30
B 2015-02 15 30 45
B 2015-03 44 44 89
第五步:对于SQL语句进行一些优化得到最终SQL:
select a.name as aname, a.month as amonth, a.pv as apv,
max(b.pv) as maxpv, sum(b.pv) as sumpv
from
(select a.name, a.month, sum(a.pv) as pv from exercise01 a group by a.name, a.month) a
join
(select a.name, a.month, sum(a.pv) as pv from exercise01 a group by a.name, a.month) b
on a.name = b.name
where a.month >= b.month
group by a.name, a.month, a.pv;
四、综合练习
至少完成其中的十题。
1、查询"01"课程比"02"课程成绩高的学生的信息及课程分数:
select student.s_id,student.s_name,s1_score,s2_score from (
select s_id,s_score as s1_score from score where c_id = 01
)s1
join(
select s_id,s_score as s2_score from score where c_id = 02
)s2 on s1.s_id = s2.s_id
left join student on s1.s_id = student.s_id
where s1.s1_score > s2.s2_score
2、查询"01"课程比"02"课程成绩低的学生的信息及课程分数:
select student.s_id,student.s_name,s1_score,s2_score from (
select s_id,s_score as s1_score from score where c_id = 01
)s1
join(
select s_id,s_score as s2_score from score where c_id = 02
)s2 on s1.s_id = s2.s_id
left join student on s1.s_id = student.s_id
where s1.s1_score < s2.s2_score
3、查询平均成绩大于等于60分的同学的学生编号和学生姓名和平均成绩:
select score.s_id,student.s_name,avg(score.s_score) from score
left join student on score.s_id = student.s_id
group by score.s_id,student.s_name
having avg(score.s_score) >= 60
4、查询平均成绩小于60分的同学的学生编号和学生姓名和平均成绩: (包括有成绩的和无成绩的)
select * from (
select student.s_id,student.s_name,
case when avg(score.s_score) is null then 0
else avg(score.s_score)
end as avgscore
from student
left join score on student.s_id = score.s_id
group by student.s_id,student.s_name
)ss where ss.avgscore < 60
5、查询所有同学的学生编号、学生姓名、选课总数、所有课程的总成绩:
select student.s_id,student.s_name,
count(score.s_score) as countscore,
sum(score.s_score) as total from student
left join score on student.s_id = score.s_id
group by student.s_id,student.s_name
6、查询"李"姓老师的数量:
select count(1) from teacher where t_name like '李%'
7、查询学过"张三"老师授课的同学的信息:
select student.* from student
join score on student.s_id = score.s_id
join course on course.c_id = score.c_id
join teacher on teacher.t_id = course.t_id
where teacher.t_name = '张三';
8、查询没学过"张三"老师授课的同学的信息:
select * from student where s_id not in (
select student.s_id from student
join score on student.s_id = score.s_id
join course on course.c_id = score.c_id
join teacher on teacher.t_id = course.t_id
where teacher.t_name = '张三'
);
select student.* from student
join teacher on teacher.t_name = '张三'
join course on course.c_id = teacher.t_id
left join score on student.s_id = score.s_id and score.c_id = course.c_id
where 1=1
and score.s_score is null;
9、查询学过编号为"01"并且也学过编号为"02"的课程的同学的信息:
select student.* from (
select s_id,s_score as s1_score from score where c_id = 01
)s1
join(
select s_id,s_score as s2_score from score where c_id = 02
)s2 on s1.s_id = s2.s_id
left join student on s1.s_id = student.s_id
10、查询学过编号为"01"但是没有学过编号为"02"的课程的同学的信息:
select student.* from (
select s_id,s_score as s1_score from score where c_id = 01
)s1
left join(
select s_id,s_score as s2_score from score where c_id = 02
)s2 on s1.s_id = s2.s_id
left join student on s1.s_id = student.s_id
where s2.s_id is null
11、查询没有学全所有课程的同学的信息:
select distinct stu.*
from student stu
join course a
left join score b on b.s_id = stu.s_id and b.c_id = a.c_id
where b.s_score is null
;
12、查询至少有一门课与学号为"01"的同学所学相同的同学的信息:
select distinct stu.*
from student stu
join score a on a.s_id = stu.s_id
where stu.s_id <> '01' and a.c_id in (
select c_id from score where s_id = '01')
;
13、查询和"01"号的同学学习的课程完全相同的其他同学的信息:
select student.*,tmp1.course_id
from student
join (
select s_id ,concat_ws('|', collect_set(c_id)) course_id
from score
where s_id <> '01'
group by s_id)tmp1
on student.s_id = tmp1.s_id
join (
select concat_ws('|', collect_set(c_id)) course_id2
from score
where s_id='01' and c_id <> '01')tmp2
on tmp1.course_id = tmp2.course_id2;
14、查询没学过"张三"老师讲授的任一门课程的学生姓名:
select *
from student stu
where stu.s_id not in (
select c.s_id
from course a,teacher b,score c
where b.t_name = '张三' and a.t_id = b.t_id and c.s_id = stu.s_id and c.c_id = a.c_id
);
15、查询两门及其以上不及格课程的同学的学号,姓名及其平均成绩:
select
*
from student stu
join score a on a.s_id = stu.s_id
where a.s_score < 60;
16、检索"01"课程分数小于60,按分数降序排列的学生信息:
select stu.*
from student stu
join score a on a.s_id = stu.s_id
where a.c_id = '01' and a.s_score < 60
order by a.s_score desc;
17、按平均成绩从高到低显示所有学生的所有课程的成绩以及平均成绩:
select * from
score left join
(
select s_id,avg(score.s_score) as avgscore
from score
group by s_id
order by avgscore
)ss on ss.s_id = score.s_id;
18.查询各科成绩最高分、最低分和平均分:以如下形式显示:课程ID,课程name,最高分,最低分,平均分,及格率,中等率,优良率,优秀率:
select c_id,
course.c_course,
max(s_score),
min(s_score),
round(avg(s_score),2),
round(avg(jige),2),
round(avg(zhongdeng),2),
round(avg(youliang),2),
round(avg(youxiu),2)
from (
select score.*,
case when score.s_score >= 60 then 1 else 0 end as jige,
case when score.s_score >= 70 and score.s_score < 80 then 1 else 0 end as zhongdeng,
case when score.s_score >= 80 and score.s_score < 90 then 1 else 0 end as youliang,
case when score.s_score >= 90 then 1 else 0 end as youxiu
from score
)score
left join course on score.c_id = course.c_id
group by score.c_id,course.c_course
19、按各科成绩进行排序,并显示排名:– row_number() over()分组排序功能
select *,
row_number() over(distribute by c_id sort by s_score desc) as rn
from score ;
20、查询学生的总成绩并进行排名:
select s_id,sum(s_score) as sumScores
from score
group by s_id
order by sumScores desc;
21、查询不同老师所教不同课程平均分从高到低显示:
select t_id,a.c_id,round(avg(s_score),2) as avgscore
from score a
join course b on b.c_id = a.c_id
group by t_id,a.c_id
order by t_id,avgscore desc ;
22、查询所有课程的成绩第2名到第3名的学生信息及该课程成绩:
select student.,ss.s_score from (
select score.,
row_number() over(distribute by c_id sort by s_score desc) as rn
from score
)ss
left join student on ss.s_id = student.s_id
where ss.rn between 2 and 3;
23、统计各科成绩各分数段人数:课程编号,课程名称,[100-85],[85-70],[70-60],[0-60]及所占百分比
select c_id,
round(sum(case when s_score >= 85 then 1 else 0 end)/count(1),2) as 85score,
round(sum(case when s_score between 70 and 84 then 1 else 0 end)/count(1),2) as 70score,
round(sum(case when s_score between 60 and 69 then 1 else 0 end)/count(1),2) as 60score,
round(sum(case when s_score < 60 then 1 else 0 end)/count(1),2) as 59score,
count(1) as totalstu
from score
group by c_id
;
24、查询学生平均成绩及其名次:
select *,
row_number() over(sort by a.avgscore desc) as rm
from (
select
s_id,
round(avg(s_score),2) as avgscore
from score
group by s_id ) a;
25、查询各科成绩前三名的记录三个语句
select * from
(
select *,
row_number() over(distribute by c_id sort by s_score desc) as rm,
rank() over(distribute by c_id sort by s_score desc) as rk,
dense_rank() over(distribute by c_id sort by s_score desc) as drk
from score
) a
where a.rm < 4;
26、查询每门课程被选修的学生数:
select c_id,count(1) as totalstu
from score
group by c_id;
27、查询出只有两门课程的全部学生的学号和姓名:
select stu.s_id,stu.s_name
from score
left join student stu on stu.s_id = score.s_id
group by stu.s_id,stu.s_name
having count(1) =2;
28、查询男生、女生人数:
select s_sex,count(1) as totalstu
from student
group by s_sex
;
29、查询名字中含有"风"字的学生信息:
select *
from student
where s_name like ‘%风%’;
30、查询同名同性学生名单,并统计同名人数:
select
s_name,
s_sex,
count(1) as totalstu
from
student
group by s_name,s_sex
having totalstu >1;