题目和数据均来自于网络,本篇博客用来记录自己写的答案和解析
求单月访问次数和总访问次数
用户名,访问日期,访问次数
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,1
create external table test1(name string,month string,count int)
row format delimited fields terminated by ',';
load data local inpath '/usr/local/data/test1.txt' into table test1;
首先针对name和month进行去重,相同name和month的count进行求和
select name,
month,
sum(count)
from test1
group by name,month;
然后求出,每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数
select name,
month,
max(sumCount) over (partition by name order by month) as maxMonth,
sum(sumCount) over (partition by name order by month) as sumMonth,
sumCount
from (select name,
month,
sum(count) as sumCount
from test1
group by name,month)t
order by name,month;
这里要注意,max,sum等聚合函数,加上order by后就是累计计算,不加order by就是在分区内全局计算。比如这里的sum(),如果后面是over (partition by name),没有order by,那么结果是按照name相同的一组sumCount进行全局求和
举一下例子
select name,
month,
max(sumCount) over (partition by name ) as maxMonth,
sum(sumCount) over (partition by name ) as sumMonth,
sumCount
from (select name,
month,
sum(count) as sumCount
from test1
group by name,month)t
order by name,month;
结果为
A 2015-01 38 81 33
A 2015-02 38 81 10
A 2015-03 38 81 38
B 2015-01 34 79 30
B 2015-02 34 79 15
B 2015-03 34 79 34
这是直接按照name分组后,对一组的count进行求和
正确的sql语句为
select name,
month,
max(sumCount) over (partition by name order by month) as maxMonth,
sum(sumCount) over (partition by name order by month) as sumMonth,
sumCount
from (select name,
month,
sum(count) as sumCount
from test1
group by name,month)t
order by name,month;
结果为
A 2015-01 33 33 33
A 2015-02 33 43 10
A 2015-03 38 81 38
B 2015-01 30 30 30
B 2015-02 30 45 15
B 2015-03 34 79 34
学生课程成绩
id,学号sid,课程名course,成绩score
1 1 yuwen 43
2 1 shuxue 55
3 2 yuwen 77
4 2 shuxue 88
5 3 yuwen 98
6 3 shuxue 65
求 所有数学课程成绩 大于 语文课程成绩的学生的学号
select a.sid
from (select sid,
course,
score
from course
where course='yuwen'
) a
left join
(select sid,
course,
score
from course
where course='shuxue') b
on a.sid = b.sid
where a.score < b.score;
结果为
1
2
每一年出现过的最大气温的日期+温度
2014010216
2014010410
2012010609
2012010812
2012011023
2001010212
2001010411
2013010619
2013010812
2013011023
2008010216
2008010414
2007010619
2007010812
2007011023
2010010216
2010010410
2015010649
2015010812
2015011023
数据含义: 数据为 2014010216,表示在2014年01月02日的气温为16度
建表语句和加载数据
hive> create external table test3(tem string)
> row format delimited fields terminated by ',';
hive> load data local inpath '/usr/local/test3.txt' into table test3;
首先按照年,日期,温度进行切分,切分成三列数据
select substr(tem,1,4) as year,
substr(tem,1,8) as d,
substr(tem,-2) as temp
from test3;
结果为
2014 20140102 16
2014 20140104 10
2012 20120106 09
2012 20120108 12
2012 20120110 23
2001 20010102 12
2001 20010104 11
2013 20130106 19
2013 20130108 12
2013 20130110 23
2008 20080102 16
2008 20080104 14
2007 20070106 19
2007 20070108 12
2007 20070110 23
2010 20100102 16
2010 20100104 10
2015 20150106 49
2015 20150108 12
2015 20150110 23
然后求取每一年的温度最大值,这里不要用group by,用开窗函数,否则取不到日期字段
select d,
temp,
max(temp) over (partition by year) as maxTem
from (
select substr(tem,1,4) as year,
substr(tem,1,8) as d,
substr(tem,-2) as temp
from test3
) t
结果为
20010102 12 12
20010104 11 12
20070106 19 23
20070110 23 23
20070108 12 23
20080104 14 16
20080102 16 16
20100104 10 16
20100102 16 16
20120106 09 23
20120110 23 23
20120108 12 23
20130110 23 23
20130108 12 23
20130106 19 23
20140102 16 16
20140104 10 16
20150108 12 49
20150110 23 49
20150106 49 49
然后判断每一天的温度,是否等于当年的最高温度,如果相等则输出
select d,maxTem
from (select d,
temp,
max(temp) over (partition by year) as maxTem
from (
select substr(tem,1,4) as year,
substr(tem,1,8) as d,
substr(tem,-2) as temp
from test3
) t
)t2
where temp = maxTem;
结果为
20010102 12
20070110 23
20080102 16
20100102 16
20120110 23
20130110 23
20140102 16
20150106 49
求学生选课情况
id,course
1,a
1,b
1,c
1,e
2,a
2,c
2,d
2,f
3,a
3,b
3,c
3,e
首先将去重后的课程列表添加在每一行数据的后面
create table stu_course as
select t1.id as id,
t1.course as cs,
t2.course courses
from (
select id as id,
collect_set(course) as course
from test4 group by id ) t1
join
(select sort_array(collect_set(course)) as course from test4) t2;
其中,t2 表的结果为 [“a”,“b”,“c”,“e”,“d”,“f”]
没有on语句,即没有连接字段的 join语句,作用是将第二张表的结果添加在第一张表每一行数据的后面,这里就是将 [“a”,“b”,“c”,“e”,“d”,“f”] 添加在 t1表 每一行数据的后面
collect_set(course) 的作用是将course的全部列值转变成一个没有重复元素的数组
但是collect_set转换的数组是无序的,所以需要用 sort_array 进行排序
结果为
1 ["a","b","c","e"] ["a","b","c","d","e","f"]
2 ["a","c","d","f"] ["a","b","c","d","e","f"]
3 ["a","b","c","e"] ["a","b","c","d","e","f"]
array_contains可以用于判断一张表中同一个id的多条记录中的同一字段是否包含指定的一个或多个值。需要注意字段类型保持一致,若不一致则需要进行强制类型转换
这里就是判断同一个id的cs字段,是否包含course(0),course(1)等值
case when 会返回多个列,每一个case when对应一个列
select id,
case when array_contains(cs,courses[0]) then 1 else 0 end as a,
case when array_contains(cs,courses[1]) then 1 else 0 end as b,
case when array_contains(cs,courses[2]) then 1 else 0 end as c,
case when array_contains(cs,courses[3]) then 1 else 0 end as d,
case when array_contains(cs,courses[4]) then 1 else 0 end as e,
case when array_contains(cs,courses[5]) then 1 else 0 end as f
from stu_course;
结果为
最左边的一列为id,后面的六列对应六门课程,a、b、c、d、e、f
1 1 1 1 0 1 0
2 1 0 1 1 0 1
3 1 1 1 0 1 0
求月销售额和总销售额
a,01,150
a,01,200
b,01,1000
b,01,800
c,01,250
c,01,220
b,01,6000
a,02,2000
a,02,3000
b,02,1000
b,02,1500
c,02,350
c,02,280
a,03,350
a,03,250
create table test5(store string,month int,money int)
row format delimited fields terminated by ',';
需求是求出每个店铺的当月销售额和累计到当月的总销售额
select store,
month,
m1,
m2
from (select store,
month,
sum(money) over (partition by store,month) as m1,
sum(money) over (partition by store order by month) as m2,
row_number() over (partition by store,month) as rank
from test5) t
where rank = 1;
这里的两个sum开窗函数的用法在之前说过了,不加order by是分组内全局聚合,加上order by是累加
这里主要要注意的是,利用row_number()完成最终数据的去重
最后的结果为