hive面试题及解题思路

最新推荐文章于 2024-06-20 10:30:00 发布

autumnLemon

最新推荐文章于 2024-06-20 10:30:00 发布

阅读量647

点赞数

第一题：

现有这么一批数据，现要求出：  
每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数  
  
三个字段的意思：  
用户名，月份，访问次数  
  
A,2015-01,5  
A,2015-01,15  
B,2015-01,5  
A,2015-01,8  
B,2015-01,25  
A,2015-01,5  
A,2015-02,4  
A,2015-02,6  
B,2015-02,10  
B,2015-02,5  
A,2015-03,16  
A,2015-03,22  
B,2015-03,23  
B,2015-03,10  
B,2015-03,11  
  
最后结果展示:  
  
用户  月份      最大访问次数  总访问次数       当月访问次数  
A   2015-01     33          33        33  
A   2015-02     33          43        10  
A   2015-03     38          81        38  
B   2015-01     30          30        30  
B   2015-02     30          45                15  
B   2015-03     44          89        44

准备数据：
 
data1.txt
 
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,11
 
建表：
create database if not exists exercise;
use exercise;
drop table if exists exercise1;
create table if not exists exercise1(name string, month string, visit int) 
row format delimited fields terminated by ",";
load data local inpath "/home/hadoop/data1.txt" into table exercise1;
select * from exercise1;

第二步：
 
为了统计出以下这样的数据：
A	2015-01	33
A	2015-02	10
A	2015-03	38
 
编写SQL ：
 
create table exercise1_step1 as select name,month, sum(visit) as visit from exercise1 group by name,month;
select * from exercise1_step1;   // 6 行 3 列

第三步：
create table exercise1_step2 as 
select  a.name as namea, a.month as montha, a.visit as visita,
b.name as nameb, b.month as monthb, b.visit as visitb 
from exercise1_step1 a join exercise1_step1 b on a.name = b.name;

select * from exercise1_step2;

最终的SQL ：create table exercise1_step3 as select nameb, monthb, visitb, max(visita) as max_visit, sum(visita) as sum_visit from exercise1_step2 where montha <= monthb group by nameb, monthb, visitb ;查询验证结果：select * from exercise1_step3;

第二题：

// 建表语句：  
CREATE TABLE `course` (  
  `id` int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,  
  `sid` int(11) DEFAULT NULL,  
  `course` varchar(255) DEFAULT NULL,  
  `score` int(11) DEFAULT NULL   
) ENGINE=InnoDB DEFAULT CHARSET=utf8;  
  
// 插入数据  
// 字段解释：id, 学号， 课程， 成绩  
INSERT INTO `course` VALUES (1, 1, 'yuwen', 43);  
INSERT INTO `course` VALUES (2, 1, 'shuxue', 55);  
INSERT INTO `course` VALUES (3, 2, 'yuwen', 77);  
INSERT INTO `course` VALUES (4, 2, 'shuxue', 88);  
INSERT INTO `course` VALUES (5, 3, 'yuwen', 98);  
INSERT INTO `course` VALUES (6, 3, 'shuxue', 65);  
  
求：所有数学课程成绩 大于 语文课程成绩的学生的学号

解题思路：
数据准备：
data2.txt
 
1	1	yuwen	43
2	1	shuxue	55
3	2	yuwen	77
4	2	shuxue	88
5	3	yuwen	98
6	3	shuxue	65
7	3	yingyu	88
 
建表：
 
create table if not exists course(id int, sid int, course string,score int) 
row format delimited fields terminated by "\t";
load data local inpath "/home/hadoop/data2.txt" into table course;
 
select * from course;

行转列：
select sid,
case course when "yuwen" then score else 0 end as yuwen, 
case course when "shuxue" then score else 0 end as shuxue,
case course when "yingyu" then score else 0 end as yingyu 
from course;

行转列只是思路。
最终SQL
select sid from (
select sid,
sum(case course when "yuwen" then score else 0 end) as yuwen, 
sum(case course when "shuxue" then score else 0 end) as shuxue,
sum(case course when "yingyu" then score else 0 end) as yingyu 
from course group by sid
) b where b.shuxue > b.yuwen;

第三题：


2014010114
2014010216
2014010317
2014010410
2014010506
2012010609
2012010732
2012010812
2012010919
2012011023
2001010116
2001010212
2001010310
2001010411
2001010529
2013010619
2013010722
2013010812
2013010929
2013011023
2008010105
2008010216
2008010337
2008010414
2008010516
2007010619
2007010712
2007010812
2007010999
2007011023
2010010114
2010010216
2010010317
2010010410
2010010506
2015010649
2015010722
2015010812
2015010999
2015011023
 
比如：2010012325表示在2010年01月23日的气温为25度。现在要求使用hive，计算每一年出现过的最大气温的日期+温度。
要计算出每一年的最大气温。我用
select substr(data,1,4),max(substr(data,9,2)) from table2 group by substr(data,1,4);
出来的是  年份  +  温度  这两列数据例如 2015     99

准备数据：
建表：
create database if not exists exercise;
use exercise;
drop table if exists exercise3;
create table exercise3(line string);
load data local inpath "/home/hadoop/data3.txt" into table exercise3;
select * from exercise3;

求出每一年中的最高温度：
select substring(line, 1, 4) as year,  max(substring(line, -2)) as max_temp 
from exercise3 group  by substring(line, 1, 4);

需求增强：  不仅要求出每一年的最高温度，还要求出这个最高温度出现在这一年中的哪一天？
这个是错的：select substring(line, 1, 8) as year,  max(substring(line, -2)) as max_temp 
from exercise3 group by substring(line, 1, 4);
 
所以要使用：
在一个具有group by关键字的SQL语句中；
如果某个字段出现在select的后面，
那么就一定要是以下这两种情况之一才可以：
1、要么是group by后面的一个字段
2、要么是聚合函数的结果字段
 
 
正确的SQL ：
 
提供两种方案：
 
1、使用TopN 的技巧可以解决
 
2、再次使用一次连接SQL既可以解决
 
 
最终的完整SQL ：
 
select substring(b.line, 1, 8) as max_temp_date, a.max_temp  
from exercise3 b join (
select substring(c.line, 1, 4) as year,  max(substring(c.line, -2)) as max_temp 
from exercise3 c group  by substring(c.line, 1, 4)
) a 
on a.year = substring(b.line, 1, 4)  and 
a.max_temp = substring(b.line, -2);

autumnLemon

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive面试题及解题思路

第一题：现有这么一批数据，现要求出：每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数三个字段的意思：用户名，月份，访问次数 A,2015-01,5 A,2015-01,15 B,2015-01,5 A,2015-01,8 B,2015-01,25 A,2015-01,5 A,2015-02,4 A,2015-02,...
复制链接

扫一扫