编程题(前6题每题10分,第7,8题每小题20分,共100分)
- 有三个Hive的表,分别是A、B、C,数据如下,求三个表中互不重复的数据
提示: 不重复的数据,即在三张表中所有的数据加载一起时,只出现了一次
A.txt B.txt C.txt
1 2 1
2 3 2
3 11 3
4 12 11
5 14 5
6 15 6
7 16 7
8 18 8
9 35 20
6 30
7 40 - 有数据如下,求英雄出场次数Top3的出场次数及出场率(出场次数/总场次)
id heros
1 廉颇,镜,沈梦溪,李元芳,太乙真人
2 关羽,兰陵王,嬴政,虞姬,鲁班大师
3 梦琪,盘古,周瑜,狄仁杰,大乔
4 廉颇,澜,上官婉儿,公孙离,盾山
5 吕布,娜可露露路,姜子牙,公孙离,张飞
6 马超,猪八戒,狄仁杰,沈梦溪,太乙真人
7 吕布,盘古,嫦娥,公孙离,张飞
8 廉颇,橘右京,西施,虞姬,大乔
9 关羽,镜,姜子牙,狄仁杰,鲁班大师
10 梦琪,阿古朵,周瑜,后羿,蔡文姬
11 夏侯惇,娜可露露,不知火舞,孙尚香,太乙真人
12 猪八戒,镜,嫦娥,伽罗,孙膑
13 廉颇,镜,上官婉儿,马可波罗,蔡文姬
14 梦琪,裴擒虎,沈梦溪,虞姬,鲁班大师
15 梦琪,盘古,不知火舞,成吉思汗,太乙真人
16 夏侯惇,澜,周瑜,马可波罗,张飞
17 猪八戒,露娜,周瑜,狄仁杰,盾山
18 吕布,橘右京,西施,蒙伢,蔡文姬
19 吕布,赵云,西施,公孙离,张飞
20 廉颇,兰陵王,沈梦溪,虞姬,大乔 - 有如下数据,表示1、2、3三名学生选修了a、b、c、d、e、f中的若干课程
id course
1 a
1 b
1 c
1 e
2 a
2 c
2 d
2 f
3 a
3 b
3 c
3 e
根据如上数据,查询出如下结果,其中1表示选修,0表示未选修
id a b c d e f
1 1 1 1 0 1 0
2 1 0 1 1 0 1
3 1 1 1 0 1 0 - 有如下数据
PLATFORM USER_ID CLICK_TIME CLICK_URL
WEB 12332321 2013-03-21 13:48:31.324 /home/
WEB 12332321 2013-03-21 13:48:32.954 /selectcat/er/
WEB 12332321 2013-03-21 13:48:46.365 /er/viewad/12.html
WEB 12332321 2013-03-21 13:48:53.651 /er/viewad/13.html
WEB 12332321 2013-03-21 13:49:13.435 /er/viewad/24.html
WEB 12332321 2013-03-21 13:49:35.876 /selectcat/che/
WEB 12332321 2013-03-21 13:49:56.398 /che/viewad/93.html
WEB 12332321 2013-03-21 13:50:03.143 /che/viewad/10.html
WEB 12332321 2013-03-21 13:50:34.265 /home/
WAP 32483923 2013-03-21 23:58:41.123 /m/home/
WAP 32483923 2013-03-21 23:59:16.123 /m/selectcat/fang/
WAP 32483923 2013-03-21 23:59:45.123 /m/fang/33.html
WAP 32483923 2013-03-22 00:00:23.984 /m/fang/54.html
WAP 32483923 2013-03-22 00:00:54.043 /m/selectcat/er/
WAP 32483923 2013-03-22 00:01:16.576 /m/er/49.html
…… …… …… ……
现需将数据整理成如下数据结构
PLATFORM USER_ID SEQ FROM_URL TO_URL
WEB 12332321 1 NULL /home/
WEB 12332321 2 /home/ /selectcat/er/
WEB 12332321 3 /selectcat/er/ /er/viewad/12.html
WEB 12332321 4 /er/viewad/12.html /er/viewad/13.html
WEB 12332321 5 /er/viewad/13.html /er/viewad/24.html
WEB 12332321 6 /er/viewad/24.html /selectcat/che/
WEB 12332321 7 /selectcat/che/ /che/viewad/93.html
WEB 12332321 8 /che/viewad/93.html /che/viewad/10.html
WEB 12332321 9 /che/viewad/10.html /home/
WAP 32483923 1 NULL /m/home/
WAP 32483923 2 /m/home/ /m/selectcat/fang/
WAP 32483923 3 /m/selectcat/fang/ /m/fang/33.html
WAP 32483923 4 /m/fang/33.html /m/fang/54.html
WAP 32483923 5 /m/fang/54.html /m/selectcat/er/
WAP 32483923 6 /m/selectcat/er/ /m/er/49.html
…… …… …… …… ……
说明:PLATFORM和USER_ID还是代表平台和用户ID;
SEQ字段代表用户按时间排序后的访问顺序,
FROM_URL和TO_URL分别代表用户从哪一页跳转到哪一页。
对于某个平台上某个用户的第一条访问记录,其FROM_URL是NULL(空值)。 - Employee 表包含所有员工和他们的经理。每个员工都有一个 Id,并且还有一列是经理的 Id。
Id Name department managerId
101 john A NULL
102 dan A 101
103 james A 101
104 amy A 101
105 Anne A 101
…
请编写一个SQL查询来查找至少有4名直接下属的经理。 - 现在有一份数据mingxing_favor表:
Id Name Age Favor
1 Huangbo 33 A,B,C,D,E
2 Xuzheng 44 B,C
3 Wangbaoqiang 55 C,D,E
4 Fanbingbing 32 A,B,D
求出每种爱好中年龄最大的人,如果有相同的年龄,请并列显示
7、我们有如下的用户访问数据:
userId visitDate visitCount
u01 2017/1/21 5
u02 2017/1/23 6
u03 2017/1/22 8
u04 2017/1/20 3
u01 2017/1/23 6
u01 2017/2/21 8
U02 2017/1/23 6
U01 2017/2/22 4
要求使用SQL统计出每个用户的累积访问次数,如下表所示:
用户id 月份 小计 累积
u01 2017-01 11 11
u01 2017-02 12 23
u02 2017-01 12 12
u03 2017-01 8 8
u04 2017-01 3 3
8、有很多的天猫店铺,每个顾客访客访问任何一个店铺的任何一个商品时都会产生一条访问日志, 访问日志存储的表名为Visit,访客的用户id为user_id,被访问的店铺名称为shop,数据如下
u1 a
u2 b
u1 b
u1 a
u3 c
u4 b
u1 a
u2 c
u5 b
u4 b
u6 c
u2 c
u1 b
u2 a
u2 a
u3 a
u5 a
u5 a
u5 a
(1)每个店铺的UV(访客数)
(2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
– 1
select *
from demo01;
select number
from demo01
group by number
having count(1) = 1;
– 2
select *
from (select name,
cishu,
dense_rank() over (order by cishu desc ) paiming,
cishu / (sum(cishu) over () / 5) bili
from (select name, count(1) cishu
from demo02 lateral view explode(heros) hero_table as name
group by name) t1) t2
where paiming <= 3
order by paiming;
select name, cishu, ratio
from (
select name,
cishu,
cast(cishu / sum(cishu) over ()) as ratio,
dense_rank() over (order by cishu desc) as dr
from (
select name, count(1) as cishu
from demo02 lateral view explode(heros) hero_table as name
group by name) t1) n
where dr <= 3;
– 3
select id,
sum(case when course = “a” then 1 else 0 end) a,
sum(case when course = “b” then 1 else 0 end) b,
sum(case when course = “c” then 1 else 0 end) c,
sum(case when course = “d” then 1 else 0 end) d,
sum(case when course = “e” then 1 else 0 end) e,
sum(case when course = “f” then 1 else 0 end) f
from demo03
group by id;
– 4
CREATE TABLE demo04
(
PLATFORM string,
USER_ID int,
VISITE_TIME string,
FROM_URL string
)
row format delimited
fields terminated by “=”;
load data local inpath ‘/root/data/demo04.txt’ into table demo04;
select *
from demo04;
select PLATFORM,
USER_ID,
row_number() over (partition by user_id order by VISITE_TIME ) SEQ,
lag(FROM_URL, 1, null) over (partition by USER_ID order by VISITE_TIME) from_url,
FROM_URL to_url
from demo04;
– 5
select id, name
from demo05
where id in
(select managerid from demo05 group by managerid having count(*) > 3);
– 6
select id, name, age, people_favor
from (select id,
name,
age,
people_favor,
dense_rank() over (partition by people_favor order by age desc ) paiming
from demo06 lateral view explode(favor) favortable as people_favor) t1
where paiming = 1;
– 7
create table demo07
(
id varchar(10),
visitDate string,
visitCount int
)
row format delimited
fields terminated by ‘:’;
load data local inpath ‘/root/data/demo07.txt’ overwrite into table demo07;
select id,
times,
tal,
sum(tal) over (partition by id order by times rows between unbounded preceding and current row )
from (select id,
times,
sum(visitCount) tal
from (select id, date_format(visitDate, ‘yyyy-MM’) times, visitCount from demo07) t1
group by id, times) t2;
– 8
create table demo08
(
user_id varchar(10),
shop_name varchar(10)
)
row format delimited
fields terminated by ‘:’;
load data local inpath ‘/root/data/demo08.txt’ into table demo08;
select *
from demo08;
– (1)
select shop_name, count(distinct user_id) number
from demo08
group by shop_name;
select shop_name, count(*)
from (select shop_name, user_id number
from demo08
group by shop_name, user_id) t1
group by shop_name;
– (2)
select shop_name, user_id, cishu,paiming
from
(select shop_name, user_id, cishu, rank() over (partition by shop_name order by cishu desc ) paiming
from (select shop_name,
user_id,
count(1) cishu
from demo08
group by shop_name, user_id) t1) t2 where paiming<=3;