这里的两份数据是NBA 2017-2018赛季 的比赛结果nba_game.csv和比赛详情nba_game_detail.csv
关于NBA,需要了解一个背景知识:
NBA总共有30支球队
NBA的每支球队在每个赛季的常规赛,都会有82场比赛
NBA的每支球队在赛季初都会有场次不等的季前赛,但是按照时间的先后顺序排序之后,最后的82场比赛是常规赛,其余的都是季前赛
NBA的每支球队,如果排名在所在分部的前8名次,则表示有资格参加季后赛
数据中包含了部分还没有开始比赛的场次的结果。注意区分
分部: NBA总共30支球队,每个分部有三个分区,每个分区有5支球队
东部:1表示
西部:0表示
分区:
["西南赛区","太平洋赛区","西北赛区","大西洋赛区","东南赛区","中部赛区"]
分别用数字1,2,3,4,5,6表示
NBA的每一场比赛都分为四节,每一节12分钟。如果四节结束,两支队伍的得分一样,那么进入加时赛,加时一次比赛5分钟,如果加时决不出胜负,则一直加时,直到决出胜负
第一份数据:nba_team.csv
数据意义:每一支球队的相关信息,主要是球队名称,属于哪个分部,哪个分区
数据样式:
1,0,西南赛区,1,https://nba.hupu.com/schedule/rockets,休斯顿火箭
表示“休斯顿火箭”球队是西部的西南赛区的球队
字段意义:
球队ID,分部,分赛区,分赛区ID,球队URL,球队名称
字段名称:
team_id, we_id, district, district_id, team_url, team_name
第二份数据:nba_game.csv
意义:每一条记录表示一场NBA结果
数据样式:
1,火箭,雷霆,104,27-32-31-14,97,25-27-21-24,胜,2017-10-04 08:00:00,暂无统计,17733,https://nba.hupu.com/games/boxscore/154277
字段意义:
比赛ID,客场队伍,主场队伍,客场队伍得分,课程队伍的每一节得分,主场队伍得分,主场队伍的每一节得分,比赛胜负结果,比赛开始时间,比赛详情URL
字段名称:
game_id, team_away, team_home, score_away, section_away, score_home, section_home, result, start_time, game_time, audience, game_url
第三份数据:nba_game_detail.csv
意义:每一位球员出场一场比赛的详情记录信息
数据样式:
1,1,0,1,詹姆斯-哈登,G,29,5-16,3-12,3-5,1,4,5,10,3,3,7,2,16,+2
字段意义:
比赛ID,球员所属队伍的ID(1-30),主客场(0客场球员,1主场球员),是否首发(1是,0不是),出场球员,球员角色,出场时间,投篮,三分投篮,罚球,前场篮板,后场篮板,总篮板,助攻,犯规,抢断,失误,封盖,得分,效率正负值
字段名称:
game_id, team_id, ishome, isfirst, name, role, time, shoot, shoot3, penalty_shoot, offensive_rebs, deffensive_rebs, total_rebs, assist, foul, steal, fault, block, score, value
需求:
1、创建表,导入数据,并查询验证
问题总结:
在把数据导入到Hive的时候,会出现一行数据,一行全是null的情况。 原因是在windows环境下生成的这种CSV格式的数据,在每一行的最后会出现一个控制字符:^M
42,火箭,活塞,101,37-21-19-24,108,31-26-31-20,负,2018-01-07 08:00:00,02:07,18046,https://nba.hupu.com/games/boxscore/153617^M
43,火箭,公牛,116,35-25-27-29,107,22-33-21-31,胜,2018-01-09 09:00:00,02:02,17462,https://nba.hupu.com/games/boxscore/153629^M
44,开拓者,火箭,112,19-26-32-35,121,25-30-31-35,胜,2018-01-11 09:00:00,02:22,18055,https://nba.hupu.com/games/boxscore/153643^M
45,火箭,太阳,112,27-30-33-22,95,23-21-24-27,胜,2018-01-13 11:30:00,02:03,18055,https://nba.hupu.com/games/boxscore/153661^M
46,火箭,快船,102,31-25-28-18,113,30-29-33-21,负,2018-01-16 11:30:00,02:20,17622,https://nba.hupu.com/games/boxscore/153683^M
这个控制字符 可以 通过 ctrl + m 键摁出来。 所以可以考虑使用sed命令进行替换。 以下是替换命令
sed -i 's/\x0d//g' nba_team.csv
sed -i 's/\x0d//g' nba_game.csv
sed -i 's/\x0d//g' nba_game_detail.csv
第一张表:
create database if not exists nba;
use nba;
drop table if exists nba_team;
create table if not exists nba_team (team_id int, we_id int, district string, district_id int, team_url string, team_name string)
row format delimited fields terminated by ",";
truncate table nba_team;
load data local inpath "/home/hadoop/nba/nba_team.csv" into table nba_team;
select * from nba_team limit 10;
第二张表:
create database if not exists nba;
use nba;
drop table if exists nba_game;
create table if not exists nba_game (game_id int, team_away_id int, team_home_id int, team_away string, team_home string, score_away int, section_away string, score_home int, section_home string,
result string, start_time string, game_time string, audience int, game_url string)
row format delimited fields terminated by ",";
truncate table nba_game;
load data local inpath "/home/hadoop/nba/nba_game.csv" into table nba_game;
select * from nba_game limit 10;
第三张表:
create database if not exists nba;
use nba;
drop table if exists game_detail;
create table if not exists game_detail (game_id int, team_id int, ishome int, isfirst int, name string, role string, time int, shoot string, shoot3 string, penalty_shoot string
, offensive_rebs int, deffensive_rebs int, total_rebs int, assist int, foul int, steal int, fault int, block int, score int, value string)
row format delimited fields terminated by ",";
truncate table game_detail;
load data local inpath "/home/hadoop/nba/nba_game_detail.csv" into table game_detail;
select * from game_detail limit 10;
2、 求出常规赛得分王,要求出场次数大于等于30(场均得分最高的球员)(姓名,得分,出场次数)
求出常规赛助攻王,要求出场次数大于等于30(场均助攻最高的球员)(姓名,助攻,出场次数)
求出常规赛篮板王,要求出场次数大于等于30(场均篮板最高的球员)(姓名,篮板,出场次数)
求出常规赛抢断王,要求出场次数大于等于30(场均抢断最高的球员)(姓名,抢断,出场次数)
求出常规赛盖帽王,要求出场次数大于等于30(场均盖帽最高的球员)(姓名,盖帽,出场次数)
第一步:查询出所有的常规赛(按照比赛日期进行过滤)
drop table if exists all_game;
create table all_game as
select * from nba_game where start_time >= "2017-10-17 00:00:00" and start_time <= "2018-04-14 00:00:00";
select * from all_game;
select * from all_game where team_home = "勇士" order by start_time;
第二步:在原有的常规比赛比赛基础之上,再复制一份,把复制出来的一份数据的主客场队伍两个字段的顺序对换一下
这样做的目的,就是保证不管按照team_away字段还是按照team_home字段,都能进行按照球队分组,得到每支球队的82场比赛为一组,进行战绩统计
drop table if exists all_changgui_game_error_result;
create table all_changgui_game_error_result as
select game_id,
team_away_id, team_home_id, team_away, team_home,
score_away, section_away, score_home, section_home,
result, start_time, game_time, audience, game_url from all_game
union all
select game_id ,
team_home_id as team_away_id, team_away_id as team_home_id , team_home as team_away, team_away as team_home,
score_home as score_away, section_home as section_away, score_away as score_home, section_away as section_home,
result, start_time, game_time, audience, game_url from all_game;
// 这样计算出来:队伍的胜负结果是错的。!! 所以必须重拍, 这个结果 胜负 是相对于 team_home 字段来说的。
// all_changgui_game 表中的结果result的结果,都是相对于team_home来说的
drop table if exists all_changgui_game;
create table all_changgui_game as
select game_id, team_away_id, team_home_id, team_away, team_home,
score_away, section_away, score_home, section_home,
if(score_away > score_home, "负","胜") as result, start_time, game_time, audience, game_url
from all_changgui_game_error_result;
select * from all_changgui_game limit 10;
select * from all_changgui_game where team_home = "火箭";
select * from all_changgui_game where team_home = "勇士" order by start_time;
解题:
得分王:
select bb.name as name, bb.avg_score as score, bb.count/2 as count from (
select a.name as name, avg(a.score) as avg_score, count(1) as count
from game_detail a join all_changgui_game b on a.game_id = b.game_id
where a.time != 0
group by a.name
having count >= 30
order by avg_score desc limit 1) bb;
结果:
詹姆斯-哈登 30.430555555555557 72.0
助攻王:
select bb.name as name, bb.avg_assist as assist, bb.count/2 as count from (
select a.name as name, avg(a.assist) as avg_assist, count(1) as count
from game_detail a join all_changgui_game b on a.game_id = b.game_id
where a.time != 0
group by a.name
having count >= 30
order by avg_assist desc limit 1) bb;
结果:
拉塞尔-威斯布鲁克 10.25 80.0
篮板王:
select bb.name as name, bb.avg_rebs as avg_rebs, bb.count/2 as count from (
select a.name as name, avg(a.total_rebs) as avg_rebs, count(1) as count
from game_detail a join all_changgui_game b on a.game_id = b.game_id
where a.time != 0
group by a.name
having count >= 30
order by avg_rebs desc limit 1) bb;
结果:
安德烈-德拉蒙德 16.0 78.0
抢断王:
select bb.name as name, bb.avg_steal as steal, bb.count/2 as count from (
select a.name as name, avg(a.steal) as avg_steal, count(1) as count
from game_detail a join all_changgui_game b on a.game_id = b.game_id
where a.time != 0
group by a.name
having count >= 30
order by avg_steal desc limit 1) bb;
结果:
维克托-奥拉迪波 2.3333333333333335 75.0
盖帽王:
select bb.name as name, bb.avg_block as block, bb.count/2 as count from (
select a.name as name, avg(a.block) as avg_block, count(1) as count
from game_detail a join all_changgui_game b on a.game_id = b.game_id
where a.time != 0
group by a.name
having count >= 30
order by avg_block desc limit 1) bb;
结果:
安东尼-戴维斯 2.56 75.0
3、 求出每个球队的得分王(参与比赛的场次 >= 30)(球队名称,球员姓名,得分, 场次)
求出每个球队的助攻王(参与比赛的场次 >= 30)(球队名称,球员姓名,助攻, 场次)
求出每个球队的篮板王(参与比赛的场次 >= 30)(球队名称,球员姓名,篮板, 场次)
求出每个球队的抢断王(参与比赛的场次 >= 30)(球队名称,球员姓名,抢断, 场次)
求出每个球队的盖帽王(参与比赛的场次 >= 30)(球队名称,球员姓名,盖帽, 场次)
得分王:
drop table if exists team_score;
create table team_score as
select a.team_id as team_id,
c.team_name as team_name,
a.name,
avg(a.score) as avg_score,
count(1) as count
from game_detail a join all_changgui_game b on a.game_id = b.game_id join nba_team c on b.team_home_id = c.team_id
where a.time != 0
group by a.team_id, c.team_name, a.name
having count >= 30;
select * from team_score;
结果:休斯顿火箭的所有出场超过30的球员得分信息
1 休斯顿火箭 PJ-塔克 6.121951219512195 82
1 休斯顿火箭 克林特-卡佩拉 13.864864864864865 74
1 休斯顿火箭 克里斯-保罗 18.637931034482758 58
1 休斯顿火箭 内内 6.538461538461538 52
1 休斯顿火箭 卢克-理查德-巴莫特 7.540983606557377 61
1 休斯顿火箭 埃里克-戈登 18.014492753623188 69
1 休斯顿火箭 塔里克-布莱克 3.5294117647058822 51
1 休斯顿火箭 杰拉德-格林 12.121951219512194 41
1 休斯顿火箭 特雷沃-阿里扎 11.671641791044776 67
1 休斯顿火箭 莱恩-安德森 9.348484848484848 66
1 休斯顿火箭 詹姆斯-哈登 30.430555555555557 72
select team_name, name, avg_score, count from (select a.*, row_number() over (distribute by a.team_id sort by avg_score desc) as index from team_score a) aa where aa.index = 1 order by avg_score desc;
结果:每一支球队的得分王
休斯顿火箭 詹姆斯-哈登 30.430555555555557 72
新奥尔良鹈鹕 安东尼-戴维斯 28.133333333333333 75
克利夫兰骑士 勒布朗-詹姆斯 27.451219512195124 82
波特兰开拓者 达米安-利拉德 26.876712328767123 73
密尔沃基雄鹿 扬尼斯-阿德托昆博 26.85333333333333 75
金州勇士 斯蒂芬-库里 26.392156862745097 51
俄克拉荷马城雷霆 拉塞尔-威斯布鲁克 25.35 80
菲尼克斯太阳 德文-布克 24.925925925925927 54
波士顿凯尔特人 凯里-欧文 24.433333333333334 60
圣安东尼奥马刺 拉马库斯-阿尔德里奇 23.133333333333333 75
印第安纳步行者 维克托-奥拉迪波 23.133333333333333 75
多伦多猛龙 德马尔-德罗赞 23.0 80
费城76人 乔尔-恩比德 22.936507936507937 63
纽约尼克斯 克里斯塔普斯-波尔津吉斯 22.666666666666668 48
华盛顿奇才 布拉德利-比尔 22.646341463414632 82
洛杉矶快船 布雷克-格里芬 22.606060606060606 33
明尼苏达森林狼 吉米-巴特勒 22.152542372881356 59
夏洛特黄蜂 肯巴-沃克 22.125 80
犹他爵士 多诺万-米切尔 20.455696202531644 79
孟菲斯灰熊 泰瑞克-埃文斯 19.423076923076923 52
亚特兰大老鹰 丹尼斯-施罗德 19.417910447761194 67
达拉斯独行侠 哈里森-巴恩斯 18.857142857142858 77
丹佛掘金 尼古拉-约基奇 18.466666666666665 75
底特律活塞 托拜厄斯-哈里斯 18.020833333333332 48
奥兰多魔术 埃文-富尼耶 17.771929824561404 57
迈阿密热火 戈兰-德拉季奇 17.28 75
洛杉矶湖人 朱利叶斯-兰德尔 16.134146341463413 82
布鲁克林篮网 丹吉洛-拉塞尔 15.479166666666666 48
芝加哥公牛 劳里-马尔卡宁 15.191176470588236 68
萨克拉门托国王 扎克-兰多夫 14.525423728813559 59
篮板。 助攻, 盖帽, 抢断等数据, 按照类似的逻辑, 可以计算出来。~~~~~~ 不再重复
4、求出每支球队的战绩(多少胜,多少负),并且按照战绩进行东西部排名
结果样式:
火箭 65-17 西部 1
解题SQL:
drop table if exists all_changgui_zhanji;
create table all_changgui_zhanji as
select a.*, row_number() over (distribute by a.we_id sort by win desc ) as index from
(select bb.we_id as we_id,bb.district_id as district_id, aa.team_home_id as home_id, aa.team_home as home_name, "胜-负" as result, concat_ws("-", collect_list(cast(aa.total as string))) as total,
split(concat_ws("-", collect_list(cast(aa.total as string))), "-")[0] as win, split(concat_ws("-", collect_list(cast(aa.total as string))), "-")[1] as lose
from
(select team_home_id, team_home, result, count(*) as total from all_changgui_game a group by team_home_id, team_home, result) aa join nba_team bb on aa.team_home_id = bb.team_id
group by bb.we_id, bb.district_id, aa.team_home_id, aa.team_home, "胜-负") a;
select * from all_changgui_zhanji;
结果:
0 1 火箭 胜-负 65-17 65 17 1
0 6 勇士 胜-负 58-24 58 24 2
0 12 开拓者 胜-负 49-33 49 33 3
0 2 鹈鹕 胜-负 48-34 48 34 4
0 11 爵士 胜-负 48-34 48 34 5
0 13 雷霆 胜-负 48-34 48 34 6
0 3 马刺 胜-负 47-35 47 35 7
0 14 森林狼 胜-负 47-35 47 35 8
0 15 掘金 胜-负 46-36 46 36 9
0 7 快船 胜-负 42-40 42 40 10
0 8 湖人 胜-负 35-47 35 47 11
0 9 国王 胜-负 27-55 27 55 12
0 4 独行侠 胜-负 24-58 24 58 13
0 5 灰熊 胜-负 22-60 22 60 14
0 10 太阳 胜-负 21-61 21 61 15
1 16 猛龙 胜-负 59-23 59 23 1
1 17 凯尔特人 胜-负 55-27 55 27 2
1 18 76人 胜-负 52-30 52 30 3
1 26 骑士 胜-负 50-32 50 32 4
1 27 步行者 胜-负 48-34 48 34 5
1 21 热火 胜-负 44-38 44 38 6
1 28 雄鹿 胜-负 44-38 44 38 7
1 22 奇才 胜-负 43-39 43 39 8
1 29 活塞 胜-负 39-43 39 43 9
1 23 黄蜂 胜-负 36-46 36 46 10
1 19 尼克斯 胜-负 29-53 29 53 11
1 20 篮网 胜-负 28-54 28 54 12
1 30 公牛 胜-负 27-55 27 55 13
1 24 魔术 胜-负 25-57 25 57 14
1 25 老鹰 胜-负 24-58 24 58 15
14、分别求出东部和西部的季后赛队伍(东西部战绩前八)
select * from all_changgui_zhanji where index <= 8;
5、求出每支球队的场均净胜分(球队名称,场均净胜分),按照场均净胜分降序排序
drop table if exists team_avg_win;
create table team_avg_win as
select aa.team_name as team_name, (aa.all_score_home - aa.all_score_away)/count as avg_win
from (
select a.team_home as team_name, sum(a.score_away) as all_score_away, sum(a.score_home) as all_score_home, count(a.team_away_id) as count
from all_changgui_game a
group by a.team_home) aa order by avg_win desc;
select * from team_avg_win;
结果:
火箭 8.475609756097562
猛龙 7.780487804878049
勇士 5.975609756097561
76人 4.5
爵士 4.304878048780488
凯尔特人 3.5853658536585367
雷霆 3.4146341463414633
马刺 2.8902439024390243
开拓者 2.597560975609756
森林狼 2.231707317073171
掘金 1.475609756097561
步行者 1.3780487804878048
鹈鹕 1.3048780487804879
骑士 0.9390243902439024
奇才 0.5853658536585366
热火 0.47560975609756095
黄蜂 0.25609756097560976
快船 0.036585365853658534
活塞 -0.14634146341463414
雄鹿 -0.3048780487804878
湖人 -1.548780487804878
独行侠 -3.0365853658536586
尼克斯 -3.5609756097560976
篮网 -3.7439024390243905
魔术 -4.817073170731708
老鹰 -5.451219512195122
灰熊 -6.2073170731707314
国王 -6.987804878048781
公牛 -7.036585365853658
太阳 -9.365853658536585
6、求出每个赛区的头号种子(每个赛区战绩最好的队伍)(赛区名称,球队名称,战绩)
观察all_changgui_zhanji这张表的结构:
desc all_changgui_zhanji;
we_id int
district_id int
home_id int
home_name string
result string
total string
win string
lose string
index int
直接写出SQL:
drop table if exists district_firt;
create table district_firt as
select aa.district_id as district_id, bb.district as district, aa.home_name as team_name, aa.total as total
from (
select a.*, row_number() over (distribute by a.district_id sort by a.win desc ) as district_index from all_changgui_zhanji a
) aa join nba_team bb on aa.home_id = bb.team_id
where aa.district_index = 1;
select * from district_firt;
结果:
1 西南赛区 火箭 65-17
2 太平洋赛区 勇士 58-24
3 西北赛区 开拓者 49-33
4 大西洋赛区 猛龙 59-23
5 东南赛区 热火 44-38
6 中部赛区 骑士 50-32
7、求出NBA常规赛总共有多少场带加时的比赛
select count(*)/2 from all_changgui_game a where size(split(a.section_home, "-")) > 4;
结果:122
7、求出NBA常规赛每支球队的加时赛总场次, 给出最多的单场加时次数, 并且按照总加时赛次数从高到低排序
drop table if exists team_jiashi;
create table team_jiashi as
select a.team_home as team_home, count(*) as total, max(size(split(a.section_home, "-")) - 4) as max
from all_changgui_game a
where size(split(a.section_home, "-")) > 4
group by a.team_home
order by total desc;
select * from team_jiashi;
结果:
热火 9 2
鹈鹕 9 2
湖人 8 2
掘金 7 2
雄鹿 6 1
奇才 6 2
篮网 6 2
猛龙 6 1
森林狼 5 1
活塞 5 1
独行侠 5 1
尼克斯 5 2
公牛 5 2
黄蜂 4 1
雷霆 4 3
凯尔特人 4 2
开拓者 3 1
步行者 3 1
爵士 3 1
国王 3 1
火箭 2 2
马刺 2 1
骑士 2 1
勇士 2 1
76人 2 3
魔术 2 1
老鹰 1 1
灰熊 1 1
太阳 1 1
快船 1 1
select * from all_changgui_game where team_home = "76人";
select * from all_changgui_game where team_home = "热火";
7、求出NBA常规赛每支球队的加时赛总场次, 给出最多的加时次数的比赛ID和加时次数, 并且按照加时赛次数从高到低排序
难点:在一个含有group by关键字的SQL语句中, 怎么让不是 group by的字段, 也不是聚合函数的字段出现在select的后面?
7、求出加时次数最多的10场比赛,给出必要信息(参赛双方队伍名称,双方常规得分,加时次数, 双方加时得分)
自定义一个内置函数:udf.jar
自定义函数的实现:
public class NBASectionUDF extends UDF{
/**
* 传入参数score: score就是section_away 和 section_home 的字段
*
* 传入参数flag: flag为真时,求常规得分。 为假的时候,求加时得分
*/
public int evaluate(String score, boolean flag){
String[] section_scores = score.trim().split("-");
int sum = 0;
if(flag){
for(int i=0; i<4; i++){
sum += Integer.parseInt(section_scores[i]);
}
}else{
for(int i=4; i<section_scores.length; i++){
sum += Integer.parseInt(section_scores[i]);
}
}
return sum;
}
}
add jar /home/hadoop/udf.jar;
create temporary function get_nba_score as "com.ghgj.hive.udf.NBASectionUDF";
drop table if exists jiashi_game_detail;
create table jiashi_game_detail as
select a.team_away, a.team_home, (size(split(a.section_home, "-")) - 4) as length,
get_nba_score(a.section_home, true) as changgui_home, get_nba_score(a.section_home, false) as jiashi_home,
get_nba_score(a.section_away, true) as changgui_away, get_nba_score(a.section_away, false) as jiashi_away
from nba_game a
where size(split(a.section_home, "-")) > 4 and start_time >= "2017-10-17 00:00:00" and start_time <= "2018-04-14 00:00:00"
order by length desc;
select * from jiashi_game_detail;
8、求出替补球员中的得分王(首发不超过10场就算是替补登场球员)(出场次数大于等于30)
注意事项:有些球员在某些比赛中是首发, 在某些比赛中是替补
表game_detail的样式:
1,13,1,1,史蒂文-亚当斯,C,19,1-1,0-0,2-2,0,4,4,1,4,0,2,1,4,-11
1,13,1,0,杰拉米-格兰特,F,31,3-8,0-2,4-4,3,3,6,1,3,0,0,1,10,+1
game_shoufa_tibu :
是否首发,球员姓名, 总得分, 总出场次数
is_first,name,total,games
第一个SQL
drop table if exists game_shoufa_tibu;
create table game_shoufa_tibu as
select a.isfirst as is_first, a.name as name , sum(a.score) as total, count(a.game_id) as games
from game_detail a join all_game b on a.game_id = b.game_id
where a.time != 0 and b.start_time >= "2017-10-17 00:00:00" and b.start_time <= "2018-04-14 00:00:00"
group by a.isfirst, a.name
order by name, is_first;
select * from game_shoufa_tibu;
结果:
数据样式:
1 马科-贝里内利 27 1
0 马科-贝里内利 574 52
这种数据是我们想要的,马科-贝里内利 这名球员 1场 首发, 52 场替补。 这是标准的替补球员身份。
1 安东尼-托利弗 208 14
0 安东尼-托利弗 495 65
按照要求,这种数据不是我们需要的数据。
第二个SQL
drop table if exists game_shoufa_tibu_detail;
create table game_shoufa_tibu_detail as
select a.name,
concat_ws("-", collect_list(cast (a.is_first as string))) as is_first,
concat_ws("-", collect_list(cast(a.total as string))) as total,
concat_ws("-", collect_list(cast(a.games as string))) as games
from game_shoufa_tibu a
group by a.name;
select * from game_shoufa_tibu_detail order by name;
game_shoufa_tibu_detail
team_id name is_first total games
结果数据样式:
杰夫-蒂格 1 994 70
杰拉德-格林 0-1 451-46 39-2
杰拉米-格兰特 0-1 672-10 80-1
杰森-史密斯 0-1 107-6 31-2
杰森-塔特姆 1 1112 80
杰森-特里 0-1 141-25 46-4
杰迈克尔-格林 0-1 8-561 1-54
杰里-贝勒斯 0-1 193-114 28-11
杰里安-格兰特 0-1 364-255 48-26
select * from game_shoufa_tibu_detail where is_first = "0" and games >= 30;
第三个SQL:求出了所有替补球员中出场数大于30场,并且首发不超过10场的球员
drop table if exists game_shoufa10_tibu30;
create table game_shoufa10_tibu30 as
select *
from game_shoufa_tibu_detail a
where if(a.is_first = "0", cast(a.games as int), 0) >= 30
or
(
if(a.is_first = "0-1", cast(split(a.games, "-")[0] as int) + cast(split(a.games, "-")[1] as int) >= 30, false)
and
if(a.is_first = "0-1", cast(split(a.games, "-")[1] as int) < 10, false)
);
select * from game_shoufa10_tibu30;
第四个SQL :求出替补球员中的 得分列表
drop table if exists tibu_score_desc;
create table tibu_score_desc as
select a.name, if(a.is_first = "0", cast(a.games as int), cast(split(a.games, "-")[0] as int) + cast(split(a.games, "-")[1] as int)) as total,
if(a.is_first = "0", cast(a.games as int), cast(split(a.games, "-")[0] as int)) as tibu,
if(a.is_first = "0", cast(a.total as int), cast(split(a.total, "-")[0] as int) + cast(split(a.total, "-")[1] as int))/if(a.is_first = "0", cast(a.games as int), cast(split(a.games, "-")[0] as int) + cast(split(a.games, "-")[1] as int)) as avg_score
from game_shoufa10_tibu30 a
order by avg_score desc;
select * from tibu_score_desc;
求出得分王:
select * from tibu_score_desc limit 1;
结果:
乔丹-克拉克森 13.876543209876543
博比-波蒂斯 13.205479452054794
特雷-伯克 12.805555555555555
贾巴里-帕克 12.612903225806452
马科-贝里内利 12.1375
杰拉德-格林 12.121951219512194
鲁迪-盖伊 11.491228070175438
德维恩-韦德 11.417910447761194
韦恩-艾灵顿 11.220779220779221
弗兰克-卡明斯基 11.050632911392405
9、求出这样的球员:一场比赛中得分最高并且所在球队获胜,且获胜场次最多的前5名球员(给出球员,获胜场次)
第一步SQL:求出每场比赛中,两边得分最高的球员
drop table if exists game_player_score_detail;
create table game_player_score_detail as
select b.game_id as game_id, b.ishome as is_home , max(b.score) as max_score, sum(b.score) as sum_score
from game_detail b join nba_game a on a.game_id = b.game_id
where b.time != 0 and a.start_time >= "2017-10-17 00:00:00" and a.start_time <= "2018-04-14 00:00:00"
group by b.game_id, b.ishome ;
select * from game_player_score_detail;
game_player_score_detail
game_id is_home max_score sum_score
第二个SQL:
drop table if exists game_player_score_detail2;
create table game_player_score_detail2 as
select a.game_id, a.is_home, a.max_score, a.sum_score, b.name
from game_player_score_detail a join game_detail b on a.game_id = b.game_id and a.is_home = b.ishome and a.max_score = b.score;
select * from game_player_score_detail2;
数据样式:
1249 0 24 119 弗兰克-卡明斯基
1249 1 13 93 达伦-科里森
1249 1 13 93 格伦-罗宾逊三世
1249 1 13 93 TJ-利夫
1251 0 23 114 尼古拉-武切维奇
1251 1 22 93 勒布朗-詹姆斯
第三个SQL:
drop table if exists game_player_score_detail2_index;
create table game_player_score_detail2_index as
select aa.name as name, count(distinct aa.game_id) as total
from (
select a.*, rank() over (distribute by a.game_id sort by a.sum_score desc) as index
from game_player_score_detail2 a
) aa
where aa.index = 1
group by aa.name
order by total desc;
select * from game_player_score_detail2_index limit 5;
select * from game_player_score_detail2_index where total >= 10;
结果样式展示:
詹姆斯-哈登 45
勒布朗-詹姆斯 39
拉马库斯-阿尔德里奇 38
德马尔-德罗赞 38
安东尼-戴维斯 32
凯里-欧文 30
维克托-奥拉迪波 29
达米安-利拉德 28
扬尼斯-阿德托昆博 28
凯文-杜兰特 27
多诺万-米切尔 25
乔尔-恩比德 24
布拉德利-比尔 24
拉塞尔-威斯布鲁克 24
10、求出每支球队替补得分高于首发得分并且比赛获胜的场次(球队,获胜场次)
drop table if exists game_shoufa_tibu_score;
create table game_shoufa_tibu_score as
select a.game_id, a.team_id, a.isfirst, sum(a.score) as total
from game_detail a join all_game b on a.game_id = b.game_id
where a.time != 0 and b.start_time >= "2017-10-17 00:00:00" and b.start_time <= "2018-04-14 00:00:00"
group by a.game_id, a.team_id, a.isfirst
order by a.game_id, a.team_id, a.isfirst;
select * from game_shoufa_tibu_score;
结果:
game_shoufa_tibu_score
game_id team_id isfirst total
1339 28 0 70
1339 28 1 48
1339 30 0 37
1339 30 1 68
1340 29 0 32
1340 29 1 73
1340 30 0 35
1340 30 1 72
drop table if exists tibu_gt_shoufa;
create table tibu_gt_shoufa as
select *
from (
select a.game_id, a.team_id, collect_list(a.total) as total
from game_shoufa_tibu_score a
group by a.game_id, a.team_id
) aa where aa.total[0] > aa.total[1];
select * from tibu_gt_shoufa;
另一种写法:基本一致
select *
from (
select a.game_id, a.team_id, concat_ws("-", collect_list(a.total)) as total
from game_shoufa_tibu_score a
group by a.game_id, a.team_id
) aa where slipt(aa.total, "-")[0] > slipt(aa.total, "-")[1];
drop table if exists tibu_gt_shoufa_every_team_total;
create table tibu_gt_shoufa_every_team_total as
select a.team_home, count(distinct a.game_id) as total
from all_changgui_game a join tibu_gt_shoufa b on a.game_id = b.game_id
where a.result = "胜"
group by a.team_home
order by total desc;
select * from tibu_gt_shoufa_every_team_total;
最终结果:
76人 13
公牛 6
凯尔特人 9
勇士 6
国王 12
太阳 3
奇才 9
尼克斯 10
开拓者 2
快船 11
掘金 12
森林狼 6
步行者 9
活塞 4
湖人 4
火箭 7
灰熊 5
热火 6
爵士 4
独行侠 3
猛龙 15
篮网 5
老鹰 4
雄鹿 4
雷霆 6
马刺 10
骑士 5
魔术 2
鹈鹕 9
黄蜂 7
11、求出NBA整个赛季中场均20+5+5的球员(场均分得分20以上,场均篮板5以上,场均助攻5以上)
drop table if exists score20_rebs5_assist5;
create table score20_rebs5_assist5 as
select b.name, sum(b.score) as total_score, sum(b.total_rebs) as total_rebs, sum(b.assist) as total_assist, count(b.game_id) as total
from all_game a join game_detail b on a.game_id = b.game_id
where b.time != 0 and a.start_time >= "2017-10-17 00:00:00" and a.start_time <= "2018-04-14 00:00:00"
group by b.name
having total_score/total >= 20 and total_rebs/total >= 5 and total_assist/total >= 5 and total >= 30;
select * from score20_rebs5_assist5;
结果:
凯文-杜兰特 1792 465 366 68
勒布朗-詹姆斯 2251 710 747 82
布雷克-格里芬 1242 429 334 58
德马库斯-考辛斯 1210 619 257 48
拉塞尔-威斯布鲁克 2028 805 820 80
斯蒂芬-库里 1346 261 310 51
詹姆斯-哈登 2191 387 630 72
12、求出NBA整个赛季场均20+10的球员(场均得分或者助攻或者篮板或者抢断满足有一个20以上,有一个10以上)
试试求 20 + 10 的球员:
drop table if exists score20_rebs_or_assist_10;
create table score20_rebs_or_assist_10 as
select b.name, sum(b.score) as total_score, sum(b.total_rebs) as total_rebs, sum(b.assist) as total_assist, count(b.game_id) as total
from all_game a join game_detail b on a.game_id = b.game_id
where b.time != 0 and a.start_time >= "2017-10-17 00:00:00" and a.start_time <= "2018-04-14 00:00:00"
group by b.name
having (total_score/total >= 20 and total_rebs/total >= 10) or (total_score/total >= 20 and total_assist/total >= 10);
select * from score20_rebs_or_assist_10;
最终结果:
乔尔-恩比德 22.9 10.9 3.2
卡尔-安东尼-唐斯 21.3 12.4 2.4
安东尼-戴维斯 28.1 11.1 2.3
德马库斯-考辛斯 25.2 12.9 5.4
扬尼斯-阿德托昆博 26.9 10.1 4.8
拉塞尔-威斯布鲁克 25.4 10.1 10.2
换一种写法:求平均的写法
drop table if exists score20_rebs_or_assist_10_copy;
create table score20_rebs_or_assist_10_copy as
select * from (
select aa.name, format_number(aa.total_score/aa.total,1) as avg_score, format_number(aa.total_rebs/aa.total,1) as avg_rebs, format_number(aa.total_assist/aa.total,1) as avg_assist
from (
select b.name as name, sum(b.score) as total_score, sum(b.total_rebs) as total_rebs, sum(b.assist) as total_assist, count(b.game_id) as total
from all_game a join game_detail b on a.game_id = b.game_id
where b.time != 0 and a.start_time >= "2017-10-17 00:00:00" and a.start_time <= "2018-04-14 00:00:00"
group by b.name
) aa
) bb where (bb.avg_score >= 20 and bb.avg_rebs >= 10) or (bb.avg_score >= 20 and bb.avg_assist >= 10);
select * from score20_rebs_or_assist_10_copy;
最终结果:
乔尔-恩比德 22.9 10.9 3.2
卡尔-安东尼-唐斯 21.3 12.4 2.4
安东尼-戴维斯 28.1 11.1 2.3
德马库斯-考辛斯 25.2 12.9 5.4
扬尼斯-阿德托昆博 26.9 10.1 4.8
拉塞尔-威斯布鲁克 25.4 10.1 10.2
16、求出本赛季的唯一一位20+10+10(场均得分至少20,场均篮板至少10,场均助攻至少10)的球员的名字
drop table if exists score20_rebs10_assist10;
create table score20_rebs10_assist10 as
select * from (
select aa.name, format_number(aa.total_score/aa.total,1) as avg_score, format_number(aa.total_rebs/aa.total,1) as avg_rebs, format_number(aa.total_assist/aa.total,1) as avg_assist
from (
select b.name as name, sum(b.score) as total_score, sum(b.total_rebs) as total_rebs, sum(b.assist) as total_assist, count(b.game_id) as total
from all_game a join game_detail b on a.game_id = b.game_id
where b.time != 0 and a.start_time >= "2017-10-17 00:00:00" and a.start_time <= "2018-04-14 00:00:00"
group by b.name
) aa
) bb where bb.avg_score >= 20 and bb.avg_rebs >= 10 and bb.avg_assist >= 10;
select * from score20_rebs10_assist10;
最终结果:
塞尔-威斯布鲁克 25.4 10.1 10.2
13、求出每支球队的得分王
得分信息:game_detail
drop table if exists game_player_score;
create table game_player_score as
select a.team_id, a.name, sum(a.score) as sum_score, count(a.game_id) as total
from game_detail a join nba_game b on a.game_id = b.game_id
where a.time != 0 and b.start_time >= "2017-10-17 00:00:00" and b.start_time <= "2018-04-14 00:00:00"
group by a.team_id, a.name ;
select * from game_player_score;
game_player_score
team_id name sum_score total
drop table if exists team_max_score_player;
create table team_max_score_player as
select bb.team_name as team_name, aa.name as player_name, format_number(aa.sum_score/aa.total,2) as avg_score from (
select a.*, row_number() over (distribute by a.team_id sort by a.sum_score/a.total desc) as index
from game_player_score a
) aa join nba_team bb on aa.team_id = bb.team_id
where aa.index = 1
order by avg_score desc;
select * from team_max_score_player;
最终结果:
休斯顿火箭 詹姆斯-哈登 30.43
新奥尔良鹈鹕 安东尼-戴维斯 28.13
克利夫兰骑士 勒布朗-詹姆斯 27.45
波特兰开拓者 达米安-利拉德 26.88
密尔沃基雄鹿 扬尼斯-阿德托昆博 26.85
金州勇士 斯蒂芬-库里 26.39
俄克拉荷马城雷霆 拉塞尔-威斯布鲁克 25.35
菲尼克斯太阳 德文-布克 24.93
波士顿凯尔特人 凯里-欧文 24.43
圣安东尼奥马刺 拉马库斯-阿尔德里奇 23.13
印第安纳步行者 维克托-奥拉迪波 23.13
多伦多猛龙 德马尔-德罗赞 23.00
费城76人 乔尔-恩比德 22.94
纽约尼克斯 克里斯塔普斯-波尔津吉斯 22.67
华盛顿奇才 布拉德利-比尔 22.65
洛杉矶快船 布雷克-格里芬 22.61
明尼苏达森林狼 吉米-巴特勒 22.15
夏洛特黄蜂 肯巴-沃克 22.12
犹他爵士 多诺万-米切尔 20.46
孟菲斯灰熊 马尚-布鲁克斯 20.14
底特律活塞 布雷克-格里芬 19.84
亚特兰大老鹰 丹尼斯-施罗德 19.42
达拉斯独行侠 哈里森-巴恩斯 18.86
丹佛掘金 尼古拉-约基奇 18.47
布鲁克林篮网 林书豪 18.00
奥兰多魔术 埃文-富尼耶 17.77
迈阿密热火 戈兰-德拉季奇 17.28
芝加哥公牛 尼古拉-米罗蒂奇 16.80
洛杉矶湖人 朱利叶斯-兰德尔 16.13
萨克拉门托国王 扎克-兰多夫 14.53
15、求出对战季后赛球队胜率最高的5支队伍,并给出他们的胜率 和 整赛季胜率
17、求出本赛季三双(一场比赛中得分,篮板,助攻都超过10)次数最多的10位球员
drop table if exists player_3shuang_desc;
create table player_3shuang_desc as
select a.name, count(a.game_id) as total
from game_detail a
where a.score >= 10 and a.assist >= 10 and a.total_rebs >= 10
group by a.name
order by total desc;
select * from player_3shuang_desc;
最终结果:
拉塞尔-威斯布鲁克 25
勒布朗-詹姆斯 19
本-西蒙斯 11
尼古拉-约基奇 10
詹姆斯-哈登 4
德雷蒙德-格林 3
布雷克-格里芬 3
凯尔-洛瑞 3
德马库斯-考辛斯 3
尼古拉斯-巴图姆 2
拉简-隆多 2
朗佐-鲍尔 2
埃尔弗里德-佩顿 2
凯文-杜兰特 2
马克尔-富尔茨 1
马克-加索尔 1
贾莱特-杰克 1
特里-罗齐尔 1
格雷格-门罗 1
朱利叶斯-兰德尔 1
朱-霍勒迪 1
扬尼斯-阿德托昆博 1
尼古拉-武切维奇 1
埃里克-布莱索 1
克里斯-米德尔顿 1
保罗-加索尔 1
丹尼斯-史密斯 1
丹吉洛-拉塞尔 1
TJ-麦康奈尔 1
18、求出第四节翻盘次数最多的队伍(翻盘:前三节得分总和A队伍大于B队伍,但是A队伍却输掉了比赛,所以B队伍第四节翻盘成功)
简单来说:就是前三节领先,但是最后却输掉了比赛
drop table if exists nba_game_changgui_result;
create table nba_game_changgui_result as
select a.game_id, a.team_away_id, a.team_home_id,
a.team_away, a.team_home, a.score_away,
a.section_away, a.score_home , a.section_home,
if(a.score_away > a.score_home, "负", "胜") as result, a.start_time, a.game_time, a.audience, a.game_url
from nba_game a
where a.start_time >= "2017-10-17 00:00:00" and a.start_time <= "2018-04-14 00:00:00" ;
select * from nba_game_changgui_result limit 10;
drop table if exists nba_game_fanpan;
create table nba_game_fanpan as
select a.team_home, count(game_id) as total
from nba_game_changgui_result a
where a.result = "胜" and
cast(split(a.section_away, "-")[0] as int) + cast(split(a.section_away, "-")[1] as int) + cast(split(a.section_away, "-")[2] as int) >
cast(split(a.section_home, "-")[0] as int) + cast(split(a.section_home, "-")[1] as int) + cast(split(a.section_home, "-")[2] as int)
group by a.team_home
order by total desc limit 5;
select * from nba_game_fanpan;
最总结果:
掘金 10
马刺 8
快船 7
热火 7
魔术 6
验证结果:
select a.team_away, a.team_home,
cast(split(a.section_away, "-")[0] as int) + cast(split(a.section_away, "-")[1] as int) + cast(split(a.section_away, "-")[2] as int) as team_away_score3 ,
cast(split(a.section_home, "-")[0] as int) + cast(split(a.section_home, "-")[1] as int) + cast(split(a.section_home, "-")[2] as int) as team_home_score3 ,
a.result
from nba_game_copy_result a where a.team_home = "魔术";
19、求出每支球队都有多少场比赛属于每一节都领先的
自定义一个函数:
public class NBA_Section_CompareAll_UDF extends UDF{
/**
* 日期: 2018年4月17日 上午10:13:42
*
* 描述:
* 第一个参数:str1 = score_away
* 第一个参数:str2 = score_home
*/
public boolean evaluate(String str1, String str2, String splitStr){
String[] split_array1 = str1.split(splitStr);
String[] split_array2 = str2.split(splitStr);
int size = split_array1.length;
boolean result = true; // 默认为true表示 第一个数组的 所有 元素都 小于 第二个 数组的 元素
for(int i=0; i<size; i++){
// 如果 对应位置上的 元素 , 第一个 大于 第二个
if(Integer.parseInt(split_array1[i]) > Integer.parseInt(split_array2[i])){
return false;
}
}
return result;
}
}
add jar /home/hadoop/udf.jar;
create temporary function compare_score as "com.ghgj.hive.udf.NBA_Section_CompareAll_UDF";
drop table if exists game_all_section_gt;
create table game_all_section_gt as
select a.team_home, count(*) as total
from nba_game_changgui_result a
where compare_score(a.score_away, a.score_home, "-")
group by a.team_home
order by total desc ;
select * from game_all_section_gt;
最终结果:
火箭 34
猛龙 34
马刺 33
掘金 31
森林狼 30
76人 30
骑士 29
勇士 29
开拓者 28
爵士 28
凯尔特人 27
雷霆 27
步行者 27
热火 26
雄鹿 25
活塞 25
鹈鹕 24
奇才 23
快船 22
黄蜂 21
湖人 20
尼克斯 19
魔术 17
公牛 17
灰熊 16
老鹰 16
独行侠 15
篮网 15
国王 14
太阳 10
20、求出助攻失误比最高的10名球员(助攻失误比 = 场均助攻 / 场均失误)
21、求背靠背比赛中胜率最高的前5支队伍(球队名称,背靠背场次,胜场次,背靠背比赛胜率)
22、求出NBA的每支队伍的最长连胜(队伍名称,最长连胜的场次)
23、求出最擅长打主场和客场的队伍(队伍名称,总主场数/总客场数,胜场数)
24、求出有某月全胜的队伍及月份(球队名称,某月,总比赛场次)
25、求出所有连续得分上双的场次超过50的球员(球员,场次)
26、求出所有的全勤球员,并且按照平均出勤时间排序(整个赛季的82场比赛都出场的球员)
27、求出每个位置上效率最高的球员组成一个最佳阵容(位置,球员姓名,效率)
28、