本数据集的下载地址,读者可以自行下载。
公众号(可以与我取得联系):蓝皮怪的数据坊
知乎:知乎ID—蓝皮怪
CSDN:CSDN—蓝皮怪
1.项目背景
随着NBA赛事的日益激烈和全球关注度的提升,球员的赛季表现数据成为了球迷、分析师以及球队决策者评估球员能力和比赛走势的重要依据。为了更好地理解球员在赛季中的表现以及他们对球队的贡献,本项目基于某赛季NBA球员的赛场数据,进行了全面的数据分析,旨在揭示球员表现的规律和趋势,探索不同球员类型的特征,并为球队在人员选择、战术部署及未来赛季预测等方面提供数据支持。
2.数据说明
字段 | 说明 |
---|---|
URL | 球员统计数据页面的URL |
player_name | 球员姓名 |
player_games_played | 球员出场的比赛场次 |
player_games_started | 球员首发的比赛场次 |
player_minutes_per_game | 球员每场比赛的平均上场时间 |
player_points_per_game | 球员每场比赛的平均得分 |
player_offensive_rebounds_per_game | 球员每场比赛的平均进攻篮板数 |
player_defensive_rebounds_per_game | 球员每场比赛的平均防守篮板数 |
player_rebounds_per_game | 球员每场比赛的平均总篮板数 |
player_assists_per_game | 球员每场比赛的平均助攻数 |
player_steals_per_game | 球员每场比赛的平均抢断数 |
player_blocks_per_game | 球员每场比赛的平均盖帽数 |
player_turnovers_per_game | 球员每场比赛的平均失误数 |
player_fouls_per_game | 球员每场比赛的平均犯规数 |
player_assist_to_turnover_ratio | 球员的助攻失误比 |
team | 所属球队名称 |
season_type | 赛季类型(如常规赛、季后赛等) |
season_year | 赛季年份 |
timestamp | 数据时间戳(用于标识数据的具体时间) |
3.Python库导入及数据读取
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from scipy.stats import spearmanr
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
data = pd.read_csv("/home/mw/input/11251956/NBA players' stats.csv")
4.数据预览及数据预处理
print('查看数据信息:')
data.info()
查看数据信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 url 1000 non-null object
1 player_name 1000 non-null object
2 player_games_played 1000 non-null int64
3 player_games_started 1000 non-null int64
4 player_minutes_per_game 1000 non-null float64
5 player_points_per_game 1000 non-null float64
6 player_offensive_rebounds__per_game 1000 non-null float64
7 player_defensive_rebounds_per_game 1000 non-null float64
8 player_rebounds_per_game 1000 non-null float64
9 player_assists_per_game 1000 non-null float64
10 player_steals_per_game 1000 non-null float64
11 player_blocks_per_game 1000 non-null float64
12 player_turnovers_per_game 1000 non-null float64
13 player_fouls_per_game 1000 non-null float64
14 player_assist_to_turnover_ratio 1000 non-null float64
15 team 1000 non-null object
16 season_type 1000 non-null object
17 season_year 1000 non-null object
18 timestamp 1000 non-null int64
dtypes: float64(11), int64(3), object(5)
memory usage: 148.6+ KB
print(f'查看重复值:{
data.duplicated().sum()}')
查看重复值:0
characteristic = data.select_dtypes(include=['object']).columns.tolist() + ['timestamp'] # 我这里之所以把 timestamp 加上,是因为数据中作者没有对 timestamp 进行说明,又考虑可能和时间有关。
print('数据中指定变量的唯一值情况:')
for i in characteristic:
print(f'{
i}:')
print(f'共有:{
len(data[i].unique())}条唯一值')
print('-'*50)
数据中指定变量的唯一值情况:
url:
共有:1000条唯一值
--------------------------------------------------
player_name:
共有:754条唯一值
--------------------------------------------------
team:
共有:33条唯一值
--------------------------------------------------
season_type:
共有:3条唯一值
--------------------------------------------------
season_year:
共有:36条唯一值
--------------------------------------------------
timestamp:
共有:3条唯一值
--------------------------------------------------
data.head()
url | player_name | player_games_played | player_games_started | player_minutes_per_game | player_points_per_game | player_offensive_rebounds__per_game | player_defensive_rebounds_per_game | player_rebounds_per_game | player_assists_per_game | player_steals_per_game | player_blocks_per_game | player_turnovers_per_game | player_fouls_per_game | player_assist_to_turnover_ratio | team | season_type | season_year | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://www.espn.com/nba/player/_/id/2377/chri... | Chr███Duh███ | 82 | 73 | 26.5 | 5.9 | 0.3 | 2.3 | 2.6 | 4.9 | 1.0 | 0.0 | 1.5 | 2.5 | 3.3 | CHI | Not clear | 2004-05 | 45606 |
1 | https://www.espn.com/nba/player/_/id/3934672/j... | Jal███Bru███n | 73 | 38 | 21.8 | 9.3 | 0.3 | 2.0 | 2.3 | 3.2 | 0.5 | 0.1 | 1.2 | 1.7 | 2.7 | DAL | Not clear | 2018-19 | 45606 |
2 | https://www.espn.com/nba/player/_/id/2011/kyle... | Kyl███orv███ | 16 | 0 | 15.7 | 6.8 | 0.0 | 1.8 | 1.8 | 1.1 | 0.2 | 0.1 | 1.1 | 1.4 | 1.0 | CLE | Regular season | 2018-19 | 45606 |
3 | https://www.espn.com/nba/player/stats/_/id/456... | Fra███Wag███ | 79 | 79 | 30.7 | 15.2 | 1.1 | 3.4 | 4.5 | 2.9 | 0.9 | 0.4 | 1.5 | 2.1 | 1.9 | ORL | Not clear | 2021-22 | 45606 |
4 | https://www.espn.com/nba/player/_/id/167/austi... | Aus███ Cr███ere███ | 26 | 0 | 9.3 | 2.9 | 0.4 | 1.3 | 1.7 | 0.3 | 0.3 | 0.2 | 0.5 | 1.2 | 0.6 | IND | Not clear | 1997-98 | 45606 |
player_name中存在重复值,是多个不同的赛季,又结合其是经过脱敏处理的数据,大部分名字中的字符被替换为“█”,因此重复是很正常的,通过每一位球员对应的URL,可以去官网查到其真实名字:
当然,这里我发现了,这个脱敏处理的并不够好,这是部分脱敏,有些球员名字还是可以通过部分信息推出来,然后就是每个URL后面都带着该球员的名字,如:https://www.espn.com/nba/player/_/id/2377/chris-duhon?year=2008-09&team=NY 我们就直接知道这个球员叫ChrisDuhon,所以这里可以通过提取URL来获取球员的真实名字。
# 定义一个函数,从URL中提取球员名字
def extract_player_name_from_url(url):
# 试图匹配不同类型的URL
match = re.search(r'\/id\/\d+\/([a-z-]+)(?:\?|$)', url)
if match:
return match.group(1).replace('-', ' ').title() # 将名字中的连字符替换为空格并首字母大写
# 尝试另一种形式:匹配像 '/stats/_/id/...' 格式
match_stats = re.search(r'\/id\/\d+\/([a-z-]+)', url)
if match_stats:
return match_stats.group(1).replace('-', ' ').title() # 将名字中的连字符替换为空格并首字母大写
return None # 如果URL中没有符合的名字,返回None
# 使用apply方法应用函数,提取球员名字并更新'player_name'列
data['player_name'] = data['url'].apply(extract_player_name_from_url)
print(f"检查处理后的球员名字,还含有黑框的行数:{
data['player_name'].str.contains('█', na=False).sum()}")
print(f"包含 None 或 NaN 的球员名字数量: {
data['player_name'].isna().sum()}")
检查处理后的球员名字,还含有黑框的行数:0
包含 None 或 NaN 的球员名字数量: 0
OK,看样子已经处理完毕了,现在已经展示出真正的球员名字了,现在要看看有几个分类变量特别少的特征,是否存在无用信息的情况。
data['season_type'].value_counts(normalize=True) * 100
season_type
Not clear 80.9
Regular season 11.2
Postseason 7.9
Name: proportion, dtype: float64
data['timestamp'].value_counts(normalize=True) * 100
timestamp
45606 96.7
45611 3.0
45607 0.3
Name: proportion, dtype: float64
在season_type列中,存在80.9%的Not clear,也就是说该球员没有明确的赛季类型分类,而timestamp列未发觉有特殊意义,且45606占比特别大,因此考虑删除这两列特征,并且删除URL。
data = data.drop(columns=['url','season_type','timestamp'])
还要处理season_year,一开始我还以为是年份-月份,实际应该是年份-年份,比如2004 - 05 应该指的就是04年,所以要先检查是否存在跨度大于1年的情况。
# 提取赛季开始和结束年份,分隔成两列
data[['season_start', 'season_end']] = data['season_year'].str.split('-', expand=True)
# 检查赛季跨度是否大于1
data['season_start'] = pd.to_numeric(data['season_start'])
data['season_end'] = pd.to_numeric(data['season_end'])
# 计算跨度
data['season_span'] = data['season_end'] - data['season_start']
print(f"赛季跨度大于1的数量:{
len(data[data['season_span'] > 1])}")
赛季跨度大于1的数量:0
确认没问题后,只保留开始年的数据。
# 用 season_start 替换 season_year
data['season_year'] = data['season_start']
# 删除 season_span 、season_start 和 season_end 列
data = data.drop(columns=['season_span','season_start', 'season_end'])
print(f'查看处理后的重复值:{
data.duplicated().sum()}')
查看处理后的重复值:167
这里需要删除这些重复值,虽然之前是独立的url,可是经过处理后,还存在大量的重复值,就表明,这些重复值是同一个人在同一个赛季的数据,各项指标也是相同的,故需要进行删除处理。
# 删除重复的行
data = data.drop_duplicates()
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 833 entries, 0 to 999
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 player_name 833 non-null object
1 player_games_played 833 non-null int64
2 player_games_started 833 non-null int64
3 player_minutes_per_game 833 non-null float64
4 player_points_per_game 833 non-null float64
5 player_offensive_rebounds__per_game 833 non-null float64
6 player_defensive_rebounds_per_game 833 non-null float64
7 player_rebounds_per_game 833 non-null float64
8 player_assists_per_game 833 non-null float64
9 player_steals_per_game 833 non-null float64
10 player_blocks_per_game 833 non-null float64
11 player_turnovers_per_game 833 non-null float64
12 player_fouls_per_game 833 non-null float64
13 player_assist_to_turnover_ratio 833 non-null float64
14 team 833 non-null object
15 season_year 833 non-null int64
dtypes: float64(11), int64(3), object(2)
memory usage: 110.6+ KB
data.head()
player_name | player_games_played | player_games_started | player_minutes_per_game | player_points_per_game | player_offensive_rebounds__per_game | player_defensive_rebounds_per_game | player_rebounds_per_game | player_assists_per_game | player_steals_per_game | player_blocks_per_game | player_turnovers_per_game | player_fouls_per_game | player_assist_to_turnover_ratio | team | season_year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Chris Duhon | 82 | 73 | 26.5 | 5.9 | 0.3 | 2.3 | 2.6 | 4.9 | 1.0 | 0.0 | 1.5 | 2.5 | 3.3 | CHI | 2004 |
1 | Jalen Brunson | 73 | 38 | 21.8 | 9.3 | 0.3 | 2.0 | 2.3 | 3.2 | 0.5 | 0.1 | 1.2 | 1.7 | 2.7 | DAL | 2018 |
2 | Kyle Korver | 16 | 0 | 15.7 | 6.8 | 0.0 | 1.8 | 1.8 | 1.1 | 0.2 | 0.1 | 1.1 | 1.4 | 1.0 | CLE | 2018 |
3 | Franz Wagner | 79 | 79 | 30.7 | 15.2 | 1.1 | 3.4 | 4.5 | 2.9 | 0.9 | 0.4 | 1.5 | 2.1 | 1.9< |