NBA球员得分预测-基于线性回归、KNN回归、决策树回归、随机森林回归

最新推荐文章于 2024-05-19 20:58:01 发布

Master_Wrong

最新推荐文章于 2024-05-19 20:58:01 发布

阅读量9.6k

点赞数

文章标签：回归线性回归决策树

本文链接：https://blog.csdn.net/python_Daddy/article/details/138340259

版权

前言

在NBA中，预测每个球员的得分在篮球分析领域至关重要。它是一个关键的表现指标，允许教练、分析师和球迷评估球员的得分能力和对球队的整体进攻贡献。了解球员的得分潜力有助于比赛中的战略决策、球员选择和人才发掘。在本篇报告中，我们深入研究了篮球数据分析领域并使用机器学习技术来预测每个球员的得分水平。

预测所采用的回归模型：

线性回归
KNN回归器
决策树回归器
随机森林回归器

通过使用这些回归模型，旨在了解它们在预测球员得分方面的表现，并比较各自的预测能力。通过对比分析，可以从实际意义上考量不同模型各自的优劣，并在这个特定的数据集中确定最适合预测球员得分的模型。

一数据集概述

2023_nba_player_stats.csv
在该数据集中，包含2023年所有NBA球员的各项指标数据。其中各列名简称的实际解释意义如下：

PName	Pos	Team	Age	GP	W
球员姓名	球员位置	所属球队	年龄	出场次数	胜场
L	Min	PTS	FGM	FGA	FG%
负场	出场时间	总得分	投篮命中数	投篮总次数	投篮命中率
3PM	3PA	3P%	FTM	FTA	FT%
三分命中数	三分出手数	三分命中率	罚球命中数	罚球总次数	罚球命中率
OREB	DREB	REB	AST	TOV	STL
进攻篮板数	防守篮板数	总篮板数	总助攻数	总失误数	总抢断数
BLK	PF	FP	DD2	TD3	+/-
总盖帽数	个人犯规数	虚拟得分	两双数	三双数	正负值总和

其中，球员虚拟得分（FP）指的是在NBA2K2023中进行模拟球队对局所产生的常规赛各球员得分总数。其余各项指标均为篮球基本术语，在此不过多解释。

二导入库

在进行数据分析与处理的过程中，需要在pycharm编辑器中导入数据操作与可视化所需的库。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.model_selection import train_test_split, 
                                    GridSearchCV, 
                                    cross_val_score
from sklearn.metrics import classification_report, 
                            confusion_matrix, 
                            f1_score, r2_score

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

import warnings

warnings.filterwarnings("ignore")

众所周知，Python拥有强大的库资源。在本篇报告中，基于数据主要导入了pandas库用于数据分析与处理、numpy库用于数值计算、matplotlib库用于数据可视化呈现以及plotly库用于实现交互式数据可视化；基于模型则导入了linear_model、neighbors、tree、ensemble各自的回归器，分别用于实现线性回归、K最近邻回归、决策树回归以及随机森林回归。同时，导入warnings库用于在控制台忽视warnings信息。

三读取数据集

3.1 读取数据

利用pandas库读取csv文件，读取后的数据类型为DataFrame类型。

# 读取nba球员数据为csv文件
df = pd.read_csv('E:\\数据文件\\2023_nba_player_stats.csv')

3.2 数据集探索

识别数据集的行数与列数
修改列名
加载数据集基本信息
描述性统计

# 数据集的行数和列数
row, col = df.shape
print("This Dataset have", row, "rows and", col, "columns.")
print("Number of duplicate data : ", df.duplicated().sum())

This Dataset have 539 rows and 30 columns.
Number of duplicate data :  0

数据集包含539行，30列，其中完全重复数据为0条。

df.rename(columns={
   
    'PName': 'Player_Name',
    'POS': 'Position',
    'Team': 'Team_Abbreviation',
    'Age': 'Age',
    'GP': 'Games_Played',
    'W': 'Wins',
    'L': 'Losses',
    'Min': 'Minutes_Played',
    'PTS': 'Total_Points',
    'FGM': 'Field_Goals_Made',
    'FGA': 'Field_Goals_Attempted',
    'FG%': 'Field_Goal_Percentage',
    '3PM': 'Three_Point_FG_Made',
    '3PA': 'Three_Point_FG_Attempted',
    '3P%': 'Three_Point_FG_Percentage',
    'FTM': 'Free_Throws_Made',
    'FTA': 'Free_Throws_Attempted',
    'FT%': 'Free_Throw_Percentage',
    'OREB': 'Offensive_Rebounds',
    'DREB': 'Defensive_Rebounds',
    'REB': 'Total_Rebounds',
    'AST': 'Assists',
    'TOV': 'Turnovers',
    'STL': 'Steals',
    'BLK': 'Blocks',
    'PF': 'Personal_Fouls',
    'FP': 'NBA_Fantasy_Points',
    'DD2': 'Double_Doubles',
    'TD3': 'Triple_Doubles',
    '+/-': 'Plus_Minus'
}, inplace=True)

将原始数据的列名缩写修改为全称。

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 539 entries, 0 to 538
Data columns (total 30 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Player_Name                539 non-null    object 
 1   Position                   534 non-null    object 
 2   Team_Abbreviation          539 non-null    object 
 3   Age                        539 non-null    int64  
 4   Games_Played               539 non-null    int64  
 5   Wins                       539 non-null    int64  
 6   Losses                     539 non-null    int64  
 7   Minutes_Played             539 non-null    float64
 8   Total_Points               539 non-null    int64  
 9   Field_Goals_Made           539 non-null    int64  
 10  Field_Goals_Attempted      539 non-null    int64  
 11  Field_Goal_Percentage      539 non-null    float64
 12  Three_Point_FG_Made        539 non-null    int64  
 13  Three_Point_FG_Attempted   539 non-null    int64  
 14  Three_Point_FG_Percentage  539 non-null    float64
 15  Free_Throws_Made           539 non-null    int64  
 16  Free_Throws_Attempted      539 non-null    int64  
 17  Free_Throw_Percentage      539 non-null    float64
 18  Offensive_Rebounds         539 non-null    int64  
 19  Defensive_Rebounds         539 non-null    int64  
 20  Total_Rebounds             539 non-null    int64  
 21  Assists                    539 non-null    int64  
 22  Turnovers                  539 non-null    int64  
 23  Steals                     539 non-null    int64  
 24  Blocks                     539 non-null    int64  
 25  Personal_Fouls             539 non-null    int64  
 26  NBA_Fantasy_Points         539 non-null    int64  
 27  Double_Doubles             539 non-null    int64  
 28  Triple_Doubles             539 non-null    int64  
 29  Plus_Minus                 539 non-null    int64  
dtypes: float64(4), int64(23), object(3)
memory usage: 126.5+ KB

从加载的数据集基本信息表发现，除Position列存在NaN型数据外，其余列数据项均完整。

print(df.describe(include=np.number))
print(df.describe(include='object'))

              Age  Games_Played  ...  Triple_Doubles  Plus_Minus
count  539.000000    539.000000  ...      539.000000  539.000000
mean    25.970315     48.040816  ...        0.220779    0.000000
std      4.315513     24.650686  ...        1.564432  148.223909
min     19.000000      1.000000  ...        0.000000 -642.000000
25%     23.000000     30.500000  ...        0.000000  -70.000000
50%     25.000000     54.000000  ...        0.000000   -7.000000
75%     29.000000     68.000000  ...        0.000000   57.000000
max     42.000000     83.000000  ...       29.000000  640.000000

[8 rows x 27 columns]
         Player_Name Position Team_Abbreviation
count            539      534               539
unique           539        7                30
top     Jayson Tatum       SG               DAL
freq               1       96                21

从描述性统计来看，出现频率最高的球员位置为SG（得分后卫），出现频率为96次，不同的球员位置包括7种，分别为PG、SG、SF、PF、C以及不明确的G与F，其中PG、SG属于G的划分，SF、PF则属于F的划分。NBA球员的平均年龄为26岁，最小的仅19岁。

3.3 数据可视化

在数据集探索过程中，发现Position列中存在NaN型数据，在描述性统计中发现SG为Position列中出现频率最高的一项，因此，考虑将缺失项修正为SG。

df['Position'].fillna('SG', inplace=True)

在此之后，可以考虑将数据按照球员位置分组进行可视化呈现。其中包括根据球员位置分组得到的平均总得分、球员年龄的频数分布直方图、按位置分组下球员年龄与总得分、投篮命中率、总助攻的二维关系散点图等。

position_stats = df.groupby(['Position']).agg

最低0.47元/天解锁文章

NBA球员得分预测-基于线性回归、KNN回归、决策树回归、随机森林回归

前言

一 数据集概述

二 导入库

三 读取数据集