深圳新秀租房贵吗_从新秀表现预测职业表现

最新推荐文章于 2024-07-28 15:46:11 发布

cumei1658

最新推荐文章于 2024-07-28 15:46:11 发布

阅读量766

点赞数

文章标签： python 机器学习大数据深度学习人工智能

原文链接：https://www.pybloggers.com/2016/03/predicting-career-performance-from-rookie-performance/

版权

深圳新秀租房贵吗

As a huge t-wolves fan, I’ve been curious all year by what we can infer from Karl-Anthony Towns’ great rookie season. To answer this question, I’ve create a simple linear regression model that uses rookie year performance to predict career performance.

作为超级大灰狼的忠实粉丝，我一直对我们从卡尔·安东尼·汤斯（Karl-Anthony Towns）出色的新秀赛季中得出的结论感到好奇。为了回答这个问题，我创建了一个简单的线性回归模型，该模型使用新秀的年表现来预测职业表现。

Many have attempted to predict NBA players’ success via regression style approaches. Notable models I know of include Layne Vashro’s model which uses combine and college performance to predict career performance. Layne Vashro’s model is a quasi-poisson GLM. I tried a similar approach, but had the most success when using ws/48 and OLS. I will discuss this a little more at the end of the post.

许多人试图通过回归风格方法来预测NBA球员的成功。我知道的著名模型包括Layne Vashro的模型，该模型使用结合和大学表现来预测职业表现。 Layne Vashro的模型是准泊松GLM。我尝试了类似的方法，但是在使用ws / 48和OLS时获得了最大的成功。我将在帖子末尾对此进行更多讨论。

I collected all the data for this project from basketball-reference.com. I posted the functions for collecting the data on my github. The data is also posted there. Beware, the data collection scripts take awhile to run.

我从Basketball-reference.com收集了该项目的所有数据。我在github上发布了收集数据的功能。数据也将发布到此处。注意，数据收集脚本需要一段时间才能运行。

This data includes per 36 stats and advanced statistics such as usage percentage. I simply took all the per 36 and advanced statistics from a player’s page on basketball-reference.com.

该数据包括每36个统计数据和高级统计信息，例如使用率。我只是简单地从Basketball-reference.com上球员页面上获取了每36项和高级统计信息。

The variable I am trying to predict is average WS/48 over a player’s career. There’s no perfect box-score statistic when it comes to quantifying a player’s peformance, but ws/48 seems relatively solid.

我试图预测的变量是玩家职业生涯中的平均WS / 48 。在量化球员的表现方面，没有完美的得分统计数据，但是ws / 48似乎相对可靠。

The predicted variable looks pretty gaussian, so I can use ordinary least squares. This will be nice because while ols is not flexible, it’s highly interpretable. At the end of the post I’ll mention some more complex models that I will try.

预测变量看起来很高斯，所以我可以使用普通的最小二乘法。这会很好，因为虽然ols不灵活，但它是高度可解释的。在文章的结尾，我将介绍一些我将尝试的更复杂的模型。

Above, I remove some predictors from the rookie data. Lets run the regression!

上面，我从新秀数据中删除了一些预测变量。让我们运行回归！

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  WS/48   R-squared:                       0.476
Model:                            OLS   Adj. R-squared:                  0.461
Method:                 Least Squares   F-statistic:                     31.72
Date:                Sun, 20 Mar 2016   Prob (F-statistic):          2.56e-194
Time:                        15:29:43   Log-Likelihood:                 3303.9
No. Observations:                1690   AIC:                            -6512.
Df Residuals:                    1642   BIC:                            -6251.
Df Model:                          47                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.2509      0.078      3.223      0.001         0.098     0.404
x1            -0.0031      0.001     -6.114      0.000        -0.004    -0.002
x2            -0.0004   9.06e-05     -4.449      0.000        -0.001    -0.000
x3            -0.0003   8.12e-05     -3.525      0.000        -0.000    -0.000
x4          1.522e-05   4.73e-06      3.218      0.001      5.94e-06  2.45e-05
x5             0.0030      0.031      0.096      0.923        -0.057     0.063
x6             0.0109      0.019      0.585      0.559        -0.026     0.047
x7            -0.0312      0.094     -0.331      0.741        -0.216     0.154
x8             0.0161      0.027      0.594      0.553        -0.037     0.069
x9            -0.0054      0.018     -0.292      0.770        -0.041     0.031
x10            0.0012      0.007      0.169      0.866        -0.013     0.015
x11            0.0136      0.023      0.592      0.554        -0.031     0.059
x12           -0.0099      0.018     -0.538      0.591        -0.046     0.026
x13            0.0076      0.054      0.141      0.888        -0.098     0.113
x14            0.0094      0.012      0.783      0.433        -0.014     0.033
x15            0.0029      0.002      1.361      0.174        -0.001     0.007
x16            0.0078      0.009      0.861      0.390        -0.010     0.026
x17           -0.0107      0.019     -0.573      0.567        -0.047     0.026
x18           -0.0062      0.018     -0.342      0.732        -0.042     0.029
x19            0.0095      0.017      0.552      0.581        -0.024     0.043
x20            0.0111      0.004      2.853      0.004         0.003     0.019
x21            0.0109      0.018      0.617      0.537        -0.024     0.046
x22           -0.0139      0.006     -2.165      0.030        -0.026    -0.001
x23            0.0024      0.005      0.475      0.635        -0.008     0.012
x24            0.0022      0.001      1.644      0.100        -0.000     0.005
x25           -0.0125      0.012     -1.027      0.305        -0.036     0.011
x26           -0.0006      0.000     -1.782      0.075        -0.001  5.74e-05
x27           -0.0011      0.001     -1.749      0.080        -0.002     0.000
x28            0.0012      0.003      0.487      0.626        -0.004     0.006
x29            0.1824      0.089      2.059      0.040         0.009     0.356
x30           -0.0288      0.025     -1.153      0.249        -0.078     0.020
x31           -0.0128      0.011     -1.206      0.228        -0.034     0.008
x32           -0.0046      0.008     -0.603      0.547        -0.020     0.010
x33           -0.0071      0.005     -1.460      0.145        -0.017     0.002
x34            0.0131      0.012      1.124      0.261        -0.010     0.036
x35           -0.0023      0.001     -2.580      0.010        -0.004    -0.001
x36           -0.0077      0.013     -0.605      0.545        -0.033     0.017
x37            0.0069      0.004      1.916      0.055        -0.000     0.014
x38           -0.0015      0.001     -2.568      0.010        -0.003    -0.000
x39           -0.0002      0.002     -0.110      0.912        -0.005     0.004
x40           -0.0109      0.017     -0.632      0.528        -0.045     0.023
x41           -0.0142      0.017     -0.821      0.412        -0.048     0.020
x42            0.0217      0.017      1.257      0.209        -0.012     0.056
x43            0.0123      0.102      0.121      0.904        -0.188     0.213
x44            0.0441      0.018      2.503      0.012         0.010     0.079
x45            0.0406      0.018      2.308      0.021         0.006     0.075
x46           -0.0410      0.018     -2.338      0.020        -0.075    -0.007
x47            0.0035      0.003      1.304      0.192        -0.002     0.009
==============================================================================
Omnibus:                       42.820   Durbin-Watson:                   1.966
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               54.973
Skew:                           0.300   Prob(JB):                     1.16e-12
Kurtosis:                       3.649   Cond. No.                     1.88e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.88e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  WS/48   R-squared:                       0.476
Model:                            OLS   Adj. R-squared:                  0.461
Method:                 Least Squares   F-statistic:                     31.72
Date:                Sun, 20 Mar 2016   Prob (F-statistic):          2.56e-194
Time:                        15:29:43   Log-Likelihood:                 3303.9
No. Observations:                1690   AIC:                            -6512.
Df Residuals:                    1642   BIC:                            -6251.
Df Model:                          47                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.2509      0.078      3.223      0.001         0.098     0.404
x1            -0.0031      0.001     -6.114      0.000        -0.004    -0.002
x2            -0.0004   9.06e-05     -4.449      0.000        -0.001    -0.000
x3            -0.0003   8.12e-05     -3.525      0.000        -0.000    -0.000
x4          1.522e-05   4.73e-06      3.218      0.001      5.94e-06  2.45e-05
x5             0.0030      0.031      0.096      0.923        -0.057     0.063
x6             0.0109      0.019      0.585      0.559        -0.026     0.047
x7            -0.0312      0.094     -0.331      0.741        -0.216     0.154
x8             0.0161      0.027      0.594      0.553        -0.037     0.069
x9            -0.0054      0.018     -0.292      0.770        -0.041     0.031
x10            0.0012      0.007      0.169      0.866        -0.013     0.015
x11            0.0136      0.023      0.592      0.554        -0.031     0.059
x12           -0.0099      0.018     -0.538      0.591        -0.046     0.026
x13            0.0076      0.054      0.141      0.888        -0.098     0.113
x14            0.0094      0.012      0.783      0.433        -0.014     0.033
x15            0.0029      0.002      1.361      0.174        -0.001     0.007
x16            0.0078      0.009      0.861      0.390        -0.010     0.026
x17           -0.0107      0.019     -0.573      0.567        -0.047     0.026
x18           -0.0062      0.018     -0.342      0.732        -0.042     0.029
x19            0.0095      0.017      0.552      0.581        -0.024     0.043
x20            0.0111      0.004      2.853      0.004         0.003     0.019
x21            0.0109      0.018      0.617      0.537        -0.024     0.046
x22           -0.0139      0.006     -2.165      0.030        -0.026    -0.001
x23            0.0024      0.005      0.475      0.635        -0.008     0.012
x24            0.0022      0.001      1.644      0.100        -0.000     0.005
x25           -0.0125      0.012     -1.027      0.305        -0.036     0.011
x26           -0.0006      0.000     -1.782      0.075        -0.001  5.74e-05
x27           -0.0011      0.001     -1.749      0.080        -0.002     0.000
x28            0.0012      0.003      0.487      0.626        -0.004     0.006
x29            0.1824      0.089      2.059      0.040         0.009     0.356
x30           -0.0288      0.025     -1.153      0.249        -0.078     0.020
x31           -0.0128      0.011     -1.206      0.228        -0.034     0.008
x32           -0.0046      0.008     -0.603      0.547        -0.020     0.010
x33           -0.0071      0.005     -1.460      0.145        -0.017     0.002
x34            0.0131      0.012      1.124      0.261        -0.010     0.036
x35           -0.0023      0.001     -2.580      0.010        -0.004    -0.001
x36           -0.0077      0.013     -0.605      0.545        -0.033     0.017
x37            0.0069      0.004      1.916      0.055        -0.000     0.014
x38           -0.0015      0.001     -2.568      0.010        -0.003    -0.000
x39           -0.0002      0.002     -0.110      0.912        -0.005     0.004
x40           -0.0109      0.017     -0.632      0.528        -0.045     0.023
x41           -0.0142      0.017     -0.821      0.412        -0.048     0.020
x42            0.0217      0.017      1.257      0.209        -0.012     0.056
x43            0.0123      0.102      0.121      0.904        -0.188     0.213
x44            0.0441      0.018      2.503      0.012         0.010     0.079
x45            0.0406      0.018      2.308      0.021         0.006     0.075
x46           -0.0410      0.018     -2.338      0.020        -0.075    -0.007
x47            0.0035      0.003      1.304      0.192        -0.002     0.009
==============================================================================
Omnibus:                       42.820   Durbin-Watson:                   1.966
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               54.973
Skew:                           0.300   Prob(JB):                     1.16e-12
Kurtosis:                       3.649   Cond. No.                     1.88e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.88e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

There’s a lot to look at in the regression output (especially with this many features). For an explanation of all the different parts of the regression take a look at this post. Below is a quick plot of predicted ws/48 against actual ws/48.

回归输出中有很多要看的东西（尤其是具有这么多的功能）。有关回归的所有不同部分的说明，请参阅此文章。以下是预测ws / 48与实际ws / 48的快速曲线图。

plt.plot(estAll.predict(X_rookie),Y,'o')
plt.plot(np.arange(0,0.25,0.01),np.arange(0,0.25,0.01),'b-')
plt.ylabel('Career WS/48')
plt.xlabel('Predicted WS/48');
plt . plot ( estAll . predict ( X_rookie ), Y , 'o' )
 plt . plot ( np . arange ( 0 , 0.25 , 0.01 ), np . arange ( 0 , 0.25 , 0.01 ), 'b-' )
 plt . ylabel ( 'Career WS/48' )
 plt . xlabel ( 'Predicted WS/48' );

The blue line above is NOT the best-fit line. It’s the identity line. I plot it to help visualize where the model fails. The model seems to primarily fail in the extremes – it tends to overestimate the worst players.

上面的蓝线不是最佳拟合线。这是身份线。我将其绘制以帮助可视化模型失败的地方。该模型似乎主要在极端情况下失败-它往往高估了最差的参与者。

All in all, This model does a remarkably good job given its simplicity (linear regression), but it also leaves a lot of variance unexplained.

总而言之，该模型简单易行（线性回归），因此表现出色，但也留下了许多无法解释的差异。

One reason this model might miss some variance is there’s more than one way to be a productive basketball player. For instance, Dwight Howard and Steph Curry find very different ways to contribute. One linear regression model is unlikely to succesfully predict both players.

这种模型可能会遗漏一些差异的一个原因是，要成为一名富有成效的篮球运动员，有多种方法。例如，德怀特·霍华德（Dwight Howard）和史蒂芬·库里（Steph Curry）发现了非常不同的贡献方式。一个线性回归模型不太可能成功预测两个参与者。

In a previous post, I grouped players according to their on-court performance. These player groupings might help predict career performance.

在上一篇文章中，我根据球员在场上的表现对其进行了分组。这些球员分组可能有助于预测职业表现。

Below, I will use the same player grouping I developed in my previous post, and examine how these groupings impact my ability to predict career performance.

在下文中，我将使用与上一篇文章中开发的球员分组相同的方式，并研究这些分组如何影响我的职业生涯预测能力。

from sklearn.preprocessing import StandardScaler

df = pd.read_pickle('nba_bballref_career_stats_2016_Mar_15.pkl')
df = df[df['G']>50]
df_drop = df.drop(['Year','Name','G','GS','MP','FG','FGA','FG%','3P','2P','FT','TRB','PTS','ORtg','DRtg','PER','TS%','3PAr','FTr','ORB%','DRB%','TRB%','AST%','STL%','BLK%','TOV%','USG%','OWS','DWS','WS','WS/48','OBPM','DBPM','BPM','VORP'],1)
X = df_drop.as_matrix() #take data out of dataframe
ScaleModel = StandardScaler().fit(X)
X = ScaleModel.transform(X)
from sklearn.preprocessing import StandardScaler
 
 df = pd . read_pickle ( 'nba_bballref_career_stats_2016_Mar_15.pkl' )
 df = df [ df [ 'G' ] > 50 ]
 df_drop = df . drop ([ 'Year' , 'Name' , 'G' , 'GS' , 'MP' , 'FG' , 'FGA' , 'FG%' , '3P' , '2P' , 'FT' , 'TRB' , 'PTS' , 'ORtg' , 'DRtg' , 'PER' , 'TS%' , '3PAr' , 'FTr' , 'ORB%' , 'DRB%' , 'TRB%' , 'AST%' , 'STL%' , 'BLK%' , 'TOV%' , 'USG%' , 'OWS' , 'DWS' , 'WS' , 'WS/48' , 'OBPM' , 'DBPM' , 'BPM' , 'VORP' ], 1 )
 X = df_drop . as_matrix () #take data out of dataframe
 ScaleModel = StandardScaler () . fit ( X )
 X = ScaleModel . transform ( X )

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

reduced_model = PCA(n_components=5, whiten=True).fit(X)

reduced_data = reduced_model.transform(X) #transform data into the 5 PCA components space
final_fit = KMeans(n_clusters=6).fit(reduced_data) #fit 6 clusters
df['kmeans_label'] = final_fit.labels_ #label each data point with its clusters
from sklearn.decomposition import PCA
 from sklearn.cluster import KMeans
 
 reduced_model = PCA ( n_components = 5 , whiten = True ) . fit ( X )
 
 reduced_data = reduced_model . transform ( X ) #transform data into the 5 PCA components space
 final_fit = KMeans ( n_clusters = 6 ) . fit ( reduced_data ) #fit 6 clusters
 df [ 'kmeans_label' ] = final_fit . labels_ #label each data point with its clusters

See my other post for more details about this clustering procedure.

有关此群集过程的更多详细信息，请参见我的其他文章。

Let’s see how WS/48 varies across the groups.

让我们看看WS / 48在各个组之间如何变化。

WS_48 = [df[df['kmeans_label']==x]['WS/48'] for x in np.unique(df['kmeans_label'])] #create a vector of ws/48. One for each cluster
plt.boxplot(WS_48);
WS_48 = [ df [ df [ 'kmeans_label' ] == x ][ 'WS/48' ] for x in np . unique ( df [ 'kmeans_label' ])] #create a vector of ws/48. One for each cluster
 plt . boxplot ( WS_48 );

Some groups perform better than others, but there’s lots of overlap between the groups. Importantly, each group has a fair amount of variability. Each group spans at least 0.15 WS/48. This gives the regression enough room to successfully predict performance in each group.

有些小组的表现要好于其他小组，但是小组之间存在很多重叠之处。重要的是，每个组都有相当大的可变性。每个组至少跨越0.15 WS / 48。这为回归提供了足够的空间来成功预测每个组的性能。

Now, lets get a bit of a refresher on what the groups are. Again, my previous post has a good description of these groups.

现在，让我们来回顾一下什么是组。同样，我以前的帖子对这些群体有很好的描述。

TS = [np.mean(df[df['kmeans_label']==x]['TS%'])*100 for x in np.unique(df['kmeans_label'])] #create vectors of each stat for each cluster
ThreeAr = [np.mean(df[df['kmeans_label']==x]['3PAr'])*100 for x in np.unique(df['kmeans_label'])]
FTr = [np.mean(df[df['kmeans_label']==x]['FTr'])*100 for x in np.unique(df['kmeans_label'])]
RBD = [np.mean(df[df['kmeans_label']==x]['TRB%']) for x in np.unique(df['kmeans_label'])]
AST = [np.mean(df[df['kmeans_label']==x]['AST%']) for x in np.unique(df['kmeans_label'])]
STL = [np.mean(df[df['kmeans_label']==x]['STL%']) for x in np.unique(df['kmeans_label'])]
TOV = [np.mean(df[df['kmeans_label']==x]['TOV%']) for x in np.unique(df['kmeans_label'])]
USG = [np.mean(df[df['kmeans_label']==x]['USG%']) for x in np.unique(df['kmeans_label'])]

Data = np.array([TS,ThreeAr,FTr,RBD,AST,STL,TOV,USG])
ind = np.arange(1,9)

plt.figure(figsize=(16,8))
plt.plot(ind,Data,'o-',linewidth=2)
plt.xticks(ind,('True Shooting', '3 point Attempt', 'Free Throw Rate', 'Rebound', 'Assist','Steal','TOV','Usage'),rotation=45)
plt.legend(('Group 1','Group 2','Group 3','Group 4','Group 5','Group 6'))
plt.ylabel('Percent');
TS = [ np . mean ( df [ df [ 'kmeans_label' ] == x ][ 'TS%' ]) * 100 for x in np . unique ( df [ 'kmeans_label' ])] #create vectors of each stat for each cluster
 ThreeAr = [ np . mean ( df [ df [ 'kmeans_label' ] == x ][ '3PAr' ]) * 100 for x in np . unique ( df [ 'kmeans_label' ])]
 FTr = [ np . mean ( df [ df [ 'kmeans_label' ] == x ][ 'FTr' ]) * 100 for x in np . unique ( df [ 'kmeans_label' ])]
 RBD = [ np . mean ( df [ df [ 'kmeans_label' ] == x ][ 'TRB%' ]) for x in np . unique ( df [ 'kmeans_label' ])]
 AST = [ np . mean ( df [ df [ 'kmeans_label' ] == x ][ 'AST%' ]) for x in np . unique ( df [ 'kmeans_label' ])]
 STL = [ np . mean ( df [ df [ 'kmeans_label' ] == x ][ 'STL%' ]) for x in np . unique ( df [ 'kmeans_label' ])]
 TOV = [ np . mean ( df [ df [ 'kmeans_label' ] == x ][ 'TOV%' ]) for x in np . unique ( df [ 'kmeans_label' ])]
 USG = [ np . mean ( df [ df [ 'kmeans_label' ] == x ][ 'USG%' ]) for x in np . unique ( df [ 'kmeans_label' ])]
 
 Data = np . array ([ TS , ThreeAr , FTr , RBD , AST , STL , TOV , USG ])
 ind = np . arange ( 1 , 9 )
 
 plt . figure ( figsize = ( 16 , 8 ))
 plt . plot ( ind , Data , 'o-' , linewidth = 2 )
 plt . xticks ( ind ,( 'True Shooting' , '3 point Attempt' , 'Free Throw Rate' , 'Rebound' , 'Assist' , 'Steal' , 'TOV' , 'Usage' ), rotation = 45 )
 plt . legend (( 'Group 1' , 'Group 2' , 'Group 3' , 'Group 4' , 'Group 5' , 'Group 6' ))
 plt . ylabel ( 'Percent' );

I’ve plotted the groups across a number of useful categories. For information about these categories see basketball reference’s glossary.

我已经在许多有用的类别中绘制了分组。有关这些类别的信息，请参阅《篮球参考》的词汇表。

Here’s a quick rehash of the groupings. See my previous post for more detail.

这是分组的快速总结。有关更多详细信息，请参见我以前的文章。

Group 1: These are the distributors who shoot a fair number of threes, don’t rebound at all, dish out assists, gather steals, and …turn the ball over.
Group 2: These are the scorers who get to the free throw line, dish out assists, and carry a high usage.
Group 3: These are the bench players who don’t score…or do much in general.
Group 4: These are the 3 point shooters who shoot tons of 3 pointers, almost no free throws, and don’t rebound well.
Group 5: These are the mid-range shooters who shoot well, but don’t shoot threes or draw free throws
Group 6: These are the defensive big men who shoot no threes, rebound lots, and carry a low usage.

第一组：这些发球者投出相当多的三分球，根本不反弹，提供助攻，抢断，然后把球翻过来。
第二组：这些得分手可以进入罚球线，提供助攻，并保持较高的使用率。
第三组：这些是板凳球员，他们不得分……或者总体上不胜一筹。
第4组：这是三分射手，他们能射出3吨的三分球，几乎没有罚球，而且篮板也不佳。
第5组：这些是中距离射手，射门能力强，但不投三分球或罚球
第6组：这些防守大个子不投三分，抢篮板，投篮偏低。

On to the regression.

回归。

rookie_df = pd.read_pickle('nba_bballref_rookie_stats_2016_Mar_15.pkl')
rookie_df = rookie_df.drop(['Year','Career Games','Name'],1)

X = rookie_df.as_matrix() #take data out of dataframe
ScaleRookie = StandardScaler().fit(X) #scale data
X = ScaleRookie.transform(X) #transform data to scale

reduced_model_rookie = PCA(n_components=10).fit(X) #create pca model of first 10 components. 
rookie_df = pd . read_pickle ( 'nba_bballref_rookie_stats_2016_Mar_15.pkl' )
 rookie_df = rookie_df . drop ([ 'Year' , 'Career Games' , 'Name' ], 1 )
 
 X = rookie_df . as_matrix () #take data out of dataframe
 ScaleRookie = StandardScaler () . fit ( X ) #scale data
 X = ScaleRookie . transform ( X ) #transform data to scale
 
 reduced_model_rookie = PCA ( n_components = 10 ) . fit ( X ) #create pca model of first 10 components.

You might have noticed the giant condition number in the regression above. This indicates significant multicollinearity of the features, which isn’t surprising since I have many features that reflect the same abilities.

您可能已经注意到上面回归中的巨大条件数。这表明这些功能具有显着的多重共线性，这并不奇怪，因为我有许多功能可以反映相同的功能。

The multicollinearity doesn’t prevent the regression model from making accurate predictions, but does it make the beta weight estimates irratic. With irratic beta weights, it’s hard to tell whether the different clusters use different models when predicting career ws/48.

多重共线性不会阻止回归模型做出准确的预测，但是会使beta权重估计变得不合理。由于beta权重不稳定，在预测职业ws / 48时很难说出不同的集群是否使用不同的模型。

In the following regression, I put the predicting features through a PCA and keep only the first 10 PCA components. Using only the first 10 PCA components keeps the component score below 20, indicating that multicollinearity is not a problem. I then examine whether the different groups exhibit a different patterns of beta weights (whether different models predict success of the different groups).

在下面的回归中，我将预测特征通过PCA放置，仅保留前10个PCA组件。仅使用前10个PCA组件可使组件得分保持在20以下，这表明多重共线性不是问题。然后，我检查了不同的群体是否表现出不同的beta权重模式（不同的模型是否预测了不同群体的成功）。

cluster_labels = df[df['Year']>1980]['kmeans_label'] #limit labels to players after 1980
rookie_df_drop['kmeans_label'] = cluster_labels #label each data point with its clusters

estHold = [[],[],[],[],[],[]]

for i,group in enumerate(np.unique(final_fit.labels_)):

    Grouper = df['kmeans_label']==group #do regression one group at a time
    Yearer = df['Year']>1980

    Group1 = df[Grouper & Yearer]
    Y = Group1['WS/48'] #get predicted data

    Group1_rookie = rookie_df_drop[rookie_df_drop['kmeans_label']==group] #get predictor data of group
    Group1_rookie = Group1_rookie.drop(['kmeans_label'],1)

    X = Group1_rookie.as_matrix() #take data out of dataframe
    X = ScaleRookie.transform(X) #scale data

    X = reduced_model_rookie.transform(X) #transform data into the 10 PCA components space

    X = sm.add_constant(X)  # Adds a constant term to the predictor
    est = sm.OLS(Y,X) #create regression model
    est = est.fit()
    #print(est.summary())
    estHold[i] = est

cluster_labels = df [ df [ 'Year' ] > 1980 ][ 'kmeans_label' ] #limit labels to players after 1980
 rookie_df_drop [ 'kmeans_label' ] = cluster_labels #label each data point with its clusters
 
 estHold = [[],[],[],[],[],[]]
 
 for i , group in enumerate ( np . unique ( final_fit . labels_ )):
 
     Grouper = df [ 'kmeans_label' ] == group #do regression one group at a time
     Yearer = df [ 'Year' ] > 1980
 
     Group1 = df [ Grouper & Yearer ]
     Y = Group1 [ 'WS/48' ] #get predicted data
 
     Group1_rookie = rookie_df_drop [ rookie_df_drop [ 'kmeans_label' ] == group ] #get predictor data of group
     Group1_rookie = Group1_rookie . drop ([ 'kmeans_label' ], 1 )
 
     X = Group1_rookie . as_matrix () #take data out of dataframe
     X = ScaleRookie . transform ( X ) #scale data
 
     X = reduced_model_rookie . transform ( X ) #transform data into the 10 PCA components space
 
     X = sm . add_constant ( X )  # Adds a constant term to the predictor
     est = sm . OLS ( Y , X ) #create regression model
     est = est . fit ()
     #print(est.summary())
     estHold [ i ] = est

plt.figure(figsize=(12,6)) #plot the beta weights
width=0.12
for i,est in enumerate(estHold):
    plt.bar(np.arange(11)+width*i,est.params,color=plt.rcParams['axes.color_cycle'][i],width=width,yerr=(est.conf_int()[1]-est.conf_int()[0])/2)

plt.xlim(right=11)
plt.xlabel('Principle Components')
plt.legend(('Group 1','Group 2','Group 3','Group 4','Group 5','Group 6'))
plt.ylabel('Beta Weights');
plt . figure ( figsize = ( 12 , 6 )) #plot the beta weights
 width = 0.12
 for i , est in enumerate ( estHold ):
     plt . bar ( np . arange ( 11 ) + width * i , est . params , color = plt . rcParams [ 'axes.color_cycle' ][ i ], width = width , yerr = ( est . conf_int ()[ 1 ] - est . conf_int ()[ 0 ]) / 2 )
 
 plt . xlim ( right = 11 )
 plt . xlabel ( 'Principle Components' )
 plt . legend (( 'Group 1' , 'Group 2' , 'Group 3' , 'Group 4' , 'Group 5' , 'Group 6' ))
 plt . ylabel ( 'Beta Weights' );

Above I plot the beta weights for each principle component across the groupings. This plot is a lot to look at, but I wanted to depict how the beta values changed across the groups. They are not drastically different, but they’re also not identical. Error bars depict 95% confidence intervals.

在上方，我绘制了各组中每个主要成分的Beta权重。这个图有很多值得关注的地方，但我想描述一下各个组之间的beta值如何变化。它们并没有很大的不同，但是它们也不相同。误差棒描绘了95％的置信区间。

Below I fit a regression to each group, but with all the features. Again, multicollinearity will be a problem, but this will not decrease the regression’s accuracy, which is all I really care about.

下面，我对每个组进行回归分析，但具有所有功能。同样，多重共线性将是一个问题，但这不会降低回归的准确性，这是我真正关心的。

X = rookie_df.as_matrix() #take data out of dataframe

cluster_labels = df[df['Year']>1980]['kmeans_label']
rookie_df_drop['kmeans_label'] = cluster_labels #label each data point with its clusters

plt.figure(figsize=(8,6));

estHold = [[],[],[],[],[],[]]

for i,group in enumerate(np.unique(final_fit.labels_)):

    Grouper = df['kmeans_label']==group #do one regression at a time
    Yearer = df['Year']>1980

    Group1 = df[Grouper & Yearer]
    Y = Group1['WS/48'] #get predictor data

    Group1_rookie = rookie_df_drop[rookie_df_drop['kmeans_label']==group]
    Group1_rookie = Group1_rookie.drop(['kmeans_label'],1) #get predicted data

    X = Group1_rookie.as_matrix() #take data out of dataframe    

    X = sm.add_constant(X)  # Adds a constant term to the predictor
    est = sm.OLS(Y,X) #fit with linear regression model
    est = est.fit()
    estHold[i] = est
    #print est.summary()

    plt.subplot(3,2,i+1) #plot each regression's prediction against actual data
    plt.plot(est.predict(X),Y,'o',color=plt.rcParams['axes.color_cycle'][i])
    plt.plot(np.arange(-0.1,0.25,0.01),np.arange(-0.1,0.25,0.01),'-')
    plt.title('Group %d'%(i+1))
    plt.text(0.15,-0.05,'$r^2$=%.2f'%est.rsquared)
    plt.xticks([0.0,0.12,0.25])
    plt.yticks([0.0,0.12,0.25]);
X = rookie_df . as_matrix () #take data out of dataframe
 
 cluster_labels = df [ df [ 'Year' ] > 1980 ][ 'kmeans_label' ]
 rookie_df_drop [ 'kmeans_label' ] = cluster_labels #label each data point with its clusters
 
 plt . figure ( figsize = ( 8 , 6 ));
 
 estHold = [[],[],[],[],[],[]]
 
 for i , group in enumerate ( np . unique ( final_fit . labels_ )):
 
     Grouper = df [ 'kmeans_label' ] == group #do one regression at a time
     Yearer = df [ 'Year' ] > 1980
 
     Group1 = df [ Grouper & Yearer ]
     Y = Group1 [ 'WS/48' ] #get predictor data
 
     Group1_rookie = rookie_df_drop [ rookie_df_drop [ 'kmeans_label' ] == group ]
     Group1_rookie = Group1_rookie . drop ([ 'kmeans_label' ], 1 ) #get predicted data
 
     X = Group1_rookie . as_matrix () #take data out of dataframe    
 
     X = sm . add_constant ( X )  # Adds a constant term to the predictor
     est = sm . OLS ( Y , X ) #fit with linear regression model
     est = est . fit ()
     estHold [ i ] = est
     #print est.summary()
 
     plt . subplot ( 3 , 2 , i + 1 ) #plot each regression's prediction against actual data
     plt . plot ( est . predict ( X ), Y , 'o' , color = plt . rcParams [ 'axes.color_cycle' ][ i ])
     plt . plot ( np . arange ( - 0.1 , 0.25 , 0.01 ), np . arange ( - 0.1 , 0.25 , 0.01 ), '-' )
     plt . title ( 'Group  %d ' % ( i + 1 ))
     plt . text ( 0.15 , - 0.05 , '$r^2$= %.2f ' % est . rsquared )
     plt . xticks ([ 0.0 , 0.12 , 0.25 ])
     plt . yticks ([ 0.0 , 0.12 , 0.25 ]);

The plots above depict each regression’s predictions against actual ws/48. I provide each model’s r² in the plot too.

上图显示了每个回归相对于实际ws / 48的预测。我也在图中提供了每个模型的r ² 。

Some regressions are better than others. For instance, the regression model does a pretty awesome job predicting the bench warmers…I wonder if this is because they have shorter careers… The regression model does not do a good job predicting the 3-point shooters.

一些回归比其他回归更好。例如，回归模型在预测替补球员方面做得非常出色……我想知道这是否是因为他们的职业生涯较短……回归模型在预测三分射手方面表现不佳。

Now onto the fun stuff though.

现在，尽管有趣。

Below, create a function for predicting a players career WS/48. First, I write a function that finds what cluster a player would belong to, and what the regression model predicts for this players career (with 95% confidence intervals).

在下面，创建一个用于预测玩家职业WS / 48的函数。首先，我编写一个函数来查找球员所属的集群，以及回归模型对该球员职业的预测（置信区间为95％）。

def player_prediction__regressionModel(PlayerName):
    from statsmodels.sandbox.regression.predstd import wls_prediction_std

    clust_df = pd.read_pickle('nba_bballref_career_stats_2016_Mar_05.pkl')
    clust_df = clust_df[clust_df['Name']==PlayerName]
    clust_df = clust_df.drop(['Name','G','GS','MP','FG','FGA','FG%','3P','2P','FT','TRB','PTS','ORtg','DRtg','PER','TS%','3PAr','FTr','ORB%','DRB%','TRB%','AST%','STL%','BLK%','TOV%','USG%','OWS','DWS','WS','WS/48','OBPM','DBPM','BPM','VORP'],1)
    new_vect = ScaleModel.transform(clust_df.as_matrix()[0])
    reduced_data = reduced_model.transform(new_vect)
    Group = final_fit.predict(reduced_data)
    clust_df['kmeans_label'] = Group[0]

    Predrookie_df = pd.read_pickle('nba_bballref_rookie_stats_2016_Mar_15.pkl')
    Predrookie_df = Predrookie_df[Predrookie_df['Name']==PlayerName]
    Predrookie_df = Predrookie_df.drop(['Year','Career Games','Name'],1)
    predX = Predrookie_df.as_matrix() #take data out of dataframe
    predX = sm.add_constant(predX,has_constant='add')  # Adds a constant term to the predictor
    prstd_ols, iv_l_ols, iv_u_ols = wls_prediction_std(estHold[Group[0]],predX,alpha=0.05)
    return {'Name':PlayerName,'Group':Group[0]+1,'Prediction':estHold[Group[0]].predict(predX),'Upper_CI':iv_u_ols,'Lower_CI':iv_l_ols}
def player_prediction__regressionModel ( PlayerName ):
     from statsmodels.sandbox.regression.predstd import wls_prediction_std
 
     clust_df = pd . read_pickle ( 'nba_bballref_career_stats_2016_Mar_05.pkl' )
     clust_df = clust_df [ clust_df [ 'Name' ] == PlayerName ]
     clust_df = clust_df . drop ([ 'Name' , 'G' , 'GS' , 'MP' , 'FG' , 'FGA' , 'FG%' , '3P' , '2P' , 'FT' , 'TRB' , 'PTS' , 'ORtg' , 'DRtg' , 'PER' , 'TS%' , '3PAr' , 'FTr' , 'ORB%' , 'DRB%' , 'TRB%' , 'AST%' , 'STL%' , 'BLK%' , 'TOV%' , 'USG%' , 'OWS' , 'DWS' , 'WS' , 'WS/48' , 'OBPM' , 'DBPM' , 'BPM' , 'VORP' ], 1 )
     new_vect = ScaleModel . transform ( clust_df . as_matrix ()[ 0 ])
     reduced_data = reduced_model . transform ( new_vect )
     Group = final_fit . predict ( reduced_data )
     clust_df [ 'kmeans_label' ] = Group [ 0 ]
 
     Predrookie_df = pd . read_pickle ( 'nba_bballref_rookie_stats_2016_Mar_15.pkl' )
     Predrookie_df = Predrookie_df [ Predrookie_df [ 'Name' ] == PlayerName ]
     Predrookie_df = Predrookie_df . drop ([ 'Year' , 'Career Games' , 'Name' ], 1 )
     predX = Predrookie_df . as_matrix () #take data out of dataframe
     predX = sm . add_constant ( predX , has_constant = 'add' )  # Adds a constant term to the predictor
     prstd_ols , iv_l_ols , iv_u_ols = wls_prediction_std ( estHold [ Group [ 0 ]], predX , alpha = 0.05 )
     return { 'Name' : PlayerName , 'Group' : Group [ 0 ] + 1 , 'Prediction' : estHold [ Group [ 0 ]] . predict ( predX ), 'Upper_CI' : iv_u_ols , 'Lower_CI' : iv_l_ols }

Here I create a function that creates a list of all the first round draft picks from a given year.

在这里，我创建一个函数，该函数创建给定年份的所有第一轮选秀权的列表。

def gather_draftData(Year):

    import urllib2
    from bs4 import BeautifulSoup
    import pandas as pd
    import numpy as np

    draft_len = 30

    def convert_float(val):
        try:
            return float(val)
        except ValueError:
            return np.nan

    url = 'http://www.basketball-reference.com/draft/NBA_'+str(Year)+'.html'
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html,"lxml")

    draft_num = [soup.findAll('tbody')[0].findAll('tr')[i].findAll('td')[0].text for i in range(draft_len)]
    draft_nam = [soup.findAll('tbody')[0].findAll('tr')[i].findAll('td')[3].text for i in range(draft_len)]

    draft_df = pd.DataFrame([draft_num,draft_nam]).T
    draft_df.columns = ['Number','Name']
    df.index = range(np.size(df,0))
    return draft_df
def gather_draftData ( Year ):
 
     import urllib2
     from bs4 import BeautifulSoup
     import pandas as pd
     import numpy as np
 
     draft_len = 30
 
     def convert_float ( val ):
         try :
             return float ( val )
         except ValueError :
             return np . nan
 
     url = 'http://www.basketball-reference.com/draft/NBA_' + str ( Year ) + '.html'
     html = urllib2 . urlopen ( url )
     soup = BeautifulSoup ( html , "lxml" )
 
     draft_num = [ soup . findAll ( 'tbody' )[ 0 ] . findAll ( 'tr' )[ i ] . findAll ( 'td' )[ 0 ] . text for i in range ( draft_len )]
     draft_nam = [ soup . findAll ( 'tbody' )[ 0 ] . findAll ( 'tr' )[ i ] . findAll ( 'td' )[ 3 ] . text for i in range ( draft_len )]
 
     draft_df = pd . DataFrame ([ draft_num , draft_nam ]) . T
     draft_df . columns = [ 'Number' , 'Name' ]
     df . index = range ( np . size ( df , 0 ))
     return draft_df

Below I create predictions for each first-round draft pick from 2015. The spurs’ first round pick, Nikola Milutinov, has yet to play so I do not create a prediction for him.

下面，我为2015年以来的首轮选秀做出预测。马刺的首轮选秀Nikola Milutinov尚未出场，因此我不会为他做出预测。

import matplotlib.patches as mpatches

draft_df = gather_draftData(2015)

draft_df['Name'][14] =  'Kelly Oubre Jr.' #annoying name inconsistencies 

plt.subplots(figsize=(14,6));
plt.xticks(range(1,31),draft_df['Name'],rotation=90)

draft_df = draft_df.drop(17, 0) #Sam Dekker has received little playing time making his prediction highly irratic
draft_df = draft_df.drop(25, 0) #spurs' 1st round pick has not played yet

for name in draft_df['Name']:

    draft_num = draft_df[draft_df['Name']==name]['Number']

    predict_dict = player_prediction__regressionModel(name)
    yerr = (predict_dict['Upper_CI']-predict_dict['Lower_CI'])/2

    plt.errorbar(draft_num,predict_dict['Prediction'],fmt='o',label=name,
                color=plt.rcParams['axes.color_cycle'][predict_dict['Group']-1],yerr=yerr);

plt.xlim(left=0,right=31)
patch = [mpatches.Patch(color=plt.rcParams['axes.color_cycle'][i], label='Group %d'%(i+1)) for i in range(6)]
plt.legend(handles=patch,ncol=3)
plt.ylabel('Predicted WS/48')
plt.xlabel('Draft Position');
import matplotlib.patches as mpatches
 
 draft_df = gather_draftData ( 2015 )
 
 draft_df [ 'Name' ][ 14 ] =  'Kelly Oubre Jr.' #annoying name inconsistencies 
 
 plt . subplots ( figsize = ( 14 , 6 ));
 plt . xticks ( range ( 1 , 31 ), draft_df [ 'Name' ], rotation = 90 )
 
 draft_df = draft_df . drop ( 17 , 0 ) #Sam Dekker has received little playing time making his prediction highly irratic
 draft_df = draft_df . drop ( 25 , 0 ) #spurs' 1st round pick has not played yet
 
 for name in draft_df [ 'Name' ]:
 
     draft_num = draft_df [ draft_df [ 'Name' ] == name ][ 'Number' ]
 
     predict_dict = player_prediction__regressionModel ( name )
     yerr = ( predict_dict [ 'Upper_CI' ] - predict_dict [ 'Lower_CI' ]) / 2
 
     plt . errorbar ( draft_num , predict_dict [ 'Prediction' ], fmt = 'o' , label = name ,
                 color = plt . rcParams [ 'axes.color_cycle' ][ predict_dict [ 'Group' ] - 1 ], yerr = yerr );
 
 plt . xlim ( left = 0 , right = 31 )
 patch = [ mpatches . Patch ( color = plt . rcParams [ 'axes.color_cycle' ][ i ], label = 'Group  %d ' % ( i + 1 )) for i in range ( 6 )]
 plt . legend ( handles = patch , ncol = 3 )
 plt . ylabel ( 'Predicted WS/48' )
 plt . xlabel ( 'Draft Position' );

The plot above is ordered by draft pick. The error bars depict 95% confidence interbals…which are a little wider than I would like. It’s interesting to look at what clusters these players fit into. Lots of 3-pt shooters! It could be that rookies play a limited role in the offense – just shooting 3s.

上图是按选秀顺序排序的。误差线表示95％置信区间...…比我想要的宽一点。看看这些参与者适合的集群是很有趣的。很多3点射手！菜鸟在进攻中的作用可能很有限-仅射3分。

As a t-wolves fan, I am relatively happy about the high prediction for Karl-Anthony Towns. His predicted ws/48 is between Marc Gasol and Elton Brand. Again, the CIs are quite wide, so the model says there’s a 95% chance he is somewhere between Lebron James ever and a player that averages less than 0.1 ws/48.

作为t狼的粉丝，我对Karl-Anthony Towns的高预测感到满意。他预测的ws / 48在Marc Gasol和Elton Brand之间。同样，CI的范围很广，因此模型显示他有95％的机会介于勒布朗·詹姆斯和平均水平低于0.1 ws / 48的球员之间。

Karl-Anthony Towns would have the highest predicted ws/48 if it were not for Kevin Looney who the model loves. Kevin Looney has not seen much playing time though, which likely makes his prediction more erratic. Keep in mind I did not use draft position as a predictor in the model.

如果不是模型所爱的凯文·卢尼，卡尔·安东尼·汤斯（Karl-Anthony Towns）的ws / 48预测值最高。凯文·卢尼（Kevin Looney）的比赛时间并不多，这可能会使他的预测更加不稳定。请记住，我没有将草稿位置用作模型中的预测变量。

Sam Dekker has a pretty huge error bar, likely because of his limited playing time this year.

萨姆·德克（Sam Dekker）的错误栏非常大，这可能是因为他今年的比赛时间有限。

While I fed a ton of features into this model, it’s still just a linear regression. The simplicity of the model might prevent me from making more accurate predictions.

尽管我向该模型提供了大量功能，但仍然只是线性回归。模型的简单性可能使我无法做出更准确的预测。

I’ve already started playing with some more complex models. If those work out well, I will post them here. I ended up sticking with a plain linear regression because my vast number of features is a little unwieldy in a more complex models. If you’re interested (and the models produce better results) check back in the future.

我已经开始使用一些更复杂的模型。如果这些工作顺利，我将在这里发布。我最终坚持使用简单的线性回归，因为我的大量功能在更复杂的模型中有点笨拙。如果您有兴趣（并且模型可以产生更好的结果），请以后再查看。

For now, these models explain between 40 and 70% of the variance in career ws/48 from only a player’s rookie year. Even predicting 30% of variance is pretty remarkable, so I don’t want to trash on this part of the model. Explaining 65% of the variance is pretty awesome. The model gives us a pretty accurate idea of how these “bench players” will perform. For instance, the future does not look bright for players like Emmanuel Mudiay and Tyus Jones. Not to say these players are doomed. The model assumes that players will retain their grouping for the entire career. Emmanuel Mudiay and Tyus Jones might start performing more like distributors as their career progresses. This could result in a better career.

目前，这些模型仅说明了一名新秀年份职业ws / 48中40％至70％的差异。甚至预测30％的方差都非常出色，因此我不想在模型的这一部分上浪费时间。解释65％的差异非常棒。该模型为我们提供了关于这些“替补球员”将如何表现的非常准确的想法。例如，对于像伊曼纽尔·穆迪亚伊（Emmanuel Mudiay）和泰斯·琼斯（Tyus Jones）这样的球员来说，未来并不光明。更不用说这些玩家注定了。该模型假设玩家将在整个职业生涯中保持分组。随着职业生涯的发展，伊曼纽尔·穆迪亚伊（Emmanuel Mudiay）和泰斯·琼斯（Tyus Jones）可能会开始表现得更像发行人。这可能会带来更好的职业。

One nice part about this model is it tells us where the predictions are less confident. For instance, it is nice to know that we’re relatively confident when predicting bench players, but not when we’re predicting 3-point shooters.

关于此模型的一个很好的部分是，它告诉我们预测的不确定性。例如，很高兴知道我们在预测替补球员时相对自信，而在预测三分投篮时则相对自信。

For those curious, I output each groups regression summary below.

对于那些好奇的人，我在下面输出每个组的回归摘要。

[print(i.summary()) for i in estHold];
[ print ( i . summary ()) for i in estHold ];

翻译自: https://www.pybloggers.com/2016/03/predicting-career-performance-from-rookie-performance/

深圳新秀租房贵吗

cumei1658

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
深圳新秀租房贵吗_从新秀表现预测职业表现

深圳新秀租房贵吗As a huge t-wolves fan, I’ve been curious all year by what we can infer from Karl-Anthony Towns’ great rookie season. To answer this question, I’ve create a simple linear regression model that...
复制链接

扫一扫