深圳新秀租房贵吗_再次根据新秀表现对NBA职业生涯进行预测...

最新推荐文章于 2024-07-25 12:14:33 发布

cumei1658

最新推荐文章于 2024-07-25 12:14:33 发布

阅读量528

点赞数

文章标签：机器学习 python 人工智能深度学习大数据

原文链接：https://www.pybloggers.com/2016/08/revisting-nba-career-predictions-from-rookie-performance-again/

版权

深圳新秀租房贵吗

Now that the NBA season is done, we have complete data from this year’s NBA rookies. In the past I have tried to predict NBA rookies’ future performance using regression models. In this post I am again trying to predict rookies’ future performance, but now using using a classification approach. When using a classification approach, I predict whether player X will be a “great,” “average,” or “poor” player rather than predicting exactly how productive player X will be.

现在NBA赛季已经结束，我们已经掌握了今年NBA新秀的全部数据。过去，我曾尝试使用回归模型来预测NBA新秀的未来表现。在本文中，我再次尝试预测新秀的未来表现，但现在使用分类方法。当使用分类方法时，我预测参与者X是“伟大”，“平均”还是“贫穷”参与者，而不是确切预测参与者X的生产率。

Much of this post re-uses code from the previous posts, so I skim over some of the repeated code.

这篇文章中的大部分重复使用了先前文章中的代码，因此我略过了一些重复的代码。

As usual, I will post all code as a jupyter notebook on my github.

和往常一样，我将所有代码作为jupyter笔记本发布在github上。

Load the data. Reminder – this data is available on my github.

加载数据。提醒-这些数据可在我的github上找到。

Load more data, and normalize the data for the PCA transformation.

加载更多数据，并为PCA转换标准化数据。

In the past I used k-means to group players according to their performance (see my post on grouping players for more info). Here, I use a gaussian mixture model (GMM) to group the players. I use the GMM model because it assigns each player a “soft” label rather than a “hard” label. By soft label I mean that a player simultaneously belongs to several groups. For instance, Russell Westbrook belongs to both my “point guard” group and my “scorers” group. K-means uses hard labels where each player can only belong to one group. I think the GMM model provides a more accurate representation of players, so I’ve decided to use it in this post. Maybe in a future post I will spend more time describing it.

过去，我使用k-means根据球员的表现对球员进行分组（有关更多信息，请参阅我的分组球员文章）。在这里，我使用高斯混合模型（GMM）对玩家进行分组。我使用GMM模型是因为它为每个玩家分配了一个“软”标签，而不是一个“硬”标签。软标签是指玩家同时属于多个组。例如，罗素·威斯布鲁克（Russell Westbrook）属于我的“控球后卫”组和我的“得分手”组。 K-means使用硬标签，其中每个玩家只能属于一个组。我认为GMM模型可以更准确地表示球员，因此我决定在本文中使用它。也许在以后的文章中，我会花更多的时间来描述它。

For anyone wondering, the GMM groupings looked pretty similar to the k-means groupings.

对于任何想知道的人，GMM分组看起来都非常类似于k-means分组。

In this past I have attempted to predict win shares per 48 minutes. I am using win shares as a dependent variable again, but I want to categorize players.

在过去，我试图预测每48分钟的获胜份额。我再次将获胜份额用作因变量，但我想对球员进行分类。

Below I create a histogram of players’ win shares per 48.

下面，我创建了每48个玩家获胜份额的直方图。

I split players into 4 groups which I will refer to as “bad,” “below average,” “above average,” and “great”: Poor players are the bottom 10% in win shares per 48, Below average are the 10-50th percentiles, Above average and 50-90th percentiles, Great are the top 10%. This assignment scheme is relatively arbitrary; the model performs similarly with different assignment schemes.

我将玩家分为4组，分别称为“差”，“低于平均水平”，“高于平均水平”和“伟大”：差的玩家是每48个获胜份额中排名倒数10％的玩家，低于平均水平的是10-前10％高于平均水平的50％和50-90％的百分位数。这种分配方案是相对任意的。该模型在不同的分配方案下的表现类似。

[0.096314496314496317,
 0.40196560196560199,
 0.39950859950859952,
 0.10221130221130222]
[0.096314496314496317,
 0.40196560196560199,
 0.39950859950859952,
 0.10221130221130222]

My goal is to use rookie year performance to classify players into these 4 categories. I have a big matrix with lots of data about rookie year performance, but the reason that I grouped player using the GMM is because I suspect that players in the different groups have different “paths” to success. I am including the groupings in my classification model and computing interaction terms. The interaction terms will allow rookie performance to produce different predictions for the different groups.

我的目标是利用新秀年度表现将球员分为以下4类。我有一个大矩阵，其中包含有关新秀年度表现的大量数据，但之所以使用GMM对球员进行分组是因为我怀疑不同组中的球员有不同的“成功道路”。我将分类包括在分类模型和计算交互项中。交互条件将允许菜鸟表现为不同的群体产生不同的预测。

By including interaction terms, I include quite a few predictor features. I’ve printed the number of predictor features and the number of predicted players below.

通过包括交互项，我包括了很多预测器功能。我在下面打印了预测功能的数量和预测玩家的数量。

from sklearn import preprocessing

df_drop = df[df['Year']>1980]
for x in np.unique(new_labels):
    Label = 'Category%d' % x
    rookie_df_drop[Label] = df_drop[Label] #give rookies the groupings produced by the GMM model

X = rookie_df_drop.as_matrix() #take data out of dataframe   

poly = preprocessing.PolynomialFeatures(2,interaction_only=True) #create interaction terms.
X = poly.fit_transform(X)

Career_data = df[df['Year']>1980]
Y = Career_data['perf_cat'] #get predictor data
print(np.shape(X))
print(np.shape(Y))
from sklearn import preprocessing
 
 df_drop = df [ df [ 'Year' ] > 1980 ]
 for x in np . unique ( new_labels ):
     Label = 'Category %d ' % x
     rookie_df_drop [ Label ] = df_drop [ Label ] #give rookies the groupings produced by the GMM model
 
 X = rookie_df_drop . as_matrix () #take data out of dataframe   
 
 poly = preprocessing . PolynomialFeatures ( 2 , interaction_only = True ) #create interaction terms.
 X = poly . fit_transform ( X )
 
 Career_data = df [ df [ 'Year' ] > 1980 ]
 Y = Career_data [ 'perf_cat' ] #get predictor data
 print ( np . shape ( X ))
 print ( np . shape ( Y ))

Now that I have all the features, it’s time to try and predict which players will be poor, below average, above average, and great. To create these predictions, I will use a logistic regression model.

现在，我已经具备了所有功能，现在该尝试预测哪些球员将是贫穷，低于平均水平，高于平均水平和出色的球员。为了创建这些预测，我将使用逻辑回归模型。

Because I have so many predictors, correlation between predicting features and over-fitting the data are major concerns. I use regularization and cross-validation to combat these issues.

因为我有很多预测变量，所以预测特征与数据过度拟合之间的相关性是主要问题。我使用正则化和交叉验证来解决这些问题。

Specifically, I am using l2 regularization and k-fold 5 cross-validation. Within the cross-validation, I am trying to estimate how much regularization is appropriate.

具体来说，我正在使用l2正则化和k倍5交叉验证。在交叉验证中，我试图估计多少正则化是合适的。

Some important notes – I am using “balanced” weights which tells the model that worse to incorrectly predict the poor and great players than the below average and above average players. I do this because I don’t want the model to completely ignore the less frequent classifications. Second, I use the multi_class multinomial because it limits the number of models I have to fit.

一些重要注意事项–我正在使用“平衡”权重，该权重告诉模型与平均水平以下和平均水平以上的参与者相比，错误地预测贫穷和伟大的参与者更为糟糕。我这样做是因为我不希望模型完全忽略频率较低的分类。其次，我使用multi_class多项式，因为它限制了我必须适合的模型数量。

0.738109219025
0.738109219025

Okay, the model did pretty well, but lets look at where the errors are coming from. To visualize the models accuracy, I am using a confusion matrix. In a confusion matrix, every item on the diagnonal is a correctly classified item. Every item off the diagonal is incorrectly classified. The color bar’s axis is the percent correct. So the dark blue squares represent cells with more items.

好的，模型做得很好，但是让我们看一下错误的出处。为了可视化模型的准确性，我使用了一个混淆矩阵。在混淆矩阵中，诊断中的每个项目都是正确分类的项目。对角线上的每个项目均未正确分类。彩条的轴是正确的百分比。因此，深蓝色方块代表具有更多项目的单元格。

It seems the model is best at predicting poor players and great players. It makes more errors when trying to predict the more average players.

似乎该模型最适合预测不良参与者和优秀参与者。尝试预测更多平均玩家时，它会产生更多错误。

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y, est.predict(X))

def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap,vmin=0.0, vmax=1.0)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(np.unique(df['perf_cat'])))
    plt.xticks(tick_marks, np.unique(df['perf_cat']))
    plt.yticks(tick_marks, np.unique(df['perf_cat']))
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')
from sklearn.metrics import confusion_matrix
 cm = confusion_matrix ( Y , est . predict ( X ))
 
 def plot_confusion_matrix ( cm , title = 'Confusion matrix' , cmap = plt . cm . Blues ):
     plt . imshow ( cm , interpolation = 'nearest' , cmap = cmap , vmin = 0.0 , vmax = 1.0 )
     plt . title ( title )
     plt . colorbar ()
     tick_marks = np . arange ( len ( np . unique ( df [ 'perf_cat' ])))
     plt . xticks ( tick_marks , np . unique ( df [ 'perf_cat' ]))
     plt . yticks ( tick_marks , np . unique ( df [ 'perf_cat' ]))
     plt . tight_layout ()
     plt . ylabel ( 'True label' )
     plt . xlabel ( 'Predicted label' )
 
 cm_normalized = cm . astype ( 'float' ) / cm . sum ( axis = 1 )[:, np . newaxis ]
 plot_confusion_matrix ( cm_normalized , title = 'Normalized confusion matrix' )

Lets look at what the model predicts for this year’s rookies. Below I modified two functions that I wrote for a previous post. The first function finds a particular year’s draft picks. The second function produces predictions for each draft pick.

让我们看看该模型对今年新秀的预测。下面，我修改了为上一篇文章编写的两个函数。第一个功能查找特定年份的选秀权。第二个功能为每个选秀权产生预测。

def gather_draftData(Year):

    import urllib2
    from bs4 import BeautifulSoup
    import pandas as pd
    import numpy as np

    draft_len = 30

    def convert_float(val):
        try:
            return float(val)
        except ValueError:
            return np.nan

    url = 'http://www.basketball-reference.com/draft/NBA_'+str(Year)+'.html'
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html,"lxml")

    draft_num = [soup.findAll('tbody')[0].findAll('tr')[i].findAll('td')[0].text for i in range(draft_len)]
    draft_nam = [soup.findAll('tbody')[0].findAll('tr')[i].findAll('td')[3].text for i in range(draft_len)]

    draft_df = pd.DataFrame([draft_num,draft_nam]).T
    draft_df.columns = ['Number','Name']
    df.index = range(np.size(df,0))
    return draft_df

def player_prediction__regressionModel(PlayerName):

    clust_df = pd.read_pickle('nba_bballref_career_stats_2016_Apr_15.pkl')
    clust_df = clust_df[clust_df['Name']==PlayerName]
    clust_df = clust_df.drop(['Year','Name','G','GS','MP','FG','FGA','FG%','3P','2P','FT','TRB','PTS','ORtg','DRtg','PER','TS%','3PAr','FTr','ORB%','DRB%','TRB%','AST%','STL%','BLK%','TOV%','USG%','OWS','DWS','WS','WS/48','OBPM','DBPM','BPM','VORP'],1)
    new_vect = ScaleModel.transform(clust_df.as_matrix().reshape(1,-1))
    reduced_data = reduced_model.transform(new_vect)
    predictions = g.predict_proba(reduced_data)
    for x in np.unique(new_labels):
        Label = 'Category%d' % x
        clust_df[Label] = predictions[:,x]

    Predrookie_df = pd.read_pickle('nba_bballref_rookie_stats_2016_Apr_16.pkl')
    Predrookie_df = Predrookie_df[Predrookie_df['Name']==PlayerName]
    Predrookie_df = Predrookie_df.drop(['Year','Career Games','Name'],1)
    for x in np.unique(new_labels):
        Label = 'Category%d' % x
        Predrookie_df[Label] = clust_df[Label] #give rookies the groupings produced by the GMM model
    predX = Predrookie_df.as_matrix() #take data out of dataframe
    predX = poly.fit_transform(predX)
    predictions2 = est.predict_proba(predX)
    return {'Name':PlayerName,'Group':predictions,'Prediction':predictions2[0]}
def gather_draftData ( Year ):
 
     import urllib2
     from bs4 import BeautifulSoup
     import pandas as pd
     import numpy as np
 
     draft_len = 30
 
     def convert_float ( val ):
         try :
             return float ( val )
         except ValueError :
             return np . nan
 
     url = 'http://www.basketball-reference.com/draft/NBA_' + str ( Year ) + '.html'
     html = urllib2 . urlopen ( url )
     soup = BeautifulSoup ( html , "lxml" )
 
     draft_num = [ soup . findAll ( 'tbody' )[ 0 ] . findAll ( 'tr' )[ i ] . findAll ( 'td' )[ 0 ] . text for i in range ( draft_len )]
     draft_nam = [ soup . findAll ( 'tbody' )[ 0 ] . findAll ( 'tr' )[ i ] . findAll ( 'td' )[ 3 ] . text for i in range ( draft_len )]
 
     draft_df = pd . DataFrame ([ draft_num , draft_nam ]) . T
     draft_df . columns = [ 'Number' , 'Name' ]
     df . index = range ( np . size ( df , 0 ))
     return draft_df
 
 def player_prediction__regressionModel ( PlayerName ):
 
     clust_df = pd . read_pickle ( 'nba_bballref_career_stats_2016_Apr_15.pkl' )
     clust_df = clust_df [ clust_df [ 'Name' ] == PlayerName ]
     clust_df = clust_df . drop ([ 'Year' , 'Name' , 'G' , 'GS' , 'MP' , 'FG' , 'FGA' , 'FG%' , '3P' , '2P' , 'FT' , 'TRB' , 'PTS' , 'ORtg' , 'DRtg' , 'PER' , 'TS%' , '3PAr' , 'FTr' , 'ORB%' , 'DRB%' , 'TRB%' , 'AST%' , 'STL%' , 'BLK%' , 'TOV%' , 'USG%' , 'OWS' , 'DWS' , 'WS' , 'WS/48' , 'OBPM' , 'DBPM' , 'BPM' , 'VORP' ], 1 )
     new_vect = ScaleModel . transform ( clust_df . as_matrix () . reshape ( 1 , - 1 ))
     reduced_data = reduced_model . transform ( new_vect )
     predictions = g . predict_proba ( reduced_data )
     for x in np . unique ( new_labels ):
         Label = 'Category %d ' % x
         clust_df [ Label ] = predictions [:, x ]
 
     Predrookie_df = pd . read_pickle ( 'nba_bballref_rookie_stats_2016_Apr_16.pkl' )
     Predrookie_df = Predrookie_df [ Predrookie_df [ 'Name' ] == PlayerName ]
     Predrookie_df = Predrookie_df . drop ([ 'Year' , 'Career Games' , 'Name' ], 1 )
     for x in np . unique ( new_labels ):
         Label = 'Category %d ' % x
         Predrookie_df [ Label ] = clust_df [ Label ] #give rookies the groupings produced by the GMM model
     predX = Predrookie_df . as_matrix () #take data out of dataframe
     predX = poly . fit_transform ( predX )
     predictions2 = est . predict_proba ( predX )
     return { 'Name' : PlayerName , 'Group' : predictions , 'Prediction' : predictions2 [ 0 ]}

Below I create a plot depicting the model’s predictions. On the y-axis are the four classifications. On the x-axis are the players from the 2015 draft. Each cell in the plot is the probability of a player belonging to one of the classifications. Again, dark blue means a cell or more likely. Good news for us T-Wolves fans! The model loves KAT.

在下面，我创建了一个描述模型预测的图。在y轴上有四个分类。 x轴上是2015年选秀的球员。情节中的每个像元都是玩家属于其中一种分类的概率。同样，深蓝色表示一个细胞或更可能。对我们T-狼队的球迷们来说是个好消息！该模特喜欢KAT。

draft_df = gather_draftData(2015)

draft_df['Name'][14] =  'Kelly Oubre Jr.' #annoying name inconsistencies 

plt.subplots(figsize=(14,6));

draft_df = draft_df.drop(25, 0) #spurs' 1st round pick has not played yet

predictions = []
for name in draft_df['Name']:
    draft_num = draft_df[draft_df['Name']==name]['Number']
    predict_dict = player_prediction__regressionModel(name)
    predictions.append(predict_dict['Prediction'])

plt.imshow(np.array(predictions).T, interpolation='nearest', cmap=plt.cm.Blues,vmin=0.0, vmax=1.0)
plt.title('Predicting Future Performance of 2015-16 Rookies')
plt.colorbar(shrink=0.25)
tick_marks = np.arange(len(np.unique(df['perf_cat'])))
plt.xticks(range(0,29),draft_df['Name'],rotation=90)
plt.yticks(range(0,4), ['Poor','Below Average','Above Average','Great'])
plt.tight_layout()
plt.ylabel('Prediction')
plt.xlabel('Draft Position');
draft_df = gather_draftData ( 2015 )
 
 draft_df [ 'Name' ][ 14 ] =  'Kelly Oubre Jr.' #annoying name inconsistencies 
 
 plt . subplots ( figsize = ( 14 , 6 ));
 
 draft_df = draft_df . drop ( 25 , 0 ) #spurs' 1st round pick has not played yet
 
 predictions = []
 for name in draft_df [ 'Name' ]:
     draft_num = draft_df [ draft_df [ 'Name' ] == name ][ 'Number' ]
     predict_dict = player_prediction__regressionModel ( name )
     predictions . append ( predict_dict [ 'Prediction' ])
 
 plt . imshow ( np . array ( predictions ) . T , interpolation = 'nearest' , cmap = plt . cm . Blues , vmin = 0.0 , vmax = 1.0 )
 plt . title ( 'Predicting Future Performance of 2015-16 Rookies' )
 plt . colorbar ( shrink = 0.25 )
 tick_marks = np . arange ( len ( np . unique ( df [ 'perf_cat' ])))
 plt . xticks ( range ( 0 , 29 ), draft_df [ 'Name' ], rotation = 90 )
 plt . yticks ( range ( 0 , 4 ), [ 'Poor' , 'Below Average' , 'Above Average' , 'Great' ])
 plt . tight_layout ()
 plt . ylabel ( 'Prediction' )
 plt . xlabel ( 'Draft Position' );

翻译自: https://www.pybloggers.com/2016/08/revisting-nba-career-predictions-from-rookie-performance-again/

深圳新秀租房贵吗

cumei1658

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
深圳新秀租房贵吗_再次根据新秀表现对NBA职业生涯进行预测...

深圳新秀租房贵吗Now that the NBA season is done, we have complete data from this year’s NBA rookies. In the past I have tried to predict NBA rookies’ future performance using regression models. In this post I...
复制链接

扫一扫