深圳新秀租房贵吗_再次根据新秀表现对NBA职业生涯进行预测...

深圳新秀租房贵吗

Now that the NBA season is done, we have complete data from this year’s NBA rookies. In the past I have tried to predict NBA rookies’ future performance using regression models. In this post I am again trying to predict rookies’ future performance, but now using using a classification approach. When using a classification approach, I predict whether player X will be a “great,” “average,” or “poor” player rather than predicting exactly how productive player X will be.

现在NBA赛季已经结束,我们已经掌握了今年NBA新秀的全部数据。 过去,我曾尝试使用回归 模型来预测NBA新秀的未来表现。 在本文中,我再次尝试预测新秀的未来表现,但现在使用分类方法 。 当使用分类方法时,我预测参与者X是“伟大”,“平均”还是“贫穷”参与者,而不是确切预测参与者X的生产率。

Much of this post re-uses code from the previous posts, so I skim over some of the repeated code.

这篇文章中的大部分重复使用了先前文章中的代码,因此我略过了一些重复的代码。

As usual, I will post all code as a jupyter notebook on my github.

和往常一样,我将所有代码作为jupyter笔记本发布在github上

1
1
2
2
3
3
4
4
5
5
6
6
7
7

Load the data. Reminder – this data is available on my github.

加载数据。 提醒-这些数据可在我的github上找到

1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8

Load more data, and normalize the data for the PCA transformation.

加载更多数据,并为PCA转换标准化数据。

1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8

In the past I used k-means to group players according to their performance (see my post on grouping players for more info). Here, I use a gaussian mixture model (GMM) to group the players. I use the GMM model because it assigns each player a “soft” label rather than a “hard” label. By soft label I mean that a player simultaneously belongs to several groups. For instance, Russell Westbrook belongs to both my “point guard” group and my “scorers” group. K-means uses hard labels where each player can only belong to one group. I think the GMM model provides a more accurate representation of players, so I’ve decided to use it in this post. Maybe in a future post I will spend more time describing it.

过去,我使用k-means根据球员的表现对球员进行分组(有关更多信息,请参阅我的分组球员文章)。 在这里,我使用高斯混合模型 (GMM)对玩家进行分组。 我使用GMM模型是因为它为每个玩家分配了一个“软”标签,而不是一个“硬”标签。 软标签是指玩家同时属于多个组。 例如,罗素·威斯布鲁克(Russell Westbrook)属于我的“控球后卫”组和我的“得分手”组。 K-means使用硬标签,其中每个玩家只能属于一个组。 我认为GMM模型可以更准确地表示球员,因此我决定在本文中使用它。 也许在以后的文章中,我会花更多的时间来描述它。

For anyone wondering, the GMM groupings looked pretty similar to the k-means groupings.

对于任何想知道的人,GMM分组看起来都非常类似于k-means分组。

1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13

In this past I have attempted to predict win shares per 48 minutes. I am using win shares as a dependent variable again, but I want to categorize players.

在过去,我试图预测每48分钟的获胜份额。 我再次将获胜份额用作因变量,但我想对球员进行分类。

Below I create a histogram of players’ win shares per 48.

下面,我创建了每48个玩家获胜份额的直方图。

I split players into 4 groups which I will refer to as “bad,” “below average,” “above average,” and “great”: Poor players are the bottom 10% in win shares per 48, Below average are the 10-50th percentiles, Above average and 50-90th percentiles, Great are the top 10%. This assignment scheme is relatively arbitrary; the model performs similarly with different assignment schemes.

我将玩家分为4组,分别称为“差”,“低于平均水平”,“高于平均水平”和“伟大”:差的玩家是每48个获胜份额中排名倒数10%的玩家,低于平均水平的是10-前10%高于平均水平的50%和50-90%的百分位数。 这种分配方案是相对任意的。 该模型在不同的分配方案下的表现类似。

1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
[0.096314496314496317,
 0.40196560196560199,
 0.39950859950859952,
 0.10221130221130222]
[0.096314496314496317,
 0.40196560196560199,
 0.39950859950859952,
 0.10221130221130222]
 

My goal is to use rookie year performance to classify players into these 4 categories. I have a big matrix with lots of data about rookie year performance, but the reason that I grouped player using the GMM is because I suspect that players in the different groups have different “paths” to success. I am including the groupings in my classification model and computing interaction terms. The interaction terms will allow rookie performance to produce different predictions for the different groups.

我的目标是利用新秀年度表现将球员分为以下4类。 我有一个大矩阵,其中包含有关新秀年度表现的大量数据,但之所以使用GMM对球员进行分组是因为我怀疑不同组中的球员有不同的“成功道路”。 我将分类包括在分类模型和计算交互项中。 交互条件将允许菜鸟表现为不同的群体产生不同的预测。

By including interaction terms, I include quite a few predictor features. I’ve printed the number of predictor features and the number of predicted players below.

通过包括交互项,我包括了很多预测器功能。 我在下面打印了预测功能的数量和预测玩家的数量。

from sklearn import preprocessing

df_drop = df[df['Year']>1980]
for x in np.unique(new_labels):
    Label = 'Category%d' % x
    rookie_df_drop[Label] = df_drop[Label] #give rookies the groupings produced by the GMM model

X = rookie_df_drop.as_matrix() #take data out of dataframe   

poly = preprocessing.PolynomialFeatures(2,interaction_only=True) #create interaction terms.
X = poly.fit_transform(X)

Career_data = df[df['Year']>1980]
Y = Career_data['perf_cat'] #get predictor data
print(np.shape(X))
print(np.shape(Y))
from sklearn import preprocessing
 
 df_drop = df [ df [ 'Year' ] > 1980 ]
 for x in np . unique ( new_labels ):
     Label = 'Category %d ' % x
     rookie_df_drop [ Label ] = df_drop [ Label ] #give rookies the groupings produced by the GMM model
 
 X = rookie_df_drop . as_matrix () #take data out of dataframe   
 
 poly = preprocessing . PolynomialFeatures ( 2 , interaction_only = True ) #create interaction terms.
 X = poly . fit_transform ( X )
 
 Career_data = df [ df [ 'Year' ] > 1980 ]
 Y = Career_data [ 'perf_cat' ] #get predictor data
 print ( np . shape ( X ))
 print ( np . shape ( Y ))
 

Now that I have all the features, it’s time to try and predict which players will be poor, below average, above average, and great. To create these predictions, I will use a logistic regression model.

现在,我已经具备了所有功能,现在该尝试预测哪些球员将是贫穷,低于平均水平,高于平均水平和出色的球员。 为了创建这些预测,我将使用逻辑回归模型

Because I have so many predictors, correlation between predicting features and over-fitting the data are major concerns. I use regularization and cross-validation to combat these issues.

因为我有很多预测变量,所以预测特征与数据过度拟合之间的相关性是主要问题。 我使用正则化交叉验证来解决这些问题。

Specifically, I am using l2 regularization and k-fold 5 cross-validation. Within the cross-validation, I am trying to estimate how much regularization is appropriate.

具体来说,我正在使用l2正则化和k倍5交叉验证。 在交叉验证中,我试图估计多少正则化是合适的。

Some important notes – I am using “balanced” weights which tells the model that worse to incorrectly predict the poor and great players than the below average and above average players. I do this because I don’t want the model to completely ignore the less frequent classifications. Second, I use the multi_class multinomial because it limits the number of models I have to fit.

一些重要注意事项–我正在使用“平衡”权重,该权重告诉模型与平均水平以下和平均水平以上的参与者相比,错误地预测贫穷和伟大的参与者更为糟糕。 我这样做是因为我不希望模型完全忽略频率较低的分类。 其次,我使用multi_class多项式,因为它限制了我必须适合的模型数量。

1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
0.738109219025
0.738109219025
 

Okay, the model did pretty well, but lets look at where the errors are coming from. To visualize the models accuracy, I am using a confusion matrix. In a confusion matrix, every item on the diagnonal is a correctly classified item. Every item off the diagonal is incorrectly classified. The color bar’s axis is the percent correct. So the dark blue squares represent cells with more items.

好的,模型做得很好,但是让我们看一下错误的出处。 为了可视化模型的准确性,我使用了一个混淆矩阵 。 在混淆矩阵中,诊断中的每个项目都是正确分类的项目。 对角线上的每个项目均未正确分类。 彩条的轴是正确的百分比。 因此,深蓝色方块代表具有更多项目的单元格。

It seems the model is best at predicting poor players and great players. It makes more errors when trying to predict the more average players.

似乎该模型最适合预测不良参与者和优秀参与者。 尝试预测更多平均玩家时,它会产生更多错误。

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y, est.predict(X))

def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap,vmin=0.0, vmax=1.0)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(np.unique(df['perf_cat'])))
    plt.xticks(tick_marks, np.unique(df['perf_cat']))
    plt.yticks(tick_marks, np.unique(df['perf_cat']))
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')
from sklearn.metrics import confusion_matrix
 cm = confusion_matrix ( Y , est . predict ( X ))
 
 def plot_confusion_matrix ( cm , title = 'Confusion matrix' , cmap = plt . cm . Blues ):
     plt . imshow ( cm , interpolation = 'nearest' , cmap = cmap , vmin = 0.0 , vmax = 1.0 )
     plt . title ( title )
     plt . colorbar ()
     tick_marks = np . arange ( len ( np . unique ( df [ 'perf_cat' ])))
     plt . xticks ( tick_marks , np . unique ( df [ 'perf_cat' ]))
     plt . yticks ( tick_marks , np . unique ( df [ 'perf_cat' ]))
     plt . tight_layout ()
     plt . ylabel ( 'True label' )
     plt . xlabel ( 'Predicted label' )
 
 cm_normalized = cm . astype ( 'float' ) / cm . sum ( axis = 1 )[:, np . newaxis ]
 plot_confusion_matrix ( cm_normalized , title = 'Normalized confusion matrix' )
 

Lets look at what the model predicts for this year’s rookies. Below I modified two functions that I wrote for a previous post. The first function finds a particular year’s draft picks. The second function produces predictions for each draft pick.

让我们看看该模型对今年新秀的预测。 下面,我修改了为上一篇文章编写的两个函数。 第一个功能查找特定年份的选秀权。 第二个功能为每个选秀权产生预测。

def gather_draftData(Year):

    import urllib2
    from bs4 import BeautifulSoup
    import pandas as pd
    import numpy as np

    draft_len = 30

    def convert_float(val):
        try:
            return float(val)
        except ValueError:
            return np.nan

    url = 'http://www.basketball-reference.com/draft/NBA_'+str(Year)+'.html'
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html,"lxml")

    draft_num = [soup.findAll('tbody')[0].findAll('tr')[i].findAll('td')[0].text for i in range(draft_len)]
    draft_nam = [soup.findAll('tbody')[0].findAll('tr')[i].findAll('td')[3].text for i in range(draft_len)]

    draft_df = pd.DataFrame([draft_num,draft_nam]).T
    draft_df.columns = ['Number','Name']
    df.index = range(np.size(df,0))
    return draft_df

def player_prediction__regressionModel(PlayerName):

    clust_df = pd.read_pickle('nba_bballref_career_stats_2016_Apr_15.pkl')
    clust_df = clust_df[clust_df['Name']==PlayerName]
    clust_df = clust_df.drop(['Year','Name','G','GS','MP','FG','FGA','FG%','3P','2P','FT','TRB','PTS','ORtg','DRtg','PER','TS%','3PAr','FTr','ORB%','DRB%','TRB%','AST%','STL%','BLK%','TOV%','USG%','OWS','DWS','WS','WS/48','OBPM','DBPM','BPM','VORP'],1)
    new_vect = ScaleModel.transform(clust_df.as_matrix().reshape(1,-1))
    reduced_data = reduced_model.transform(new_vect)
    predictions = g.predict_proba(reduced_data)
    for x in np.unique(new_labels):
        Label = 'Category%d' % x
        clust_df[Label] = predictions[:,x]

    Predrookie_df = pd.read_pickle('nba_bballref_rookie_stats_2016_Apr_16.pkl')
    Predrookie_df = Predrookie_df[Predrookie_df['Name']==PlayerName]
    Predrookie_df = Predrookie_df.drop(['Year','Career Games','Name'],1)
    for x in np.unique(new_labels):
        Label = 'Category%d' % x
        Predrookie_df[Label] = clust_df[Label] #give rookies the groupings produced by the GMM model
    predX = Predrookie_df.as_matrix() #take data out of dataframe
    predX = poly.fit_transform(predX)
    predictions2 = est.predict_proba(predX)
    return {'Name':PlayerName,'Group':predictions,'Prediction':predictions2[0]}
def gather_draftData ( Year ):
 
     import urllib2
     from bs4 import BeautifulSoup
     import pandas as pd
     import numpy as np
 
     draft_len = 30
 
     def convert_float ( val ):
         try :
             return float ( val )
         except ValueError :
             return np . nan
 
     url = 'http://www.basketball-reference.com/draft/NBA_' + str ( Year ) + '.html'
     html = urllib2 . urlopen ( url )
     soup = BeautifulSoup ( html , "lxml" )
 
     draft_num = [ soup . findAll ( 'tbody' )[ 0 ] . findAll ( 'tr' )[ i ] . findAll ( 'td' )[ 0 ] . text for i in range ( draft_len )]
     draft_nam = [ soup . findAll ( 'tbody' )[ 0 ] . findAll ( 'tr' )[ i ] . findAll ( 'td' )[ 3 ] . text for i in range ( draft_len )]
 
     draft_df = pd . DataFrame ([ draft_num , draft_nam ]) . T
     draft_df . columns = [ 'Number' , 'Name' ]
     df . index = range ( np . size ( df , 0 ))
     return draft_df
 
 def player_prediction__regressionModel ( PlayerName ):
 
     clust_df = pd . read_pickle ( 'nba_bballref_career_stats_2016_Apr_15.pkl' )
     clust_df = clust_df [ clust_df [ 'Name' ] == PlayerName ]
     clust_df = clust_df . drop ([ 'Year' , 'Name' , 'G' , 'GS' , 'MP' , 'FG' , 'FGA' , 'FG%' , '3P' , '2P' , 'FT' , 'TRB' , 'PTS' , 'ORtg' , 'DRtg' , 'PER' , 'TS%' , '3PAr' , 'FTr' , 'ORB%' , 'DRB%' , 'TRB%' , 'AST%' , 'STL%' , 'BLK%' , 'TOV%' , 'USG%' , 'OWS' , 'DWS' , 'WS' , 'WS/48' , 'OBPM' , 'DBPM' , 'BPM' , 'VORP' ], 1 )
     new_vect = ScaleModel . transform ( clust_df . as_matrix () . reshape ( 1 , - 1 ))
     reduced_data = reduced_model . transform ( new_vect )
     predictions = g . predict_proba ( reduced_data )
     for x in np . unique ( new_labels ):
         Label = 'Category %d ' % x
         clust_df [ Label ] = predictions [:, x ]
 
     Predrookie_df = pd . read_pickle ( 'nba_bballref_rookie_stats_2016_Apr_16.pkl' )
     Predrookie_df = Predrookie_df [ Predrookie_df [ 'Name' ] == PlayerName ]
     Predrookie_df = Predrookie_df . drop ([ 'Year' , 'Career Games' , 'Name' ], 1 )
     for x in np . unique ( new_labels ):
         Label = 'Category %d ' % x
         Predrookie_df [ Label ] = clust_df [ Label ] #give rookies the groupings produced by the GMM model
     predX = Predrookie_df . as_matrix () #take data out of dataframe
     predX = poly . fit_transform ( predX )
     predictions2 = est . predict_proba ( predX )
     return { 'Name' : PlayerName , 'Group' : predictions , 'Prediction' : predictions2 [ 0 ]}
 

Below I create a plot depicting the model’s predictions. On the y-axis are the four classifications. On the x-axis are the players from the 2015 draft. Each cell in the plot is the probability of a player belonging to one of the classifications. Again, dark blue means a cell or more likely. Good news for us T-Wolves fans! The model loves KAT.

在下面,我创建了一个描述模型预测的图。 在y轴上有四个分类。 x轴上是2015年选秀的球员。 情节中的每个像元都是玩家属于其中一种分类的概率。 同样,深蓝色表示一个细胞或更可能。 对我们T-狼队的球迷们来说是个好消息! 该模特喜欢KAT。

draft_df = gather_draftData(2015)

draft_df['Name'][14] =  'Kelly Oubre Jr.' #annoying name inconsistencies 

plt.subplots(figsize=(14,6));

draft_df = draft_df.drop(25, 0) #spurs' 1st round pick has not played yet

predictions = []
for name in draft_df['Name']:
    draft_num = draft_df[draft_df['Name']==name]['Number']
    predict_dict = player_prediction__regressionModel(name)
    predictions.append(predict_dict['Prediction'])

plt.imshow(np.array(predictions).T, interpolation='nearest', cmap=plt.cm.Blues,vmin=0.0, vmax=1.0)
plt.title('Predicting Future Performance of 2015-16 Rookies')
plt.colorbar(shrink=0.25)
tick_marks = np.arange(len(np.unique(df['perf_cat'])))
plt.xticks(range(0,29),draft_df['Name'],rotation=90)
plt.yticks(range(0,4), ['Poor','Below Average','Above Average','Great'])
plt.tight_layout()
plt.ylabel('Prediction')
plt.xlabel('Draft Position');
draft_df = gather_draftData ( 2015 )
 
 draft_df [ 'Name' ][ 14 ] =  'Kelly Oubre Jr.' #annoying name inconsistencies 
 
 plt . subplots ( figsize = ( 14 , 6 ));
 
 draft_df = draft_df . drop ( 25 , 0 ) #spurs' 1st round pick has not played yet
 
 predictions = []
 for name in draft_df [ 'Name' ]:
     draft_num = draft_df [ draft_df [ 'Name' ] == name ][ 'Number' ]
     predict_dict = player_prediction__regressionModel ( name )
     predictions . append ( predict_dict [ 'Prediction' ])
 
 plt . imshow ( np . array ( predictions ) . T , interpolation = 'nearest' , cmap = plt . cm . Blues , vmin = 0.0 , vmax = 1.0 )
 plt . title ( 'Predicting Future Performance of 2015-16 Rookies' )
 plt . colorbar ( shrink = 0.25 )
 tick_marks = np . arange ( len ( np . unique ( df [ 'perf_cat' ])))
 plt . xticks ( range ( 0 , 29 ), draft_df [ 'Name' ], rotation = 90 )
 plt . yticks ( range ( 0 , 4 ), [ 'Poor' , 'Below Average' , 'Above Average' , 'Great' ])
 plt . tight_layout ()
 plt . ylabel ( 'Prediction' )
 plt . xlabel ( 'Draft Position' );
 

翻译自: https://www.pybloggers.com/2016/08/revisting-nba-career-predictions-from-rookie-performance-again/

深圳新秀租房贵吗

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值