NBA球员总得分预测——K近邻算法

最新推荐文章于 2021-08-28 14:11:15 发布

mmい

最新推荐文章于 2021-08-28 14:11:15 发布

阅读量3.1k

点赞数 1

分类专栏： Machine Learning

本文链接：https://blog.csdn.net/zm714981790/article/details/51295486

版权

Machine Learning 专栏收录该内容

18 篇文章 3 订阅

订阅专栏

Dataset

本文的数据集nba_2013.csv是2013到2014赛季的NBA球员信息：

player – name of the player
pos – the position of the player
g – number of games the player was in
gs – number of games the player started
pts – total points the player scored

import pandas
with open("nba_2013.csv", 'r') as csvfile:
    nba = pandas.read_csv(csvfile)

# The names of the columns in the data.
print(nba.columns.values)
'''
['player' 'pos' 'age' 'bref_team_id' 'g' 'gs' 'mp' 'fg' 'fga' 'fg.' 'x3p'
 'x3pa' 'x3p.' 'x2p' 'x2pa' 'x2p.' 'efg.' 'ft' 'fta' 'ft.' 'orb' 'drb'
 'trb' 'ast' 'stl' 'blk' 'tov' 'pf' 'pts' 'season' 'season_end']
'''

Euclidean Distance

下面这段代码计算的是LeBron James与每个球员的欧氏距离，记住在这里只能用iloc[0]来索引出eBron James，首先该行得到的是一个DataFrame对象，loc是根据行标签索引，但此时eBron James的行标签未知，用iloc表示第0行最合适。

selected_player = nba[nba["player"] == "LeBron James"].iloc[0]
distance_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts']
import math
def euclidean_distance(row):
    inner_value = 0
    for k in distance_columns:
        inner_value += (row[k] - selected_player[k]) ** 2
    return math.sqrt(inner_value)

lebron_distance = nba.apply(euclidean_distance, axis=1)
'''
lebron_distance  : Series (<class 'pandas.core.series.Series'>)
0     3475.792868
1     3148.395020
2     3161.567361
3     1189.554979
4     3216.773098
...
'''

Normalizing Columns

由于属性的取值范围较大将会对距离度量产生很大的影响，因此，为了保证各个属性的平等性，需要对属性值进行正规化，使其均值为0，方差为1.
nba_numeric.mean()函数得到是每一列的均值，nba_numeric.std()得到的是每一列的标准差。

nba_numeric = nba[distance_columns]
nba_normalized = (nba_numeric - nba_numeric.mean()) / nba_numeric.std()

Finding The Nearest Neighbor

在前面我们已经计算了eBron James到每个球员的距离，但是在scripy.spatial中有一个distance类，它含有各种距离度量函数，在这里我们使用distance.euclidean计算，可以得到与前面相同的结果。然后我们对其进行排序，第一个是eBron James到eBron James自己，因此距离为0，我们需要找的最近邻是第二个。使用apply函数时，前面是可以迭代的对象，参数是函数的名称，但是此处函数的参数不仅仅是迭代的每一行数据，还有一个eBron James对象，因此采用lambda函数的同时，在里面调用更复杂的距离函数。

from scipy.spatial import distance

# Fill in NA values in nba_normalized
nba_normalized.fillna(0, inplace=True)

# Find the normalized vector for lebron james.
lebron_normalized = nba_normalized[nba["player"] == "LeBron James"].iloc[0]

# Find the distance between lebron james and everyone else.
euclidean_distances = nba_normalized.apply(lambda row: distance.euclidean(row, lebron_normalized), axis=1)
distance_frame = pandas.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_lebron = nba.loc[int(second_smallest)]["player"]

我们在获得lebron_normalized到每个球员的距离后，创建了一个DataFrame对象，然后按照距离对其进行排序，为了排序的时候保存每个球员的index,t添加了一个新的属性。其实也可以不用添加，因为DataFrame的行标签就是球员的index。其中distance_frame.sort_values(“dist”, inplace=True)中的inplace=True表示就地执行排序，等价于distance_frame=distance_frame.sort_values(“dist”），只是如果inplace=False，表示原有的distance_frame是没有变化的。

Generating Training And Testing Sets

训练集以及测试集目前还没有正规化：

import random
from numpy.random import permutation

# Randomly shuffle the index of nba.
random_indices = permutation(nba.index)
# Set a cutoff for how many items we want in the test set (in this case 1/3 of the items)
test_cutoff = math.floor(len(nba)/3)
# Generate the test set by taking the first 1/3 of the randomly shuffled indices.
test = nba.loc[random_indices[1:test_cutoff]]
# Generate the train set with the rest of the data.
train = nba.loc[random_indices[test_cutoff:]]

Using Sklearn

sklearn中有个专门的计算最近邻的算法KNeighborsRegressor，由于我么需要预测球员的总得分pts，是个连续值，因此是回归问题。可以发现训练集和测试集都是原始数据，我们没有对其进行正规化，因为在KNeighborsRegressor自动对齐进行normalization以及距离的计算都是自动完成的。这些参数可以在KNeighborsRegressor中进行调整。

# The columns that we will be making predictions with.
x_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf']
# The column that we want to predict：total points the player scored
y_column = ["pts"]

from sklearn.neighbors import KNeighborsRegressor
# Create the knn model.
knn = KNeighborsRegressor(n_neighbors=5)
# Fit the model on the training data.
knn.fit(train[x_columns], train[y_column])
# Make predictions on the test set using the fit model.
predictions = knn.predict(test[x_columns])

Computing Error

计算模型的误差MSE，通常对于分类问题，我们计算其AUC值，而对于回归问题，由于roc_auc_score中写道 Note: this implementation is restricted to the binary classification task or multilabel classification task in label indicator format.

actual = test[y_column]
mse = (((predictions - actual) ** 2).sum()) / len(predictions)

mmい

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录