机器学习算法机器人足球_购买足球队:一种机器学习方法

机器学习算法机器人足球

An approach that is better than random guessing or choosing players from a pool of 18000 professional players.

这种方法比从18000名专业玩家中随机猜测或选择玩家更好。

As we are progressing into a world where sports have become a vital part of our lives, it has also become a hot market for investors to gain better returns and interact with the audience and make their presence felt. Also, we can see that there has been a surge in sports viewership which leads to more tournaments, and capitalizing on them has become a difficult task for an investor. We have taken up a challenge to help major investors to pick the best players amongst 18000 Soccer players to build a dream Soccer team which can participate and outperform other clubs in major leagues. We have leveraged Machine learning algorithms to classify potential team members in our club and potential budget for an investor to optimize their market gains. As a result, we have come up with a strategy to build the best team whilst keeping in mind the investor’s budget which has a limit of 1 billion Euros.

随着我们步入一个世界,体育已成为我们生活中至关重要的一部分,它也已成为投资者获取更好的回报并与观众互动并提高他们的存在感的热门市场。 此外,我们可以看到体育观众的数量激增,导致举办更多的比赛,而利用这些机会成为投资者的一项艰巨任务。 我们已经接受了一项挑战,以帮助主要投资者从18000名足球运动员中挑选最佳球员,以建立一支理想的足球队,使其能够参与并超越大联盟中的其他俱乐部。 我们利用机器学习算法对俱乐部中的潜在团队成员和潜在预算进行分类,以帮助投资者优化其市场收益。 因此,我们提出了建立最佳团队的战略,同时要牢记投资者的预算上限为10亿欧元。

介绍 (INTRODUCTION)

We have a FIFA dataset in which there are few columns named — rating, release clause, and wages. We assume that we do not have these variables in upcoming out of time datasets. These can be used in various formations like rating a player as a better performing player or moderate or not up to the marked player. Additionally, it can also translate into a player who should be sent an invite to some club gatherings, events, etc. We built 2 models using Supervised Learning on Rating variables and we made it a classification problem by splitting this variable into 2 classes: greater than or equal to 70 and less than 70 as our potential club member or not. We chose 70 as our threshold as most of the major clubs have only players whose rating is greater than 70. In order to compete with them, we are inclined to have only those players having ratings greater than our threshold. Additionally, we have worked with our model results to predict the cost to investors for offering a club membership annually. Our second model utilizes predicted rating class obtained from our previous best classifiers instead of “actual rating” and here, and a combination of release_clause and annual wages as the cost to investors as our dependent variable.

我们有一个FIFA数据集,其中很少有名为-评级,免责条款和工资的列。 我们假设在即将到来的超时数据集中没有这些变量。 这些可以多种形式使用,例如将玩家评为表现较好的玩家,或将玩家评为中等或不超过标记玩家。 此外,它还可以转换为应邀请其参加一些俱乐部聚会,活动等的球员。我们使用监督学习的评分变量构建了2个模型,并将该变量分为2个类将其分类为一个问题:等于或等于70且小于等于我们潜在俱乐部成员的70。 我们选择70作为我们的门槛,因为大多数主要俱乐部只有等级高于70的球员。为了与他们竞争,我们倾向于只让那些得分高于我们的门槛的球员。 此外,我们使用模型结果来预测投资者每年提供俱乐部会员资格所需的费用。 我们的第二个模型使用从我们以前的最佳分类器中获得的预测评级级别代替“实际评级”,在这里,以及释放条款和年薪的组合作为投资者的成本作为我们的因变量。

数据集 (DATASET)

We are using the FIFA 2019 and 2020 data from the Kaggle FIFA complete player dataset. FIFA complete player dataset contains 18k+ unique players and 100+ attributes extracted from the latest edition of FIFA. It contains:

我们正在使用Kaggle FIFA完整球员数据集中的FIFA 2019和2020数据。 FIFA完整的球员数据集包含从最新版本的FIFA中提取的18k +独特球员和100+属性。 它包含了:

  • Files present in CSV format.

    文件以CSV格式显示。
  • FIFA 2020–18,278 unique players and 104 attributes for each player. (Test dataset)

    FIFA 2020–18,278位独特的球员,每个球员有104个属性。 (测试数据集)
  • FIFA 2019–17,770 unique players and 104 attributes for each player. (Train dataset)

    FIFA 2019–17,770名独特球员和每位球员的104个属性。 (火车数据集)
  • Player positions, with the role in the club and in the national team.

    球员位置,在俱乐部和国家队中发挥作用。
  • Player attributes with statistics as Attacking, Skills, Defense, Mentality, GK Skills, etc.

    玩家属性的统计数据包括进攻,技能,防守,心态,GK技能等。
  • Player personal data like Nationality, Club, DateOfBirth, Wage, Salary, etc.

    球员的个人数据,例如国籍,俱乐部,出生日期,工资,薪水等。

资料清理 (DATA CLEANING)

  • In some places, both the datasets have different data types of the same features. After reading the data dictionary, we brought them into sync.

    在某些地方,两个数据集都具有相同特征的不同数据类型。 阅读数据字典后,我们将它们同步。
  • Some variables have in-built formulas, so we corrected their formatting.

    一些变量具有内置公式,因此我们更正了它们的格式。
  • We removed ‘sofifa_id’, ‘player_url’, ‘short_name’, ‘long_name’, ‘real_face’, ‘dob’, ‘gk_diving’, ‘gk_handling’, ‘gk_kicking’, ‘gk_reflexes’, ‘gk_speed’, ‘gk_positioning’ and ‘body_type’ based on dictionary definitions or repeated columns as they add no useful impact in our analysis.

    我们删除了'sofifa_id','player_url','short_name','long_name','real_face','dob','gk_diving','gk_handling','gk_kicking','gk_reflexes','gk_speed','gk_positioning'和基于字典定义或重复列的“ body_type” ,因为它们对我们的分析没有任何有益的影响。

  • We converted overall ratings into 2 binary classes, with rating > 70 (as many big clubs use this threshold) to recruit their team players and will be treated as our dependent variable

    我们将整体评分转换为2个二进制类别,评分> 70(许多大型俱乐部使用此阈值)以招募其团队成员,并将其视为因变量

探索性数据分析 (EXPLORATORY DATA ANALYSIS)

We first considered various interesting statistics for performing our exploratory data analysis.

我们首先考虑了各种有趣的统计数据来进行探索性数据分析。

  • Univariate statistics like Missing values percentage in the whole data to treat the missing values, univariate statistics of the continuous variables (count, mean, std, min, max, skewness, kurtosis, unique, missing, IQR) and their distributions.

    单变量统计量,例如整个数据中的缺失值百分比以处理缺失值,连续变量的单变量统计量(计数,平均值,std,最小值,最大值,偏度,峰度,唯一性,缺失,IQR)及其分布。
  • Bivariate statistics: Correlation among the features and T-test for continuous variables and chi-square test & Cramer’s V for categorical variables.

    双变量统计:连续变量的特征和T检验以及分类变量的卡方检验和Cramer V的相关性。

单变量 (Univariate)

We performed univariate analysis on the continuous variables to get the sense of the distribution of different fields in our dataset. According to our observation (mean, std, skewness, kurtosis, etc.), we observed many key features that follow a normal distribution. Moreover, the Interquartile range (IQR) was used to detect outliers using Tukey’s method.

我们对连续变量进行了单变量分析,以了解数据集中不同字段的分布。 根据我们的观察结果(平均值,标准差,偏度,峰度等),我们观察到许多遵循正态分布的关键特征。 此外,四分位间距(IQR)用于使用Tukey方法检测异常值。

Image for post

For the categorical variables, the univariate analysis consists of their count, unique values, categories with maximum counts (i.e., top), their frequency, and the number of missing values they have. From the categorical table, we can see that player_tags, loaned_from, nation_position, player_traits have more than 54% of missing values. It would not be easy to impute these with any promising values.

对于分类变量,单变量分析包括其计数,唯一值,具有最大计数(即最高)的类别,其频率以及它们具有的缺失值数量。 从分类表中,我们可以看到player_tags,loaned_from,national_position,player_traits的缺失值超过54%。 用任何有前途的价值来估算这些都不是容易的

Image for post

双变量 (Bivariate)

对于连续变量 (For continuous variables)

We built a correlation matrix to get a sense of the extent of the linear relationship between rating and other explanatory variables and which variables can be excluded in later stages. We used the seaborn package in Python to create the above heat map.

我们建立了一个相关矩阵,以了解等级与其他解释变量之间线性关系的程度,以及在以后的阶段可以排除哪些变量。 我们使用Python中的seaborn包创建了上面的热图。

Image for post

T检验 (T-test)

We also performed a t-test to check whether the mean of the variables when rating = 1 is significantly different from the mean of the variables when rating = 0. After this stage, we removed some variables which are either not significant or having no correlation at all with dependent variables.

我们还进行了t检验,以检查在等级= 1时变量的平均值与在等级= 0时变量的平均值是否显着不同。在此阶段之后,我们删除了一些不重要或没有相关性的变量根本没有因变量。

对于分类变量 (For categorical variables)

We performed a chi-square test to check the significance of the variables with the dependent variable rating. The table below contains p values corresponding to categorical variables. we obtained that preferred_foot is not significant in our analysis.

我们执行了卡方检验,以检查具有因变量等级的变量的显着性。 下表包含与类别变量相对应的p个值。 我们认为在我们的分析中preferred_foot并不重要。

To find the correlation between categorical variables with the dependent variable, we applied Cramer’s V rule.

为了找到分类变量与因变量之间的相关性,我们应用了Cramer的V规则。

Image for post

V equals the square root of chi-square divided by sample size, n, times m, which is the smaller of (rows — 1) or (columns — 1): V = SQRT(X2/nm).

V等于卡方的平方根除以样本大小n乘以m,它是(行_1)或(列_1 )中的较小者: V = SQRT(X2 / nm)。

  • Interpretation: V may be viewed as the association between two variables as a percentage of their maximum possible variation. V2 is the mean square canonical correlation between the variables. For 2-by-2 tables, V = chi-square-based measure of association.

    解释: V可以看作是两个变量之间的关联,以其最大可能变化的百分比表示。 V2是变量之间的均方规范相关。 对于2×2表,V =基于卡方的关联度。

  • Symmetricalness: V is a symmetrical measure. It does not matter which is the independent variable.

    对称性: V是对称度量。 哪个是自变量都没有关系。

  • Data level: V may be used with nominal data or higher.

    数据级别: V可以用于标称数据或更高的数据。

  • Values: Ranges from 0 to 1.

    取值范围 0〜1。

In this scenario, we kept columns that showed a decent correlation between independent variables and dependent variables. These are ‘club_new’, ‘Pos’, ‘attack_rate’, ‘nation’

在这种情况下,我们保留了显示自变量和因变量之间良好相关性的列。 这些是'club_new','Pos','attack_rate','nation'

功能工程: (FEATURE ENGINEERING:)

1.重新分类/估算变量 (1. Re-categorizing/imputing Variables)

  • Since team_jersey_number, nation_jersey_number is not actually a continuous variable, we have decided to treat them as categorical variables.

    由于team_jersey_number,national_jersey_number实际上不是连续变量,因此我们决定将它们视为分类变量。

  • We further imputed team_position with ‘not played’ for the missing values and re-categorized the players into defender, attacker, goalkeeper, resting, Mid-Fielder, substitute, not played, to reduce 29 unique values to 7 levels.

    我们进一步对team_position的缺失值进行了“不玩”的估算,并将球员重新分类为防守者,攻击者,守门员,休息者, 中场球员,替补球员(不参加比赛),以将29个唯一值减少到7个级别。

  • We conjecture that a goalkeeper will have minimum values for ‘pace’, ‘shooting’, ‘passing’, ‘dribbling’, ‘defending’, ‘physic’ thus imputing with such values.

    我们推测,守门员将具有“步速”,“射击”,“传球”,“盘带”,“防守”,“体能”的最小值,从而用这些值来估算。

  • Moreover, 2 variables — nationality and club have very high cardinality. Based on their volume and event rate, we have re-categorized them into low cardinal variables.

    此外, 国籍俱乐部这两个变量具有很高的基数。 根据它们的数量和事件发生率,我们将它们重新分类为低基数变量。

2.创建变量: (2. Creating Variables:)

  • From the data, we observed that ‘player_positions’ gives the idea about players' multiple playing positions. so, we have decided to assign individual players with the total count of their availability at different on-field positions into ‘playing_positions’.

    从数据中,我们观察到“ player_positions”给出了有关玩家多个游戏位置的想法。 因此,我们决定为各个球员分配在不同场上位置的可用总得分 进入“ playing_positions”。

  • A player’s work_rate is given by his attack and defense rate; thus, we have separated them into variables.

    球员的工作率由他的进攻和防守率决定; 因此,我们将它们分为变量。

  • We have also calculated the term an individual player will be associated with the club to better understand their loyalty with the club.

    我们还计算出的个别球员将与俱乐部相关联,以更好地了解他们与俱乐部的忠诚度术语

  • We have also used one-hot encoding to utilize categorical variables in a form that could be provided to ML algorithms to do a better job in predictions.

    我们还使用了一种热编码,以某种形式利用分类变量,该形式可以提供给ML算法以更好地进行预测。

3. 模型1 (3. MODEL 1)

Here, Y = Rating with population event rate as 31.23 % (which is class 1)

在这里,Y =人口事件发生率为31.23%(等级1)的等级

3.1。 逻辑回归: (3.1. Logistic Regression:)

For the logistic regression model, we first performed the classification without regularization followed by a ridge and lasso regression. L1 regularized logistic regression requires solving a convex optimization problem. However, standard algorithms for solving convex optimization problems do not scale well enough to handle the large datasets encountered in many practical settings.

对于逻辑回归模型,我们首先在不进行正则化的情况下执行分类,然后进行岭和套索回归。 L1正则逻辑回归需要解决凸优化问题。 但是,用于解决凸优化问题的标准算法的伸缩性不足以处理许多实际设置中遇到的大型数据集。

The objective of Logistic Regression while applying a penalty to minimize loss function:

Logistic回归的目标,同时应用惩罚以最小化损失函数:

Image for post
Image for post

The best result received from running the logistic regression models pre and post regularization (L1 and L2) can be summarized below:

从运行正则化前后的逻辑回归模型(L1和L2)获得的最佳结果可以总结如下:

Image for post

3.2。 KNN: (3.2. KNN:)

kNN is a case-based learning method, which keeps all the training data for classification. One of the evaluation standards for different algorithms is their performance. As kNN is a simple but effective method for classification and it is convincing as one of the most effective methods it motivates us to build a model for kNN to improve its efficiency whilst preserving its classification accuracy as well.

kNN是基于案例的学习方法,可保留所有训练数据进行分类。 不同算法的评估标准之一是它们的性能。 由于kNN是一种简单但有效的分类方法,并且令人信服,它是最有效的方法之一,因此它促使我们建立kNN模型以提高其效率,同时也保留其分类精度。

Image for post

Looking at Figure 1, a training dataset including 11 data points with two classes {square, triangle} is distributed in 2-dimensional data space. If we use Euclidean distance as our similarity measure, many data points with the same class label are close to each other according to distance measure in the local area.

如图1所示,一个训练数据集包括11个数据点和两个类别{正方形,三角形},分布在二维数据空间中。 如果我们使用欧几里得距离作为我们的相似性度量,那么根据局部区域中的距离度量,许多具有相同类别标签的数据点将彼此靠近。

For instance, if we take the region where k=3 represented with a solid line circle and check the majority voting amongst classes we observe that our data point {circle} will be classified as a triangle. However, if we increase the value of k =5 represented by the dotted circle, our data point will be classified as a square. This motivates us to optimize our k-Nearest Neighbors algorithms to find the optimal k where the classification error is minimal.

例如,如果我们用实心圆表示k = 3的区域,并检查类别之间的多数表决,我们会发现我们的数据点{circle}将被分类为三角形。 但是,如果我们增加由虚线圆表示的k = 5的值,则我们的数据点将被分类为正方形。 这激励我们优化我们的k最近邻算法,以找到分类误差最小的最优k

实验: (Experiment:)

We initially trained our k-NN model with k=1, with splitting our data into 70% -30% as our training and validation data. From table 2, we observe that training accuracy is 1 which implies that the model fits perfectly, however, the accuracy and AUC for the test data are higher than validation data, which is indicative of overfitting, therefore, we are subjective to perform parameter tuning.

最初,我们以k = 1训练了k- NN模型,然后将我们的数据分为70%-30%作为训练和验证数据。 从表2中,我们看到训练精度为1,这表示模型完美拟合,但是测试数据的精度和AUC高于验证数据,这表明过拟合,因此,我们主观进行参数调整。

Image for post

优化: (Optimization:)

We utilized the elbow method to find the least error on training data. After running for the best k, we observed that the least error rate is observed when k=7, Although our optimized results performed better in train and validation, our test AUC has reduced.

我们利用弯头法找到训练数据上的最小误差。 在获得最佳k后,我们观察到在k = 7时观察到的错误率最小尽管我们的优化结果在训练和验证中表现更好,但我们的测试AUC却有所减少。

Image for post

Even though the accuracy for the test is reduced, we observe that the precision-recall for the same has increased indicating that our model is classifying more class (1) better as that is our target class. (players with greater than 70 ratings).

即使测试的准确性降低了,我们也注意到该模型的精度召回率有所提高,这表明我们的模型更好地将更多的类别(1)分类为目标类别。 (具有70个以上评分的玩家)。

Image for post

3.3。 决策树: (3.3. DECISION TREE:)

The decision tree method is a powerful statistical tool for classification, prediction, interpretation, and data manipulation that has several potential applications in many fields.

决策树方法是用于分类,预测,解释和数据处理的强大统计工具,在许多领域中都有多种潜在应用。

Using decision tree models has the following advantages:

使用决策树模型具有以下优点:

  • Simplifies complex relationships between input variables and target variables by dividing original input variables into significant subgroups.

    通过将原始输入变量分成重要的子组,简化了输入变量和目标变量之间的复杂关系。
  • A non-parametric approach without distributional assumptions so, Easy to understand and interpret.

    一种无分布假设的非参数方法,易于理解和解释。

The main disadvantage is that it can be subject to overfitting and underfitting, particularly when using a small data set.

主要缺点是,它可能会过度拟合和拟合不足,尤其是在使用小的数据集时。

Experiment:

实验:

We trained our Decision tree classifier from the Sklearn library without passing any parameters. From the table, we observed that there is overfitting of the data thus we must tune our parameters to get optimized results.

我们从Sklearn库训练了决策树分类器,而没有传递任何参数。 从表中,我们观察到数据过度拟合,因此我们必须调整参数以获得最佳结果。

Image for post

Optimization:

优化

We worked with the following parameters:

我们使用以下参数:

  • criterion: string, optional (default=” Gini”):

    条件:字符串,可选(默认=“ Gini”):

Image for post
  • max_depth: int or None, optional (default=None):

    max_depth: int或None,可选(默认= None):

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

树的最大深度。 如果为None,则将节点展开,直到所有叶子都是纯净的,或者直到所有叶子都包含少于min_samples_split个样本。

  • min_samples_split: int, float, optional (default=2):

    min_samples_split: int,float,可选(默认= 2):

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

树的最大深度。 如果为None,则将节点展开,直到所有叶子都是纯净的,或者直到所有叶子都包含少于min_samples_split个样本。

  • min_weight_fraction_leaf: float, optional ():

    min_weight_fraction_leaf:浮点数,可选():

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided

在所有叶节点处(所有输入样本)的权重总和中的最小加权分数。 未提供sample_weight时,样本的权重相等

Image for post

From the experiments above, we see that Gini is outperforming the entropy in all the variants of the experimental parameters. Thus, our criteria are Gini. Similarly, we can observe the other parameters for max_depth = 10, min_samples_split = 17.5, Min_weight_fraction_leaf =0, Gini gives higher accuracy. Thus, utilizing these parameters, we train our model to observe that there is no overfitting and we can capture more true classes in the class 1 category.

从上面的实验中,我们可以看出,在所有实验参数的变体中, Gini的性能都优于 。 因此,我们的标准是Gini 。 同样,我们可以观察到其他参数, 例如max_depth = 10,min_samples_split = 17.5,Min_weight_fraction_leaf = 0, Gini给出了更高的精度。 因此,利用这些参数,我们训练模型以观察到没有过度拟合,并且可以捕获1类类别中的更多真实类。

Image for post

3.4。 支持向量机: (3.4. SUPPORT VECTOR MACHINES:)

The folklore view of SVM is that they find an “optimal” hyperplane as the solution to the learning problem. The simplest formulation of SVM is the linear one, where the hyperplane lies in the space of the input data x.

SVM的民间​​传说观点是,他们找到了“最佳”超平面作为学习问题的解决方案。 SVM的最简单公式是线性公式,其中超平面位于输入数据x的空间中。

In this case, the hypothesis space is a subset of all hyperplanes of the form:

在这种情况下,假设空间是以下形式的所有超平面的子集:

f(x) = w⋅x +b.

f( x )= w⋅x + b。

Hard Margin Case:

硬保证金案例:

Image for post
Image for post

The maximum margin separating hyperplane objective is to find:

分离超平面目标的最大余量是找到:

Image for post

Soft Margin Case:

软保证金案例:

Slack variables are part of the objective function too:

松弛变量也是目标函数的一部分:

Image for post
Image for post

The cost coefficient C>0 is a hyperparameter that specifies the misclassification penalty and is tuned by the user based on the classification task and dataset characteristics.

成本系数C> 0是一个超参数,用于指定错误分类罚分,并由用户根据分类任务和数据集特征进行调整。

Image for post

RBF SVMs

RBF支持向量机

In general, the RBF kernel is a reasonable first choice. This kernel nonlinearly maps samples into a higher-dimensional space, so it, unlike the linear kernel, can handle the case when the relation between class labels and attributes is nonlinear. Furthermore, the linear kernel is a special case of RBF since the linear kernel with a penalty parameter Ĉ has the same performance as the RBF kernel with some parameters (C, γ). The second reason is the number of hyperparameters which influences the complexity of model selection.

通常,RBF内核是合理的首选。 该内核将样本非线性地映射到高维空间,因此,与线性内核不同,它可以处理类标签和属性之间的关系为非线性的情况。 此外,线性核是RBF的一种特殊情况,因为带有惩罚参数Ĉ的线性核与具有某些参数(C,γ)的RBF核具有相同的性能。 第二个原因是超参数的数量会影响模型选择的复杂性。

Experiments:

实验:

We subjected our training data to a linear SVM classifier without training it for soft margins. The results observed does look promising, however,

我们对训练数据进行了线性SVM分类器,而没有对其进行软边距训练。 观察到的结果看起来确实很有希望,但是,

Image for post

The reason for the good score was that the data was almost linearly separable most of the time with very few misclassifications.

得分高的原因是,在大多数情况下,数据几乎可以线性分离 ,并且几乎没有错误分类。

Image for post

Optimization:

优化:

We decided to run a grid search with a linear, and radial basis function with varying C and γ to train our model efficiently. From the Grid search, we obtained the best estimators as

我们决定运行具有线性和径向基函数(具有变化的C和γ)的网格搜索,以有效地训练我们的模型。 通过Grid搜索,我们获得了最佳估计量,即

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=’ovr’, degree=3, gamma=’auto_deprecated’, kernel=’linear’, max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False)

SVC(C = 1,cache_size = 200,class_weight = None,coef0 = 0.0,Decision_function_shape ='ovr',degree = 3,gamma ='auto_deprecated',kernel ='linear',max_iter = -1,概率= True,random_state =无,缩小=正确,tol = 0.001,冗长= False)

Image for post

And for the radial basis function, we got our best estimators as

对于径向基函数,我们得到了最好的估计量

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=’ovr’, degree=3, gamma=0.001, kernel=’rbf’, max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False)

SVC(C = 100,cache_size = 200,class_weight =无,coef0 = 0.0,decision_function_shape ='ovr',度= 3,伽马= 0.001,内核='rbf',max_iter = -1,概率= True,random_state =无,缩小= True,tol = 0.001,详细= False)

Since the generalization error (expected loss) is used to approximate the population error, we observed the Errorval in RBF kernel model is the smallest amongst other models. Also, this is our best model as it fits the data better than the rest of the models.

由于使用泛化误差(预期损失)来近似总体误差,因此我们观察到RBF内核模型中的Errorval在其他模型中最小。 另外,这是我们最好的模型,因为它比其他模型更适合数据。

Image for post

RBF kernel took the data into higher infinite-dimensional space which helped our model to stand out. The Precision-Recall curve shows us how well the positive class is getting predicted with AUC — 0.961.

RBF内核将数据带入更高的无限维空间,这有助于我们的模型脱颖而出。 Precision-Recall曲线向我们展示了使用AUC-0.961预测正分类的效果。

4. 模型2: (4. MODEL 2:)

Here, X is the same including the predicted rating from Model 1 and Y = Release clause + 52*Wage as the cost to investors. (52 is multiplied as the weekly wage is given).

在这里,X是相同的,包括来自模型1的预测评级, Y =释放条款+ 52 *工资作为投资者的成本。 (将52乘以周工资)。

After selecting significant variables, from Univariate and Bi-variate analysis as earlier, we plotted a scatter plot of independent variables with the dependent variables.

在选择了显着变量之后,从前面的单变量和双变量分析中,我们绘制了独立变量与因变量的散点图。

Image for post

It is clearly visible that they follow a relationship, but it does not seem linear. We confirmed this by developing a linear model.

很明显,他们遵循某种关系,但似乎不是线性的。 我们通过建立线性模型来确认这一点。

4.1。 线性模型: (4.1. Linear Model:)

Results:

结果:

R square train 0.54

R方列车0.54

R square validation 0.55

R平方验证0.55

R square test 0.54

R平方检验0.54

R square is the measure of closeness to perfect prediction. Here, R square is not good.

R平方是接近完美预测的度量。 在这里,R平方不好。

Checking Linearity from residuals: Data should be randomly scattered. But here, we figured out that they are not random. This means a linear model would never be a good choice to fit this model.

从残差检查线性:数据应随机分散。 但是在这里,我们发现它们不是随机的。 这意味着线性模型永远不是适合此模型的好选择。

Image for post

4.2。 决策树:在这种情况下,这是比线性模型更好的选择。 (4.2. Decision Trees: This was the better choice than linear models in this scenario.)

Results (Baseline):

结果(基线):

Train Data: R square — 0.99 and RMSE — 0.05

火车数据: R平方-0.99和RMSE-0.05

Validation Data: R square — 0.54 and RMSE — 8.05

验证数据: R平方-0.54和RMSE-8.05

Test Data: R square — 0.59 and RMSE — 7.35

测试数据: R平方-0.59和RMSE-7.35

There was a clear indication of over-fitting. The model was not performing as expected. Therefore, we tried a grid search based on min_split, tree_depth, and min_weight_fraction_leaf and learning criteria.

明显有过度合身的迹象。 该模型的表现未达到预期。 因此,我们尝试了基于min_split,tree_depth和min_weight_fraction_leaf和学习准则的网格搜索。

As shown above, Entropy performed better with min_split=3 and max_depth=15.

如上所示,熵在min_split = 3和max_depth = 15时表现更好。

Image for post

Results after Grid Search: (Main Model)

网格搜索后的结果:(主模型)

Train Data: R square — 0.85 and RMSE — 4.40

火车数据: R平方-0.85和RMSE-4.40

Validation Data: R square — 0.69 and RMSE — 6.59

验证数据: R平方-0.69和RMSE-6.59

Test Data: R square — 0.70 and RMSE — 6.26

测试数据: R平方-0.70和RMSE-6.26

R-squared value seems far better now. RMSE value is also low and the problem of overfitting is also solved.

R平方值现在似乎好得多。 RMSE值也很低,也解决了过拟合的问题。

Hence, Decision Trees performed better here in order to predict the cost to investors.

因此,决策树在此处表现更好,以便预测投资者的成本。

Final Strategy:

最终策略:

The final step was to make a strategy to pick players for our team keeping in mind:

最后一步是制定策略,为我们的团队选拔球员时要牢记:

  • Rating should be greater than 70 (means class 1)

    评分应大于70(平均等级1)
  • Budget — 1 billion Euros and the number of players around 30.

    预算-10亿欧元,球员人数约30名。

Firstly, we selected only the players who had ratings greater than the threshold of 70. Number of players left — 5276

首先,我们只选择得分大于70的球员。剩余球员数量-5276

Secondly, we performed some analysis like the decile analysis of the cost to investors. We made some buckets each having approximately 30 players from the remaining pool and sorted those buckets based on the cost to investors in descending order.

其次,我们进行了一些分析,例如对投资者成本的十分位分析。 我们从剩余的彩池中抽取了一些约有30名玩家的存储桶,并根据对投资者的降序对这些存储桶进行了分类。

Image for post

Here, we can observe that the amount needed to pick the whole team from the first bucket is 3.45 billion Euros (which is out of budget). That means we can’t pick the top 30 players directly and the amount needed to pick the team from the 11th bucket is 0.945 billion Euros (which is in our budget). However, it would be a wrong strategy to pick all the players from this bucket only as we’d leave almost 300 high valued players who are above this bucket. So, the best solution is to pick 8–10 core players from the top buckets and the rest of the players from medium and low-valued buckets.

在这里,我们可以看到从第一桶中选出整个团队所需的资金为34.5亿欧元 (超出预算)。 这意味着我们不能直接选拔前30名球员,而从第11名中选拔球队所需的金额为9.45亿欧元 (这在我们的预算中)。 但是,仅从该存储桶中选择所有玩家将是错误的策略,因为我们会留下将近300个高于该存储桶的高价值玩家。 因此,最好的解决方案是从最高级的存储桶中选择8–10个核心参与者,从中低价值的存储桶中选择其余的参与者。

This decision can be easily made by the above analysis and it is up to the investors and team managers to decide what kind of players they want in their team.

通过上面的分析可以很容易地做出此决定,这取决于投资者和团队经理来决定他们想要什么样的球员。

5. 结论: (5. CONCLUSION:)

In this work, we constructed 2 models that utilize Machine learning algorithms to benefit investors while simultaneously capturing the meaningfully classifying players as good performers and then regressing them in the budget of the investor. The result, classification, and regression fitting is a new selection model for the supervised learning of players that outperforms other teams. Ultimately, we have narrowed down the selection process of a player within a club, which is rather better than selecting at random.

在这项工作中,我们构建了2个模型,这些模型利用机器学习算法使投资者受益,同时将有意义的参与者分类为表现良好的参与者,然后将他们归还给投资者预算。 结果,分类和回归拟合是一种新的选择模型,用于监督球员的学习,其表现优于其他团队。 最终,我们缩小了俱乐部内球员的选择范围,这比随机选择要好。

Future Scope: We can also implement Time Series techniques. As our dependent variables — rating and cost both depend on previous years' data. For example — If a certain player has a rating of 85 in Dec’19, his rating in Jan’20 would be around 85 +/- 3. Therefore, Time Series techniques might be useful for this data.

未来范围:我们还可以实施时间序列技术。 作为我们的因变量,评级和成本都取决于前几年的数据。 例如,如果某位玩家在19年12月的评分为85,那么他在20年1月的评分将为85 +/-3。因此,时间序列技术可能对该数据有用。

  1. Guo, Gongde & Wang, Hui & Bell, David & Bi, Yaxin. (2004). KNN Model-Based Approach in Classification.

    郭功德,王辉,钟慧,大卫和毕亚新。 (2004)。 基于KNN模型的分类方法。
  2. Yan-yan SONG, Ying LU. (2015). Decision tree methods: applications for classification and prediction.

    宋艳艳,陆颖。 (2015)。 决策树方法:分类和预测的应用程序。
  3. https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680

    https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680

  4. Apostolidis-Afentoulis, Vasileios. (2015). SVM Classification with Linear and RBF kernels. 10.13140/RG.2.1.3351.4083.

    Apostolidis-Afentoulis,Vasileios。 (2015)。 线性和RBF内核的SVM分类。 10.13140 / RG.2.1.3351.4083。

翻译自: https://towardsdatascience.com/buying-a-soccer-team-a-machine-learning-approach-283f51d52511

机器学习算法机器人足球

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值