机器学习与分布式机器学习_机器学习gridsearchcv randomizedsearchcv

机器学习与分布式机器学习

Introduction

介绍

In this article, I’d like to speak about how we can improve the performance of our machine learning model by tuning the parameters.

在本文中,我想谈一谈如何通过调整参数来提高机器学习模型的性能。

Obviously it is important to know the meaning of the parameters that we want to adjust in order to improve our model. For this reason, before to speak about GridSearchCV and RandomizedSearchCV, I will start by explaining some parameters like C and gamma.

显然,了解我们要调整的参数的含义对于改善模型很重要。 因此,在谈论GridSearchCV和RandomizedSearchCV之前,我将首先解释一些参数,例如C和gamma。

Part I: An overview of some parameters in SVC

第一部分:SVC中一些参数的概述

In the Logistic Regression and the Support Vector Classifier, the parameter that determines the strength of the regularization is called C.

Logistic回归支持向量分类器中 ,确定正则化强度的参数称为C。

For a high C, we will have a less regularization and that means we are trying to fit the training set as best as possible. Instead, with low values of the parameter C, the algorithm tries to adjust to the “majority” of data points and increase the generalization of the model.

对于高C ,我们将没有那么正规化,这意味着我们正在尝试尽可能地拟合训练集。 取而代之的是,在参数C的较低的情况下,算法尝试调整为数据点的“多数”,并提高模型的泛化性。

There is another important parameter called gamma. But before to talk about it, I think it is important to understand a little bit the limitation of linear models.

还有另一个重要的参数称为gamma 。 但是在谈论它之前,我认为重要的一点是要了解线性模型的局限性。

Linear models can be quite limiting in low-dimensional spaces, as lines and hyperplanes have limited flexibility. One way to make a linear model more flexible is by adding more features, for example, by adding interactions or polynomials of the input features.

线性模型在低维空间中可能会受到很大限制,因为线和超平面的灵活性有限。 使线性模型更灵活的一种方法是添加更多特征,例如,添加输入特征的相互作用或多项式。

A linear model for classification is only able to separate points using a line, and that is not always the better choice. So, the solution could be to represent the points in a three-dimensional space and not in a two-dimensional space. In fact, in three-dimensional space, we can create a plane that divides and classifies the points of our dataset in a more precise way.

用于分类的线性模型只能使用一条线将点分开,但这并不总是更好的选择。 因此,解决方案可能是在三维空间中而不是二维空间中表示点。 实际上,在三维空间中,我们可以创建一个平面,以更精确的方式对数据集的点进行划分和分类。

There are two ways to map your data into a higher-dimensional space: the polynomial kernel, which computes all possible polynomials up to a certain degree of the original features; and the radial basis function(RBF) kernel, also known as the Gaussian kernel which measures the distance between data points. Here, the task of gamma is to control the width of the Gaussian Kernel.

有两种方法可以将数据映射到更高维的空间中: 多项式内核 ,它可以计算所有可能的多项式,直到原始特征的一定程度; 径向基函数(RBF)核,也称为高斯核 ,用于测量数据点之间的距离。 在这里, 伽玛的任务是控制高斯核的宽度

Part II: GridSearchCV

第二部分:GridSearchCV

As I showed in my previous article, Cross-Validation permits us to evaluate and improve our model. But there is another interesting technique to improve and evaluate our model, this technique is called Grid Search.

正如我在上一篇文章中所展示的 ,“交叉验证”使我们能够评估和改进模型。 但是还有另一种有趣的技术可以改进和评估我们的模型,该技术称为Grid Search

Grid Search is an effective method for adjusting the parameters in supervised learning and improve the generalization performance of a model. With Grid Search, we try all possible combinations of the parameters of interest and find the best ones.

网格搜索是在监督学习中调整参数并提高模型泛化性能的有效方法。 使用网格搜索,我们可以尝试将感兴趣的参数进行所有可能的组合,并找到最佳的组合。

Scikit-learn provides the GridSeaechCV class. Obviously we first need to specify the parameters we want to search and then GridSearchCV will perform all the necessary model fits. For example, we can create the below dictionary that presents all the parameters that we want to search for our model.

Scikit-learn提供了GridSeaechCV类。 显然,我们首先需要指定要搜索的参数,然后GridSearchCV将执行所有必要的模型拟合。 例如,我们可以创建下面的字典,其中显示了我们要搜索模型的所有参数。

parameters = {‘C’: [0.001, 0.01, 0.1, 1, 10, 100], 
‘gamma’: [0.001, 0.01, 0.1, 1, 10, 100]}

Then we can instantiate the GridSearchCV class with the model SVC and apply 6 experiments with cross-validation. Of course, we need also to split our data into a training and test set, to avoid overfitting the parameters.

然后,我们可以使用模型SVC实例化GridSearchCV类,并应用6个交叉验证的实验。 当然,我们还需要将数据分为训练和测试集,以避免参数过度拟合。

from sklearn.model_selection import GridSearchCV from sklearn.model_selection import train_test_splitfrom sklearn.svm import SVC search = GridSearchCV(SVC(), parameters, cv=5)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Now we can fit the search object that we have created with our training data.

现在,我们可以使用训练数据来拟合我们创建的搜索对象。

search.fit(X_train, y_train)

So the GridSearchCV object searches for the best parameters and automatically fits a new model on the whole training dataset.

因此,GridSearchCV对象搜索最佳参数,并自动在整个训练数据集上拟合新模型。

Part III: RandomizedSearchCV

第三部分:RandomizedSearchCV

RandomizedSearchCV is very useful when we have many parameters to try and the training time is very long. For this example, I use a random-forest classifier, so I suppose you already know how this kind of algorithm works.

当我们要尝试许多参数并且训练时间很长时,RandomizedSearchCV非常有用。 对于此示例,我使用随机森林分类器,因此我想您已经知道这种算法的工作原理。

The first step is to write the parameters that we want to consider and from these parameters select the best ones.

第一步是编写我们要考虑的参数,然后从这些参数中选择最佳参数。

param = {‘max_depth: [6,9, None], 
‘n_estimators’:[50, 70, 100, 150],
'max_features': randint(1,6),
'criterion' : ['gini', 'entropy'],
'bootstrap':[True, False],
'mln_samples_leaf': randint(1,4)}

Now we can create our RandomizedSearchCV object and fit the data. Finally, we can find the best parameters and the best scores.

现在,我们可以创建我们的RandomizedSearchCV对象并拟合数据。 最后,我们可以找到最佳参数和最佳分数。

from sklearn.model_selection import RandomSearchCVfrom sklearn.ensemble import RandomForestClassifierrnd_search = RandomizedSearchCV(RandomForestClassifier(), param, 
n_iter =10, cv=9)rnd_search.fit(X,y)
rnd_search.best_params_
rnd_search.best_score_

Conclusion

结论

So, Grid Search is good when we work with a small number of hyperparameters. However, if the number of parameters to consider is particularly high and the magnitudes of influence are imbalanced, the better choice is to use the Random Search.

因此,当我们使用少量的超参数时,网格搜索非常有用。 但是,如果要考虑的参数数量特别多且影响的大小不平衡,则更好的选择是使用随机搜索。

Thanks for reading this. There are some other ways you can keep in touch with me and follow my work:

感谢您阅读本文。 您可以通过其他方法与我保持联系并关注我的工作:

翻译自: https://towardsdatascience.com/machine-learning-gridsearchcv-randomizedsearchcv-d36b89231b10

机器学习与分布式机器学习

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值