Scikit学习-随机梯度下降

最新推荐文章于 2024-07-07 18:47:07 发布

cunzai1985

最新推荐文章于 2024-07-07 18:47:07 发布

阅读量434

点赞数

文章标签：算法 python 机器学习深度学习人工智能

原文链接：https://www.tutorialspoint.com/scikit_learn/scikit_learn_stochastic_gradient_descent.htm

版权

本文介绍了Scikit学习中的优化算法——随机梯度下降(SGD)，用于线性分类器和回归模型的训练。SGD以其简单高效著称，支持多种损失函数和惩罚项。在分类中，SGDClassifier提供了如平方损失、Huber损失等选项；而在回归中，SGDRegressor除了平方损失外，还包括了对异常值容忍的损失函数。虽然SGD需要调整超参数且对特征缩放敏感，但因其高效的训练速度，常被用于大规模数据集。

摘要由CSDN通过智能技术生成

Scikit学习-随机梯度下降 (Scikit Learn - Stochastic Gradient Descent)

Here, we will learn about an optimization algorithm in Sklearn, termed as Stochastic Gradient Descent (SGD).

在这里，我们将学习Sklearn中的优化算法，称为随机梯度下降(SGD)。

Stochastic Gradient Descent (SGD) is a simple yet efficient optimization algorithm used to find the values of parameters/coefficients of functions that minimize a cost function. In other words, it is used for discriminative learning of linear classifiers under convex loss functions such as SVM and Logistic regression. It has been successfully applied to large-scale datasets because the update to the coefficients is performed for each training instance, rather than at the end of instances.

随机梯度下降(SGD)是一种简单而有效的优化算法，用于查找使成本函数最小化的函数参数/系数值。换句话说，它用于凸损失函数(例如SVM和Logistic回归)下的线性分类器的判别学习。它已成功应用于大型数据集，因为是针对每个训练实例(而不是在实例结束时)执行系数更新。

SGD分类器 (SGD Classifier)

Stochastic Gradient Descent (SGD) classifier basically implements a plain SGD learning routine supporting various loss functions and penalties for classification. Scikit-learn provides SGDClassifier module to implement SGD classification.

随机梯度下降(SGD)分类器基本上实现了简单的SGD学习例程，该例程支持各种损失函数和分类惩罚。 Scikit-learn提供了SGDClassifier模块来实现SGD分类。

参量 (Parameters)

Followings table consist the parameters used by SGDClassifier module −

下表包含SGDClassifier模块使用的参数-

Sr.No	Parameter & Description
1	loss − str, default = ‘hinge’ It represents the loss function to be used while implementing. The default value is ‘hinge’ which will give us a linear SVM. The other options which can be used are − log − This loss will give us logistic regression i.e. a probabilistic classifier. modified_huber − a smooth loss that brings tolerance to outliers along with probability estimates. squared_hinge − similar to ‘hinge’ loss but it is quadratically penalized. perceptron − as the name suggests, it is a linear loss which is used by the perceptron algorithm.
2	penalty − str, ‘none’, ‘l2’, ‘l1’, ‘elasticnet’ It is the regularization term used in the model. By default, it is L2. We can use L1 or ‘elasticnet; as well but both might bring sparsity to the model, hence not achievable with L2.
3	alpha − float, default = 0.0001 Alpha, the constant that multiplies the regularization term, is the tuning parameter that decides how much we want to penalize the model. The default value is 0.0001.
4	l1_ratio − float, default = 0.15 This is called the ElasticNet mixing parameter. Its range is 0 < = l1_ratio < = 1. If l1_ratio = 1, the penalty would be L1 penalty. If l1_ratio = 0, the penalty would be an L2 penalty.
5	fit_intercept − Boolean, Default=True This parameter specifies that a constant (bias or intercept) should be added to the decision function. No intercept will be used in calculation and data will be assumed already centered, if it will set to false.
6	tol − float or none, optional, default = 1.e-3 This parameter represents the stopping criterion for iterations. Its default value is False but if set to None, the iterations will stop when 𝒍*loss* > *best_loss - tol for n_iter_no_change*successive epochs.
7	shuffle − Boolean, optional, default = True This parameter represents that whether we want our training data to be shuffled after each epoch or not.
8	verbose − integer, default = 0 It represents the verbosity level. Its default value is 0.
9	epsilon − float, default = 0.1 This parameter specifies the width of the insensitive region. If loss = ‘epsilon-insensitive’, any difference, between current prediction and the correct label, less than the threshold would be ignored.
10	max_iter − int, optional, default = 1000 As name suggest, it represents the maximum number of passes over the epochs i.e. training data.
11	warm_start − bool, optional, default = false With this parameter set to True, we can reuse the solution of the previous call to fit as initialization. If we choose default i.e. false, it will erase the previous solution.
12	random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options. int − In this case, *random_state* is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random.
13	n_jobs − int or none, optional, Default = None It represents the number of CPUs to be used in OVA (One Versus All) computation, for multi-class problems. The default value is none which means 1.
14	learning_rate − string, optional, default = ‘optimal’ If learning rate is ‘constant’, eta = eta0; If learning rate is ‘optimal’, eta = 1.0/(alpha*(t+t0)), where t0 is chosen by Leon Bottou; If learning rate = ‘invscalling’, eta = eta0/pow(t, power_t). If learning rate = ‘adaptive’, eta = eta0.
15	eta0 − double, default = 0.0 It represents the initial learning rate for above mentioned learning rate options i.e. ‘constant’, ‘invscalling’, or ‘adaptive’.
16	power_t − idouble, default =0.5 It is the exponent for ‘incscalling’ learning rate.
17	early_stopping − bool, default = False This parameter represents the use of early stopping to terminate training when validation score is not improving. Its default value is false but when set to true, it automatically set aside a stratified fraction of training data as validation and stop training when validation score is not improving.
18	validation_fraction − float, default = 0.1 It is only used when early_stopping is true. It represents the proportion of training data to set asides as validation set for early termination of training data..
19	n_iter_no_change − int, default=5 It represents the number of iteration with no improvement should algorithm run before early stopping.
20	classs_weight − dict, {class_label: weight} or “balanced”, or None, optional This parameter represents the weights associated with classes. If not provided, the classes are supposed to have weight 1.
20	warm_start − bool, optional, default = false With this parameter set to True, we can reuse the solution of the previous call to fit as initialization. If we choose default i.e. false, it will erase the previous solution.
21	average − iBoolean or int, optional, default = false It represents the number of CPUs to be used in OVA (One Versus All) computation, for multi-class problems. The default value is none which means 1.

序号	参数及说明
1个	损失 -str，默认='铰链' 它表示实现时要使用的损失函数。默认值为“ hinge”，这将为我们提供线性SVM。可以使用的其他选项是- log-这种损失将使我们进行逻辑回归，即概率分类器。 modified_huber-平滑损失，对异常值和概率估计值具有容忍度。 squared_hinge-与'hinge'损失相似，但二次惩罚。感知器 -顾名思义，这是感知器算法使用的线性损耗。
2	惩罚 -str，'none'，'l2'，'l1'，'elasticnet' 它是模型中使用的正则化术语。默认情况下为L2。我们可以使用L1或'elasticnet; 同样，但是两者都可能给模型带来稀疏性，因此L2无法实现。
3	alpha-浮点数，默认= 0.0001 Alpha(乘以正则项的常数)是调整参数，它决定了我们要对模型进行多少惩罚。默认值为0.0001。
4	l1_ratio-浮点数，默认= 0.15 这称为ElasticNet混合参数。其范围为0 <= l1_ratio <=1。如果l1_ratio = 1，则惩罚为L1惩罚。如果l1_ratio = 0，则惩罚为L2惩罚。
5	fit_intercept-布尔值，默认为True 此参数指定应将常量(偏差或截距)添加到决策函数。如果将其设置为false，则不会在计算中使用截距，并且将假定数据已居中。
6	tol-浮动或无，可选，默认= 1.e-3 此参数表示迭代的停止标准。它的默认值为False，但如果设置为None，则当n 损失 > *best_loss-连续n_iter_no_change个时期的tol*时，迭代将停止。
7	shuffle-布尔值，可选，默认= True 此参数表示我们是否希望在每个时期之后对训练数据进行混洗。
8	详细 -整数，默认= 0 它代表了详细程度。默认值为0。
9	epsilon-浮动，默认= 0.1 此参数指定不敏感区域的宽度。如果损失=“对ε不敏感”，则当前预测与正确标签之间的任何差异(小于阈值)将被忽略。
10	max_iter -int，可选，默认= 1000 顾名思义，它代表历时的最大通过次数，即训练数据。
11	warm_start -bool，可选，默认= false 通过将此参数设置为True，我们可以重用上一个调用的解决方案以适合初始化。如果我们选择默认值，即false，它将删除先前的解决方案。
12	random_state -int，RandomState实例或无，可选，默认=无此参数表示生成的伪随机数的种子，在对数据进行混洗时会使用该种子。以下是选项。 INT -在这种情况下*，random_state是由随机数生成所使用的种子。 RandomState实例* -在这种情况下， random_state是随机数生成器。无 -在这种情况下，随机数生成器是np.random使用的RandonState实例。
13	n_jobs-整数或无，可选，默认=无它表示用于多类问题的OVA(一个对所有)计算中使用的CPU数量。默认值为none，表示1。
14	learning_rate-字符串，可选，默认='最优' 如果学习速率为“恒定”，则eta = eta0; 如果学习率是“最佳”，则eta = 1.0 /(alpha *(t + t0))，其中t0由Leon Bottou选择；如果学习率='invscalling'，则eta = eta0 / pow(t，power_t)。如果学习率=“自适应”，则eta = eta0。
15	eta0-两倍，默认= 0.0 它代表上述学习率选项(即“恒定”，“渐进”或“自适应”)的初始学习率。
16	power_t -idouble，默认= 0.5 它是“增加”学习率的指数。
17	early_stopping − bool，默认= False 此参数表示当验证分数没有提高时，使用早期停止来终止训练。它的默认值是false，但是当设置为true时，它会自动将训练数据的分层部分留作验证，并在验证得分没有提高时停止训练。
18	validation_fraction-浮点数，默认= 0.1 仅当early_stopping为true时使用。它表示将训练数据设置为辅助参数以尽早终止训练数据的比例。
19	n_iter_no_change-整数，默认= 5 它表示算法的迭代次数，如果算法在尽早停止之前仍未运行，则不会有所改善。
20	classs_weight -dict，{class_label：weight}或“ balanced”，或者“无”，可选此参数表示与类关联的权重。如果未提供，则该类的权重应为1。
20	warm_start -bool，可选，默认= false 通过将此参数设置为True，我们可以重用上一个调用的解决方案以适合初始化。如果我们选择默认值，即false，它将删除先前的解决方案。
21	平均值 -iBoolean或int，可选，默认= false 它表示用于多类问题的OVA(一个对所有)计算中使用的CPU数量。默认值为none，表示1。

属性 (Attributes)

Following table consist the attributes used by SGDClassifier module −

下表包含SGDClassifier模块使用的属性-

Sr.No	Attributes & Description
1	coef_ − array, shape (1, n_features) if n_classes==2, else (n_classes, n_features) This attribute provides the weight assigned to the features.
2	intercept_ − array, shape (1,) if n_classes==2, else (n_classes,) It represents the independent term in decision function.
3	n_iter_ − int It gives the number of iterations to reach the stopping criterion.

Sr.No

Attributes & Description

coef_ − array, shape (1, n_features) if n_classes==2, else (n_classes, n_features)

This attribute provides the weight assigned to the features.

intercept_ − array, shape (1,) if n_classes==2, else (n_classes,)

It represents the independent term in decision function.

n_iter_ − int

It gives the number of iterations to reach the stopping criterion.

序号	属性和说明
1个	coef_-数组，如果n_classes == 2，则形状为(1，n_features)，否则为(n_classes，n_features) 此属性提供分配给要素的权重。
2	intercept_ -阵列的形状(1)如果n_classes == 2，否则(n_classes，) 它代表决策功能中的独立项。
3	n_iter_-整数它给出了达到停止标准的迭代次数。

序号

属性和说明

1个

coef_-数组，如果n_classes == 2，则形状为(1，n_features)，否则为(n_classes，n_features)

此属性提供分配给要素的权重。

intercept_ -阵列的形状(1)如果n_classes == 2，否则(n_classes，)

它代表决策功能中的独立项。

n_iter_-整数

它给出了达到停止标准的迭代次数。

Implementation Example

实施实例

Like other classifiers, Stochastic Gradient Descent (SGD) has to be fitted with following two arrays −

像其他分类器一样，随机梯度下降(SGD)必须配备以下两个数组-

An array X holding the training samples. It is of size [n_samples, n_features].
存放训练样本的数组X。它的大小为[n_samples，n_features]。
An array Y holding the target values i.e. class labels for the training samples. It is of size [n_samples].
保存目标值的数组Y，即训练样本的类别标签。它的大小为[n_samples]。

Example

例

Following Python script uses SGDClassifier linear model −

以下Python脚本使用SGDClassifier线性模型-


import numpy as np
from sklearn import linear_model
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
SGDClf = linear_model.SGDClassifier(max_iter = 1000, tol=1e-3,penalty = "elasticnet")
SGDClf.fit(X, Y)

Output

输出量


SGDClassifier(
   alpha = 0.0001, average = False, class_weight = None,
   early_stopping = False, epsilon = 0.1, eta0 = 0.0, fit_intercept = True,
   l1_ratio = 0.15, learning_rate = 'optimal', loss = 'hinge', max_iter = 1000,
   n_iter = None, n_iter_no_change = 5, n_jobs = None, penalty = 'elasticnet',
   power_t = 0.5, random_state = None, shuffle = True, tol = 0.001,
   validation_fraction = 0.1, verbose = 0, warm_start = False
)

Example

例

Now, once fitted, the model can predict new values as follows −

现在，一旦拟合，模型可以预测新值，如下所示：


SGDClf.predict([[2.,2.]])

Output

输出量


array([2])

Example

例

For the above example, we can get the weight vector with the help of following python script −

对于上面的示例，我们可以借助以下python脚本获取权重向量-


SGDClf.coef_

Output

输出量


array([[19.54811198, 9.77200712]])

Example

例

Similarly, we can get the value of intercept with the help of following python script −

同样，我们可以在以下python脚本的帮助下获取拦截的值-


SGDClf.intercept_

Output

输出量


array([10.])

Example

例

We can get the signed distance to the hyperplane by using SGDClassifier.decision_function as used in the following python script −

通过使用以下python脚本中使用的SGDClassifier.decision_function ，可以获取到超平面的有符号距离-


SGDClf.decision_function([[2., 2.]])

Output

输出量


array([68.6402382])

SGD回归器 (SGD Regressor)

Stochastic Gradient Descent (SGD) regressor basically implements a plain SGD learning routine supporting various loss functions and penalties to fit linear regression models. Scikit-learn provides SGDRegressor module to implement SGD regression.

随机梯度下降(SGD)回归器基本上实现了简单的SGD学习例程，该例程支持各种损失函数和惩罚以适应线性回归模型。 Scikit-learn提供了SGDRegressor模块来实现SGD回归。

参量 (Parameters)

Parameters used by SGDRegressor are almost same as that were used in SGDClassifier module. The difference lies in ‘loss’ parameter. For SGDRegressor modules’ loss parameter the positives values are as follows −

SGDRegressor使用的参数与SGDClassifier模块中使用的参数几乎相同。区别在于“损失”参数。对于SGDRegressor模块的loss参数，正值如下所示-

squared_loss − It refers to the ordinary least squares fit.
squared_loss-它是指普通的最小二乘拟合。
huber: SGDRegressor − correct the outliers by switching from squared to linear loss past a distance of epsilon. The work of ‘huber’ is to modify ‘squared_loss’ so that algorithm focus less on correcting outliers.
Huber：SGDRegressor-通过将平方损失转换为线性损失超过ε距离来校正异常值。 “休伯”的工作是修改“ squared_loss”，以使算法较少关注校正异常值。
epsilon_insensitive − Actually, it ignores the errors less than epsilon.
epsilon_insensitive-实际上，它忽略小于epsilon的错误。
squared_epsilon_insensitive − It is same as epsilon_insensitive. The only difference is that it becomes squared loss past a tolerance of epsilon.
squared_epsilon_insensitive-与epsilon_insensitive相同。唯一的区别是，它变成超过ε容差的平方损耗。

Another difference is that the parameter named ‘power_t’ has the default value of 0.25 rather than 0.5 as in SGDClassifier. Furthermore, it doesn’t have ‘class_weight’ and ‘n_jobs’ parameters.

另一个区别是名为'power_t'的参数的默认值为0.25，而不是SGDClassifier中的 0.5。此外，它没有'class_weight'和'n_jobs'参数。

属性 (Attributes)

Attributes of SGDRegressor are also same as that were of SGDClassifier module. Rather it has three extra attributes as follows −

SGDRegressor的属性也与SGDClassifier模块的属性相同。相反，它具有三个额外的属性，如下所示：

average_coef_ − array, shape(n_features,)
average_coef_ −数组，形状(n_features，)

As name suggest, it provides the average weights assigned to the features.

顾名思义，它提供分配给功能的平均权重。

average_intercept_ − array, shape(1,)
average_intercept_-数组，shape(1，)

As name suggest, it provides the averaged intercept term.

顾名思义，它提供了平均截距项。

t_ − int
t_-整数

It provides the number of weight updates performed during the training phase.

它提供了在训练阶段执行的体重更新次数。

Note − the attributes average_coef_ and average_intercept_ will work after enabling parameter ‘average’ to True.

注意 -在将参数“ average”启用为True之后，属性average_coef_和average_intercept_将起作用。

Implementation Example

实施实例

Following Python script uses SGDRegressor linear model −

以下Python脚本使用SGDRegressor线性模型-


import numpy as np
from sklearn import linear_model
n_samples, n_features = 10, 5
rng = np.random.RandomState(0)
y = rng.randn(n_samples)
X = rng.randn(n_samples, n_features)
SGDReg =linear_model.SGDRegressor(
   max_iter = 1000,penalty = "elasticnet",loss = 'huber',tol = 1e-3, average = True
)
SGDReg.fit(X, y)

Output

输出量


SGDRegressor(
   alpha = 0.0001, average = True, early_stopping = False, epsilon = 0.1,
   eta0 = 0.01, fit_intercept = True, l1_ratio = 0.15,
   learning_rate = 'invscaling', loss = 'huber', max_iter = 1000,
   n_iter = None, n_iter_no_change = 5, penalty = 'elasticnet', power_t = 0.25,
   random_state = None, shuffle = True, tol = 0.001, validation_fraction = 0.1,
   verbose = 0, warm_start = False
)

Example

例

Now, once fitted, we can get the weight vector with the help of following python script −

现在，一旦拟合，我们就可以在以下python脚本的帮助下获得权重向量-


SGDReg.coef_

Output

输出量


array([-0.00423314, 0.00362922, -0.00380136, 0.00585455, 0.00396787])

Example

例

Similarly, we can get the value of intercept with the help of following python script −

同样，我们可以在以下python脚本的帮助下获取拦截的值-


SGReg.intercept_

Output

输出量


SGReg.intercept_

Example

例

We can get the number of weight updates during training phase with the help of the following python script −

我们可以借助以下python脚本获取训练阶段体重更新的次数-


SGDReg.t_

Output

输出量


61.0

SGD的优点和缺点 (Pros and Cons of SGD)

Following the pros of SGD −

遵循SGD的优点-

Stochastic Gradient Descent (SGD) is very efficient.
随机梯度下降(SGD)非常有效。
It is very easy to implement as there are lots of opportunities for code tuning.
这很容易实现，因为有很多代码调优的机会。

Following the cons of SGD −

遵循SGD的缺点-

Stochastic Gradient Descent (SGD) requires several hyperparameters like regularization parameters.
随机梯度下降(SGD)需要一些超参数，例如正则化参数。
It is sensitive to feature scaling.
它对特征缩放很敏感。

翻译自: https://www.tutorialspoint.com/scikit_learn/scikit_learn_stochastic_gradient_descent.htm

cunzai1985

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Scikit学习-随机梯度下降

Scikit学习-随机梯度下降 (Scikit Learn - Stochastic Gradient Descent)Advertisements 广告 Previous Page 上一页 Next Page 下一页 Here, we will learn about an optimization algorithm in Sklearn, termed...
复制链接

扫一扫