scikit-learn学习系列 - 广义线性模型

下面是一组用于回归的方法,其中目标值是输入变量的线性组合。在数学上,如果y^是预测值

在整个模块中,我们指定向量 w=(w1,...,wp) as coef_ and w0 as intercept_.

使用广义线性模型进行分类,见 Logistic regression.

1. 普通最小二乘法

线性回归适用于带有系数w=(w1,...,wp) 的线性模型,最小化数据集中观测响应与线性逼近预测响应之间的残差平方和。 数学上它解决了这样一个问题

../_images/sphx_glr_plot_ols_0011.png

线性回归将取其拟合方法阵列X、y,并将线性模型的系数w存储在其coef_member中

>>> from sklearn import linear_model
>>> reg = linear_model.LinearRegression()
>>> reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
...                                       
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                 normalize=False)
>>> reg.coef_
array([0.5, 0.5])

然而,普通最小二乘的系数估计依赖于模型项的独立性。当项相关且设计矩阵X的列近似线性相关时,设计矩阵趋于奇异,使得最小二乘估计对观测响应中的随机误差高度敏感,产生较大的方差。例如,在没有实验设计的情况下收集数据,就会出现多重共线性的情况。

示例:

1.1. 普通最小二乘法复杂度

该方法利用X的奇异值分解计算最小二乘解。如果矩阵X大小为(n, p) ,则复杂度为cost of O(np2),假设n≥p。

2. 岭回归

岭回归通过对系数的大小施加惩罚来解决普通最小二乘的一些问题。岭系数使残差平方和最小

这里, α≥0 是控制收缩量的复杂性参数: α越大,收缩量越大,从而系数对共线性具有较强的鲁棒性。

../_images/sphx_glr_plot_ridge_path_0011.png

和其他线性模型一样,Ridge will take in its fit method arrays X, y and will store the coefficients w of the linear model in its coef_ member:

>>> from sklearn import linear_model
>>> reg = linear_model.Ridge(alpha=.5)
>>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) 
Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)
>>> reg.coef_
array([0.34545455, 0.34545455])
>>> reg.intercept_ 
0.13636...

示例:

2.1. 复杂度

该方法的复杂度与普通最小二乘法相同。

2.2. 正则化参数设置:广义交叉验证

RidgeCV 通过内置的alpha参数交叉验证来实现ridge回归。该对象的工作方式与GridSearchCV相同,除了它默认为通用交叉验证(GCV),这是一种有效的一出交叉验证形式

>>> from sklearn import linear_model
>>> reg = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0], cv=3)
>>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])       
RidgeCV(alphas=[0.1, 1.0, 10.0], cv=3, fit_intercept=True, scoring=None,
    normalize=False)
>>> reg.alpha_                                      
0.1

参考:

3. Lasso

Lasso 是一个估计稀疏系数的线性模型。它在某些情况下非常有用,因为它倾向于选择参数值更少的解决方案,从而有效地减少了给定解决方案所依赖的变量的数量。因此,Lasso及其变体是压缩感知领域的基础。在一定条件下,它可以恢复非零权值的精确集合(见Compressive sensing: tomography reconstruction with L1 prior (Lasso)).

数学上,它由一个线性模型训练与ℓ1前调整。最小化的目标函数为

The lasso estimate thus solves the minimization of the least-squares penalty with α||w||1 added, 其中α是常量,and ||w||1 是ℓ1-norm的参数向量。

Lasso类中的实现使用坐标下降算法来拟合系数。见 Least Angle Regression实现其他应用:

>>> from sklearn import linear_model
>>> reg = linear_model.Lasso(alpha=0.1)
>>> reg.fit([[0, 0], [1, 1]], [0, 1])
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
>>> reg.predict([[1, 1]])
array([0.8])

对于较低级别的任务,lasso_path函数也很有用,它计算可能值的整个路径上的系数。

示例:

Note:Lasso特征选择

由于Lasso回归得到稀疏模型,因此可以用来进行特征选择,细节见 L1-based feature selection.

下面两个参考文献解释了scikit-learn坐标下降求解器中使用的迭代,以及收敛控制中使用的对偶间隙计算。

参考:

  • “Regularization Path For Generalized linear Models by Coordinate Descent”, Friedman, Hastie & Tibshirani, J Stat Softw, 2010 (Paper).
  • “An Interior-Point Method for Large-Scale L1-Regularized Least Squares,” S. J. Kim, K. Koh, M. Lustig, S. Boyd and D. Gorinevsky, in IEEE Journal of Selected Topics in Signal Processing, 2007 (Paper)

3.1. 设置正则化参数

参数alpha控制估计系数的稀疏程度。

3.1.1. 使用交叉验证

scikit-learn exposes objects that set the Lasso alpha parameter by cross-validation: LassoCVLassoLarsCV.LassoLarsCV基于如下的最小角度回归算法。

对于有许多共线回归的高维数据集,LassoCV通常更可取。然而,LassoLarsCV具有探索alpha参数更相关值的优势,且如果样本数量相对于特征数量非常小,则通常比LassoCV要快。

lasso_cv_1

 

lasso_cv_2

3.1.2. 基于信息准则的模型选择

另外,估计量LassoLarsIC提出使用Akaike信息准则(AIC)和Bayes信息准则(BIC)。当使用k-fold交叉验证时,正则化路径只计算一次,而不是k+1次,因此寻找alpha的最优值是一种计算成本更低的选择。然而,这样的准则需要对解的自由度进行适当的估计,是针对大样本(渐近结果)推导出来的,并且假设模型是正确的,即数据实际上是由这个模型生成的。当问题严重时(特征多于样本),它们也会中断。

../_images/sphx_glr_plot_lasso_model_selection_0011.png

示例:

3.1.3. 与SVM正则化参数的比较

alpha与SVM正则化参数的等价性, C is given by alpha = 1 / C or alpha = 1 / (n_samples * C), 依赖于模型优化后的估计量和精确目标函数。

4. 多任务Lasso

多任务Lasso是一种联合估计多元回归问题稀疏系数的线性模型:y is a 2D array, of shape (n_samples, n_tasks). 约束条件是所选的特性对于所有的回归问题(也称为任务)都是相同的。

下图比较了使用简单Lasso或多任务Lasso获得的W中非零的位置。Lasso估计产生分散的非零,而非零的多任务Lasso是完整的列。

multi_task_lasso_1

 

multi_task_lasso_2

Fitting a time-series model, imposing that any active feature be active at all times.

示例:

Mathematically, it consists of a linear model trained with a mixed ℓ1 ℓ2 prior as regularizer. The objective function to minimize is:

minw12nsamples||XW−Y||Fro2+α||W||21

where Fro indicates the Frobenius norm:

||A||Fro=∑ijaij2

and ℓ1 ℓ2 reads:

||A||21=∑i∑jaij2

在多任务Lasso类的实现中,使用坐标下降算法来拟合系数。

5. 弹性网络

弹性网络是一个以L1和L2先验作为正则化器进行训练的线性回归模型。这种组合允许学习一个稀疏模型,其中很少的权重像Lasso那样是非零的,同时仍然保持Ridge的正则化属性。我们使用l1_ratio参数控制L1和L2的凸组合。 

当多个特征相互关联时,弹性网络是有用的。Lasso可能会随机选择其中一个,而elastic-net可能同时选择这两个。

A practical advantage of trading-off between Lasso and Ridge是使弹性网络在旋转时继承了ridge的一些稳定性。

这里的目标函数是最小化

../_images/sphx_glr_plot_lasso_coordinate_descent_path_0011.png

ElasticNetCV类可被用来用交叉验证设置参数alpha (α)和l1_ratio (ρ)。

示例:

下面两个参考文献解释了scikit-learn坐标下降求解器中使用的迭代,以及收敛控制中使用的对偶间隙计算。

参考:

  • “Regularization Path For Generalized linear Models by Coordinate Descent”, Friedman, Hastie & Tibshirani, J Stat Softw, 2010 (Paper).
  • “An Interior-Point Method for Large-Scale L1-Regularized Least Squares,” S. J. Kim, K. Koh, M. Lustig, S. Boyd and D. Gorinevsky, in IEEE Journal of Selected Topics in Signal Processing, 2007 (Paper)

6. 多任务弹性网络

多任务弹性网络为多元回归问题联合估计稀疏系数的弹性网络模型:Y是一个二维数组,大小为(n_samples, n_tasks)。约束条件是所选的特性对于所有的回归问题(也称为任务)都是相同的。

数学上,it consists of a linear model trained with a mixed ℓ1 ℓ2 prior and ℓ2 prior as regularizer. 最小化的目标函数为:

在MultiTaskElasticNet类的实现中,使用坐标下降算法来拟合系数。

MultiTaskElasticNetCV类可被用来用交叉验证设置参数alpha (α)和l1_ratio (ρ)。

7. 最小角度回归

最小角度回归(LARS)是一种针对高维数据的回归算法,由Bradley Efron、Trevor Hastie、Iain Johnstone和Robert Tibshirani开发。LARS类似于正向逐步回归。在每一步中,它发现预测因子与反应最相关。当多个预测因子具有相等的相关性时,它不是沿着同一预测因子继续前进,而是在预测因子之间以相等的方向前进。

LARS的优点:

  • It is numerically efficient in contexts where p >> n (也就是, 当维数明显大于点的个数时)
  • 它的计算速度和正向选择一样快,复杂度与普通最小二乘相同。
  • 它生成一个完整的分段线性解决方案路径,这在交叉验证或类似的优化模型的尝试中非常有用。
  • 如果两个变量与响应的相关性几乎相等,那么它们的系数应该以大约相同的速率增加。因此,该算法的行为就像直觉所期望的那样,而且更加稳定。
  • 它很容易修改,从而为其他估计值生成解,比如Lasso。

LARS的缺点

  • 因为LARS是基于残差的迭代重构,它似乎对噪声的影响特别敏感。韦斯伯格在《Efron等人(2004)统计年鉴》文章的讨论部分详细讨论了这个问题。

LARS模型可以使用估计器Lars或其低级实现lars_path来使用。

8. LARS Lasso

LassoLars是一种使用LARS算法实现的lasso模型,与基于coordinate_descent的实现不同,它产生的精确解是分段线性的,是其系数范数的函数。

../_images/sphx_glr_plot_lasso_lars_0011.png

 

>>> from sklearn import linear_model
>>> reg = linear_model.LassoLars(alpha=.1)
>>> reg.fit([[0, 0], [1, 1]], [0, 1])  
LassoLars(alpha=0.1, copy_X=True, eps=..., fit_intercept=True,
     fit_path=True, max_iter=500, normalize=True, positive=False,
     precompute='auto', verbose=False)
>>> reg.coef_    
array([0.717157..., 0.        ])

示例:

Lars算法几乎免费提供正则化参数的系数的完整路径,因此常用的操作包括使用lars_path函数检索路径

8.1. 数学公式

该算法类似于前向逐步回归,但不是在每一步都包含变量,而是将估计参数按与残差相关的方向递增。

LARS解由一条曲线组成,表示参数向量L1范数的每个值的解,而不是给出一个向量结果。完整的系数路径存储在数组coef_path_中,该数组具有大小(n_features, max_features+1)。第一列总是0。

参考:

9. 正交匹配追踪算法(OMP)

OrthogonalMatchingPursuitorthogonal_mp实现了一种OMP算法,该算法通过对非零系数的个数施加约束来逼近线性模型的拟合(ie. the L0伪范数).

正交匹配追踪是一种像最小角度回归一样的正向特征选择方法,它可以用固定数量的非零元素逼近最优解向量:

另外,正交匹配跟踪可以针对特定的误差,而不是特定数目的非零系数。这可以表示为:

OMP是基于一种贪婪算法,该算法在每一步都包含了与当前剩余最相关的原子。它类似于更简单的匹配追踪(MP)方法,但更好的是,在每次迭代中,残差都使用正交投影在先前选择的字典元素的空间上重新计算。

示例:

参考:

10. 贝叶斯回归

贝叶斯回归技术可以用于在估计过程中包含正则化参数:正则化参数不是硬性设置的,而是根据手头的数据进行调整。

这可以通过在模型的超参数上引入无信息性先验来实现。岭回归中使用的ℓ2正规化相当于下找到一个最大后验估计高斯之前在参数wwith精密λ−1。不需要手动设置lambda,可以将其视为从数据中估计的随机变量。

为了得到完全概率模型,假设输出y为分布在Xw周围的高斯分布:

Alpha再次将作为一个随机变量处理,从数据中进行估计。

贝叶斯回归的优点:

  • 它能适应手头的数据。
  • 它可以用于在估计过程中包含正则化参数。

贝叶斯回归的缺点:

  • 模型的推论可能很耗时。

参考:

  • A good introduction to Bayesian methods is given in C. Bishop: Pattern Recognition and Machine learning
  • Original Algorithm is detailed in the book Bayesian learning for neural networks by Radford M. Neal

10.1. 贝叶斯岭回归

BayesianRidge估计了上述回归问题的概率模型。参数w的先验由球面高斯分布给出:

α和λ的先验选择gamma分布,高斯函数精度的共轭先验。

得到的模型称为贝叶斯岭回归,与经典岭相似。参数w,α和λare估计共同在适合的模型。其余hyperparameters在α和γ先验的参数λ。这些通常是非信息性的。这些参数是通过使边际对数似然最大化来估计的。

默认

../_images/sphx_glr_plot_bayesian_ridge_0011.png

贝叶斯岭回归用于回归

>>> from sklearn import linear_model
>>> X = [[0., 0.], [1., 1.], [2., 2.], [3., 3.]]
>>> Y = [0., 1., 2., 3.]
>>> reg = linear_model.BayesianRidge()
>>> reg.fit(X, Y)
BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
       fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300,
       normalize=False, tol=0.001, verbose=False)

拟合后,模型可用于预测新值

>>> reg.predict([[1, 0.]])
array([0.50000013])

模型的权值w为

>>> reg.coef_
array([0.49999993, 0.49999993])

由于贝叶斯框架的存在,得到的权值与普通最小二乘法得到的权值略有不同。然而,贝叶斯岭回归对不适定问题更具有鲁棒性。

示例:

参考:

10.2. Automatic Relevance Determination - ARD

ARDRegression与贝叶斯岭回归非常相似,但是会导致权值w[1][2]的稀疏。ardreprogressive提出了一个不同的先验除以w,通过放弃高斯是球形的假设。

相反,w上的分布被假设为轴平行的椭圆高斯分布。

这意味着每个重量wi是从一个高斯分布,集中在零和精度λi:

其中

在贝叶斯岭回归相比,每个坐标λi wi有它自己的标准偏差。对所有λi之前选择的是相同的伽马分布由hyperparametersλ1和λ2。

../_images/sphx_glr_plot_ard_0011.png

ARD在文献中也被称为稀疏贝叶斯学习和关联向量机[3][4]。

示例:

参考:

11. 逻辑回归

逻辑回归,尽管它的名字,是一个线性模型的分类,而不是回归。Logistic回归在文献中也被称为logit回归、最大熵分类(MaxEnt)或对数线性分类器。在这个模型中,描述单一试验可能结果的概率是使用逻辑函数建模的。

The implementation of logistic regression in scikit-learn can be accessed from class LogisticRegression. This implementation can fit binary, One-vs- Rest, or multinomial logistic regression with optional L2 or L1 regularization.

As an optimization problem, binary class L2 penalized logistic regression minimizes the following cost function:

minw,c12wTw+C∑i=1nlog⁡(exp⁡(−yi(XiTw+c))+1).

Similarly, L1 regularized logistic regression solves the following optimization problem

minw,c‖w‖1+C∑i=1nlog⁡(exp⁡(−yi(XiTw+c))+1).

Note that, in this notation, it’s assumed that the observation yi takes values in the set −1,1 at trial i.

The solvers implemented in the class LogisticRegression are “liblinear”, “newton-cg”, “lbfgs”, “sag” and “saga”:

The solver “liblinear” uses a coordinate descent (CD) algorithm, and relies on the excellent C++ LIBLINEAR library, which is shipped with scikit-learn. However, the CD algorithm implemented in liblinear cannot learn a true multinomial (multiclass) model; instead, the optimization problem is decomposed in a “one-vs-rest” fashion so separate binary classifiers are trained for all classes. This happens under the hood, so LogisticRegression instances using this solver behave as multiclass classifiers. For L1 penalization sklearn.svm.l1_min_c allows to calculate the lower bound for C in order to get a non “null” (all feature weights to zero) model.

The “lbfgs”, “sag” and “newton-cg” solvers only support L2 penalization and are found to converge faster for some high dimensional data. Setting multi_class to “multinomial” with these solvers learns a true multinomial logistic regression model [5], which means that its probability estimates should be better calibrated than the default “one-vs-rest” setting.

The “sag” solver uses a Stochastic Average Gradient descent [6]. It is faster than other solvers for large datasets, when both the number of samples and the number of features are large.

The “saga” solver [7] is a variant of “sag” that also supports the non-smooth penalty=”l1” option. This is therefore the solver of choice for sparse multinomial logistic regression.

The “lbfgs” is an optimization algorithm that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm [8], which belongs to quasi-Newton methods. The “lbfgs” solver is recommended for use for small data-sets but for larger datasets its performance suffers. [9]

The following table summarizes the penalties supported by each solver:

 Solvers
惩罚‘liblinear’‘lbfgs’‘newton-cg’‘sag’‘saga’
Multinomial + L2 penaltynoyesyesyesyes
OVR + L2 penaltyyesyesyesyesyes
Multinomial + L1 penaltynonononoyes
OVR + L1 penaltyyesnononoyes
Behaviors 
Penalize the intercept (bad)yesnononono
Faster for large datasetsnononoyesyes
Robust to unscaled datasetsyesyesyesnono

The “lbfgs” solver is used by default for its robustness. For large datasets the “saga” solver is usually faster. For large dataset, you may also consider using SGDClassifier with ‘log’ loss, which might be even faster but require more tuning.

示例:

Differences from liblinear:

There might be a difference in the scores obtained between LogisticRegression with solver=liblinear or LinearSVC and the external liblinear library directly, when fit_intercept=False and the fit coef_ (or) the data to be predicted are zeroes. This is because for the sample(s) with decision_function zero, LogisticRegression and LinearSVC predict the negative class, while liblinear predicts the positive class. Note that a model with fit_intercept=False and having many samples with decision_function zero, is likely to be a underfit, bad model and you are advised to set fit_intercept=True and increase the intercept_scaling.

Note:Feature selection with sparse logistic regression

A logistic regression with L1 penalty yields sparse models, and can thus be used to perform feature selection, as detailed in L1-based feature selection.

LogisticRegressionCV implements Logistic Regression with builtin cross-validation to find out the optimal C parameter. “newton-cg”, “sag”, “saga” and “lbfgs” solvers are found to be faster for high-dimensional dense data, due to warm-starting. For the multiclass case, if multi_class option is set to “ovr”, an optimal C is obtained for each class and if the multi_classoption is set to “multinomial”, an optimal C is obtained by minimizing the cross-entropy loss.

参考:

12. 随机梯度下降 - SGD

随机梯度下降法是一种简单而有效的线性模型拟合方法。当示例的数量(以及特性的数量)非常大时,它尤其有用。partial_fit方法允许在线/非核心学习。

The classes SGDClassifier and SGDRegressor provide functionality to fit linear models for classification and regression using different (convex) loss functions and different penalties. E.g., with loss="log"SGDClassifier fits a logistic regression model, while with loss="hinge" it fits a linear support vector machine (SVM).

参考:

13. 感知机

The Perceptron is another simple classification algorithm suitable for large scale learning. By default:

  • 它不需要一个学习率。
  • 它没有被规范化(惩罚)。
  • 它只根据错误更新模型。

最后一个特征表明,在铰链损失的情况下,感知器的训练速度略快于SGD,得到的模型比较稀疏。

14. Passive Aggressive Algorithms

The passive-aggressive算法是一种大规模学习算法。它们与感知器相似,不需要学习速率。然而,与感知器相反,它们包含一个正则化参数C。

For classification, PassiveAggressiveClassifier can be used with loss='hinge' (PA-I) or loss='squared_hinge' (PA-II). For regression, PassiveAggressiveRegressor can be used with loss='epsilon_insensitive' (PA-I) orloss='squared_epsilon_insensitive' (PA-II).

参考:

15. Robustness回归: 离群值和建模误差

Robust回归感兴趣的是在存在异常值或模型中的错误的情况下拟合回归模型。

../_images/sphx_glr_plot_theilsen_0011.png

15.1. 不同的场景和有用的概念

在处理被异常值损坏的数据时,需要记住以下几点:

  • Outliers in X or in y?

    Outliers in the y directionOutliers in the X direction
    y_outliersX_outliers
  • Fraction of outliers versus amplitude of error

    The number of outlying points matters, but also how much they are outliers.

    Small outliersLarge outliers
    y_outlierslarge_y_outliers

鲁棒拟合的一个重要概念是分解点的概念:数据的一部分可能会偏离拟合而开始丢失内嵌数据。

通常情况下,在高维设置(large n_features)中进行健壮的拟合是非常困难的。这里的健壮模型可能无法在这些设置中工作。

Trade-offs: which estimator?

Scikit-learn provides 3 robust regression estimators: RANSACTheil Sen and HuberRegressor

  • HuberRegressor should be faster than RANSAC and Theil Sen unless the number of samples are very large, i.e n_samples >> n_features. This is because RANSAC and Theil Sen fit on smaller subsets of the data. However, both Theil Sen and RANSAC are unlikely to be as robust as HuberRegressor for the default parameters.
  • RANSAC is faster than Theil Sen and scales much better with the number of samples
  • RANSAC will deal better with large outliers in the y direction (most common situation)
  • Theil Sen will cope better with medium-size outliers in the X direction, but this property will disappear in large dimensional settings.

When in doubt, use RANSAC

15.2. RANSAC: RANdom SAmple Consensus

RANSAC (RANdom SAmple Consensus) 从完整数据集的随机离群点子集拟合模型。

RANSAC is a non-deterministic algorithm producing only a reasonable result with a certain probability, which is dependent on the number of iterations (see max_trials parameter). It is typically used for linear and non-linear regression problems and is especially popular in the fields of photogrammetric computer vision.

The algorithm splits the complete input sample data into a set of inliers, which may be subject to noise, and outliers, which are e.g. caused by erroneous measurements or invalid hypotheses about the data. The resulting model is then estimated only from the determined inliers.

../_images/sphx_glr_plot_ransac_0011.png

15.2.1. 算法细节

每次迭代执行以下步骤:

  1. Select min_samples random samples from the original data and check whether the set of data is valid (see is_data_valid).
  2. Fit a model to the random subset (base_estimator.fit) and check whether the estimated model is valid (see is_model_valid).
  3. Classify all data as inliers or outliers by calculating the residuals to the estimated model (base_estimator.predict(X) - y) - all data samples with absolute residuals smaller than the residual_threshold are considered as inliers.
  4. Save fitted model as best model if number of inlier samples is maximal. In case the current estimated model has the same number of inliers, it is only considered as the best model if it has better score.

These steps are performed either a maximum number of times (max_trials) or until one of the special stop criteria are met (see stop_n_inliers and stop_score). The final model is estimated using all inlier samples (consensus set) of the previously determined best model.

The is_data_valid and is_model_valid functions allow to identify and reject degenerate combinations of random sub-samples. If the estimated model is not needed for identifying degenerate cases, is_data_valid should be used as it is called prior to fitting the model and thus leading to better computational performance.

示例:

参考:

15.3. Theil-Sen estimator: generalized-median-based estimator

The TheilSenRegressor estimator uses a generalization of the median in multiple dimensions. It is thus robust to multivariate outliers. Note however that the robustness of the estimator decreases quickly with the dimensionality of the problem. It looses its robustness properties and becomes no better than an ordinary least squares in high dimension.

示例:

参考:

15.3.1. 理论分析

TheilSenRegressor is comparable to the Ordinary Least Squares (OLS) in terms of asymptotic efficiency and as an unbiased estimator. In contrast to OLS, Theil-Sen is a non-parametric method which means it makes no assumption about the underlying distribution of the data. Since Theil-Sen is a median-based estimator, it is more robust against corrupted data aka outliers. In univariate setting, Theil-Sen has a breakdown point of about 29.3% in case of a simple linear regression which means that it can tolerate arbitrary corrupted data of up to 29.3%.

../_images/sphx_glr_plot_theilsen_0011.png

The implementation of TheilSenRegressor in scikit-learn follows a generalization to a multivariate linear regression model [10] using the spatial median which is a generalization of the median to multiple dimensions [11].

In terms of time and space complexity, Theil-Sen scales according to

(nsamplesnsubsamples)

which makes it infeasible to be applied exhaustively to problems with a large number of samples and features. Therefore, the magnitude of a subpopulation can be chosen to limit the time and space complexity by considering only a random subset of all possible combinations.

示例:

参考:

15.4. Huber回归

The HuberRegressor is different to Ridge because it applies a linear loss to samples that are classified as outliers. A sample is classified as an inlier if the absolute error of that sample is lesser than a certain threshold. It differs from TheilSenRegressor and RANSACRegressor because it does not ignore the effect of the outliers but gives a lesser weight to them.

../_images/sphx_glr_plot_huber_vs_ridge_001.png

The loss function that HuberRegressor minimizes is given by

minw,σ∑i=1n(σ+Hϵ(Xiw−yiσ)σ)+α||w||22

where

Hϵ(z)={z2,if |z|<ϵ,2ϵ|z|−ϵ2,otherwise

It is advised to set the parameter epsilon to 1.35 to achieve 95% statistical efficiency.

15.5. 说明

The HuberRegressor differs from using SGDRegressor with loss set to huber in the following ways.

  • HuberRegressor is scaling invariant. Once epsilon is set, scaling X and y down or up by different values would produce the same robustness to outliers as before. as compared to SGDRegressor where epsilon has to be set again when X and y are scaled.
  • HuberRegressor should be more efficient to use on data with small number of samples while SGDRegressor needs a number of passes on the training data to produce the same robustness.

示例:

参考:

  • Peter J. Huber, Elvezio M. Ronchetti: Robust Statistics, Concomitant scale estimates, pg 172

此外,这个估计器不同于鲁棒回归的R实现,因为R实现是一个加权最小二乘实现,根据残差大于某个阈值的程度为每个样本赋予权重。

16. 多项式回归:用基函数扩展线性模型

机器学习的一个常见模式是使用基于数据非线性函数的线性模型。这种方法保持了线性方法的一般快速性能,同时允许它们适应更广泛的数据范围。

例如,一个简单的线性回归可以通过从系数构造多项式特征来扩展。在标准的线性回归情况下,你可能有一个二维数据的模型:

y^(w,x)=w0+w1x1+w2x2

If we want to fit a paraboloid to the data instead of a plane, we can combine the features in second-order polynomials, so that the model looks like this:

y^(w,x)=w0+w1x1+w2x2+w3x1x2+w4x12+w5x22

The (sometimes surprising) observation is that this is still a linear model: to see this, imagine creating a new variable

z=[x1,x2,x1x2,x12,x22]

With this re-labeling of the data, our problem can be written

y^(w,x)=w0+w1z1+w2z2+w3z3+w4z4+w5z5

我们看到得到的多项式回归与我们在上面考虑的线性模型是同一类的(即模型在w中是线性的),可以用相同的技术来求解。通过在用这些基函数构建的高维空间中考虑线性拟合,该模型具有适应更广泛数据范围的灵活性。

下面是一个将这种思想应用于一维数据的例子,它使用了不同程度的多项式特征:

../_images/sphx_glr_plot_polynomial_interpolation_0011.png

This figure is created using the PolynomialFeatures preprocessor. This preprocessor transforms an input data matrix into a new data matrix of a given degree. It can be used as follows:

>>> from sklearn.preprocessing import PolynomialFeatures
>>> import numpy as np
>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5]])
>>> poly = PolynomialFeatures(degree=2)
>>> poly.fit_transform(X)
array([[ 1.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  4.,  5., 16., 20., 25.]])

The features of X have been transformed from [x1,x2] to [1,x1,x2,x12,x1x2,x22], and can now be used within any linear model.

This sort of preprocessing can be streamlined with the Pipeline tools. A single object representing a simple polynomial regression can be created and used as follows:

>>> from sklearn.preprocessing import PolynomialFeatures
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.pipeline import Pipeline
>>> import numpy as np
>>> model = Pipeline([('poly', PolynomialFeatures(degree=3)),
...                   ('linear', LinearRegression(fit_intercept=False))])
>>> # fit to an order-3 polynomial data
>>> x = np.arange(5)
>>> y = 3 - 2 * x + x ** 2 - x ** 3
>>> model = model.fit(x[:, np.newaxis], y)
>>> model.named_steps['linear'].coef_
array([ 3., -2.,  1., -1.])

The linear model trained on polynomial features is able to exactly recover the input polynomial coefficients.

In some cases it’s not necessary to include higher powers of any single feature, but only the so-called interaction featuresthat multiply together at most d distinct features. These can be gotten from PolynomialFeatures with the settinginteraction_only=True.

例如,在处理布尔特征时,所有n的辛=习,因此是无用的;但是xixj代表两个布尔数的结合。这样我们就可以用线性分类器来解决XOR问题:

>>> from sklearn.linear_model import Perceptron
>>> from sklearn.preprocessing import PolynomialFeatures
>>> import numpy as np
>>> X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
>>> y = X[:, 0] ^ X[:, 1]
>>> y
array([0, 1, 1, 0])
>>> X = PolynomialFeatures(interaction_only=True).fit_transform(X).astype(int)
>>> X
array([[1, 0, 0, 0],
       [1, 0, 1, 0],
       [1, 1, 0, 0],
       [1, 1, 1, 1]])
>>> clf = Perceptron(fit_intercept=False, max_iter=10, tol=None,
...                  shuffle=False).fit(X, y)

分类器“预测”是完美的:

>>> clf.predict(X)
array([0, 1, 1, 0])
>>> clf.score(X, y)
1.0
  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值