[更新ing]sklearn(十一)：Generalized Linear Models

Sarah ฅʕ•̫͡•ʔฅ

已于 2022-05-24 18:20:59 修改

阅读量328

点赞数

分类专栏： Sklearn 文章标签： sklearn 机器学习 python

于 2018-09-30 00:08:56 首次发布

本文链接：https://blog.csdn.net/u014765410/article/details/82904201

版权

Sklearn 专栏收录该内容

27 篇文章 4 订阅

订阅专栏

LinearRegression

1、目标函数：ordianry least squares

2、ordionary least squares based coefficient 估计，要求特征之间相互独立，如果特征之间存在collinear，那么X matrix将为奇异矩阵，无法用公式求得coefficient，此外，如果特征之间存在collinear，则least squares estimate对于错误的y值非常的敏感，这将增大model的variance；
3、code实例：

>>> from sklearn import linear_model
>>> reg = linear_model.LinearRegression()
>>> reg.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
...                                       
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                 normalize=False)
>>> reg.coef_
array([0.5, 0.5])

Ridge Regression

1、目标函数：penaltied ordinary least squares with L2

2、penaltied ordinary least squares可以解决ordinary least square中特征之间存在colinear的问题，加大惩罚项alpha，可以增加模型对于特征colinear的鲁棒性；alpha越大，model的robust越强；
3、Redge Regression有两个Function可以实现

#function1：Ridge()，利用training data拟合特定alpha值下的redge regression model。没有cross-validation,无法对alpha进行甄选；
>>> from sklearn import linear_model
>>> reg = linear_model.Ridge (alpha = .5)
>>> reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) 
Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)
>>> reg.coef_
array([0.34545455, 0.34545455])
>>> reg.intercept_ 
0.13636...

#function2：RidgeCV()，可以利用cross-validation从一系列alpha中选出最优alpha
>>> from sklearn import linear_model
>>> reg = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0], cv=3)
>>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])       
RidgeCV(alphas=[0.1, 1.0, 10.0], cv=3, fit_intercept=True, scoring=None,
    normalize=False)
>>> reg.alpha_                                      
0.1

Ridge()中的solver，以及RidgeCV()中的gcv_mode，均为指定特定的计算方式，各个计算方式运用于ridge regression中的哪一步，以及如何利用尚不清楚，有时间查阅资料更新？？？

Lasso

1、目标函数：penaltied ordinary least squares with L1

2、Lasso目标函数的优化方法有两种：1. coordinate descend；2. Least Angle Regression；Lasso不仅可以用于predict，而且可以作为feature selection的工具，因为L1的约束使得Lasso最后得到的系数有很大一部分为0，从而删去了许多冗余的feature；alpha值越大，model的shrinkage就越强；
3、实现Lasso的function

# Lasso() 计算特定alpha值下的Lasso模型
>>> from sklearn import linear_model
>>> reg = linear_model.Lasso(alpha = 0.1)
>>> reg.fit([[0, 0], [1, 1]], [0, 1])
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
>>> reg.predict([[1, 1]])
array([0.8])

#Lasso_path() 计算一些列alpha值下的Lasso模型
>>> X = np.array([[1, 2, 3.1], [2.3, 5.4, 4.3]]).T
>>> y = np.array([1, 2, 3.1])
>>> # Use lasso_path to compute a coefficient path
>>> _, coef_path, _ = lasso_path(X, y, alphas=[5., 1., .5])
>>> print(coef_path)
[[0.         0.         0.46874778]
 [0.2159048  0.4425765  0.23689075]]

#LassoCV() 和 LassoLarsCV()
#LassoCV() 利用cross-validation从一系列alpha中选择最优alpha,
#LassoLarsCV()也具有同样的功能，但是，一般来讲LassoCV()more preferable，
#LassoLarsCV()是基于Least Angle Regression求解目标函数的，
#LassoLarsCV()适用场景：与特征数相比，样本量很小时，LassoLarsCV()的计算速度要快于LASSOCV()；

#LassoLarsIC()
#利用BIC或者AIC来进行model selection

4、与SVM中的正则化系数C进行比较：alpha = 1/C；

Multi-task Lasso

Multi-task Lasso是指这样的一个线性模型：给定feature，求参数w，使得w*feature=[y1,y2,y3…]；
1、目标函数：其正则项为L1,L2的混合项；其最小化的是带有正则项的Frobenius norm；与其他的linear regression明显不同；

2、目标函数优化算法：coordinate descent；
3、示例code:

>>> from sklearn import linear_model
>>> clf = linear_model.MultiTaskLasso(alpha=0.1)
>>> clf.fit([[0,0], [1, 1], [2, 2]], [[0, 0], [1, 1], [2, 2]])  #fit(X,y)
MultiTaskLasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
        normalize=False, random_state=None, selection='cyclic', tol=0.0001,
        warm_start=False)
>>> print(clf.coef_)
[[0.89393398 0.        ]
 [0.89393398 0.        ]]
>>> print(clf.intercept_)
[0.10606602 0.10606602]

Elastic Net

Elastic Net同时兼有Lasso和Ridge linear regression的特性，它即能获得一个较为稀疏的权重w，同时又保有Ridge的稳定性。
当多个feature之间存在correlated时，Elastic Net较为实用，他不同于Lasso，只从correlated features中randomly 选择一个，Elastic Net更倾向于选择both；
1、目标函数：带有正则项L1，L2的ordinary least square；

3、相关function有2个，code 示例：

#function 1：
sklearn.linear_model.ElasticNet(alpha=1.0, l1_ratio=0.5, fit_intercept=True, normalize=False, precompute=False, max_iter=1000, copy_X=True, tol=0.0001, warm_start=False, positive=False, random_state=None, selection=’cyclic’)
#alpha：constant that multiply the penalty terms;
#l1_ratio：l1_ratio=0时，L1penalty term is erased;
#不能利用cross-validation选择alpha；

>>> from sklearn.linear_model import ElasticNet
>>> from sklearn.datasets import make_regression
>>>
>>> X, y = make_regression(n_features=2, random_state=0)
>>> regr = ElasticNet(random_state=0)
>>> regr.fit(X, y)
ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=0, selection='cyclic', tol=0.0001, warm_start=False)
>>> print(regr.coef_) 
[18.83816048 64.55968825]
>>> print(regr.intercept_) 
1.451...
>>> print(regr.predict([[0, 0]])) 
[1.451...]

#function2：
sklearn.linear_model.ElasticNetCV(l1_ratio=0.5, eps=0.001, n_alphas=100, alphas=None, fit_intercept=True, normalize=False, precompute=’auto’, max_iter=1000, tol=0.0001, cv=’warn’, copy_X=True, verbose=0, n_jobs=None, positive=False, random_state=None, selection=’cyclic’)
#可以利用cross-validation来选择alpha

>>> from sklearn.linear_model import ElasticNetCV
>>> from sklearn.datasets import make_regression
>>>
>>> X, y = make_regression(n_features=2, random_state=0)
>>> regr = ElasticNetCV(cv=5, random_state=0)
>>> regr.fit(X, y)
ElasticNetCV(alphas=None, copy_X=True, cv=5, eps=0.001, fit_intercept=True,
       l1_ratio=0.5, max_iter=1000, n_alphas=100, n_jobs=None,
       normalize=False, positive=False, precompute='auto', random_state=0,
       selection='cyclic', tol=0.0001, verbose=0)
>>> print(regr.alpha_) 
0.1994727942696716
>>> print(regr.intercept_) 
0.398...
>>> print(regr.predict([[0, 0]])) 
[0.398...]

Multi-task Elastic Net

Multi-task Elastic Net是这样的linear regression：对于给定的feature，求一组参数w，使得：w*feature=[y1,y2,y3…]
1、目标函数：带有正则项（l1，l2混合项，L2）的Frobenius norm

2、有2个相关function，code示例如下：

#function1:
sklearn.linear_model.MultiTaskElasticNet(alpha=1.0, l1_ratio=0.5, fit_intercept=True, normalize=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, random_state=None, selection=’cyclic’)
#不能用cross-validation选择alpha；

>>> from sklearn import linear_model
>>> clf = linear_model.MultiTaskElasticNet(alpha=0.1)
>>> clf.fit([[0,0], [1, 1], [2, 2]], [[0, 0], [1, 1], [2, 2]])
... 
MultiTaskElasticNet(alpha=0.1, copy_X=True, fit_intercept=True,
        l1_ratio=0.5, max_iter=1000, normalize=False, random_state=None,
        selection='cyclic', tol=0.0001, warm_start=False)
>>> print(clf.coef_)
[[0.45663524 0.45612256]
 [0.45663524 0.45612256]]
>>> print(clf.intercept_)
[0.0872422 0.0872422]

#function2:
sklearn.linear_model.MultiTaskElasticNetCV(l1_ratio=0.5, eps=0.001, n_alphas=100, alphas=None, fit_intercept=True, normalize=False, max_iter=1000, tol=0.0001, cv=’warn’, copy_X=True, verbose=0, n_jobs=None, random_state=None, selection=’cyclic’)
#可以用cross-validation选择alpha

>>> from sklearn import linear_model
>>> clf = linear_model.MultiTaskElasticNetCV(cv=3)
>>> clf.fit([[0,0], [1, 1], [2, 2]],
...         [[0, 0], [1, 1], [2, 2]])
... 
MultiTaskElasticNetCV(alphas=None, copy_X=True, cv=3, eps=0.001,
       fit_intercept=True, l1_ratio=0.5, max_iter=1000, n_alphas=100,
       n_jobs=None, normalize=False, random_state=None, selection='cyclic',
       tol=0.0001, verbose=0)
>>> print(clf.coef_)
[[0.52875032 0.46958558]
 [0.52875032 0.46958558]]
>>> print(clf.intercept_)
[0.00166409 0.00166409]

Least Angle Regression（LARS）

1、 LARS是用于high-dimensional data的linear regression，他与Lasso一样，也可以用于feature selection，其利弊分析如下：
1）优点：

LARS is numerically efficient in context where the number of features is greater than the number of samples；
LARS的计算速度与forward selection一样快，且其计算复杂度与ordinary least square一样；
如果两个变量与y值的相关度相当，则在计算过程中，二者的coefficient increase rate相当；因此，可以说LARS算法比较stable；
2）缺点：
LARS的工作原理是：每次迭代中refitting residuals。因此，LARS对于noise比较敏感。
2、实现LARS的function有2种：

#function1  lars_path()可以执行lasso,lars
sklearn.linear_model.lars_path(X, y, Xy=None, Gram=None, max_iter=500, alpha_min=0, method=’lar’, copy_X=True, eps=2.220446049250313e-16, copy_Gram=True, verbose=0, return_path=True, return_n_iter=False, positive=False)
#Gram=XTX
#alpha_min：是指method=lasso时，其正则化系数alpha;
#method:{lar,lasso}
#eps：The machine-precision regularization in the computation of the Cholesky diagonal factors. Increase this for very ill-conditioned systems.  Cholesky diagonal factors个人理解是指LARS中的XA中角平分线方向上的单位向量ua，eps是对ua数值的规范？？？待查证？？？

#function2  Lars()只能计算lars
sklearn.linear_model.Lars(fit_intercept=True, verbose=False, normalize=True, precompute=’auto’, n_nonzero_coefs=500, eps=2.220446049250313e-16, copy_X=True, fit_path=True, positive=False)
#n_nonzero_coefs：指非0参数的个数
#eps：同lars_path()中eps，待查证？？？
#positive：如果positive=True，则得到的参数均为正值；

eps parameters的具体含义不是很明确，查证后更新？？？

参考博文：
机器学习方法：回归（三）：最小角回归Least Angle Regression（LARS），forward stagewise selection

LARS Lasso

LARS Lasso就是用LARS算法来实现Lasso目标函数的过程,该算法与forward stepwise regression算法相似；The Lars algorithm provides the full path of the coefficients along the regularization parameter almost for free；
LARS Lasso的实现可以用下面2个function：

#LassoLars()
>>> from sklearn import linear_model
>>> reg = linear_model.LassoLars(alpha=.1)
>>> reg.fit([[0, 0], [1, 1]], [0, 1])  
LassoLars(alpha=0.1, copy_X=True, eps=..., fit_intercept=True,
     fit_path=True, max_iter=500, normalize=True, positive=False,
     precompute='auto', verbose=False)
>>> reg.coef_    
array([0.717157..., 0.        ])

#lars_path(method='lasso')

Orthogonal matching pursuit

该算法的核心思想是：将每一个特征作为一个原子，每一次迭代求一个原子，使得<x1,y>最大，用该原子拟合y，得残差y’，y’与原子x正交。在下一次迭代中，求一个原子使得<x2,y’>最大，用该原子拟合y’，得残差y’‘，使y’'与上述两个原子张开的空间正交，在这里注意，第二次迭代中x1的系数是经重新计算的，与第一次迭代时的不同。重复上述步骤，直到所有原子，即特征全部用来拟合y值。
1、目标函数

2、相关function有2个，code示例如下：

#function1：OrthogonalMatchingPursuit()用于求解one_target的problem
sklearn.linear_model.OrthogonalMatchingPursuit(n_nonzero_coefs=None, tol=None, fit_intercept=True, normalize=True, precompute=’auto’)
#n_nonzero_coefs：限定非0参数的个数；
#tol：如果tol非None，则tol=n_nonzero_coefs，tol对应于上述目标函数1（求解在n_nonzero_coefs个非0参数下，使目标函数达到最优的n_nonzero_coefs个参数的value）

#function2：orthogonal_mp()用于求解n_target的problem
sklearn.linear_model.orthogonal_mp(X, y, n_nonzero_coefs=None, tol=None, precompute=False, copy_X=True, return_path=False, return_n_iter=False)

参考博文：
MP算法和OMP算法及其思想

Bayesian Regression

Bayesian Ridge Regression

1、贝叶斯回归假设参数w的先验概率P(w)服从Guassian distribution（alpha），即为P（w|alpha）。给定训练集D服从Guassian distribution（beta）。我们的线性模型的概率表示可以写为P(t|x,w,beta)，我们的目标是求在拟合模型上使得train_x拟合值为train_targe时的参数alpha，beta的最有可能的值，即最大概率值，也即后验概率函数P(alpha,beta|D)。
2、贝叶斯回归优缺点：
优点：

贝叶斯回归对数据有自适应能力，可以重复的利用实验数据，并防止过拟合
贝叶斯回归可以在估计过程中引入正则项
缺点：
贝叶斯回归的学习过程开销太大
3、示例code:


>>> from sklearn import linear_model
>>> X = [[0., 0.], [1., 1.], [2., 2.], [3., 3.]]
>>> Y = [0., 1., 2., 3.]
>>> reg = linear_model.BayesianRidge()
>>> reg.fit(X, Y)  #函数中的alpha相当于上述所说的beta，lamda相当于上述所说的alpha
BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
       fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300,
       normalize=False, tol=0.001, verbose=False)
>>> reg.predict ([[1, 0.]])
array([0.50000013]) 
>>> reg.coef_
array([0.49999993, 0.49999993])

参考博文：
贝叶斯线性回归（Bayesian Linear Regression）

Automatic Relevance Determination - ARD

1、ARD与贝叶斯岭回归算法极为相似，一个很大的不同是：在ARD中，参数w的先验概率P(w)服从的并不是（0，alpha）的Guassian distribution，而是一个axis parallel，elliptical Gaussian distribution，即每一个参数wi，都有其对应的方差alpha_i，粗略写为Guassian distribution（0，A）,diag(A)={alpha_1,alpha_2,alpha_3…}。而在贝叶斯岭回归中所有的参数w共享一个方差alpha。
目标函数与贝叶斯岭回归相同。
2、code示例：

sklearn.linear_model.ARDRegression(n_iter=300, tol=0.001, alpha_1=1e-06, alpha_2=1e-06, lambda_1=1e-06, lambda_2=1e-06, compute_score=False, threshold_lambda=10000.0, fit_intercept=True, normalize=False, copy_X=True, verbose=False)
#tol：是指weight两次的差值小于tol则停止迭代，表示weight已经收敛；
#alpha_1,alpha_2是Gamma distribution的两个超参数，alpha为训练样本Guassian分布的参数，通过特定形状的Gamma distribution给出；
#lambda_1,lambda_2与alpha_1，alpha_2性质相同。lambda为参数w的Guassian分布参数；
#threshold_lambda：threshold for removing weights；

>>> from sklearn import linear_model
>>> clf = linear_model.ARDRegression()
>>> clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
... 
ARDRegression(alpha_1=1e-06, alpha_2=1e-06, compute_score=False,
        copy_X=True, fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06,
        n_iter=300, normalize=False, threshold_lambda=10000.0, tol=0.001,
        verbose=False)
>>> clf.predict([[1, 1]])
array([1.])

参考博文：
相关向量机（RVM）
相关向量机

Logistic Regression

1、logistic regression采用最大似然估计的方法求解最优参数，在sklearn中，为logisitic regression目标函数添加了惩罚项，且将目标函数从“最大似然估计”改为了求其相反数的最小值。在sklearn中惩罚项分2中，分别为L1,L2。具体公式如下：

2、在sklearn中，LogisticRegression有两个重要参数，分别为penalty，multi_class，solver，C。具体解释如下：

sklearn.linear_model.LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’warn’, max_iter=100, multi_class=’warn’, verbose=0, warm_start=False, n_jobs=None)
#penalty：代表使用的惩罚项为L1或L2;
#dual：bool值，dual=True当且仅当penalty=L2,solver=liblinear，且，当n_samples>n_features，better to set dual=False。不明白其中道理，待更新？？？
#tol：目标函数达到的最小值吗？？？不明白具体指什么？？？
#C：inverse of regularization strength；
#fit_intercept：boo值，是否加截距；
#intercept_scaling：个人理解，截距大小。？？？
#class_weight：{class_label：weight}
#random_state
#solver：{liblinear，lbfgs，newton-cg，sag，saga}
#max_iter
#multi_class：{ovr,multinomial,auto}：ovr指one versus rest，是liblinear中多分类使用的方法，即利用多个分类器达到多分类的效果，每个分类器将1个class与其他class分割开。multinomial是真正的多分类器，除liblinear，其他几种solver均有这种选择。
#verbose
#warm_start：bool值，warm_start=True，则参数初始值reuse the solution of previous call
#n_jobs

[待更新]：dual
[待更新]：tol
[待更新]：intercept_scaling
下面着重解释一下各个solver的用途:
liblinear使用coordinate descent algorithm求最优参数，note that CD algorithm不能学习一个真正的多分类问题，只能通过学习多个one versus rest 分类器，解决多分类问题。
“lbfgs”, “sag” and “newton-cg”在高维数据中能够达到更快的收敛，他们都可以进行multi_class = multinomial的学习，且得到的estimate probability要比multi_class = ovr要好。
sag使用Stochastic Average Gradient descent 来得到最优参数，在large dataset（n_samples和n_features都很大）中，它比其它的solver收敛速度要快。
saga是sag的变种，他不仅支持L2，也支持L1，因此，使用saga可以得到一个spares multinomial logistic regression。saga通常为best choice。
“liblinear”适用于small dataset。
“lbfgs”, “sag” and “newton-cg”,"saga"适用于multinomial problem。
“sag”和“saga”在large dataset中可以达到快速收敛，但是有个前提条件，要保证各个feature有相近的scale（可以用preprocessing来预先处理）。
上述几种solver对于penalty，multi_class的选择各有不同，下面给出：

penalize the intercept(bad idea)：yes,no,no,no,no
faster for large dataset：no,no,no,yes,yes
robust to unscaled dataset：yes,yes,yes,no,no
3、code示例：

>>> from sklearn.datasets import load_iris
>>> from sklearn.linear_model import LogisticRegression
>>> X, y = load_iris(return_X_y=True)
>>> clf = LogisticRegression(random_state=0, solver='lbfgs',
...                          multi_class='multinomial').fit(X, y)
>>> clf.predict(X[:2, :])
array([0, 0])
>>> clf.predict_proba(X[:2, :]) 
array([[9.8...e-01, 1.8...e-02, 1.4...e-08],
       [9.7...e-01, 2.8...e-02, ...e-08]])
>>> clf.score(X, y)
0.97...

Stochastic Gradient Descent - SGD#

1、对于linear classification来说，SGD的核心步骤为：w = w+yx (x为错分point)，y：{1，-1}。
2、优缺点：

优点：高效；易于实现（有大量优化代码的机会）；
缺点：对特征缩放（feature
scaling）敏感；需要调节很多hyperparameter(如：regularization，number of
iterations)；

3、SGDClassifier和SGDRegressor分别用来fit linear model for classification和regression。

#function1：SGDClassifier( ), loss=hinge，拟合SVM；
sklearn.linear_model.SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, epsilon=0.1, n_jobs=None, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, class_weight=None, warm_start=False, average=False, n_iter=None)
#loss：{hinge，log，modified_huber，squared_hinge，perceptron，squared_loss，huber，epsilon_insensitive，squared_epsilon_insensitive}

#function2：SGDRegressor( ),loss=hinge，拟合SVM；
sklearn.linear_model.SGDRegressor(loss=’squared_loss’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, epsilon=0.1, random_state=None, learning_rate=’invscaling’, eta0=0.01, power_t=0.25, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, warm_start=False, average=False, n_iter=None)
#loss：{squared_loss，huber，epsilon_insensitive，squared_epsilon_insensitive}
#penalty：{L1,L2,elasticnet}
#alpha：Constant that multiplies the regularization term. Defaults to 0.0001 Also used to compute learning_rate when set to ‘optimal’.learning rate怎么求解？？？
#l1_ratio：the elastic net mixing parameter
#fit_intercept
#max_iter
#tol：previous_loss - loss
#shuffle
#verbose
#epsilon：Epsilon in th epsilon-insenstive loss function？？？
#random_state
#learn_rate：{constant, optimal, invscaling, adaptive}
#eta0：The initial learning rate for the ‘constant’, ‘invscaling’ or ‘adaptive’ schedules. 
#power_t：The exponent for inverse scaling learning rate [default 0.5].？？？
#early_stopping：bool值
#validation_fraction：The proportion of training data to set aside as validation set for early stopping.
#n_iter_no_change：如果经过n_iter_no_change迭代，依然没有improvement，则实施early_stopping
#warm_start
#average：bool值，average=True，计算average SGD weights
#n_iter

4、loss function解析
loss_classifier：{hinge，log，modified_huber，squared_hinge，perceptron，squared_loss，huber，epsilon_insensitive，squared_epsilon_insensitive}
loss_regressor：{squared_loss，huber，epsilon_insensitive，squared_epsilon_insensitive}

hinge：Hinge
loss
squared_hinge：like hinge，but gives a quadratical penalty。
log：logistic regression loss function
huber：‘huber’ modifies ‘squared_loss’ to focus less on getting outliers correct by switching from squared to linear loss past a distance of epsilon。Huber Loss
function
modified_huber： is another smooth loss that brings tolerance to
outliers as well as probability estimates。
squared_loss：ordinary least square。
epsilon_insensitive：epsilon_insensitive为SVR(支持向量回归)的loss
function。其忽略了error which is less than
epsilon。支持向量回归-SVR
squared_epsilon_insensitive：is the same as epsilon_insensitive but
becomes squared loss past a tolerance of epsilon.
perceptron：感知机误差

[待更新]：alpha=optimal，learning rate如何求取
[待更新]：power_t用法

Perceptron

核心步骤为：w = w+yx (x为错分point)，y：{1，-1}。
suitable for large scale learning。
can be used as online algorithm
1、算法特点：

没有learning rate
没有正则项
仅在错分的point上更新权重
比起SGD with hinge loss，能更快地达到收敛，且，得到的model更稀疏；
2、function

sklearn.linear_model.Perceptron(penalty=None, alpha=0.0001, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, eta0=1.0, n_jobs=None, random_state=0, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, class_weight=None, warm_start=False, n_iter=None)
#penalty：{L1,L2,elasticnet}，如何将惩罚项（正则项）应用于peceptron???
#eta0：Constant by which the updates are multiplied. Defaults to 1.

[待更新]：penalty

Passive Aggressive Algorithms

online algorithm
1、类似perceptron算法，在更新步长上做了一些文章。
2、function

# function for classification
sklearn.linear_model.PassiveAggressiveClassifier(C=1.0, fit_intercept=True, max_iter=None, tol=None, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, shuffle=True, verbose=0, loss=’hinge’, n_jobs=None, random_state=None, warm_start=False, class_weight=None, average=False, n_iter=None)
#C：Maximum step size (regularization). Defaults to 1.0.
#loss：{hinge,squared_hinge}

#function for regression
sklearn.linear_model.PassiveAggressiveRegressor(C=1.0, fit_intercept=True, max_iter=None, tol=None, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, shuffle=True, verbose=0, loss=’epsilon_insensitive’, epsilon=0.1, random_state=None, warm_start=False, average=False, n_iter=None)
#loss：{epsilon_insensitive，squared_epsilon_insensitive}

#note that：classifier和regressor的loss function略有不同；

参考博文：
Online Passive Aggressive

Robustness regression: outliers and modeling errors

Robustness regression可以对corrupt data拟合一个regression model。该corrupt data是指：含有outliers，或data不应该用regression model拟合。

note that：Robustness regression在high-dimension setting上效果并不好。

Robustness regression涵盖了3中决策函数：Theil-Sen，RANSAC，Huber Regressor。

1、对3种决策函数使用场景进行介绍：
1）图示3种function对于corrupt_x，corrupt_y data的拟合error：

2）3个function的优劣：

在small dataset，HuberRegressor比RANSAC,Theil-Sen更快，但是在large
dataset，情况则刚好相反，这是因为，RANSAC，Theil-Sen二者每一次的迭代针对的都是subset，而不是sample；
使用default parameter，HuberRegressor的robust要强于RANSAC,Theil-Sen；
RANSAC比Theil-Sen更快，其对于不同dataset size的拟合效果要比Theil-Sen好；
在一般情况下，RANSAC对于large outlier in y direction的处理效果最好；
Theil-Sen对于medium-size outlier in X direction
的处理效果最好，但是这种效果会随着数据feature size的增大而逐渐消失；
如果不确定使用哪种function时，使用SANSAC!

2、3种function介绍：

RANSAC：
适用于photogrammetric computer vision；RANSAC算法详解
Theil-Sen：
Theil-Sen回归在渐进无偏方面与OLS相似，但是，泰尔森回归是一种non-parameter回归，其对数据的distribution不做任何的假设。泰尔森回归是基于中值的估计，其对异常点有很强的robust。但是，随着特征数的增多，其效果会显著下降，甚至还不如OLS。
其核心思想是：从dataset中选定若干subset计算linear model，取这些linear model slope的median作为最后linear model的斜率。最极端的情况，用dataset中的任意2点做一条直线，求所有这些直线slope的median作为最后linear model的斜率（个人理解）。
Huber Regression
与RANSAC，Theil-Sen不同的是，Huber Regression并没有完全排除Outlier的影响，而是将outlier的loss function从quadratical loss function变为linear loss function。
其损失函数如下图：

Huber Regression与SGD(loss=Huber)的不同之处在于：
对于Huber Regressor，如果设定了epsilon parameter，则无论对dataset(X,y)进行如何的缩放，其所得最终结果相同；而SGD，对dataset(X,y)的每一次缩放，都需要重新设置epsilon parameter。
在small number of samples，Huber Regressor更高效，而为达到同样的效果，SGD可能需要更多的迭代次数。

Polynomial regression: extending linear models with basis functions

1、核心思想：
对于基本的linear model： y=wx+b+epsilon
通过将特征x映射为g(x)，可以实现linear到non-linear的转化，y=w*g(x)+b+epsilon。
例如：
original feature：[x1,x2]；linear model：y=w1x1+w2x2+b+epsilon
transformed feature：[x1,x2,x1x2,x12,x22]；polynomial model：y=w1x1+w2x2+w3x1x2+w4x12+w5x22
2、code 示例：

sklearn.preprocessing.PolynomialFeatures(degree=2, interaction_only=False, include_bias=True)
#degress：polynomial的级数
#interaction_only=True：如对于特征（x1,x2,x3），degree=2时，在进行polynomial transformation时，转为：（x1,x2,x3,x1x2,x1x3,x2x3）
#include_bias=True：相当于添加截距 

>>> from sklearn.preprocessing import PolynomialFeatures
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.pipeline import Pipeline
>>> import numpy as np
>>> model = Pipeline([('poly', PolynomialFeatures(degree=3)),
...                   ('linear', LinearRegression(fit_intercept=False))])
>>> # fit to an order-3 polynomial data
>>> x = np.arange(5)
>>> y = 3 - 2 * x + x ** 2 - x ** 3
>>> model = model.fit(x[:, np.newaxis], y)
>>> model.named_steps['linear'].coef_
array([ 3., -2.,  1., -1.])

官方文档

Sarah ฅʕ•̫͡•ʔฅ

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
[更新ing]sklearn(十一)：Generalized Linear Models

LinearRegression1、目标函数：ordianry least squares2、ordionary least squares based coefficient 估计，要求特征之间相互独立，如果特征之间存在collinear，那么X matrix将为奇异矩阵，无法用公式求得coefficient，此外，如果特征之间存在collinear，则least squares esti...
复制链接

扫一扫