[更新ing]sklearn(十一):Generalized Linear Models

LinearRegression

1、目标函数:ordianry least squares

2、ordionary least squares based coefficient 估计,要求特征之间相互独立,如果特征之间存在collinear,那么X matrix将为奇异矩阵,无法用公式求得coefficient,此外,如果特征之间存在collinear,则least squares estimate对于错误的y值非常的敏感,这将增大model的variance;
3、code实例:

>>> from sklearn import linear_model
>>> reg = linear_model.LinearRegression()
>>> reg.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
...                                       
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                 normalize=False)
>>> reg.coef_
array([0.5, 0.5])

Ridge Regression

1、目标函数:penaltied ordinary least squares with L2

2、penaltied ordinary least squares可以解决ordinary least square中特征之间存在colinear的问题,加大惩罚项alpha,可以增加模型对于特征colinear的鲁棒性;alpha越大,model的robust越强;
3、Redge Regression有两个Function可以实现

#function1:Ridge(),利用training data拟合特定alpha值下的redge regression model。没有cross-validation,无法对alpha进行甄选;
>>> from sklearn import linear_model
>>> reg = linear_model.Ridge (alpha = .5)
>>> reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) 
Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)
>>> reg.coef_
array([0.34545455, 0.34545455])
>>> reg.intercept_ 
0.13636...

#function2:RidgeCV(),可以利用cross-validation从一系列alpha中选出最优alpha
>>> from sklearn import linear_model
>>> reg = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0], cv=3)
>>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])       
RidgeCV(alphas=[0.1, 1.0, 10.0], cv=3, fit_intercept=True, scoring=None,
    normalize=False)
>>> reg.alpha_                                      
0.1

Ridge()中的solver,以及RidgeCV()中的gcv_mode,均为指定特定的计算方式,各个计算方式运用于ridge regression中的哪一步,以及如何利用尚不清楚,有时间查阅资料更新???

Lasso

1、目标函数:penaltied ordinary least squares with L1

2、Lasso目标函数的优化方法有两种:1. coordinate descend;2. Least Angle Regression;Lasso不仅可以用于predict,而且可以作为feature selection的工具,因为L1的约束使得Lasso最后得到的系数有很大一部分为0,从而删去了许多冗余的feature;alpha值越大,model的shrinkage就越强;
3、实现Lasso的function

# Lasso() 计算特定alpha值下的Lasso模型
>>> from sklearn import linear_model
>>> reg = linear_model.Lasso(alpha = 0.1)
>>> reg.fit([[0, 0], [1, 1]], [0, 1])
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
>>> reg.predict([[1, 1]])
array([0.8])

#Lasso_path() 计算一些列alpha值下的Lasso模型
>>> X = np.array([[1, 2, 3.1], [2.3, 5.4, 4.3]]).T
>>> y = np.array([1, 2, 3.1])
>>> # Use lasso_path to compute a coefficient path
>>> _, coef_path, _ = lasso_path(X, y, alphas=[5., 1., .5])
>>> print(coef_path)
[[0.         0.         0.46874778]
 [0.2159048  0.4425765  0.23689075]]

#LassoCV()LassoLarsCV()
#LassoCV() 利用cross-validation从一系列alpha中选择最优alpha,
#LassoLarsCV()也具有同样的功能,但是,一般来讲LassoCV()more preferable,
#LassoLarsCV()是基于Least Angle Regression求解目标函数的,
#LassoLarsCV()适用场景:与特征数相比,样本量很小时,LassoLarsCV()的计算速度要快于LASSOCV()#LassoLarsIC()
#利用BIC或者AIC来进行model selection

4、与SVM中的正则化系数C进行比较:alpha = 1/C;

Multi-task Lasso

Multi-task Lasso是指这样的一个线性模型:给定feature,求参数w,使得w*feature=[y1,y2,y3…];
1、目标函数:其正则项为L1,L2的混合项;其最小化的是带有正则项的Frobenius norm;与其他的linear regression明显不同;

2、目标函数优化算法:coordinate descent;
3、示例code:

>>> from sklearn import linear_model
>>> clf = linear_model.MultiTaskLasso(alpha=0.1)
>>> clf.fit([[0,0], [1, 1], [2, 2]], [[0, 0], [1, 1], [2, 2]])  #fit(X,y)
MultiTaskLasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
        normalize=False, random_state=None, selection='cyclic', tol=0.0001,
        warm_start=False)
>>> print(clf.coef_)
[[0.89393398 0.        ]
 [0.89393398 0.        ]]
>>> print(clf.intercept_)
[0.10606602 0.10606602]

Elastic Net

Elastic Net同时兼有Lasso和Ridge linear regression的特性,它即能获得一个较为稀疏的权重w,同时又保有Ridge的稳定性。
当多个feature之间存在correlated时,Elastic Net较为实用,他不同于Lasso,只从correlated features中randomly 选择一个,Elastic Net更倾向于选择both;
1、目标函数:带有正则项L1,L2的ordinary least square;

3、相关function有2个,code 示例:

#function 1:
sklearn.linear_model.ElasticNet(alpha=1.0, l1_ratio=0.5, fit_intercept=True, normalize=False, precompute=False, max_iter=1000, copy_X=True, tol=0.0001, warm_start=False, positive=False, random_state=None, selection=’cyclic’)
#alpha:constant that multiply the penalty terms;
#l1_ratio:l1_ratio=0时,L1penalty term is erased;
#不能利用cross-validation选择alpha;

>>> from sklearn.linear_model import ElasticNet
>>> from sklearn.datasets import make_regression
>>>
>>> X, y = make_regression(n_features=2, random_state=0)
>>> regr = ElasticNet(random_state=0)
>>> regr.fit(X, y)
ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=0, selection='cyclic', tol=0.0001, warm_start=False)
>>> print(regr.coef_) 
[18.83816048 64.55968825]
>>> print(regr.intercept_) 
1.451...
>>> print(regr.predict([[0, 0]])) 
[1.451...]

#function2:
sklearn.linear_model.ElasticNetCV(l1_ratio=0.5, eps=0.001, n_alphas=100, alphas=None, fit_intercept=True, normalize=False, precompute=’auto’, max_iter=1000, tol=0.0001, cv=’warn’, copy_X=True, verbose=0, n_jobs=None, positive=False, random_state=None, selection=’cyclic’)
#可以利用cross-validation来选择alpha

>>> from sklearn.linear_model import ElasticNetCV
>>> from sklearn.datasets import make_regression
>>>
>>> X, y = make_regression(n_features=2, random_state=0)
>>> regr = ElasticNetCV(cv=5, random_state=0)
>>> regr.fit(X, y)
ElasticNetCV(alphas=None, copy_X=True, cv=5, eps=0.001, fit_intercept=True,
       l1_ratio=0.5, max_iter=1000, n_alphas=100, n_jobs=None,
       normalize=False, positive=False, precompute='auto', random_state=0,
       selection='cyclic', tol=0.0001, verbose=0)
>>> print(regr.alpha_) 
0.1994727942696716
>>> print(regr.intercept_) 
0.398...
>>> print(regr.predict([[0, 0]])) 
[0.398...]

Multi-task Elastic Net

Multi-task Elastic Net是这样的linear regression:对于给定的feature,求一组参数w,使得:w*feature=[y1,y2,y3…]
1、目标函数:带有正则项(l1,l2混合项,L2)的Frobenius norm

2、有2个相关function,code示例如下:

#function1:
sklearn.linear_model.MultiTaskElasticNet(alpha=1.0, l1_ratio=0.5, fit_intercept=True, normalize=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, random_state=None, selection=’cyclic’)
#不能用cross-validation选择alpha;

>>> from sklearn import linear_model
>>> clf = linear_model.MultiTaskElasticNet(alpha=0.1)
>>> clf.fit([[0,0], [1, 1], [2, 2]], [[0, 0], [1, 1], [2, 2]])
... 
MultiTaskElasticNet(alpha=0.1, copy_X=True, fit_intercept=True,
        l1_ratio=0.5, max_iter=1000, normalize=False, random_state=None,
        selection='cyclic', tol=0.0001, warm_start=False)
>>> print(clf.coef_)
[[0.45663524 0.45612256]
 [0.45663524 0.45612256]]
>>> print(clf.intercept_)
[0.0872422 0.0872422]

#function2:
sklearn.linear_model.MultiTaskElasticNetCV(l1_ratio=0.5, eps=0.001, n_alphas=100, alphas=None, fit_intercept=True, normalize=False, max_iter=1000, tol=0.0001, cv=’warn’, copy_X=True, verbose=0, n_jobs=None, random_state=None, selection=’cyclic’)
#可以用cross-validation选择alpha

>>> from sklearn import linear_model
>>> clf = linear_model.MultiTaskElasticNetCV(cv=3)
>>> clf.fit([[0,0], [1, 1], [2, 2]],
...         [[0, 0], [1, 1], [2, 2]])
... 
MultiTaskElasticNetCV(alphas=None, copy_X=True, cv=3, eps=0.001,
       fit_intercept=True, l1_ratio=0.5, max_iter=1000, n_alphas=100,
       n_jobs=None, normalize=False, random_state=None, selection='cyclic',
       tol=0.0001, verbose=0)
>>> print(clf.coef_)
[[0.52875032 0.46958558]
 [0.52875032 0.46958558]]
>>> print(clf.intercept_)
[0.00166409 0.00166409]

Least Angle Regression(LARS)

1、 LARS是用于high-dimensional data的linear regression,他与Lasso一样,也可以用于feature selection,其利弊分析如下:
1)优点:

  • LARS is numerically efficient in context where the number of features is greater than the number of samples;
  • LARS的计算速度与forward selection一样快,且其计算复杂度与ordinary least square一样;
  • 如果两个变量与y值的相关度相当,则在计算过程中,二者的coefficient increase rate相当;因此,可以说LARS算法比较stable;
    2)缺点:
  • LARS的工作原理是:每次迭代中refitting residuals。因此,LARS对于noise比较敏感。
    2、实现LARS的function有2种:
#function1  lars_path()可以执行lasso,lars
sklearn.linear_model.lars_path(X, y, Xy=None, Gram=None, max_iter=500, alpha_min=0, method=’lar’, copy_X=True, eps=2.220446049250313e-16, copy_Gram=True, verbose=0, return_path=True, return_n_iter=False, positive=False)
#Gram=XTX
#alpha_min:是指method=lasso时,其正则化系数alpha;
#method:{lar,lasso}
#eps:The machine-precision regularization in the computation of the Cholesky diagonal factors. Increase this for very ill-conditioned systems.  Cholesky diagonal factors个人理解是指LARS中的XA中角平分线方向上的单位向量ua,eps是对ua数值的规范???待查证???

#function2  Lars()只能计算lars
sklearn.linear_model.Lars(fit_intercept=True, verbose=False, normalize=True, precompute=’auto’, n_nonzero_coefs=500, eps=2.220446049250313e-16, copy_X=True, fit_path=True, positive=False)
#n_nonzero_coefs:指非0参数的个数
#eps:同lars_path()中eps,待查证???
#positive:如果positive=True,则得到的参数均为正值;

eps parameters的具体含义不是很明确,查证后更新???

参考博文:
机器学习方法:回归(三):最小角回归Least Angle Regression(LARS),forward stagewise selection

LARS Lasso

LARS Lasso就是用LARS算法来实现Lasso目标函数的过程,该算法与forward stepwise regression算法相似;The Lars algorithm provides the full path of the coefficients along the regularization parameter almost for free;
LARS Lasso的实现可以用下面2个function:

#LassoLars()
>>> from sklearn import linear_model
>>> reg = linear_model.LassoLars(alpha=.1)
>>> reg.fit([[0, 0], [1, 1]], [0, 1])  
LassoLars(alpha=0.1, copy_X=True, eps=..., fit_intercept=True,
     fit_path=True, max_iter=500, normalize=True, positive=False,
     precompute='auto', verbose=False)
>>> reg.coef_    
array([0.717157..., 0.        ])

#lars_path(method='lasso')

Orthogonal matching pursuit

该算法的核心思想是:将每一个特征作为一个原子,每一次迭代求一个原子,使得<x1,y>最大,用该原子拟合y,得残差y’,y’与原子x正交。在下一次迭代中,求一个原子使得<x2,y’>最大,用该原子拟合y’,得残差y’‘,使y’'与上述两个原子张开的空间正交,在这里注意,第二次迭代中x1的系数是经重新计算的,与第一次迭代时的不同。重复上述步骤,直到所有原子,即特征全部用来拟合y值。
1、目标函数

2、相关function有2个,code示例如下:

#function1:OrthogonalMatchingPursuit()用于求解one_target的problem
sklearn.linear_model.OrthogonalMatchingPursuit(n_nonzero_coefs=None, tol=None, fit_intercept=True, normalize=True, precompute=’auto’)
#n_nonzero_coefs:限定非0参数的个数;
#tol:如果tol非None,则tol=n_nonzero_coefs,tol对应于上述目标函数1(求解在n_nonzero_coefs个非0参数下,使目标函数达到最优的n_nonzero_coefs个参数的value)

#function2:orthogonal_mp()用于求解n_target的problem
sklearn.linear_model.orthogonal_mp(X, y, n_nonzero_coefs=None, tol=None, precompute=False, copy_X=True, return_path=False, return_n_iter=False)

参考博文:
MP算法和OMP算法及其思想

Bayesian Regression

Bayesian Ridge Regression

1、贝叶斯回归假设参数w的先验概率P(w)服从Guassian distribution(alpha),即为P(w|alpha)。给定训练集D服从Guassian distribution(beta)。我们的线性模型的概率表示可以写为P(t|x,w,beta),我们的目标是求在拟合模型上使得train_x拟合值为train_targe时的参数alpha,beta的最有可能的值,即最大概率值,也即后验概率函数P(alpha,beta|D)。
2、贝叶斯回归优缺点:
优点:

  1. 贝叶斯回归对数据有自适应能力,可以重复的利用实验数据,并防止过拟合
  2. 贝叶斯回归可以在估计过程中引入正则项
    缺点:
  3. 贝叶斯回归的学习过程开销太大
    3、示例code:

>>> from sklearn import linear_model
>>> X = [[0., 0.], [1., 1.], [2., 2.], [3., 3.]]
>>> Y = [0., 1., 2., 3.]
>>> reg = linear_model.BayesianRidge()
>>> reg.fit(X, Y)  #函数中的alpha相当于上述所说的beta,lamda相当于上述所说的alpha
BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
       fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300,
       normalize=False, tol=0.001, verbose=False)
>>> reg.predict ([[1, 0.]])
array([0.50000013]) 
>>> reg.coef_
array([0.49999993, 0.49999993])

参考博文:
贝叶斯线性回归(Bayesian Linear Regression)

Automatic Relevance Determination - ARD

1、ARD与贝叶斯岭回归算法极为相似,一个很大的不同是:在ARD中,参数w的先验概率P(w)服从的并不是(0,alpha)的Guassian distribution,而是一个axis parallel,elliptical Gaussian distribution,即每一个参数wi,都有其对应的方差alpha_i,粗略写为Guassian distribution(0,A),diag(A)={alpha_1,alpha_2,alpha_3…}。而在贝叶斯岭回归中所有的参数w共享一个方差alpha。
目标函数与贝叶斯岭回归相同。
2、code示例:

sklearn.linear_model.ARDRegression(n_iter=300, tol=0.001, alpha_1=1e-06, alpha_2=1e-06, lambda_1=1e-06, lambda_2=1e-06, compute_score=False, threshold_lambda=10000.0, fit_intercept=True, normalize=False, copy_X=True, verbose=False)
#tol:是指weight两次的差值小于tol则停止迭代,表示weight已经收敛;
#alpha_1,alpha_2是Gamma distribution的两个超参数,alpha为训练样本Guassian分布的参数,通过特定形状的Gamma distribution给出;
#lambda_1,lambda_2与alpha_1,alpha_2性质相同。lambda为参数w的Guassian分布参数;
#threshold_lambda:threshold for removing weights;

>>> from sklearn import linear_model
>>> clf = linear_model.ARDRegression()
>>> clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
... 
ARDRegression(alpha_1=1e-06, alpha_2=1e-06, compute_score=False,
        copy_X=True, fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06,
        n_iter=300, normalize=False, threshold_lambda=10000.0, tol=0.001,
        verbose=False)
>>> clf.predict([[1, 1]])
array([1.])

参考博文:
相关向量机(RVM)
相关向量机

Logistic Regression

1、logistic regression采用最大似然估计的方法求解最优参数,在sklearn中,为logisitic regression目标函数添加了惩罚项,且将目标函数从“最大似然估计”改为了求其相反数的最小值。在sklearn中惩罚项分2中,分别为L1,L2。具体公式如下:

2、在sklearn中,LogisticRegression有两个重要参数,分别为penalty,multi_class,solver,C。具体解释如下:

sklearn.linear_model.LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’warn’, max_iter=100, multi_class=’warn’, verbose=0, warm_start=False, n_jobs=None)
#penalty:代表使用的惩罚项为L1L2;
#dual:bool值,dual=True当且仅当penalty=L2,solver=liblinear,且,当n_samples>n_features,better to set dual=False。不明白其中道理,待更新???
#tol:目标函数达到的最小值吗???不明白具体指什么???
#C:inverse of regularization strength;
#fit_intercept:boo值,是否加截距;
#intercept_scaling:个人理解,截距大小。???
#class_weight:{class_label:weight}
#random_state
#solver:{liblinear,lbfgs,newton-cg,sag,saga}
#max_iter
#multi_class:{ovr,multinomial,auto}:ovr指one versus rest,是liblinear中多分类使用的方法,即利用多个分类器达到多分类的效果,每个分类器将1class与其他class分割开。multinomial是真正的多分类器,除liblinear,其他几种solver均有这种选择。
#verbose
#warm_start:bool值,warm_start=True,则参数初始值reuse the solution of previous call
#n_jobs

[待更新]:dual
[待更新]:tol
[待更新]:intercept_scaling

下面着重解释一下各个solver的用途:
liblinear使用coordinate descent algorithm求最优参数,note that CD algorithm不能学习一个真正的多分类问题,只能通过学习多个one versus rest 分类器,解决多分类问题。
“lbfgs”, “sag” and “newton-cg”在高维数据中能够达到更快的收敛,他们都可以进行multi_class = multinomial的学习,且得到的estimate probability要比multi_class = ovr要好。
sag使用Stochastic Average Gradient descent 来得到最优参数,在large dataset(n_samples和n_features都很大)中,它比其它的solver收敛速度要快。
saga是sag的变种,他不仅支持L2,也支持L1,因此,使用saga可以得到一个spares multinomial logistic regression。saga通常为best choice。
“liblinear”适用于small dataset
“lbfgs”, “sag” and “newton-cg”,"saga"适用于multinomial problem。
“sag”和“saga”在large dataset中可以达到快速收敛,但是有个前提条件,要保证各个feature有相近的scale(可以用preprocessing来预先处理)。
上述几种solver对于penalty,multi_class的选择各有不同,下面给出:

penalize the intercept(bad idea):yes,no,no,no,no
faster for large dataset:no,no,no,yes,yes
robust to unscaled dataset:yes,yes,yes,no,no
3、code示例:

>>> from sklearn.datasets import load_iris
>>> from sklearn.linear_model import LogisticRegression
>>> X, y = load_iris(return_X_y=True)
>>> clf = LogisticRegression(random_state=0, solver='lbfgs',
...                          multi_class='multinomial').fit(X, y)
>>> clf.predict(X[:2, :])
array([0, 0])
>>> clf.predict_proba(X[:2, :]) 
array([[9.8...e-01, 1.8...e-02, 1.4...e-08],
       [9.7...e-01, 2.8...e-02, ...e-08]])
>>> clf.score(X, y)
0.97...

Stochastic Gradient Descent - SGD#

1、对于linear classification来说,SGD的核心步骤为:w = w+yx (x为错分point),y:{1,-1}。
2、优缺点:

  • 优点:高效;易于实现(有大量优化代码的机会);
  • 缺点:对特征缩放(feature
    scaling)敏感;需要调节很多hyperparameter(如:regularization,number of
    iterations);

3、SGDClassifier和SGDRegressor分别用来fit linear model for classification和regression。

#function1:SGDClassifier( ), loss=hinge,拟合SVM;
sklearn.linear_model.SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, epsilon=0.1, n_jobs=None, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, class_weight=None, warm_start=False, average=False, n_iter=None)
#loss:{hinge,log,modified_huber,squared_hinge,perceptron,squared_loss,huber,epsilon_insensitive,squared_epsilon_insensitive}

#function2:SGDRegressor( ),loss=hinge,拟合SVM;
sklearn.linear_model.SGDRegressor(loss=’squared_loss’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, epsilon=0.1, random_state=None, learning_rate=’invscaling’, eta0=0.01, power_t=0.25, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, warm_start=False, average=False, n_iter=None)
#loss:{squared_loss,huber,epsilon_insensitive,squared_epsilon_insensitive}
#penalty:{L1,L2,elasticnet}
#alpha:Constant that multiplies the regularization term. Defaults to 0.0001 Also used to compute learning_rate when set to ‘optimal’.learning rate怎么求解???
#l1_ratio:the elastic net mixing parameter
#fit_intercept
#max_iter
#tol:previous_loss - loss
#shuffle
#verbose
#epsilon:Epsilon in th epsilon-insenstive loss function???
#random_state
#learn_rate:{constant, optimal, invscaling, adaptive}
#eta0:The initial learning rate for the ‘constant’, ‘invscaling’ or ‘adaptive’ schedules. 
#power_t:The exponent for inverse scaling learning rate [default 0.5].???
#early_stopping:bool值
#validation_fraction:The proportion of training data to set aside as validation set for early stopping.
#n_iter_no_change:如果经过n_iter_no_change迭代,依然没有improvement,则实施early_stopping
#warm_start
#average:bool值,average=True,计算average SGD weights
#n_iter

4、loss function解析
loss_classifier:{hinge,log,modified_huber,squared_hinge,perceptron,squared_loss,huber,epsilon_insensitive,squared_epsilon_insensitive}
loss_regressor:{squared_loss,huber,epsilon_insensitive,squared_epsilon_insensitive}

  • hinge:Hinge
    loss
  • squared_hinge:like hinge,but gives a quadratical penalty。
  • log:logistic regression loss function
  • huber:‘huber’ modifies ‘squared_loss’ to focus less on getting outliers correct by switching from squared to linear loss past a distance of epsilon。Huber Loss
    function
  • modified_huber: is another smooth loss that brings tolerance to
    outliers as well as probability estimates。
  • squared_loss:ordinary least square。
  • epsilon_insensitive:epsilon_insensitive为SVR(支持向量回归)的loss
    function。其忽略了error which is less than
    epsilon。支持向量回归-SVR
  • squared_epsilon_insensitive:is the same as epsilon_insensitive but
    becomes squared loss past a tolerance of epsilon.
  • perceptron:感知机误差

[待更新]:alpha=optimal,learning rate如何求取
[待更新]:power_t用法

Perceptron

核心步骤为:w = w+yx (x为错分point),y:{1,-1}。
suitable for large scale learning。
can be used as online algorithm
1、算法特点:

  • 没有learning rate
  • 没有正则项
  • 仅在错分的point上更新权重
  • 比起SGD with hinge loss,能更快地达到收敛,且,得到的model更稀疏;
    2、function
sklearn.linear_model.Perceptron(penalty=None, alpha=0.0001, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, eta0=1.0, n_jobs=None, random_state=0, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, class_weight=None, warm_start=False, n_iter=None)
#penalty:{L1,L2,elasticnet},如何将惩罚项(正则项)应用于peceptron???
#eta0:Constant by which the updates are multiplied. Defaults to 1.

[待更新]:penalty

Passive Aggressive Algorithms

online algorithm
1、类似perceptron算法,在更新步长上做了一些文章。
2、function

# function for classification
sklearn.linear_model.PassiveAggressiveClassifier(C=1.0, fit_intercept=True, max_iter=None, tol=None, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, shuffle=True, verbose=0, loss=’hinge’, n_jobs=None, random_state=None, warm_start=False, class_weight=None, average=False, n_iter=None)
#C:Maximum step size (regularization). Defaults to 1.0.
#loss:{hinge,squared_hinge}

#function for regression
sklearn.linear_model.PassiveAggressiveRegressor(C=1.0, fit_intercept=True, max_iter=None, tol=None, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, shuffle=True, verbose=0, loss=’epsilon_insensitive’, epsilon=0.1, random_state=None, warm_start=False, average=False, n_iter=None)
#loss:{epsilon_insensitive,squared_epsilon_insensitive}

#note that:classifier和regressor的loss function略有不同;

参考博文:
Online Passive Aggressive

Robustness regression: outliers and modeling errors

Robustness regression可以对corrupt data拟合一个regression model。该corrupt data是指:含有outliers,或data不应该用regression model拟合。

note that:Robustness regression在high-dimension setting上效果并不好。

Robustness regression涵盖了3中决策函数:Theil-Sen,RANSAC,Huber Regressor。

1、对3种决策函数使用场景进行介绍:
1)图示3种function对于corrupt_x,corrupt_y data的拟合error:


2)3个function的优劣:

  • 在small dataset,HuberRegressor比RANSAC,Theil-Sen更快,但是在large
    dataset,情况则刚好相反,这是因为,RANSAC,Theil-Sen二者每一次的迭代针对的都是subset,而不是sample;
  • 使用default parameter,HuberRegressor的robust要强于RANSAC,Theil-Sen;
  • RANSAC比Theil-Sen更快,其对于不同dataset size的拟合效果要比Theil-Sen好;
  • 在一般情况下,RANSAC对于large outlier in y direction的处理效果最好;
  • Theil-Sen对于medium-size outlier in X direction
    的处理效果最好,但是这种效果会随着数据feature size的增大而逐渐消失;
  • 如果不确定使用哪种function时,使用SANSAC!

2、3种function介绍:

  • RANSAC:
    适用于photogrammetric computer vision;RANSAC算法详解

  • Theil-Sen:
    Theil-Sen回归在渐进无偏方面与OLS相似,但是,泰尔森回归是一种non-parameter回归,其对数据的distribution不做任何的假设。泰尔森回归是基于中值的估计,其对异常点有很强的robust。但是,随着特征数的增多,其效果会显著下降,甚至还不如OLS。
    其核心思想是:从dataset中选定若干subset计算linear model,取这些linear model slope的median作为最后linear model的斜率。最极端的情况,用dataset中的任意2点做一条直线,求所有这些直线slope的median作为最后linear model的斜率(个人理解)。

  • Huber Regression
    与RANSAC,Theil-Sen不同的是,Huber Regression并没有完全排除Outlier的影响,而是将outlier的loss function从quadratical loss function变为linear loss function。
    其损失函数如下图:

    Huber Regression与SGD(loss=Huber)的不同之处在于:
    对于Huber Regressor,如果设定了epsilon parameter,则无论对dataset(X,y)进行如何的缩放,其所得最终结果相同;而SGD,对dataset(X,y)的每一次缩放,都需要重新设置epsilon parameter。
    在small number of samples,Huber Regressor更高效,而为达到同样的效果,SGD可能需要更多的迭代次数。

Polynomial regression: extending linear models with basis functions

1、核心思想:
对于基本的linear model: y=wx+b+epsilon
通过将特征x映射为g(x),可以实现linear到non-linear的转化,y=w*g(x)+b+epsilon。
例如:
original feature:[x1,x2];linear model:y=w1x1+w2x2+b+epsilon
transformed feature:[x1,x2,x1x2,x12,x22];polynomial model:y=w1x1+w2x2+w3x1x2+w4x12+w5x22
2、code 示例:

sklearn.preprocessing.PolynomialFeatures(degree=2, interaction_only=False, include_bias=True)
#degress:polynomial的级数
#interaction_only=True:如对于特征(x1,x2,x3),degree=2时,在进行polynomial transformation时,转为:(x1,x2,x3,x1x2,x1x3,x2x3)
#include_bias=True:相当于添加截距 

>>> from sklearn.preprocessing import PolynomialFeatures
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.pipeline import Pipeline
>>> import numpy as np
>>> model = Pipeline([('poly', PolynomialFeatures(degree=3)),
...                   ('linear', LinearRegression(fit_intercept=False))])
>>> # fit to an order-3 polynomial data
>>> x = np.arange(5)
>>> y = 3 - 2 * x + x ** 2 - x ** 3
>>> model = model.fit(x[:, np.newaxis], y)
>>> model.named_steps['linear'].coef_
array([ 3., -2.,  1., -1.])

官方文档

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Sarah ฅʕ•̫͡•ʔฅ

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值