Machine_Learning_Regularization

In this chapter, we introduce regularization methods with some hands-on Python codes~


Regularization

  • Why regularization: in the bias-variance trade-off, we have shown that some complicated models will have quite low bias, however they will get overfitted because the model contains a lot of bias – which means that their model has captured some noise in their training set. So we need to control the variance.
  • How regularization: Through constraining or regularizing the coefficient estimates, or equivalently, shrinking the coefficient estimates toward zero, we can reduce the variance of the model.

Lasso

The idea of lasso is to use l 1 l1 l1 penalty that has the effect of the forcing some of the coefficient estimats to be exactly equal to 0 when the tuning parameter α \alpha α is sufficiently large. This is called sparse solutions of linear models penalized with the l 1 l1 l1 norm.

  • Mathematically, it consists of a linear model with an added regularization term or shrinkage penalty.
    Lasso
    Here we can see that, all the β \beta β's are penalized equally.
    An equivalent optimization problem:
    Lasso_opt
  • α \alpha α is called a tuning parameter. When α \alpha α is greater, the coefficient estimation will be smaller.
  • Lasso is able to do variable selection. And it is simpler and more interpretable.
from sklearn import linear_model
reg=linear_model.Lasso(alpha=0.1)
reg.fit([[0,0],[1,1]],[0,1])
reg.predict([[1,1]])

Ridge

The idea of ridge is just like Lass - we add an regularization term. However, this time we use l 2 l2 l2 norm.
Ridge
or the same opt problem:
在这里插入图片描述

  • Why l 2 l2 l2 norm: l 2 l2 l2 norm will penalize greater β \beta β's more heavily.
  • Cannot do variable selection
  • Ridge’s solution is close to linear regression: β = ( X T X + α I ) − 1 X T Y \beta = (X^TX + \alpha I)^{-1}X^TY β=(XTX+αI)1XTY
  • Ridge is quite quickly because for a fixed α, we only need to estimate 1 single model. Instead, ols will need to find 2 n 2^n 2n of stepwise models.


    In some cases we may get more features than variables, typically in healthcare data.
  • If we have n n n obs and k k k variables and k > n k>n k>n, using Lasso will produce only n n n variables, but ridge will still remain all k k k variables because Lasso does not have a solution in this case (no X − 1 X^{-1} X1)
  • For this dataset, if we only use linear regression, we will not be able to estimate a model neither. But in ridge, we could still get a model because we can get ( X T X + α I ) − 1 (X^TX + \alpha I)^{-1} (XTX+αI)1

Elastic Net

Elastic-net is a weighted combination of Ridge and Lasso:
Elastic Net
or equivalent opt problem
Elastic Net opt
Especially, Elastic-Net reduces to Lasso when L=1, while reduces to Ridge when L=0.

  • Why elasic net and lasso can do variable selection but ridge cannot: Lasso and Elastic Net regression have corners. As a result, we might have a β ^ \hat{\beta} β^ value being 0. However, since Ridge does not have corners, we will not be able to get a β ^ \hat{\beta} β^ value being 0. Thus, Lasso and Elastic net can both be used for variable selection, while Ridge cannot.
  • Comparing Lasso and Elastic Net, Lasso has sharper corners. Thus, it will more aggressively reduce the number of variables comparing to Elastic Net. Elastic Net, on the other hand, is more gentle with variable selection.

在这里插入图片描述

  • In this picture, The ellipse is the contour of RSS.
  • Ridge regression has a circular constraint with no sharp points, while Lasso and Elastic Net both have corners on the axis. The intersection of Ridge RSS will not generally occur on an axis, so its coeffeicient estimates will be exclusively non-zero.

Lasso: a real case

In lasso, we care about the parameter α \alpha α. We will use cross validation to find it.

GridSearchCV

from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True)

lasso = Lasso()

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

lasso_alphas = np.linspace(0.01, 0.1, 20)
grid = dict()
grid['alpha'] = lasso_alphas
gscv = GridSearchCV(lasso, grid, cv = cv, n_jobs=-1)

results = gscv.fit(X, y)

print('MSE:  %.5f ' % results.best_score_)
print('Config: %s ' % results.best_params_)
plt.plot(lasso_alphas, results.cv_results_['mean_test_score'])
plt.xlim(0.01, 0.1)
plt.show()
MSE:  0.47184 
Config: {'alpha': 0.019473684210526317}

在这里插入图片描述

LassoCV

Spcifically for lasso. sklearn_linear_model.LassoCV

LassoCV(eps, n_alphas=100, alphas=ndarray, fit_intercept = True, normalize=False, cv=int, random_state=, n_jobs)

  • n_jobs: Number of CPUs to use during the cross validation
  • .score(X, y): Return the coefficient of determination of the prediction.
  • .alpha_ Best alpha
  • .mse_path_ All mse’s wrt each alpha
  • .alphas_
  • .coef_ Coefficients for each variable
from sklearn.linear_model import LassoCV
lassocv = LassoCV(cv=20, random_state=0).fit(X, y)

print('Score: %.5f' % lassocv.score(X, y))
print(lassocv.predict(X[:1,]))
print('Alpha: %.5f' % lassocv.alpha_)
print(lassocv.coef_)
Score: 0.53886
[197.0943]
Alpha: 0.03265
[  -0.     -217.6243  525.7042  299.8325 -112.4265   -0.     -244.4867
    0.      497.7109   87.9674  -15.3141  -42.8252   34.8853  -63.6733
  -52.1601   50.1955   30.108   -63.8399  -30.2097   72.6927   61.6619
   47.4095  119.0708   41.8884]

Plot the mse path:

# Author: Olivier Grisel, Gael Varoquaux, Alexandre Gramfort
# License: BSD 3 clause
# #############################################################################
# LassoCV: coordinate descent

# Compute paths
print("Computing regularization path using the coordinate descent lasso...")
t1 = time.time()
model = LassoCV(cv=20).fit(X, y)
t_lasso_cv = time.time() - t1

# Display results
plt.figure()
ymin, ymax = 2300, 3800
plt.semilogx(model.alphas_ + EPSILON, model.mse_path_, ':')
plt.plot(model.alphas_ + EPSILON, model.mse_path_.mean(axis=-1), 'k',
         label='Average across the folds', linewidth=2)
plt.axvline(model.alpha_ + EPSILON, linestyle='--', color='k',
            label='alpha: CV estimate ' + str(round(model.alpha_, 3)))

plt.legend()

plt.xlabel(r'$\alpha$')
plt.ylabel('Mean square error')
plt.title('Mean square error on each fold: coordinate descent '
          '(train time: %.2fs)' % t_lasso_cv)
plt.axis('tight')
plt.ylim(ymin, ymax)

# #############################################################################
# LassoLarsCV: least angle regression

# Compute paths
print("Computing regularization path using the Lars lasso...")
t1 = time.time()
model = LassoLarsCV(cv=20, normalize=False).fit(X, y)
t_lasso_lars_cv = time.time() - t1

# Display results
plt.figure()
plt.semilogx(model.cv_alphas_ + EPSILON, model.mse_path_, ':')
plt.semilogx(model.cv_alphas_ + EPSILON, model.mse_path_.mean(axis=-1), 'k',
             label='Average across the folds', linewidth=2)
plt.axvline(model.alpha_, linestyle='--', color='k',
            label='alpha CV ' + str(round(model.alpha_, 3)))
plt.legend()

plt.xlabel(r'$\alpha$')
plt.ylabel('Mean square error')
plt.title('Mean square error on each fold: Lars (train time: %.2fs)'
          % t_lasso_lars_cv)
plt.axis('tight')
plt.ylim(ymin, ymax)

plt.show()

在这里插入图片描述
在这里插入图片描述


LassoLarsIC

I edited the code to show the aic and bic value.

# Author: Olivier Grisel, Gael Varoquaux, Alexandre Gramfort
# License: BSD 3 clause

import time

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC
from sklearn import datasets

# This is to avoid division by zero while doing np.log10
EPSILON = 1e-4

X, y = datasets.load_diabetes(return_X_y=True)

rng = np.random.RandomState(42)
X = np.c_[X, rng.randn(X.shape[0], 14)]  # add some bad features

# normalize data as done by Lars to allow for comparison
X /= np.sqrt(np.sum(X ** 2, axis=0))

# #############################################################################
# LassoLarsIC: least angle regression with BIC/AIC criterion

model_bic = LassoLarsIC(criterion='bic', normalize=False)
t1 = time.time()
model_bic.fit(X, y)
t_bic = time.time() - t1
alpha_bic_ = model_bic.alpha_

model_aic = LassoLarsIC(criterion='aic', normalize=False)
model_aic.fit(X, y)
alpha_aic_ = model_aic.alpha_


def plot_ic_criterion(model, name, color, ic):
    criterion_ = model.criterion_
    plt.semilogx(model.alphas_ + EPSILON, criterion_, '--', color=color,
                 linewidth=3, label='%s criterion' % name)
    vertical_label = 'alpha: %s estimate ' % name
    vertical_label += str(round(ic, 3))
    plt.axvline(model.alpha_ + EPSILON, color=color, linewidth=3,
                label= vertical_label )
    plt.xlabel(r'$\alpha$')
    plt.ylabel('criterion')


plt.figure()
plot_ic_criterion(model_aic, 'AIC', 'b', alpha_aic_)
plot_ic_criterion(model_bic, 'BIC', 'r', alpha_bic_)
plt.legend()
plt.title('Information-criterion for model selection (training time %.3fs)'
          % t_bic)
plt.show()

LassoLarsIC


Source:

Business Analytics - Yi Zhang
ISLR
sklearn

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值