线性模型

最新推荐文章于 2024-07-31 14:30:10 发布

u200710

最新推荐文章于 2024-07-31 14:30:10 发布

阅读量243

点赞数

分类专栏： scikit-learn 文章标签：机器学习 python

原文链接：https://scikit-learn.org/stable/modules/linear_model.html

版权

scikit-learn 专栏收录该内容

21 篇文章 1 订阅

订阅专栏

线性模型

本文后续的方法用于解决目标值是特征线性组合的回归问题。在数学符号中，假设 $\hat{y}$ 是预测值。
$\hat{y}(w, x) = w_0+w_1x_1+...+w_px_p$
在后续描述中，将向量 $w=(w_1,...,w_p)$ 定义为coef_，标量 $w_0$ 定义为intercept_。

普通最小二乘法

LinearRegression拟合带参数 $w=(w_1,...,w_p)$ 的线性模型以最小化实际目标值和预测目标值之间的残差平方和。数学上，它求解如下形式的问题：
$\max \limits_{w}||Xw-y||_2^{2}$
LinearRegression以数组 $X$ 和 $y$ 作为fit方法的输入，将线性模型的系数 $w$ 保存在成员coef_中，如下例

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

# 取某一列的数据，并以列向量的形式
diabetes_X = diabetes_X[:, np.newaxis, 2]

# 划分数据集
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

# 声明linear regression对象
regr = linear_model.LinearRegression()

# 训练模型
regr.fit(diabetes_X_train, diabetes_y_train)

# 拟合测试集中的数据
diabetes_y_pred = regr.predict(diabetes_X_test)

print('Coefficients: \n', regr.coef_)

print('Mean squared error: %.2f' % mean_squared_error(diabetes_y_test, diabetes_y_pred))

print('Coefficient of determination: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))

plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

普通最小二乘的系数依赖于特征的独立性。当特征相关和矩阵 $X$ 中的列近似线性相关时，矩阵 $X$ 会近似于奇异，因此，将导致最小二乘估计对目标值中的随机误差高度敏感，进一步会产生较大的方差。多重共线性的情况将产生。最小二乘的解由 $X$ 的奇异值分解(Singular Value Decomposition)计算得到的。

岭回归与分类

加入 L1 或 L2正则化，让权值尽可能小，最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单，能适应不同的数据集，也在一定程度上避免了过拟合现象。

可以设想一下对于一个线性回归方程，若参数很大，那么只要数据偏移一点点，就会对结果造成很大的影响；但如果参数足够小，数据偏移得多一点也不会对结果造成什幺影响，一种流行的说法是抗扰动能力强。

回归

岭回归(Ridge Regression)通过添加系数大小的惩罚项来解决普通最小二乘中的一些问题。岭回归的系数最小化如下的目标函数：
$\min \limits_{w}||Xw-y||_2^{2}+\alpha||w||_2^2$
复杂性参数 $\alpha$ 控制收缩量： $\alpha$ 的值越大，收缩量越大，系数对共线性更加鲁棒。

和其它线性模型一样，Ridge以数组 $X$ 和 $y$ 作为fit方法的输入，将线性模型的系数 $w$ 保存在成员coef_中，如下例

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

X = 1. / (np.arange(1, 11) + np.arange(0, 10)[:, np.newaxis])
y = np.ones(10)

n_alphas = 200
alphas = np.logspace(-10, -2, n_alphas)

coefs = []
for a in alphas:
    ridge = linear_model.Ridge(alpha=a, fit_intercept=False)
    ridge.fit(X, y)
    coefs.append(ridge.coef_)

ax = plt.gca()
# 解决负号不能正常显示的问题
for tick in ax.yaxis.get_major_ticks():
        tick.label1.set_fontproperties('stixgeneral')
for tick in ax.xaxis.get_major_ticks():
        tick.label1.set_fontproperties('stixgeneral')
        
ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1])
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()

以上代码所画的图表明，当 $\alpha$ 较大时，正则化的影响支配了平方损失函数，特征系数趋近于0，当 $\alpha$ 逐渐变小时，系数的取值会倾向于最小二乘法对应的值，在此过程中，特征系数的取值会发生极大地震荡，因此，在实际应用中，有必要调整Ridge系数，保证两者的平衡。

分类

RidgeClassifier首先将二元的目标值转化为{-1, 1}，然后将此分类问题当作一个回归问题，优化和回归问题相同的目标函数。预测的类别对应于回归预测的符号。对于多分类问题，会将其当作多分类回归，预测的类别对应于目标值最高的输出。

使用一个最小二乘损失函数而不是传统的logistic或折页损失函数去拟合一个分类模型，看上去是有问题的。但是，在实际中，从精确度、准确率或召回率来看，所有这些模型都能够产生相似的交叉验证分数，然而，使用最小二乘损失函数的RidgeClassifer可以使用不同计算性能的数值求解器。

RidgeClassifier计算明显比LogisiticRegression快，这是因为它只需要计算投影矩阵 $X^TX)^{-1}X^T$ 一次。

这个分类器有时候也会被当做使用线性核的最小二乘支持向量机。

该分类器的使用如下例

# coding: utf-8

import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
import matplotlib.pyplot as plt

# 命令行配置
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, ComplementNB # scikit-learn version >= 0.20.0
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.extmath import density
from sklearn import metrics

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')

# https://www.cnblogs.com/wj-1314/p/8974021.html
# dest: 可以决定解析后,取值时的属性名, 尤其适于有多个等价参数. 不指定时就是选项不加-的字符串
# action: 用于控制对选项和参数的处理
# "store": 储存值到 dest 指定的属性,强制要求后面提供参数
# "store_true": 当使用该选项时,后面的 dest 将设置为 true, 不跟参数
op = OptionParser()
op.add_option('--report',
              action='store_true', dest='print_report',
              help='Print a detailed classification report.')
op.add_option('--chi2_select',
              action='store', type='int', dest='select_chi2',
              help='Select some number of features using a chi-squared test.')
op.add_option('--confusion_matrix',
              action='store_true', dest='print_cm',
              help='Print the confusion matrix.')
op.add_option('--top10',
              action='store_true', dest='print_top10',
              help='Print ten most discriminative terms per class for every classifier.')
op.add_option('--all_categories',
              action='store_true',
              help='Whether to use all categories or not.')
op.add_option('--use_hashing',
              action='store_true',
              help='Use a hashing vectorizer.')
op.add_option('--n_features',
              action='store', type=int, default=2**16,
              help='n_features when using the hashing vectorizer.')
op.add_option('--filtered',
              action='store_true',
              help='Remove newsgroup information that is easily overfit: headers, signatures, and quoting.')


def is_interactive():
    return not hasattr(sys.modules['__main__'], '__file__')


argv = [] if is_interactive() else sys.argv[1:]
(opts, args) = op.parse_args(argv)
if len(args) > 0:
    op.error('this script takes no arguments.')
    sys.exit(1)

print(__doc__)
op.print_help()
print()

# 下载数据
if opts.all_categories:
    categories = None
else:
    categories = [
        'alt.atheism',
        'talk.religion.misc',
        'comp.graphics',
        'sci.space'
    ]

if opts.filtered:
    remove = ('headers', 'footers', 'quotes')
else:
    remove = ()

print('Loading 20 newsgroups dataset for categories:')
print(categories if categories else 'all')

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)
print('data loaded')

target_names = data_train.target_names


def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6


data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)


print('%d documents - %0.3fMB (training set)'% (len(data_train.data), data_train_size_mb))
print('%d documents - %0.3fMB (test set)'% (len(data_test.data), data_test_size_mb))
print('%d categories' % len(target_names))
print()

y_train, y_test = data_train.target, data_test.target

print('Extracting features from the training data using a sparse vectorizer')
t0 = time()
if opts.use_hashing:
    vectorizer = HashingVectorizer(stop_words='english', alternate_sign=False,
                                   n_features=opts.n_features)
    X_train = vectorizer.transform(data_train.data)
else:
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    X_train = vectorizer.fit_transform(data_train.data)
duration = time()-t0
print('done in %fs at %0.3fMB/s'%(duration, data_train_size_mb/duration))
print('n_samples: %d, n_features: %d' % X_train.shape)
print()

print('Extracting features from the test data using the same vectorizer')
t0 = time()
X_test = vectorizer.transform(data_test.data)
duration = time()-t0
print('done in %fs at %0.3fMB/s' % (duration, data_test_size_mb/duration))
print('n_samples: %d, n_features: %d' % X_test.shape)
print()

if opts.use_hashing:
    feature_names = None
else:
    feature_names = vectorizer.get_feature_names()

if opts.select_chi2:
    print('Extracting %d best features by a chi-squared test' % opts.select_chi2)
    t0 = time()
    ch2 = SelectKBest(chi2, k=opts.select_chi2)
    X_train = ch2.fit_transform(X_train, y_train)
    X_test = ch2.transform(X_test)
    if feature_names:
        feature_names = [feature_names[i] for i in ch2.get_support(indices=True)]
    print('done in %fs' % (time()-t0))
    print()

if feature_names:
    feature_names = np.asarray(feature_names)

def trim(s):
    return s if len(s) <= 80 else s[:77]+"..."

# 测试分类器
def benchmark(clf):
    print('_'*80)
    print('Training: ')
    print(clf)
    t0 = time()
    clf.fit(X_train, y_train)
    train_time = time()-t0
    print('train time: %0.3fs'%train_time)

    t0 = time()
    pred = clf.predict(X_test)
    test_time = time() - t0
    print('test time: %0.3fs' % test_time)

    score = metrics.accuracy_score(y_test, pred)
    print('accuracy: %0.3f' % score)

    if hasattr(clf, 'coef_'):
        print('dimensionalty: %d' % clf.coef_.shape[1])
        print('density: %f' % density(clf.coef_))

        if opts.print_top10 and feature_names is not None:
            print('top 10 keywords per class: ')
            for i, label in enumerate(target_names):
                top10 = np.argsort(clf.coef_[i])[-10:]
                print(trim('%s: %s' % (label, ' '.join(feature_names[top10]))))
        print()

    if opts.print_report:
        print('classification report: ')
        print(metrics.classification_report(y_test, pred,
                                            target_names=target_names))

    if opts.print_cm:
        print('confusion matrix: ')
        print(metrics.confusion_matrix(y_test, pred))

    print()
    clf_descr = str(clf).split('(')[0]
    return clf_descr, score, train_time, test_time


results = []
for clf, name in (
        (RidgeClassifier(tol=1e-2, solver='sag'), 'Ridge Classifier'),
        (Perceptron(max_iter=50), 'Perceptron'),
        (PassiveAggressiveClassifier(max_iter=50), 'Passive-Aggression'),
        (KNeighborsClassifier(n_neighbors=10), 'kNN'),
        (RandomForestClassifier(), 'Random forest')):
    print('='*80)
    print(name)
    results.append(benchmark(clf))

for penalty in ['l2', 'l1']:
    print('='*80)
    print('%s penalty' % penalty.upper())

    results.append(benchmark(LinearSVC(penalty=penalty, dual=False, tol=1e-3)))

    results.append(benchmark(SGDClassifier(alpha=0.0001, max_iter=50, penalty=penalty)))

print('='*80)
print('Elastic-Net penalty')
results.append(benchmark(SGDClassifier(alpha=0.0001, max_iter=50, penalty='elasticnet')))

print('='*80)
print('NearestCentroid (aka Rocchio classifier)')
results.append(benchmark(NearestCentroid()))

print('='*80)
print('Naive Bayes')
results.append(benchmark(MultinomialNB(alpha=0.01)))
results.append(benchmark(BernoulliNB(alpha=0.01)))
results.append(benchmark(ComplementNB(alpha=0.1)))

print('='*80)
print('LinearSVC with L1-based feature selection')
results.append(benchmark(Pipeline([
    ('feature_selection', SelectFromModel(LinearSVC(penalty='l1', dual=False,
                                                    tol=1e-3))),
    ('classification', LinearSVC(penalty='l2'))])))

# 画图
indices = np.arange(len(results))

results = [[x[i] for x in results] for i in range(4)]

clf_names, score, training_time, test_time = results
training_time = np.array(training_time) / np.max(training_time)
test_time = np.array(test_time) / np.max(test_time)

plt.figure(figsize=(12, 8))
plt.title('Score')
plt.barh(indices, score, 0.2, label='score', color='navy')
plt.barh(indices+0.3, training_time, 0.2, label='training time', color='c')
plt.barh(indices+0.6, test_time, 0.2, label='test time', color='darkorange')
plt.yticks(())
plt.legend(loc='best')
plt.subplots_adjust(left=0.25)
plt.subplots_adjust(top=0.95)
plt.subplots_adjust(bottom=0.05)

for i, c in zip(indices, clf_names):
    plt.text(-0.3, i, c)

plt.show()

不同分类器比较
RidgeCV能够使用内置alpha参数交叉验证实现岭回归。除了使用留一法交叉验证的形式外，它和GridSearchCV使用相同的方式工作。

import numpy as np
from sklearn import linear_model
reg = linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))
reg.fit([[0, 0], [0, 0], [1, 1]], [0, 0.1, 1])
print(reg.alpha_)

Lasso (Least absolute shrinkage and selection operator)

Lasso是一个用来估计稀疏系数的线性模型。它倾向于有较少非零系数的解，能够根据解，有效减少特征的数量，常用于压缩感知领域。在特定情况下，它能够准确地包含非零系数集合。

数学表达式上，它包含一个添加了正则项的线性模型，最小化的目标函数为：
$\min \limits_{w} \frac{1}{2n_{samples}} ||Xw-y||_2^{2}+\alpha||w||_1$
该分类器使用如下例：

from sklearn import linear_model
reg = linear_model.Lasso(alpha=0.1)
reg.fit([[0, 0], [1, 1]], [0, 1])
print(reg.predict([[1, 1]]))

由于 Lasso 容易使得部分权重取 0，所以可以用其做特征选择，Lasso的名字就指出了它是一个选择算子。权重为0的特征对问题没有贡献，直接去掉权重为0的特征，模型的输出值不变。

两个使用Lasso的例子，代码如下：

# coding: utf-8
# Lasso and Elastic Net for Sparse Signals

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

np.random.seed(42)

n_samples, n_features = 50, 100
X = np.random.randn(n_samples, n_features)

idx = np.arange(n_features)
coef = (-1) ** idx * np.exp(-idx / 10)
coef[10:] = 0
y = np.dot(X, coef)

y += 0.01 * np.random.normal(size=n_samples)

n_samples = X.shape[0]
X_train, y_train = X[: n_samples//2], y[: n_samples//2]
X_test, y_test = X[n_samples//2 :], y[n_samples//2 :]

from sklearn.linear_model import Lasso

alpha = 0.1
lasso = Lasso(alpha=alpha)

y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
r2_score_lasso = r2_score(y_test, y_pred_lasso)
print(lasso)
print("r^2 on test data: %f" % r2_score_lasso)

from sklearn.linear_model import ElasticNet

enet = ElasticNet(alpha=alpha, l1_ratio=0.7)

y_pred_enet = enet.fit(X_train, y_train).predict(X_test)
r2_score_enet = r2_score(y_test, y_pred_enet)
print(enet)
print('r^2 on test data: %f' % r2_score_enet)

# np.where(enet.coef_)[0] gives the non-zero value index
m, s, _ = plt.stem(np.where(enet.coef_)[0], enet.coef_[enet.coef_!=0],
                   markerfmt='x', label='Elastic net coefficients',
                   use_line_collection=True)

plt.setp([m, s], color='#2ca02c')
m, s, _ = plt.stem(np.where(lasso.coef_)[0], lasso.coef_[lasso.coef_!=0],
                   markerfmt='x', label='Lasso coefficients',
                   use_line_collection=True)
plt.setp([m, s], color='#ff7f03')
plt.stem(np.where(coef)[0], coef[coef!=0], label='True coefficients',
         markerfmt='bx', use_line_collection=True)

plt.legend(loc='best')
plt.title('Lasso $R^2$: %.3f, Elastic Net $R^2$: %.3f' % (r2_score_lasso, r2_score_enet))
plt.show()

# coding: utf-8
# Compressive sensing: tomography reconstruction with L1 prior (Lasso)

import numpy as np
from scipy import sparse
from scipy import ndimage
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt


def _weights(x, dx=1, orig=0):
    x = np.ravel(x)  # 将多维数组降为一维
    floor_x = np.floor((x-orig)/dx).astype(np.int64)
    alpha = (x - orig - floor_x*dx) /dx
    return np.hstack((floor_x, floor_x+1)), np.hstack((1-alpha, alpha))


def _generate_center_coordinates(l_x):
    X, Y = np.mgrid[:l_x, :l_x].astype(np.float64)
    center = l_x/2.
    X += 0.5 - center
    Y += 0.5 - center
    return X, Y


def build_projection_operator(l_x, n_dir):
    X, Y = _generate_center_coordinates(l_x)
    angles = np.linspace(0, np.pi, n_dir, endpoint=False)
    data_inds, weights, camera_inds = [], [], []
    data_unravel_indices = np.arange(l_x ** 2)
    data_unravel_indices = np.hstack((data_unravel_indices,
                                      data_unravel_indices))

    for i, angle in enumerate(angles):
        Xrot = np.cos(angle)*X - np.sin(angle)*Y
        inds, w = _weights(Xrot, dx=1, orig=X.min())
        mask = np.logical_and(inds >= 0, inds < l_x)
        weights += list(w[mask])
        camera_inds += list(inds[mask]+i*l_x)
        data_inds += list(data_unravel_indices[mask])

    proj_operator = sparse.coo_matrix((weights, (camera_inds, data_inds)))
    return proj_operator


def generate_synthetic_data():
    rs = np.random.RandomState(0)
    n_pts = 36
    x, y = np.ogrid[0:l, 0:l]
    mask_outer = (x - l / 2.) ** 2 + (y - l / 2.) ** 2 < (l / 2.) ** 2
    mask = np.zeros((l, l))
    points = l * rs.rand(2, n_pts)
    mask[(points[0]).astype(np.int), (points[1]).astype(np.int)] = 1
    mask = ndimage.gaussian_filter(mask, sigma=l / n_pts)
    res = np.logical_and(mask > mask.mean(), mask_outer)
    return np.logical_xor(res, ndimage.binary_erosion(res))


l = 128
proj_operator = build_projection_operator(l, l//7)
data = generate_synthetic_data()
proj = proj_operator * data.ravel()[:, np.newaxis]
proj += 0.15*np.random.randn(*proj.shape)

rgr_ridge = Ridge(alpha=0.2)
rgr_ridge.fit(proj_operator, proj.ravel())
rec_l2 = rgr_ridge.coef_.reshape(l, l)

rgr_lasso = Lasso(alpha=0.001)
rgr_lasso.fit(proj_operator, proj.ravel())
rec_l1 = rgr_lasso.coef_.reshape(l, l)

plt.figure(figsize=(8, 3.3))

plt.subplot(131)
plt.imshow(data, cmap=plt.cm.gray, interpolation='nearest')
plt.axis('off')
plt.title('original image')

plt.subplot(132)
plt.imshow(rec_l2, cmap=plt.cm.gray, interpolation='nearest')
plt.title('L2 penalization')
plt.axis('off')

plt.subplot(133)
plt.imshow(rec_l1, cmap=plt.cm.gray, interpolation='nearest')
plt.title('L1 penalization')
plt.axis('off')

plt.subplots_adjust(hspace=0.01, wspace=0.01, top=1, bottom=0, left=0, right=1)

plt.show()

设置正则化参数

alpha系数控制了所估计系数的稀疏性程度。

使用交叉验证

scikit-learn使用基于交叉验证的LassoCV和LassoLarsCV方法设置Lasso的算法参数alpha。LassoLarsCV是基于Least Angle Regreesion的算法。

对于带许多多重共线性特征的高维数据集，LassoCV是最常用的方法。但是，LassoLarsCV有着能够探索更相关的alpha参数值，如果样本数远小于特征数，它比LassoCV快。

基于信息准则的模型选择

估计器LassoLarsIC使用Akaike和Bayes信息准则(AIC和BIC)。当使用k折交叉验证时，由于正则路径只计算一次，而不是k+1次，LassoLarsIC是一种更廉价计算的方法以找到alpha的最优值。

AIC准则是由日本统计学家Akaike与1973年提出的，全称是最小化信息量准则（Akaike Information Criterion）。它是拟合精度和参数个数的加权函数：AIC=2（模型参数的个数）-2ln（模型的极大似然函数）。

AIC为模型选择提供了有效的规则，但也有不足之处。当样本容量很大时，在AIC准则中拟合误差提供的信息就要受到样本容量的放大，而参数个数的惩罚因子却和样本容量没关系（一直是2），因此当样本容量很大时，使用AIC准则选择的模型不收敛于真实模型，它通常比真实模型所含的未知参数个数要多。

BIC（Bayesian InformationCriterion）贝叶斯信息准则是Schwartz在1978年根据Bayes理论提出的判别准则，称为SBC准则(也称BIC)，弥补了AIC的不足。SBC的定义为：BIC = ln(n)(模型中参数的个数) - 2ln(模型的极大似然函数值)

下例介绍上述三种方法：

# coding: utf-8
# Lasso model selection: Cross-Validation / AIC / BIC

import time

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC
from sklearn import datasets

EPSILON = 1e-4

X, y = datasets.load_diabetes(return_X_y=True)

rng = np.random.RandomState(42)
X = np.c_[X, rng.randn(X.shape[0], 14)]

X /= np.sqrt(np.sum(X**2, axis=0))

model_bic = LassoLarsIC(criterion='bic')
t1 = time.time()
model_bic.fit(X, y)
t_bic = time.time() - t1
alpha_bic_ = model_bic.alpha_

model_aic = LassoLarsIC(criterion='aic')
model_aic.fit(X, y)
alpha_aic_ = model_aic.alpha_


def plot_ic_criterion(model, name, color):
    criterion_ = model.criterion_
    plt.semilogx(model.alphas_ + EPSILON, criterion_, '--', color=color,
                 linewidth=3, label='%s criterion' % name)
    plt.axvline(model.alpha_ + EPSILON, color=color, linewidth=3,
                label='alpha: %s estimate'% name)
    plt.xlabel(r'$\alpha$')
    plt.ylabel('criterion')


plt.figure()
plot_ic_criterion(model_aic, 'AIC', 'b')
plot_ic_criterion(model_bic, 'bic', 'r')
plt.legend()
plt.title('Information-criterion for model selection (training time %.3fs)' % t_bic)


print('Computing regularization path using the coordinate descent lasso...')
t1 = time.time()
model = LassoCV(cv=20).fit(X, y)
t_lasso_cv = time.time() - t1

plt.figure()
ymin, ymax = 2300, 3800
plt.semilogx(model.alphas_+EPSILON, model.mse_path_, ':')
plt.plot(model.alphas_ + EPSILON, model.mse_path_.mean(axis=-1), 'k',
         label='Average across the folds', linewidth=2)
plt.axvline(model.alpha_+EPSILON, linestyle='--', color='k',
            label='alpha: CV estimate')
plt.legend()

plt.xlabel(r'$\alpha$')
plt.ylabel('Mean square error')
plt.title('Mean square error on each fold: coordinate descent (training time %.3fs)' % t_lasso_cv)
plt.axis('tight')
plt.ylim(ymin, ymax)


print('Computing regularization path using the Lars lasso')
t1 = time.time()
model = LassoLarsCV(cv=20).fit(X, y)
t_lasso_lars_cv = time.time() - t1

plt.figure()
plt.semilogx(model.cv_alphas_ + EPSILON, model.mse_path_, ':')
plt.plot(model.cv_alphas_ + EPSILON, model.mse_path_.mean(axis=-1), 'k',
         label='Average across the folds', linewidth=2)
plt.axvline(model.alpha_+EPSILON, linestyle='--', color='k',
            label='alpha CV')
plt.legend()

plt.xlabel(r'$\alpha$')
plt.ylabel('Mean square error')
plt.title('Mean square error on each fold: Lars (training time %.3fs)' % t_lasso_lars_cv)
plt.axis('tight')
plt.ylim(ymin, ymax)

plt.show()

参数alpha和SVM的正则参数的等价性等于alpha = 1 / C或alpha = 1 / (n_samples*C)。

多任务的Lasso

MultiTaskLasso是一个用于估计多重回归问题稀疏系数的线性模型：y是一个大小为(n_samples, n_tasks)的二维数组。它的约束是对于所有回归问题所选择的特征是相同的，也称作tasks。

下例比较了由单个简单Lasso或MultiTaskLasso所得到系数矩阵的非零系数的位置。

# coding: utf-8
# Joint feature selection with multi-task Lasso

import matplotlib.pyplot as plt
import numpy as np

from sklearn.linear_model import MultiTaskLasso, Lasso

rng = np.random.RandomState(42)

n_samples, n_features, n_tasks = 100, 30, 40
n_relevant_features = 5
coef = np.zeros((n_tasks, n_features))
times = np.linspace(0, 2*np.pi, n_tasks)
for k in range(n_relevant_features):
    coef[:, k] = np.sin((1. + rng.randn(1))*times + 3*rng.randn(1))

X = rng.randn(n_samples, n_features)
Y = np.dot(X, coef.T) + rng.randn(n_samples, n_tasks)

coef_lasso_ = np.array([Lasso(alpha=0.5).fit(X, y).coef_ for y in Y.T])
coef_multi_task_lasso_ = MultiTaskLasso(alpha=1.0).fit(X, Y).coef_

fig = plt.figure(figsize=(8, 5))
plt.subplot(121)
plt.spy(coef_lasso_)
plt.xlabel('Feature')
plt.ylabel('Time (or Task)')
plt.text(10, 5, 'Lasso')
plt.subplot(122)
plt.spy(coef_multi_task_lasso_)
plt.xlabel('Feature')
plt.ylabel('Time (or Task)')
plt.text(10, 5, 'MultiTaskLasso')
fig.suptitle('Coefficient non-zero location')

feature_to_plot = 0
plt.figure()
lw = 2
plt.plot(coef[:, feature_to_plot], color='seagreen', linewidth=lw,
         label='Ground truth')
plt.plot(coef_lasso_[:, feature_to_plot], color='cornflowerblue', linewidth=lw,
         label='Lasso')
plt.plot(coef_multi_task_lasso_[:, feature_to_plot], color='gold', linewidth=lw,
        label='MultiTaskLasso')
plt.legend(loc='upper center')
plt.axis('tight')
plt.ylim([-1.1, 1.1])
plt.show()

数学上，它是一个带混合 $l_1l_2$ 正则项的线性模型，目标函数为
$\min \limits_{w} \frac{1}{2n_{samples}} ||XW-Y||_{Fro}^{2}+\alpha||w||_{21}$
其中， $F r o$ 是Frobenius范数 $||A||_{Fro}=\sqrt{\sum\limits_{ij}\alpha_{ij}^2}$ ， $l_1l_2$ 表示 $||A||_{21}=\sum\limits_{i}\sqrt{\sum\limits_{j}\alpha_{ij}^2}$ 。

MultiTaskLasso同样使用坐标下降算法拟合参数。

Elastic-Net

Elastic-Net是一个使用了 $l_1$ 和 $l_2$ 范数作为正则项的线性回归模型。这种组合既可以学习仅有少数非零参数的稀疏模型，如Lasso，也可以仍然保持Ridge的正则属性。通过使用l1_ratio参数可以控制 $l_1$ 和 $l_2$ 的凸组合。

当特征中有很多彼此相关联的特征时，Elastic-Net大有用处。Lasso有可能随机选择一个，而Ridge有可能两个都选。

Lasso和Ridge的平衡允许Elastic_Net继承Ridge部分稳定性。它的目标为：
$\min \limits_{W} \frac{1}{2n_{samples}} ||Xw-y||_{2}^{2}+\alpha\rho||w||_{1}+\frac{\alpha(1-\rho)}{2}||w||_2^2$
ElasticNetCV能够通过交叉验证设置参数alpha和l1_ratio。

# coding: utf-8
# Lasso and elastic net (L1 and L2 penalisation) implemented using a coordinate descent

from itertools import cycle
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import lasso_path, enet_path
from sklearn import datasets

X, y = datasets.load_diabetes(return_X_y=True)

X /= X.std(axis=0)

eps = 5e-3

print('Computing regularization path using the lasso...')
alphas_lasso, coefs_lasso, _ = lasso_path(X, y, eps=eps, fit_intercept=False)

print('Computing regularization path using the positive lasso...')
alphas_positive_lasso, coefs_positive_lasso, _ = lasso_path(
    X, y, eps=eps, positive=True, fit_intercept=False)

print('Computing regularization path using the elastic net...')
alphas_enet, coefs_enet, _ = enet_path(
    X, y, eps=eps, l1_ratio=0.8, fit_intercept=False)

print('Computing regularization path using the positive elastic net...')
alphas_positive_enet, coefs_positive_enet, _ = enet_path(
    X, y, eps=eps, l1_ratio=0.8, positive=True, fit_intercept=False)

plt.figure(1)
colors = cycle(['b', 'r', 'g', 'c', 'k'])
neg_log_alphas_lasso = -np.log10(alphas_lasso)
neg_log_alphas_enet = -np.log10(alphas_enet)
for coef_l, coef_e, c in zip(coefs_lasso, coefs_enet, colors):
    l1 = plt.plot(neg_log_alphas_lasso, coef_l, c=c)
    l2 = plt.plot(neg_log_alphas_enet, coef_e, linestyle='--', c=c)

plt.xlabel('-Log(alpha)')
plt.ylabel('coefficients')
plt.title('Lasso and Elastic-Net Paths')
plt.legend(([l1[-1], l2[-1]]), ('Lasso', 'Elastic-Net'), loc='lower left')
plt.axis('tight')

plt.figure(2)
neg_log_alphas_positive_lasso = -np.log10(alphas_positive_lasso)
for coef_l, coef_pl, c in zip(coefs_lasso, coefs_positive_lasso, colors):
    l1 = plt.plot(neg_log_alphas_lasso, coef_l, c=c)
    l2 = plt.plot(neg_log_alphas_positive_lasso, coef_pl, linestyle='--', c=c)

plt.xlabel('-Log(alpha)')
plt.ylabel('coefficients')
plt.title('Lasso and positive Lasso')
plt.legend((l1[-1], l2[-1]), ('Lasso', 'positive Lasso'), loc='lower left')
plt.axis('tight')


plt.figure(3)
neg_log_alphas_positive_enet = -np.log10(alphas_positive_enet)
for (coef_e, coef_pe, c) in zip(coefs_enet, coefs_positive_enet, colors):
    l1 = plt.plot(neg_log_alphas_enet, coef_e, c=c)
    l2 = plt.plot(neg_log_alphas_positive_enet, coef_pe, linestyle='--', c=c)

plt.xlabel('-Log(alpha)')
plt.ylabel('coefficients')
plt.title('Elastic-Net and positive Elastic-Net')
plt.legend((l1[-1], l2[-1]), ('Elastic-Net', 'positive Elastic-Net'),
           loc='lower left')
plt.axis('tight')
plt.show()

Multi-task Elastic-Net

MultiTaskElasticNet的数学形式如下
$\min \limits_{W} \frac{1}{2n_{samples}} ||XW-Y||_{Fro}^{2}+\alpha\rho||w||_{21}+\frac{\alpha(1-\rho)}{2}||w||_{Fro}^2$
MultiTaskElasticNet和MultiTaskElasticNetCV的用法与前述相同。

最小角度回归

最小角度回归Least Angle Regression, LARS是用于处理高维数据的回归算法。它类似于前向分段回归。LARS模型可以使用Lars或者lars_path或者lars_path_gram等估计器。

LARS Lasso

LassoLars是使用了LARS算法的lasso模型，不同于基于坐标下降的实现，它能够得到精确解。

# coding: utf-8
# Lasso path using LARS

import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn import datasets

X, y = datasets.load_diabetes(return_X_y=True)

print('Computing regularization path using LARS ...')
_, _, coefs = linear_model.lars_path(X, y, method='lasso', verbose=True)

xx = np.sum(np.abs(coefs.T), axis=1)
xx /= xx[-1]

plt.plot(xx, coefs.T)
ymin, ymax = plt.ylim()
plt.vlines(xx, ymin, ymax, linestyle='dashed')
plt.xlabel('|coef| / max|coef|')
plt.ylabel('Coefficient')
plt.title('LASSO Path')
plt.axis('tight')
plt.show()

正交匹配追踪算法 (OrthogonalMatchingPursuit)

orthogonalMatchingPursuit和orthogonal_mp使用OMP算法来拟合线性模型的参数，但这个模型在非零系数的数量( $l_0$ 范数)上有约束。作为和Least Angle Regression一样的前向特征选择方法，正交匹配追踪能够近似得到带固定非零元素个数的最优参数向量。
$\argmin \limits_{\gamma}||y-X\gamma||_2^2 subjec to ||\gamma||_0 \leq n_{nonzerocoefs}$
以下代码为例：

# coding: utf-8
# Orthogonal Matching Pursuit

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import OrthogonalMatchingPursuit
from sklearn.linear_model import OrthogonalMatchingPursuitCV
from sklearn.datasets import make_sparse_coded_signal

n_components, n_features = 512, 100
n_nonzero_coefs = 17

y, X, w = make_sparse_coded_signal(n_samples=1,
                                   n_components=n_components,
                                   n_features=n_features,
                                   n_nonzero_coefs=n_nonzero_coefs,
                                   random_state=0)

idx, = w.nonzero()

y_noisy = y + 0.05 * np.random.randn(len(y))

plt.figure(figsize=(7, 7))
plt.subplot(4, 1, 1)
plt.xlim(0, 512)
plt.title('Sparse signal')
plt.stem(idx, w[idx], use_line_collection=True)

omp = OrthogonalMatchingPursuit(n_nonzero_coefs=n_nonzero_coefs)
omp.fit(X, y)
coef = omp.coef_
idx_r, = coef.nonzero()
plt.subplot(4, 1, 2)
plt.xlim(0, 512)
plt.title('Recovered signal from noise free measurements')
plt.stem(idx_r, w[idx_r], use_line_collection=True)

omp.fit(X, y_noisy)
coef = omp.coef_
idx_r, = coef.nonzero()
plt.subplot(4, 1, 3)
plt.xlim(0, 512)
plt.title("Recovered signal from noisy measurements")
plt.stem(idx_r, coef[idx_r], use_line_collection=True)

omp_cv = OrthogonalMatchingPursuitCV()
omp_cv.fit(X, y_noisy)
coef = omp_cv.coef_
idx_r, = coef.nonzero()
plt.subplot(4, 1, 4)
plt.xlim(0, 512)
plt.title("Recovered signal from noisy measurements with CV")
plt.stem(idx_r, coef[idx_r], use_line_collection=True)

plt.subplots_adjust(0.06, 0.04, 0.94, 0.90, 0.20, 0.38)
plt.suptitle('Sparse signal recovery with Orthogonal Matching Pursuit',
             fontsize=16)

plt.show()

贝叶斯回归

贝叶斯回归技术能够将正则项参数包含在估计过程中，正则化参数没有强制设置，而按照手边的数据进行调整。这能通过在模型的超参上引入无信息的先验知识。在岭回归和分类中使用的 $l_2$ 正则项等价于在参数 $w$ 和 $\lambda^{-1}$ 高斯先验下，知道一个最大的后验估计。不用手工设置lambda值，可将其作为一个可以从数据中估计得到的随机变量。

为了得到一个全概率模型，输出 $y$ 可以假设为一个在 $X w$ 附近的高斯分布：
$\alpha) = N(y|Xw, \alpha)$
其中， $\alpha$ 被当做一个可以从数据中估计得到的随机变量。

贝叶斯岭回归

BayesianRidge估计如上所述回归问题的一个概率模型。系数 $w$ 的先验由一个球形的高斯分布给出：
$p(w|\gamma) = N(w|0,\lambda^{-1}I_p)$

$\alpha$ 和 $\lambda$ 的先验分布可选择为gamma分布。所得到的模型被称为Bayes Ridge Regression。

在贝叶斯框架下，上述找到的权重稍微不同于普通最小二乘所得到的权重。但是，贝叶斯岭回归对病态问题更加鲁棒。

# coding: utf-8
# Bayesian Ridge Regression

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

from sklearn.linear_model import BayesianRidge, LinearRegression

# Generating simulated data with Gaussian weights
np.random.seed(0)
n_samples, n_features = 100, 100
X = np.random.randn(n_samples, n_features)  # Create Gaussian data

# Create weights with a precision lambda_ of 4.
lambda_ = 4.
w = np.zeros(n_features)

# Keep 10 weights of interest
# return 10 rand int from 0 to n_features
relevant_features = np.random.randint(0, n_features, 10)
for i in relevant_features:
    # 通过loc和scale参数可以指定随机变量的偏移和缩放参数，这里对应的是正态分布的期望和标准差
    w[i] = stats.norm.rvs(loc=0, scale=1./np.sqrt(lambda_))

# Create noise with a precision alpha of 50.0
alpha_ = 50
noise = stats.norm.rvs(loc=0, scale= 1./np.sqrt(alpha_), size=n_samples)

# Create the target
y = np.dot(X,w) + noise

# Fit the Bayesian Ridge Regression an dan OLS for comparison
clf = BayesianRidge(compute_score=True)
clf.fit(X, y)

ols = LinearRegression()
ols.fit(X, y)

# Plot true weights, estimated weights, histogram of the weights, and
# predictions with standard deviations
lw = 2
plt.figure(figsize=(6, 5))
plt.title('Weights of the model')
plt.plot(clf.coef_, color='lightgreen', linewidth=lw,
         label='Bayesian Ridge estimate')
plt.plot(w, color='gold', label='Ground truth')
plt.plot(ols.coef_, color='navy', linestyle='--', label='OLS estimate')
plt.xlabel('Features')
plt.ylabel('Values of the weights')
plt.legend(loc='best', prop=dict(size=12))

plt.figure(figsize=(6, 5))
plt.title('Histogram of the weights')
plt.hist(clf.coef_, bins=n_features, color='gold', log=True,
         edgecolor='black')
# 构造一个数组,用指定值填充其元素
plt.scatter(clf.coef_[relevant_features], np.full(len(relevant_features), 5.0),
            color='navy', label='Relevant features')
plt.ylabel('Features')
plt.xlabel('Values of the weights')
plt.legend('upper left')

plt.figure(figsize=(6, 5))
plt.title("Marginal log-likelihood")
plt.plot(clf.scores_, color='navy', linewidth=lw)
plt.ylabel("Score")
plt.xlabel("Iterations")

def f(x, noise_amount):
    y = np.sqrt(x) * np.sin(x)
    noise = np.random.normal(0, 1, len(x))
    return y + noise_amount * noise

degree = 10
X = np.linspace(0, 10, 100)
y = f(X, noise_amount=0.1)
clf_poly = BayesianRidge()
clf_poly.fit(np.vander(X, degree), y)

X_plot = np.linspace(0, 11, 25)
y_plot = f(X_plot, noise_amount=0)
# np.vander: output matrix are powers of the input vector
y_mean, y_std = clf_poly.predict(np.vander(X_plot, degree), return_std=True)
plt.figure(figsize=(6, 5))
plt.errorbar(X_plot, y_mean, y_std, color='navy',
             label="Polynomial Bayesian Ridge Regression", linewidth=lw)
plt.plot(X_plot, y_plot, color='gold', linewidth=lw,
         label="Ground Truth")
plt.ylabel("Output y")
plt.xlabel("Feature X")
plt.legend(loc="lower left")

plt.show()

# coding: utf-8
# Curve Fitting with Bayesian Ridge Regression

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import BayesianRidge

def func(x): return np.sin(2*np.pi*x)

# generate sinusoidal data with noise
size = 25
rng = np.random.RandomState(1234)
x_train = rng.uniform(0., 1., size)
y_train = func(x_train) + rng.normal(scale=0.1, size=size)
x_test = np.linspace(0., 1., 100)

n_order = 3
X_train = np.vander(x_train, n_order+1, increasing=True)
X_test = np.vander(x_test, n_order+1, increasing=True)

reg = BayesianRidge(tol=1e-6, fit_intercept=False, compute_score=True)
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
for i, ax in enumerate(axes):
    if i == 0:
        # np.var: the variance of the array elements
        init = [1/np.var(y_train), 1.]
    else:
        init = [1., 1e-3]
        reg.set_params(alpha_init=init[0], lambda_init =init[1])
    reg.fit(X_train, y_train)
    ymean, ystd = reg.predict(X_test, return_std=True)

    ax.plot(x_test, func(x_test), color="blue", label="sin($2\\pi x$)")
    ax.scatter(x_train, y_train, s=50, alpha=0.5, label="observation")
    ax.plot(x_test, ymean, color="red", label="predict mean")
    ax.fill_between(x_test, ymean - ystd, ymean + ystd,
                    color="pink", alpha=0.5, label="predict std")
    ax.set_ylim(-1.3, 1.3)
    ax.legend()
    title = "$\\alpha$_init$={:.2f},\\ \\lambda$_init$={}$".format(
        init[0], init[1])
    if i == 0:
        title += " (Default)"
    ax.set_title(title, fontsize=12)
    text = "$\\alpha={:.1f}$\n$\\lambda={:.3f}$\n$L={:.1f}$".format(
        reg.alpha_, reg.lambda_, reg.scores_[-1])
    ax.text(0.05, -1.0, text, fontsize=12)

plt.tight_layout()
plt.show()

自动相关关联(Automatic Relevance Determination)

ARDRegression与Bayresian Ridge Regression相似，但是能够产生更稀疏的系数w。它去掉了高斯分布为球形的假设，使用了一个不同的先验，为一个轴平行，椭圆高斯分布。

ARD也被称为Sparse Bayesian Learning和Relevance Vector Machine。

ARD使用代码如下：

# coding: utf-8
# Automatic Relevance Determination Regression (ARD)

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

from sklearn.linear_model import ARDRegression, LinearRegression

# #############################################################################
# Generating simulated data with Gaussian weights

# Parameters of the example
np.random.seed(0)
n_samples, n_features = 100, 100
# Create Gaussian data
X = np.random.randn(n_samples, n_features)
# Create weights with a precision lambda_ of 4.
lambda_ = 4.
w = np.zeros(n_features)
# Only keep 10 weights of interest
relevant_features = np.random.randint(0, n_features, 10)
for i in relevant_features:
    w[i] = stats.norm.rvs(loc=0, scale=1. / np.sqrt(lambda_))
# Create noise with a precision alpha of 50.
alpha_ = 50.
noise = stats.norm.rvs(loc=0, scale=1. / np.sqrt(alpha_), size=n_samples)
# Create the target
y = np.dot(X, w) + noise

# #############################################################################
# Fit the ARD Regression
clf = ARDRegression(compute_score=True)
clf.fit(X, y)

ols = LinearRegression()
ols.fit(X, y)

# #############################################################################
# Plot the true weights, the estimated weights, the histogram of the
# weights, and predictions with standard deviations
plt.figure(figsize=(6, 5))
plt.title("Weights of the model")
plt.plot(clf.coef_, color='darkblue', linestyle='-', linewidth=2,
         label="ARD estimate")
plt.plot(ols.coef_, color='yellowgreen', linestyle=':', linewidth=2,
         label="OLS estimate")
plt.plot(w, color='orange', linestyle='-', linewidth=2, label="Ground truth")
plt.xlabel("Features")
plt.ylabel("Values of the weights")
plt.legend(loc=1)

plt.figure(figsize=(6, 5))
plt.title("Histogram of the weights")
plt.hist(clf.coef_, bins=n_features, color='navy', log=True)
plt.scatter(clf.coef_[relevant_features], np.full(len(relevant_features), 5.),
            color='gold', marker='o', label="Relevant features")
plt.ylabel("Features")
plt.xlabel("Values of the weights")
plt.legend(loc=1)

plt.figure(figsize=(6, 5))
plt.title("Marginal log-likelihood")
plt.plot(clf.scores_, color='navy', linewidth=2)
plt.ylabel("Score")
plt.xlabel("Iterations")


# Plotting some predictions for polynomial regression
def f(x, noise_amount):
    y = np.sqrt(x) * np.sin(x)
    noise = np.random.normal(0, 1, len(x))
    return y + noise_amount * noise

degree = 10
X = np.linspace(0, 10, 100)
y = f(X, noise_amount=1)
clf_poly = ARDRegression(threshold_lambda=1e5)
clf_poly.fit(np.vander(X, degree), y)

X_plot = np.linspace(0, 11, 25)
y_plot = f(X_plot, noise_amount=0)
y_mean, y_std = clf_poly.predict(np.vander(X_plot, degree), return_std=True)
plt.figure(figsize=(6, 5))
plt.errorbar(X_plot, y_mean, y_std, color='navy',
             label="Polynomial ARD", linewidth=2)
plt.plot(X_plot, y_plot, color='gold', linewidth=2,
         label="Ground Truth")
plt.ylabel("Output y")
plt.xlabel("Feature X")
plt.legend(loc="lower left")
plt.show()

Logistic回归

Logistic regression是一个用于分类而不是回归的线性模型。在文献中，Logistic regression有被称作logit regression、maximum-entropy classifiation (MaxEnt)或log-linear classifier。在此模型中，描述一次实验可能结果的概率使用logistic function建模 $\frac{1}{1+e^{-x}}$ 。

逻辑回归是以LogisticRegression对象实现的。LogisticRegression能够拟合二元、一对其它或多类的逻辑回归，并且也可以设置 $l_1$ 、 $l_2$ 或Elastic-Net的正则项。

正则在机器学习中常用且默认使用，但是在统计中不常见。正则的另一个好处是提高数值的稳定性。没有正则可将C的值设置为一个较大值。

作为一个优化问题，二元且使用 $l_2$ 正则项的逻辑回归最小化如下成本函数：
$\min\limits_{w, c}\frac{1}{2}w^Tw+C\sum_{i=1}^nlog(exp(-y_i(X_i^Tw+c))+1)$
同样的，二元且使用 $l_1$ 正则项的逻辑回归最小化如下成本函数：
$\min\limits_{w, c}||w||_1+C\sum_{i=1}^nlog(exp(-y_i(X_i^Tw+c))+1)$
Elastic-Net正则项是由上述两种组合而成，其成本函数如下：
$\min\limits_{w, c}\frac{1-\rho}{2}w^Tw+\rho||w||_1+C\sum_{i=1}^nlog(exp(-y_i(X_i^Tw+c))+1)$
其中， $\rho$ 控制两种正则项之间的强度，与l1_ratio参数对应。

在LogisticRegression中实现的求解器有liblinear、newton-cg、lbfgs、sag和saga。具体特征如下：
各种求解算法特点
因其鲁棒性，默认使用lbfgs。当数据集较大时，使用saga更快，同样，也可以使用带log损失函数的SGDClassifier，可能会更快，但是需要更多调整。

使用如下例：

# coding: utf-8
# L1 Penalty and Sparsity in Logistic Regression

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

X, y = datasets.load_digits(return_X_y=True)

X = StandardScaler().fit_transform(X)

y = (y > 4).astype(np.int)

l1_ratio = 0.5

fig, axes = plt.subplots(3, 3)

for i, (C, axes_row) in enumerate(zip((1, 0.1, 0.01), axes)):
    clf_l1_LR = LogisticRegression(C=C, penalty='l1', tol=0.01, solver='saga')
    clf_l2_LR = LogisticRegression(C=C, penalty='l2', tol=0.01, solver='saga')
    clf_en_LR = LogisticRegression(C=C, penalty='elasticnet', solver='saga', l1_ratio=l1_ratio, tol=0.01)

    clf_l1_LR.fit(X, y)
    clf_l2_LR.fit(X, y)
    clf_en_LR.fit(X, y)

    coef_l1_LR = clf_l1_LR.coef_.ravel()
    coef_l2_LR = clf_l2_LR.coef_.ravel()
    coef_en_LR = clf_en_LR.coef_.ravel()

    sparsity_l1_LR = np.mean(coef_l1_LR == 0) * 100
    sparsity_l2_LR = np.mean(coef_l2_LR == 0) * 100
    sparsity_en_LR = np.mean(coef_en_LR == 0) * 100

    print("{:<40} {:.2f}%".format("Sparsity with L1 penalty:", sparsity_l1_LR))
    print("{:<40} {:.2f}%".format("Sparsity with Elastic-Net penalty:",
                                  sparsity_en_LR))
    print("{:<40} {:.2f}%".format("Sparsity with L2 penalty:", sparsity_l2_LR))
    print("{:<40} {:.2f}".format("Score with L1 penalty:",
                                 clf_l1_LR.score(X, y)))
    print("{:<40} {:.2f}".format("Score with Elastic-Net penalty:",
                                 clf_en_LR.score(X, y)))
    print("{:<40} {:.2f}".format("Score with L2 penalty:",
                                 clf_l2_LR.score(X, y)))

    if i == 0:
        axes_row[0].set_title('L1 penalty')
        axes_row[1].set_title('Elastic-Net\nl1_ratio = %s' % l1_ratio)
        axes_row[2].set_title('L2 penalty')

    for ax, coefs in zip(axes_row, [coef_l1_LR, coef_en_LR, coef_l2_LR]):
        ax.imshow(np.abs(coefs.reshape(8, 8)), interpolation='nearest',
                  cmap='binary', vmax=1, vmin=0)
        ax.set_xticks(())
        ax.set_yticks(())

    axes_row[0].set_ylabel('C = %s' % C)

plt.show()

# coding: utf-8
# Regularization path of L1- Logistic Regression

from time import time
import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn import datasets
from sklearn.svm import l1_min_c

iris = datasets.load_iris()
X = iris.data
y = iris.target

X = X[y != 2]
y = y[y != 2]

X /= X.max()
# sklearn.svm.l1_min_c allows to calculate the lower bound for C in order to get a non “null” (all feature weights to zero) model
cs = l1_min_c(X, y, loss='log') * np.logspace(0, 7, 16)

print('Computing regularization path ...')
start = time()
clf = linear_model.LogisticRegression(penalty='l1', solver='liblinear',
                                      tol=1e-6, max_iter=int(1e6),
                                      warm_start=True, intercept_scaling=10000.)

coefs_ = []
for c in cs:
    clf.set_params(C=c)
    clf.fit(X, y)
    coefs_.append(clf.coef_.ravel().copy())
print('This took %0.3fs' % (time()-start))

coefs_ = np.array(coefs_)
plt.plot(np.log10(cs), coefs_, marker='o')
ymin, ymax = plt.ylim()
plt.xlabel('log(C)')
plt.ylabel('Coefficients')
plt.title('Logistic Regression Path')
plt.axis('tight')
plt.show()

随机梯度下降(Stochastic Gradient Descent)

随机梯度下降是一种拟合线性模型的简单并有效的方法。类SGDClassifier和SGDRegressor实现了上述算法，能够适应不同的损失函数和惩罚。

感知器(Perceptron)

Perception是另外一种用于大规模学习的分类算法。

被动攻击算法(Passive Aggressive Algorithms)

被动攻击算法是另一类大规模学习的算法。PassiveAggressiveClassifier和PassiveAggressiveRegressor。

鲁棒回归(Robustness regression)

鲁棒回归主要用于拟合存在异常点和错误的回归模型。

RANSAC：RANdom SAmple Consensus
Theil-Sen estimator
Huber Regression

多项式回归：使用基函数扩展线性模型

在机器学习中常用的模式是使用线性模型在数据的非线性函数上训练。这种方法在维持线性模型快的性能上，允许他们拟合更广的数据。

多项式回归(Polynomial Regression)通过考虑在使用这些基函数构建的高维空间线性拟合，模型有拟合更广数据的灵活性。

# coding: utf-8
# Polynomial interpolation

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline


def f(x):
    return x*np.sin(x)


x_plot = np.linspace(0, 10, 100)

x = np.linspace(0, 10, 100)
rng = np.random.RandomState(0)
rng.shuffle(x)
x = np.sort(x[:20])
y = f(x)

X = x[:, np.newaxis]
X_plot = x_plot[:, np.newaxis]

colors = ['teal', 'yellowgreen', 'gold']
lw = 2
plt.plot(x_plot, f(x_plot), color='cornflowerblue', linewidth=lw,
         label='ground truth')
plt.scatter(x, y, color='navy', s=30, marker='o', label='training points')

for count, degree in enumerate([3, 4, 5]):
    model = make_pipeline(PolynomialFeatures(degree), Ridge())
    model.fit(X, y)
    y_plot = model.predict(X_plot)
    plt.plot(x_plot, y_plot, color=colors[count], linewidth=lw,
             label='degree %d'%degree)

plt.legend(loc='best')
plt.show()