线性模型
本文后续的方法用于解决目标值是特征线性组合的回归问题。在数学符号中,假设
y
^
\hat{y}
y^是预测值。
y
^
(
w
,
x
)
=
w
0
+
w
1
x
1
+
.
.
.
+
w
p
x
p
\hat{y}(w, x) = w_0+w_1x_1+...+w_px_p
y^(w,x)=w0+w1x1+...+wpxp
在后续描述中,将向量
w
=
(
w
1
,
.
.
.
,
w
p
)
w=(w_1,...,w_p)
w=(w1,...,wp)定义为coef_
,标量
w
0
w_0
w0定义为intercept_
。
普通最小二乘法
LinearRegression
拟合带参数
w
=
(
w
1
,
.
.
.
,
w
p
)
w=(w_1,...,w_p)
w=(w1,...,wp)的线性模型以最小化实际目标值和预测目标值之间的残差平方和。数学上,它求解如下形式的问题:
max
w
∣
∣
X
w
−
y
∣
∣
2
2
\max \limits_{w}||Xw-y||_2^{2}
wmax∣∣Xw−y∣∣22
LinearRegression
以数组
X
X
X和
y
y
y作为fit
方法的输入,将线性模型的系数
w
w
w保存在成员coef_
中,如下例
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
# 取某一列的数据,并以列向量的形式
diabetes_X = diabetes_X[:, np.newaxis, 2]
# 划分数据集
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
# 声明linear regression对象
regr = linear_model.LinearRegression()
# 训练模型
regr.fit(diabetes_X_train, diabetes_y_train)
# 拟合测试集中的数据
diabetes_y_pred = regr.predict(diabetes_X_test)
print('Coefficients: \n', regr.coef_)
print('Mean squared error: %.2f' % mean_squared_error(diabetes_y_test, diabetes_y_pred))
print('Coefficient of determination: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
普通最小二乘的系数依赖于特征的独立性。当特征相关和矩阵 X X X中的列近似线性相关时,矩阵 X X X会近似于奇异,因此,将导致最小二乘估计对目标值中的随机误差高度敏感,进一步会产生较大的方差。多重共线性的情况将产生。最小二乘的解由 X X X的奇异值分解(Singular Value Decomposition)计算得到的。
岭回归与分类
加入 L1
或 L2
正则化,让权值尽可能小,最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单,能适应不同的数据集,也在一定程度上避免了过拟合现象。
可以设想一下对于一个线性回归方程,若参数很大,那么只要数据偏移一点点,就会对结果造成很大的影响;但如果参数足够小,数据偏移得多一点也不会对结果造成什幺影响,一种流行的说法是抗扰动能力强。
回归
岭回归(Ridge Regression)通过添加系数大小的惩罚项来解决普通最小二乘中的一些问题。岭回归的系数最小化如下的目标函数:
min
w
∣
∣
X
w
−
y
∣
∣
2
2
+
α
∣
∣
w
∣
∣
2
2
\min \limits_{w}||Xw-y||_2^{2}+\alpha||w||_2^2
wmin∣∣Xw−y∣∣22+α∣∣w∣∣22
复杂性参数
α
\alpha
α控制收缩量:
α
\alpha
α的值越大,收缩量越大,系数对共线性更加鲁棒。
和其它线性模型一样,Ridge
以数组
X
X
X和
y
y
y作为fit
方法的输入,将线性模型的系数
w
w
w保存在成员coef_
中,如下例
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
X = 1. / (np.arange(1, 11) + np.arange(0, 10)[:, np.newaxis])
y = np.ones(10)
n_alphas = 200
alphas = np.logspace(-10, -2, n_alphas)
coefs = []
for a in alphas:
ridge = linear_model.Ridge(alpha=a, fit_intercept=False)
ridge.fit(X, y)
coefs.append(ridge.coef_)
ax = plt.gca()
# 解决负号不能正常显示的问题
for tick in ax.yaxis.get_major_ticks():
tick.label1.set_fontproperties('stixgeneral')
for tick in ax.xaxis.get_major_ticks():
tick.label1.set_fontproperties('stixgeneral')
ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1])
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()
以上代码所画的图表明,当
α
\alpha
α较大时,正则化的影响支配了平方损失函数,特征系数趋近于0,当
α
\alpha
α逐渐变小时,系数的取值会倾向于最小二乘法对应的值,在此过程中,特征系数的取值会发生极大地震荡,因此,在实际应用中,有必要调整Ridge
系数,保证两者的平衡。
分类
RidgeClassifier
首先将二元的目标值转化为{-1, 1},然后将此分类问题当作一个回归问题,优化和回归问题相同的目标函数。预测的类别对应于回归预测的符号。对于多分类问题,会将其当作多分类回归,预测的类别对应于目标值最高的输出。
使用一个最小二乘损失函数而不是传统的logistic或折页损失函数去拟合一个分类模型,看上去是有问题的。但是,在实际中,从精确度、准确率或召回率来看,所有这些模型都能够产生相似的交叉验证分数,然而,使用最小二乘损失函数的RidgeClassifer
可以使用不同计算性能的数值求解器。
RidgeClassifier
计算明显比LogisiticRegression
快,这是因为它只需要计算投影矩阵
(
X
T
X
)
−
1
X
T
(X^TX)^{-1}X^T
(XTX)−1XT一次。
这个分类器有时候也会被当做使用线性核的最小二乘支持向量机。
该分类器的使用如下例
# coding: utf-8
import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
import matplotlib.pyplot as plt
# 命令行配置
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, ComplementNB # scikit-learn version >= 0.20.0
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.extmath import density
from sklearn import metrics
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s')
# https://www.cnblogs.com/wj-1314/p/8974021.html
# dest: 可以决定解析后,取值时的属性名, 尤其适于有多个等价参数. 不指定时就是选项不加-的字符串
# action: 用于控制对选项和参数的处理
# "store": 储存值到 dest 指定的属性,强制要求后面提供参数
# "store_true": 当使用该选项时,后面的 dest 将设置为 true, 不跟参数
op = OptionParser()
op.add_option('--report',
action='store_true', dest='print_report',
help='Print a detailed classification report.')
op.add_option('--chi2_select',
action='store', type='int', dest='select_chi2',
help='Select some number of features using a chi-squared test.')
op.add_option('--confusion_matrix',
action='store_true', dest='print_cm',
help='Print the confusion matrix.')
op.add_option('--top10',
action='store_true', dest='print_top10',
help='Print ten most discriminative terms per class for every classifier.')
op.add_option('--all_categories',
action='store_true',
help='Whether to use all categories or not.')
op.add_option('--use_hashing',
action='store_true',
help='Use a hashing vectorizer.')
op.add_option('--n_features',
action='store', type=int, default=2**16,
help='n_features when using the hashing vectorizer.')
op.add_option('--filtered',
action='store_true',
help='Remove newsgroup information that is easily overfit: headers, signatures, and quoting.')
def is_interactive():
return not hasattr(sys.modules['__main__'], '__file__')
argv = [] if is_interactive() else sys.argv[1:]
(opts, args) = op.parse_args(argv)
if len(args) > 0:
op.error('this script takes no arguments.')
sys.exit(1)
print(__doc__)
op.print_help()
print()
# 下载数据
if opts.all_categories:
categories = None
else:
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space'
]
if opts.filtered:
remove = ('headers', 'footers', 'quotes')
else:
remove = ()
print('Loading 20 newsgroups dataset for categories:')
print(categories if categories else 'all')
data_train = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=42,
remove=remove)
data_test = fetch_20newsgroups(subset='test', categories=categories,
shuffle=True, random_state=42,
remove=remove)
print('data loaded')
target_names = data_train.target_names
def size_mb(docs):
return sum(len(s.encode('utf-8')) for s in docs) / 1e6
data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)
print('%d documents - %0.3fMB (training set)'% (len(data_train.data), data_train_size_mb))
print('%d documents - %0.3fMB (test set)'% (len(data_test.data), data_test_size_mb))
print('%d categories' % len(target_names))
print()
y_train, y_test = data_train.target, data_test.target
print('Extracting features from the training data using a sparse vectorizer')
t0 = time()
if opts.use_hashing:
vectorizer = HashingVectorizer(stop_words='english', alternate_sign=False,
n_features=opts.n_features)
X_train = vectorizer.transform(data_train.data)
else:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)
duration = time()-t0
print('done in %fs at %0.3fMB/s'%(duration, data_train_size_mb/duration))
print('n_samples: %d, n_features: %d' % X_train.shape)
print()
print('Extracting features from the test data using the same vectorizer')
t0 = time()
X_test = vectorizer.transform(data_test.data)
duration = time()-t0
print('done in %fs at %0.3fMB/s' % (duration, data_test_size_mb/duration))
print('n_samples: %d, n_features: %d' % X_test.shape)
print()
if opts.use_hashing:
feature_names = None
else:
feature_names = vectorizer.get_feature_names()
if opts.select_chi2:
print('Extracting %d best features by a chi-squared test' % opts.select_chi2)
t0 = time()
ch2 = SelectKBest(chi2, k=opts.select_chi2)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
if feature_names:
feature_names = [feature_names[i] for i in ch2.get_support(indices=True)]
print('done in %fs' % (time()-t0))
print()
if feature_names:
feature_names = np.asarray(feature_names)
def trim(s):
return s if len(s) <= 80 else s[:77]+"..."
# 测试分类器
def benchmark(clf):
print('_'*80)
print('Training: ')
print(clf)
t0 = time()
clf.fit(X_train, y_train)
train_time = time()-t0
print('train time: %0.3fs'%train_time)
t0 = time()
pred = clf.predict(X_test)
test_time = time() - t0
print('test time: %0.3fs' % test_time)
score = metrics.accuracy_score(y_test, pred)
print('accuracy: %0.3f' % score)
if hasattr(clf, 'coef_'):
print('dimensionalty: %d' % clf.coef_.shape[1])
print('density: %f' % density(clf.coef_))
if opts.print_top10 and feature_names is not None:
print('top 10 keywords per class: ')
for i, label in enumerate(target_names):
top10 = np.argsort(clf.coef_[i])[-10:]
print(trim('%s: %s' % (label, ' '.join(feature_names[top10]))))
print()
if opts.print_report:
print('classification report: ')
print(metrics.classification_report(y_test, pred,
target_names=target_names))
if opts.print_cm:
print('confusion matrix: ')
print(metrics.confusion_matrix(y_test, pred))
print()
clf_descr = str(clf).split('(')[0]
return clf_descr, score, train_time, test_time
results = []
for clf, name in (
(RidgeClassifier(tol=1e-2, solver='sag'), 'Ridge Classifier'),
(Perceptron(max_iter=50), 'Perceptron'),
(PassiveAggressiveClassifier(max_iter=50), 'Passive-Aggression'),
(KNeighborsClassifier(n_neighbors=10), 'kNN'),
(RandomForestClassifier(), 'Random forest')):
print('='*80)
print(name)
results.append(benchmark(clf))
for penalty in ['l2', 'l1']:
print('='*80)
print('%s penalty' % penalty.upper())
results.append(benchmark(LinearSVC(penalty=penalty, dual=False, tol=1e-3)))
results.append(benchmark(SGDClassifier(alpha=0.0001, max_iter=50, penalty=penalty)))
print('='*80)
print('Elastic-Net penalty')
results.append(benchmark(SGDClassifier(alpha=0.0001, max_iter=50, penalty='elasticnet')))
print('='*80)
print('NearestCentroid (aka Rocchio classifier)')
results.append(benchmark(NearestCentroid()))
print('='*80)
print('Naive Bayes')
results.append(benchmark(MultinomialNB(alpha=0.01)))
results.append(benchmark(BernoulliNB(alpha=0.01)))
results.append(benchmark(ComplementNB(alpha=0.1)))
print('='*80)
print('LinearSVC with L1-based feature selection')
results.append(benchmark(Pipeline([
('feature_selection', SelectFromModel(LinearSVC(penalty='l1', dual=False,
tol=1e-3))),
('classification', LinearSVC(penalty='l2'))])))
# 画图
indices = np.arange(len(results))
results = [[x[i] for x in results] for i in range(4)]
clf_names, score, training_time, test_time = results
training_time = np.array(training_time) / np.max(training_time)
test_time = np.array(test_time) / np.max(test_time)
plt.figure(figsize=(12, 8))
plt.title('Score')
plt.barh(indices, score, 0.2, label='score', color='navy')
plt.barh(indices+0.3, training_time, 0.2, label='training time', color='c')
plt.barh(indices+0.6, test_time, 0.2, label='test time', color='darkorange')
plt.yticks(())
plt.legend(loc='best')
plt.subplots_adjust(left=0.25)
plt.subplots_adjust(top=0.95)
plt.subplots_adjust(bottom=0.05)
for i, c in zip(indices, clf_names):
plt.text(-0.3, i, c)
plt.show()
RidgeCV
能够使用内置alpha
参数交叉验证实现岭回归。除了使用留一法交叉验证的形式外,它和GridSearchCV
使用相同的方式工作。
import numpy as np
from sklearn import linear_model
reg = linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))
reg.fit([[0, 0], [0, 0], [1, 1]], [0, 0.1, 1])
print(reg.alpha_)
Lasso (Least absolute shrinkage and selection operator)
Lasso
是一个用来估计稀疏系数的线性模型。它倾向于有较少非零系数的解,能够根据解,有效减少特征的数量,常用于压缩感知领域。在特定情况下, 它能够准确地包含非零系数集合。
数学表达式上,它包含一个添加了正则项的线性模型,最小化的目标函数为:
min
w
1
2
n
s
a
m
p
l
e
s
∣
∣
X
w
−
y
∣
∣
2
2
+
α
∣
∣
w
∣
∣
1
\min \limits_{w} \frac{1}{2n_{samples}} ||Xw-y||_2^{2}+\alpha||w||_1
wmin2nsamples1∣∣Xw−y∣∣22+α∣∣w∣∣1
该分类器使用如下例:
from sklearn import linear_model
reg = linear_model.Lasso(alpha=0.1)
reg.fit([[0, 0], [1, 1]], [0, 1])
print(reg.predict([[1, 1]]))
由于 Lasso
容易使得部分权重取 0,所以可以用其做特征选择,Lasso
的名字就指出了它是一个选择算子。权重为0的特征对问题没有贡献,直接去掉权重为0的特征,模型的输出值不变。
两个使用Lasso
的例子,代码如下:
# coding: utf-8
# Lasso and Elastic Net for Sparse Signals
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
np.random.seed(42)
n_samples, n_features = 50, 100
X = np.random.randn(n_samples, n_features)
idx = np.arange(n_features)
coef = (-1) ** idx * np.exp(-idx / 10)
coef[10:] = 0
y = np.dot(X, coef)
y += 0.01 * np.random.normal(size=n_samples)
n_samples = X.shape[0]
X_train, y_train = X[: n_samples//2], y[: n_samples//2]
X_test, y_test = X[n_samples//2 :], y[n_samples//2 :]
from sklearn.linear_model import Lasso
alpha = 0.1
lasso = Lasso(alpha=alpha)
y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
r2_score_lasso = r2_score(y_test, y_pred_lasso)
print(lasso)
print("r^2 on test data: %f" % r2_score_lasso)
from sklearn.linear_model import ElasticNet
enet = ElasticNet(alpha=alpha, l1_ratio=0.7)
y_pred_enet = enet.fit(X_train, y_train).predict(X_test)
r2_score_enet = r2_score(y_test, y_pred_enet)
print(enet)
print('r^2 on test data: %f' % r2_score_enet)
# np.where(enet.coef_)[0] gives the non-zero value index
m, s, _ = plt.stem(np.where(enet.coef_)[0], enet.coef_[enet.coef_!=0],
markerfmt='x', label='Elastic net coefficients',
use_line_collection=True)
plt.setp([m, s], color='#2ca02c')
m, s, _ = plt.stem(np.where(lasso.coef_)[0], lasso.coef_[lasso.coef_!=0],
markerfmt='x', label='Lasso coefficients',
use_line_collection=True)
plt.setp([m, s], color='#ff7f03')
plt.stem(np.where(coef)[0], coef[coef!=0], label='True coefficients',
markerfmt='bx', use_line_collection=True)
plt.legend(loc='best')
plt.title('Lasso $R^2$: %.3f, Elastic Net $R^2$: %.3f' % (r2_score_lasso, r2_score_enet))
plt.show()
# coding: utf-8
# Compressive sensing: tomography reconstruction with L1 prior (Lasso)
import numpy as np
from scipy import sparse
from scipy import ndimage
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt
def _weights(x, dx=1, orig=0):
x = np.ravel(x) # 将多维数组降为一维
floor_x = np.floor((x-orig)/dx).astype(np.int64)
alpha = (x - orig - floor_x*dx) /dx
return np.hstack((floor_x, floor_x+1)), np.hstack((1-alpha, alpha))
def _generate_center_coordinates(l_x):
X, Y = np.mgrid[:l_x, :l_x].astype(np.float64)
center = l_x/2.
X += 0.5 - center
Y += 0.5 - center
return X, Y
def build_projection_operator(l_x, n_dir):
X, Y = _generate_center_coordinates(l_x)
angles = np.linspace(0, np.pi, n_dir, endpoint=False)
data_inds, weights, camera_inds = [], [], []
data_unravel_indices = np.arange(l_x ** 2)
data_unravel_indices = np.hstack((data_unravel_indices,
data_unravel_indices))
for i, angle in enumerate(angles):
Xrot = np.cos(angle)*X - np.sin(angle)*Y
inds, w = _weights(Xrot, dx=1, orig=X.min())
mask = np.logical_and(inds >= 0, inds < l_x)
weights += list(w[mask])
camera_inds += list(inds[mask]+i*l_x)
data_inds += list(data_unravel_indices[mask])
proj_operator = sparse.coo_matrix((weights, (camera_inds, data_inds)))
return proj_operator
def generate_synthetic_data():
rs = np.random.RandomState(0)
n_pts = 36
x, y = np.ogrid[0:l, 0:l]
mask_outer = (x - l / 2.) ** 2 + (y - l / 2.) ** 2 < (l / 2.) ** 2
mask = np.zeros((l, l))
points = l * rs.rand(2, n_pts)
mask[(points[0]).astype(np.int), (points[1]).astype(np.int)] = 1
mask = ndimage.gaussian_filter(mask, sigma=l / n_pts)
res = np.logical_and(mask > mask.mean(), mask_outer)
return np.logical_xor(res, ndimage.binary_erosion(res))
l = 128
proj_operator = build_projection_operator(l, l//7)
data = generate_synthetic_data()
proj = proj_operator * data.ravel()[:, np.newaxis]
proj += 0.15*np.random.randn(*proj.shape)
rgr_ridge = Ridge(alpha=0.2)
rgr_ridge.fit(proj_operator, proj.ravel())
rec_l2 = rgr_ridge.coef_.reshape(l, l)
rgr_lasso = Lasso(alpha=0.001)
rgr_lasso.fit(proj_operator, proj.ravel())
rec_l1 = rgr_lasso.coef_.reshape(l, l)
plt.figure(figsize=(8, 3.3))
plt.subplot(131)
plt.imshow(data, cmap=plt.cm.gray, interpolation='nearest')
plt.axis('off')
plt.title('original image')
plt.subplot(132)
plt.imshow(rec_l2, cmap=plt.cm.gray, interpolation='nearest')
plt.title('L2 penalization')
plt.axis('off')
plt.subplot(133)
plt.imshow(rec_l1, cmap=plt.cm.gray, interpolation='nearest')
plt.title('L1 penalization')
plt.axis('off')
plt.subplots_adjust(hspace=0.01, wspace=0.01, top=1, bottom=0, left=0, right=1)
plt.show()
设置正则化参数
alpha
系数控制了所估计系数的稀疏性程度。
使用交叉验证
scikit-learn
使用基于交叉验证的LassoCV
和LassoLarsCV
方法设置Lasso
的算法参数alpha
。LassoLarsCV
是基于Least Angle Regreesion
的算法。
对于带许多多重共线性特征的高维数据集,LassoCV
是最常用的方法。但是,LassoLarsCV
有着能够探索更相关的alpha
参数值,如果样本数远小于特征数,它比LassoCV
快。
基于信息准则的模型选择
估计器LassoLarsIC
使用Akaike
和Bayes
信息准则(AIC
和BIC
)。当使用k
折交叉验证时,由于正则路径只计算一次,而不是k+1
次,LassoLarsIC
是一种更廉价计算的方法以找到alpha
的最优值。
AIC准则是由日本统计学家Akaike与1973年提出的,全称是最小化信息量准则(Akaike Information Criterion)。它是拟合精度和参数个数的加权函数:AIC=2(模型参数的个数)-2ln(模型的极大似然函数)
。
AIC为模型选择提供了有效的规则,但也有不足之处。当样本容量很大时,在AIC准则中拟合误差提供的信息就要受到样本容量的放大,而参数个数的惩罚因子却和样本容量没关系(一直是2),因此当样本容量很大时,使用AIC准则选择的模型不收敛于真实模型,它通常比真实模型所含的未知参数个数要多。
BIC(Bayesian InformationCriterion)贝叶斯信息准则是Schwartz在1978年根据Bayes理论提出的判别准则,称为SBC准则(也称BIC),弥补了AIC的不足。SBC的定义为:BIC = ln(n)(模型中参数的个数) - 2ln(模型的极大似然函数值)
下例介绍上述三种方法:
# coding: utf-8
# Lasso model selection: Cross-Validation / AIC / BIC
import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC
from sklearn import datasets
EPSILON = 1e-4
X, y = datasets.load_diabetes(return_X_y=True)
rng = np.random.RandomState(42)
X = np.c_[X, rng.randn(X.shape[0], 14)]
X /= np.sqrt(np.sum(X**2, axis=0))
model_bic = LassoLarsIC(criterion='bic')
t1 = time.time()
model_bic.fit(X, y)
t_bic = time.time() - t1
alpha_bic_ = model_bic.alpha_
model_aic = LassoLarsIC(criterion='aic')
model_aic.fit(X, y)
alpha_aic_ = model_aic.alpha_
def plot_ic_criterion(model, name, color):
criterion_ = model.criterion_
plt.semilogx(model.alphas_ + EPSILON, criterion_, '--', color=color,
linewidth=3, label='%s criterion' % name)
plt.axvline(model.alpha_ + EPSILON, color=color, linewidth=3,
label='alpha: %s estimate'% name)
plt.xlabel(r'$\alpha$')
plt.ylabel('criterion')
plt.figure()
plot_ic_criterion(model_aic, 'AIC', 'b')
plot_ic_criterion(model_bic, 'bic', 'r')
plt.legend()
plt.title('Information-criterion for model selection (training time %.3fs)' % t_bic)
print('Computing regularization path using the coordinate descent lasso...')
t1 = time.time()
model = LassoCV(cv=20).fit(X, y)
t_lasso_cv = time.time() - t1
plt.figure()
ymin, ymax = 2300, 3800
plt.semilogx(model.alphas_+EPSILON, model.mse_path_, ':')
plt.plot(model.alphas_ + EPSILON, model.mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
plt.axvline(model.alpha_+EPSILON, linestyle='--', color='k',
label='alpha: CV estimate')
plt.legend()
plt.xlabel(r'$\alpha$')
plt.ylabel('Mean square error')
plt.title('Mean square error on each fold: coordinate descent (training time %.3fs)' % t_lasso_cv)
plt.axis('tight')
plt.ylim(ymin, ymax)
print('Computing regularization path using the Lars lasso')
t1 = time.time()
model = LassoLarsCV(cv=20).fit(X, y)
t_lasso_lars_cv = time.time() - t1
plt.figure()
plt.semilogx(model.cv_alphas_ + EPSILON, model.mse_path_, ':')
plt.plot(model.cv_alphas_ + EPSILON, model.mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
plt.axvline(model.alpha_+EPSILON, linestyle='--', color='k',
label='alpha CV')
plt.legend()
plt.xlabel(r'$\alpha$')
plt.ylabel('Mean square error')
plt.title('Mean square error on each fold: Lars (training time %.3fs)' % t_lasso_lars_cv)
plt.axis('tight')
plt.ylim(ymin, ymax)
plt.show()
参数alpha
和SVM
的正则参数的等价性等于alpha = 1 / C
或alpha = 1 / (n_samples*C)
。
多任务的Lasso
MultiTaskLasso
是一个用于估计多重回归问题稀疏系数的线性模型:y
是一个大小为(n_samples, n_tasks)
的二维数组。它的约束是对于所有回归问题所选择的特征是相同的,也称作tasks
。
下例比较了由单个简单Lasso
或MultiTaskLasso
所得到系数矩阵的非零系数的位置。
# coding: utf-8
# Joint feature selection with multi-task Lasso
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import MultiTaskLasso, Lasso
rng = np.random.RandomState(42)
n_samples, n_features, n_tasks = 100, 30, 40
n_relevant_features = 5
coef = np.zeros((n_tasks, n_features))
times = np.linspace(0, 2*np.pi, n_tasks)
for k in range(n_relevant_features):
coef[:, k] = np.sin((1. + rng.randn(1))*times + 3*rng.randn(1))
X = rng.randn(n_samples, n_features)
Y = np.dot(X, coef.T) + rng.randn(n_samples, n_tasks)
coef_lasso_ = np.array([Lasso(alpha=0.5).fit(X, y).coef_ for y in Y.T])
coef_multi_task_lasso_ = MultiTaskLasso(alpha=1.0).fit(X, Y).coef_
fig = plt.figure(figsize=(8, 5))
plt.subplot(121)
plt.spy(coef_lasso_)
plt.xlabel('Feature')
plt.ylabel('Time (or Task)')
plt.text(10, 5, 'Lasso')
plt.subplot(122)
plt.spy(coef_multi_task_lasso_)
plt.xlabel('Feature')
plt.ylabel('Time (or Task)')
plt.text(10, 5, 'MultiTaskLasso')
fig.suptitle('Coefficient non-zero location')
feature_to_plot = 0
plt.figure()
lw = 2
plt.plot(coef[:, feature_to_plot], color='seagreen', linewidth=lw,
label='Ground truth')
plt.plot(coef_lasso_[:, feature_to_plot], color='cornflowerblue', linewidth=lw,
label='Lasso')
plt.plot(coef_multi_task_lasso_[:, feature_to_plot], color='gold', linewidth=lw,
label='MultiTaskLasso')
plt.legend(loc='upper center')
plt.axis('tight')
plt.ylim([-1.1, 1.1])
plt.show()
数学上,它是一个带混合
l
1
l
2
l_1l_2
l1l2正则项的线性模型,目标函数为
min
w
1
2
n
s
a
m
p
l
e
s
∣
∣
X
W
−
Y
∣
∣
F
r
o
2
+
α
∣
∣
w
∣
∣
21
\min \limits_{w} \frac{1}{2n_{samples}} ||XW-Y||_{Fro}^{2}+\alpha||w||_{21}
wmin2nsamples1∣∣XW−Y∣∣Fro2+α∣∣w∣∣21
其中,
F
r
o
Fro
Fro是Frobenius范数
∣
∣
A
∣
∣
F
r
o
=
∑
i
j
α
i
j
2
||A||_{Fro}=\sqrt{\sum\limits_{ij}\alpha_{ij}^2}
∣∣A∣∣Fro=ij∑αij2,
l
1
l
2
l_1l_2
l1l2表示
∣
∣
A
∣
∣
21
=
∑
i
∑
j
α
i
j
2
||A||_{21}=\sum\limits_{i}\sqrt{\sum\limits_{j}\alpha_{ij}^2}
∣∣A∣∣21=i∑j∑αij2。
MultiTaskLasso
同样使用坐标下降算法拟合参数。
Elastic-Net
Elastic-Net
是一个使用了
l
1
l_1
l1和
l
2
l_2
l2范数作为正则项的线性回归模型。这种组合既可以学习仅有少数非零参数的稀疏模型,如Lasso
,也可以仍然保持Ridge
的正则属性。通过使用l1_ratio
参数可以控制
l
1
l_1
l1和
l
2
l_2
l2的凸组合。
当特征中有很多彼此相关联的特征时,Elastic-Net
大有用处。Lasso
有可能随机选择一个,而Ridge
有可能两个都选。
Lasso
和Ridge
的平衡允许Elastic_Net
继承Ridge
部分稳定性。它的目标为:
min
W
1
2
n
s
a
m
p
l
e
s
∣
∣
X
w
−
y
∣
∣
2
2
+
α
ρ
∣
∣
w
∣
∣
1
+
α
(
1
−
ρ
)
2
∣
∣
w
∣
∣
2
2
\min \limits_{W} \frac{1}{2n_{samples}} ||Xw-y||_{2}^{2}+\alpha\rho||w||_{1}+\frac{\alpha(1-\rho)}{2}||w||_2^2
Wmin2nsamples1∣∣Xw−y∣∣22+αρ∣∣w∣∣1+2α(1−ρ)∣∣w∣∣22
ElasticNetCV
能够通过交叉验证设置参数alpha
和l1_ratio
。
# coding: utf-8
# Lasso and elastic net (L1 and L2 penalisation) implemented using a coordinate descent
from itertools import cycle
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path, enet_path
from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True)
X /= X.std(axis=0)
eps = 5e-3
print('Computing regularization path using the lasso...')
alphas_lasso, coefs_lasso, _ = lasso_path(X, y, eps=eps, fit_intercept=False)
print('Computing regularization path using the positive lasso...')
alphas_positive_lasso, coefs_positive_lasso, _ = lasso_path(
X, y, eps=eps, positive=True, fit_intercept=False)
print('Computing regularization path using the elastic net...')
alphas_enet, coefs_enet, _ = enet_path(
X, y, eps=eps, l1_ratio=0.8, fit_intercept=False)
print('Computing regularization path using the positive elastic net...')
alphas_positive_enet, coefs_positive_enet, _ = enet_path(
X, y, eps=eps, l1_ratio=0.8, positive=True, fit_intercept=False)
plt.figure(1)
colors = cycle(['b', 'r', 'g', 'c', 'k'])
neg_log_alphas_lasso = -np.log10(alphas_lasso)
neg_log_alphas_enet = -np.log10(alphas_enet)
for coef_l, coef_e, c in zip(coefs_lasso, coefs_enet, colors):
l1 = plt.plot(neg_log_alphas_lasso, coef_l, c=c)
l2 = plt.plot(neg_log_alphas_enet, coef_e, linestyle='--', c=c)
plt.xlabel('-Log(alpha)')
plt.ylabel('coefficients')
plt.title('Lasso and Elastic-Net Paths')
plt.legend(([l1[-1], l2[-1]]), ('Lasso', 'Elastic-Net'), loc='lower left')
plt.axis('tight')
plt.figure(2)
neg_log_alphas_positive_lasso = -np.log10(alphas_positive_lasso)
for coef_l, coef_pl, c in zip(coefs_lasso, coefs_positive_lasso, colors):
l1 = plt.plot(neg_log_alphas_lasso, coef_l, c=c)
l2 = plt.plot(neg_log_alphas_positive_lasso, coef_pl, linestyle='--', c=c)
plt.xlabel('-Log(alpha)')
plt.ylabel('coefficients')
plt.title('Lasso and positive Lasso')
plt.legend((l1[-1], l2[-1]), ('Lasso', 'positive Lasso'), loc='lower left')
plt.axis('tight')
plt.figure(3)
neg_log_alphas_positive_enet = -np.log10(alphas_positive_enet)
for (coef_e, coef_pe, c) in zip(coefs_enet, coefs_positive_enet, colors):
l1 = plt.plot(neg_log_alphas_enet, coef_e, c=c)
l2 = plt.plot(neg_log_alphas_positive_enet, coef_pe, linestyle='--', c=c)
plt.xlabel('-Log(alpha)')
plt.ylabel('coefficients')
plt.title('Elastic-Net and positive Elastic-Net')
plt.legend((l1[-1], l2[-1]), ('Elastic-Net', 'positive Elastic-Net'),
loc='lower left')
plt.axis('tight')
plt.show()
Multi-task Elastic-Net
MultiTaskElasticNet
的数学形式如下
min
W
1
2
n
s
a
m
p
l
e
s
∣
∣
X
W
−
Y
∣
∣
F
r
o
2
+
α
ρ
∣
∣
w
∣
∣
21
+
α
(
1
−
ρ
)
2
∣
∣
w
∣
∣
F
r
o
2
\min \limits_{W} \frac{1}{2n_{samples}} ||XW-Y||_{Fro}^{2}+\alpha\rho||w||_{21}+\frac{\alpha(1-\rho)}{2}||w||_{Fro}^2
Wmin2nsamples1∣∣XW−Y∣∣Fro2+αρ∣∣w∣∣21+2α(1−ρ)∣∣w∣∣Fro2
MultiTaskElasticNet
和MultiTaskElasticNetCV
的用法与前述相同。
最小角度回归
最小角度回归Least Angle Regression, LARS
是用于处理高维数据的回归算法。它类似于前向分段回归。LARS
模型可以使用Lars
或者lars_path
或者lars_path_gram
等估计器。
LARS Lasso
LassoLars
是使用了LARS
算法的lasso
模型,不同于基于坐标下降的实现,它能够得到精确解。
# coding: utf-8
# Lasso path using LARS
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True)
print('Computing regularization path using LARS ...')
_, _, coefs = linear_model.lars_path(X, y, method='lasso', verbose=True)
xx = np.sum(np.abs(coefs.T), axis=1)
xx /= xx[-1]
plt.plot(xx, coefs.T)
ymin, ymax = plt.ylim()
plt.vlines(xx, ymin, ymax, linestyle='dashed')
plt.xlabel('|coef| / max|coef|')
plt.ylabel('Coefficient')
plt.title('LASSO Path')
plt.axis('tight')
plt.show()
正交匹配追踪算法 (OrthogonalMatchingPursuit)
orthogonalMatchingPursuit
和orthogonal_mp
使用OMP
算法来拟合线性模型的参数,但这个模型在非零系数的数量(
l
0
l_0
l0范数)上有约束。作为和Least Angle Regression
一样的前向特征选择方法,正交匹配追踪能够近似得到带固定非零元素个数的最优参数向量。
arg min
γ
∣
∣
y
−
X
γ
∣
∣
2
2
s
u
b
j
e
c
t
o
∣
∣
γ
∣
∣
0
≤
n
n
o
n
z
e
r
o
c
o
e
f
s
\argmin \limits_{\gamma}||y-X\gamma||_2^2 subjec to ||\gamma||_0 \leq n_{nonzerocoefs}
γargmin∣∣y−Xγ∣∣22subjecto∣∣γ∣∣0≤nnonzerocoefs
以下代码为例:
# coding: utf-8
# Orthogonal Matching Pursuit
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import OrthogonalMatchingPursuit
from sklearn.linear_model import OrthogonalMatchingPursuitCV
from sklearn.datasets import make_sparse_coded_signal
n_components, n_features = 512, 100
n_nonzero_coefs = 17
y, X, w = make_sparse_coded_signal(n_samples=1,
n_components=n_components,
n_features=n_features,
n_nonzero_coefs=n_nonzero_coefs,
random_state=0)
idx, = w.nonzero()
y_noisy = y + 0.05 * np.random.randn(len(y))
plt.figure(figsize=(7, 7))
plt.subplot(4, 1, 1)
plt.xlim(0, 512)
plt.title('Sparse signal')
plt.stem(idx, w[idx], use_line_collection=True)
omp = OrthogonalMatchingPursuit(n_nonzero_coefs=n_nonzero_coefs)
omp.fit(X, y)
coef = omp.coef_
idx_r, = coef.nonzero()
plt.subplot(4, 1, 2)
plt.xlim(0, 512)
plt.title('Recovered signal from noise free measurements')
plt.stem(idx_r, w[idx_r], use_line_collection=True)
omp.fit(X, y_noisy)
coef = omp.coef_
idx_r, = coef.nonzero()
plt.subplot(4, 1, 3)
plt.xlim(0, 512)
plt.title("Recovered signal from noisy measurements")
plt.stem(idx_r, coef[idx_r], use_line_collection=True)
omp_cv = OrthogonalMatchingPursuitCV()
omp_cv.fit(X, y_noisy)
coef = omp_cv.coef_
idx_r, = coef.nonzero()
plt.subplot(4, 1, 4)
plt.xlim(0, 512)
plt.title("Recovered signal from noisy measurements with CV")
plt.stem(idx_r, coef[idx_r], use_line_collection=True)
plt.subplots_adjust(0.06, 0.04, 0.94, 0.90, 0.20, 0.38)
plt.suptitle('Sparse signal recovery with Orthogonal Matching Pursuit',
fontsize=16)
plt.show()
贝叶斯回归
贝叶斯回归技术能够将正则项参数包含在估计过程中,正则化参数没有强制设置,而按照手边的数据进行调整。这能通过在模型的超参上引入无信息的先验知识。在岭回归和分类中使用的
l
2
l_2
l2正则项等价于在参数
w
w
w和
λ
−
1
\lambda^{-1}
λ−1高斯先验下,知道一个最大的后验估计。不用手工设置lambda
值,可将其作为一个可以从数据中估计得到的随机变量。
为了得到一个全概率模型,输出
y
y
y可以假设为一个在
X
w
Xw
Xw附近的高斯分布:
p
(
y
∣
X
,
w
,
α
)
=
N
(
y
∣
X
w
,
α
)
p(y|X, w, \alpha) = N(y|Xw, \alpha)
p(y∣X,w,α)=N(y∣Xw,α)
其中,
α
\alpha
α被当做一个可以从数据中估计得到的随机变量。
贝叶斯岭回归
BayesianRidge
估计如上所述回归问题的一个概率模型。系数
w
w
w的先验由一个球形的高斯分布给出:
p
(
w
∣
γ
)
=
N
(
w
∣
0
,
λ
−
1
I
p
)
p(w|\gamma) = N(w|0,\lambda^{-1}I_p)
p(w∣γ)=N(w∣0,λ−1Ip)
α
\alpha
α和
λ
\lambda
λ的先验分布可选择为gamma
分布。所得到的模型被称为Bayes Ridge Regression
。
在贝叶斯框架下,上述找到的权重稍微不同于普通最小二乘所得到的权重。但是,贝叶斯岭回归对病态问题更加鲁棒。
# coding: utf-8
# Bayesian Ridge Regression
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.linear_model import BayesianRidge, LinearRegression
# Generating simulated data with Gaussian weights
np.random.seed(0)
n_samples, n_features = 100, 100
X = np.random.randn(n_samples, n_features) # Create Gaussian data
# Create weights with a precision lambda_ of 4.
lambda_ = 4.
w = np.zeros(n_features)
# Keep 10 weights of interest
# return 10 rand int from 0 to n_features
relevant_features = np.random.randint(0, n_features, 10)
for i in relevant_features:
# 通过loc和scale参数可以指定随机变量的偏移和缩放参数,这里对应的是正态分布的期望和标准差
w[i] = stats.norm.rvs(loc=0, scale=1./np.sqrt(lambda_))
# Create noise with a precision alpha of 50.0
alpha_ = 50
noise = stats.norm.rvs(loc=0, scale= 1./np.sqrt(alpha_), size=n_samples)
# Create the target
y = np.dot(X,w) + noise
# Fit the Bayesian Ridge Regression an dan OLS for comparison
clf = BayesianRidge(compute_score=True)
clf.fit(X, y)
ols = LinearRegression()
ols.fit(X, y)
# Plot true weights, estimated weights, histogram of the weights, and
# predictions with standard deviations
lw = 2
plt.figure(figsize=(6, 5))
plt.title('Weights of the model')
plt.plot(clf.coef_, color='lightgreen', linewidth=lw,
label='Bayesian Ridge estimate')
plt.plot(w, color='gold', label='Ground truth')
plt.plot(ols.coef_, color='navy', linestyle='--', label='OLS estimate')
plt.xlabel('Features')
plt.ylabel('Values of the weights')
plt.legend(loc='best', prop=dict(size=12))
plt.figure(figsize=(6, 5))
plt.title('Histogram of the weights')
plt.hist(clf.coef_, bins=n_features, color='gold', log=True,
edgecolor='black')
# 构造一个数组,用指定值填充其元素
plt.scatter(clf.coef_[relevant_features], np.full(len(relevant_features), 5.0),
color='navy', label='Relevant features')
plt.ylabel('Features')
plt.xlabel('Values of the weights')
plt.legend('upper left')
plt.figure(figsize=(6, 5))
plt.title("Marginal log-likelihood")
plt.plot(clf.scores_, color='navy', linewidth=lw)
plt.ylabel("Score")
plt.xlabel("Iterations")
def f(x, noise_amount):
y = np.sqrt(x) * np.sin(x)
noise = np.random.normal(0, 1, len(x))
return y + noise_amount * noise
degree = 10
X = np.linspace(0, 10, 100)
y = f(X, noise_amount=0.1)
clf_poly = BayesianRidge()
clf_poly.fit(np.vander(X, degree), y)
X_plot = np.linspace(0, 11, 25)
y_plot = f(X_plot, noise_amount=0)
# np.vander: output matrix are powers of the input vector
y_mean, y_std = clf_poly.predict(np.vander(X_plot, degree), return_std=True)
plt.figure(figsize=(6, 5))
plt.errorbar(X_plot, y_mean, y_std, color='navy',
label="Polynomial Bayesian Ridge Regression", linewidth=lw)
plt.plot(X_plot, y_plot, color='gold', linewidth=lw,
label="Ground Truth")
plt.ylabel("Output y")
plt.xlabel("Feature X")
plt.legend(loc="lower left")
plt.show()
# coding: utf-8
# Curve Fitting with Bayesian Ridge Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import BayesianRidge
def func(x): return np.sin(2*np.pi*x)
# generate sinusoidal data with noise
size = 25
rng = np.random.RandomState(1234)
x_train = rng.uniform(0., 1., size)
y_train = func(x_train) + rng.normal(scale=0.1, size=size)
x_test = np.linspace(0., 1., 100)
n_order = 3
X_train = np.vander(x_train, n_order+1, increasing=True)
X_test = np.vander(x_test, n_order+1, increasing=True)
reg = BayesianRidge(tol=1e-6, fit_intercept=False, compute_score=True)
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
for i, ax in enumerate(axes):
if i == 0:
# np.var: the variance of the array elements
init = [1/np.var(y_train), 1.]
else:
init = [1., 1e-3]
reg.set_params(alpha_init=init[0], lambda_init =init[1])
reg.fit(X_train, y_train)
ymean, ystd = reg.predict(X_test, return_std=True)
ax.plot(x_test, func(x_test), color="blue", label="sin($2\\pi x$)")
ax.scatter(x_train, y_train, s=50, alpha=0.5, label="observation")
ax.plot(x_test, ymean, color="red", label="predict mean")
ax.fill_between(x_test, ymean - ystd, ymean + ystd,
color="pink", alpha=0.5, label="predict std")
ax.set_ylim(-1.3, 1.3)
ax.legend()
title = "$\\alpha$_init$={:.2f},\\ \\lambda$_init$={}$".format(
init[0], init[1])
if i == 0:
title += " (Default)"
ax.set_title(title, fontsize=12)
text = "$\\alpha={:.1f}$\n$\\lambda={:.3f}$\n$L={:.1f}$".format(
reg.alpha_, reg.lambda_, reg.scores_[-1])
ax.text(0.05, -1.0, text, fontsize=12)
plt.tight_layout()
plt.show()
自动相关关联(Automatic Relevance Determination)
ARDRegression
与Bayresian Ridge Regression
相似,但是能够产生更稀疏的系数w
。它去掉了高斯分布为球形的假设,使用了一个不同的先验,为一个轴平行,椭圆高斯分布。
ARD
也被称为Sparse Bayesian Learning
和Relevance Vector Machine
。
ARD
使用代码如下:
# coding: utf-8
# Automatic Relevance Determination Regression (ARD)
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.linear_model import ARDRegression, LinearRegression
# #############################################################################
# Generating simulated data with Gaussian weights
# Parameters of the example
np.random.seed(0)
n_samples, n_features = 100, 100
# Create Gaussian data
X = np.random.randn(n_samples, n_features)
# Create weights with a precision lambda_ of 4.
lambda_ = 4.
w = np.zeros(n_features)
# Only keep 10 weights of interest
relevant_features = np.random.randint(0, n_features, 10)
for i in relevant_features:
w[i] = stats.norm.rvs(loc=0, scale=1. / np.sqrt(lambda_))
# Create noise with a precision alpha of 50.
alpha_ = 50.
noise = stats.norm.rvs(loc=0, scale=1. / np.sqrt(alpha_), size=n_samples)
# Create the target
y = np.dot(X, w) + noise
# #############################################################################
# Fit the ARD Regression
clf = ARDRegression(compute_score=True)
clf.fit(X, y)
ols = LinearRegression()
ols.fit(X, y)
# #############################################################################
# Plot the true weights, the estimated weights, the histogram of the
# weights, and predictions with standard deviations
plt.figure(figsize=(6, 5))
plt.title("Weights of the model")
plt.plot(clf.coef_, color='darkblue', linestyle='-', linewidth=2,
label="ARD estimate")
plt.plot(ols.coef_, color='yellowgreen', linestyle=':', linewidth=2,
label="OLS estimate")
plt.plot(w, color='orange', linestyle='-', linewidth=2, label="Ground truth")
plt.xlabel("Features")
plt.ylabel("Values of the weights")
plt.legend(loc=1)
plt.figure(figsize=(6, 5))
plt.title("Histogram of the weights")
plt.hist(clf.coef_, bins=n_features, color='navy', log=True)
plt.scatter(clf.coef_[relevant_features], np.full(len(relevant_features), 5.),
color='gold', marker='o', label="Relevant features")
plt.ylabel("Features")
plt.xlabel("Values of the weights")
plt.legend(loc=1)
plt.figure(figsize=(6, 5))
plt.title("Marginal log-likelihood")
plt.plot(clf.scores_, color='navy', linewidth=2)
plt.ylabel("Score")
plt.xlabel("Iterations")
# Plotting some predictions for polynomial regression
def f(x, noise_amount):
y = np.sqrt(x) * np.sin(x)
noise = np.random.normal(0, 1, len(x))
return y + noise_amount * noise
degree = 10
X = np.linspace(0, 10, 100)
y = f(X, noise_amount=1)
clf_poly = ARDRegression(threshold_lambda=1e5)
clf_poly.fit(np.vander(X, degree), y)
X_plot = np.linspace(0, 11, 25)
y_plot = f(X_plot, noise_amount=0)
y_mean, y_std = clf_poly.predict(np.vander(X_plot, degree), return_std=True)
plt.figure(figsize=(6, 5))
plt.errorbar(X_plot, y_mean, y_std, color='navy',
label="Polynomial ARD", linewidth=2)
plt.plot(X_plot, y_plot, color='gold', linewidth=2,
label="Ground Truth")
plt.ylabel("Output y")
plt.xlabel("Feature X")
plt.legend(loc="lower left")
plt.show()
Logistic回归
Logistic regression
是一个用于分类而不是回归的线性模型。在文献中,Logistic regression
有被称作logit regression
、maximum-entropy classifiation (MaxEnt)
或log-linear classifier
。在此模型中,描述一次实验可能结果的概率使用logistic function
建模
f
(
x
)
=
1
1
+
e
−
x
f(x) = \frac{1}{1+e^{-x}}
f(x)=1+e−x1。
逻辑回归是以LogisticRegression
对象实现的。LogisticRegression
能够拟合二元、一对其它或多类的逻辑回归,并且也可以设置
l
1
l_1
l1、
l
2
l_2
l2或Elastic-Net
的正则项。
正则在机器学习中常用且默认使用,但是在统计中不常见。正则的另一个好处是提高数值的稳定性。没有正则可将C
的值设置为一个较大值。
作为一个优化问题,二元且使用
l
2
l_2
l2正则项的逻辑回归最小化如下成本函数:
min
w
,
c
1
2
w
T
w
+
C
∑
i
=
1
n
l
o
g
(
e
x
p
(
−
y
i
(
X
i
T
w
+
c
)
)
+
1
)
\min\limits_{w, c}\frac{1}{2}w^Tw+C\sum_{i=1}^nlog(exp(-y_i(X_i^Tw+c))+1)
w,cmin21wTw+Ci=1∑nlog(exp(−yi(XiTw+c))+1)
同样的,二元且使用
l
1
l_1
l1正则项的逻辑回归最小化如下成本函数:
min
w
,
c
∣
∣
w
∣
∣
1
+
C
∑
i
=
1
n
l
o
g
(
e
x
p
(
−
y
i
(
X
i
T
w
+
c
)
)
+
1
)
\min\limits_{w, c}||w||_1+C\sum_{i=1}^nlog(exp(-y_i(X_i^Tw+c))+1)
w,cmin∣∣w∣∣1+Ci=1∑nlog(exp(−yi(XiTw+c))+1)
Elastic-Net
正则项是由上述两种组合而成,其成本函数如下:
min
w
,
c
1
−
ρ
2
w
T
w
+
ρ
∣
∣
w
∣
∣
1
+
C
∑
i
=
1
n
l
o
g
(
e
x
p
(
−
y
i
(
X
i
T
w
+
c
)
)
+
1
)
\min\limits_{w, c}\frac{1-\rho}{2}w^Tw+\rho||w||_1+C\sum_{i=1}^nlog(exp(-y_i(X_i^Tw+c))+1)
w,cmin21−ρwTw+ρ∣∣w∣∣1+Ci=1∑nlog(exp(−yi(XiTw+c))+1)
其中,
ρ
\rho
ρ控制两种正则项之间的强度,与l1_ratio
参数对应。
在LogisticRegression
中实现的求解器有liblinear
、newton-cg
、lbfgs
、sag
和saga
。具体特征如下:
因其鲁棒性,默认使用lbfgs
。当数据集较大时,使用saga
更快,同样,也可以使用带log
损失函数的SGDClassifier
,可能会更快,但是需要更多调整。
使用如下例:
# coding: utf-8
# L1 Penalty and Sparsity in Logistic Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
X, y = datasets.load_digits(return_X_y=True)
X = StandardScaler().fit_transform(X)
y = (y > 4).astype(np.int)
l1_ratio = 0.5
fig, axes = plt.subplots(3, 3)
for i, (C, axes_row) in enumerate(zip((1, 0.1, 0.01), axes)):
clf_l1_LR = LogisticRegression(C=C, penalty='l1', tol=0.01, solver='saga')
clf_l2_LR = LogisticRegression(C=C, penalty='l2', tol=0.01, solver='saga')
clf_en_LR = LogisticRegression(C=C, penalty='elasticnet', solver='saga', l1_ratio=l1_ratio, tol=0.01)
clf_l1_LR.fit(X, y)
clf_l2_LR.fit(X, y)
clf_en_LR.fit(X, y)
coef_l1_LR = clf_l1_LR.coef_.ravel()
coef_l2_LR = clf_l2_LR.coef_.ravel()
coef_en_LR = clf_en_LR.coef_.ravel()
sparsity_l1_LR = np.mean(coef_l1_LR == 0) * 100
sparsity_l2_LR = np.mean(coef_l2_LR == 0) * 100
sparsity_en_LR = np.mean(coef_en_LR == 0) * 100
print("{:<40} {:.2f}%".format("Sparsity with L1 penalty:", sparsity_l1_LR))
print("{:<40} {:.2f}%".format("Sparsity with Elastic-Net penalty:",
sparsity_en_LR))
print("{:<40} {:.2f}%".format("Sparsity with L2 penalty:", sparsity_l2_LR))
print("{:<40} {:.2f}".format("Score with L1 penalty:",
clf_l1_LR.score(X, y)))
print("{:<40} {:.2f}".format("Score with Elastic-Net penalty:",
clf_en_LR.score(X, y)))
print("{:<40} {:.2f}".format("Score with L2 penalty:",
clf_l2_LR.score(X, y)))
if i == 0:
axes_row[0].set_title('L1 penalty')
axes_row[1].set_title('Elastic-Net\nl1_ratio = %s' % l1_ratio)
axes_row[2].set_title('L2 penalty')
for ax, coefs in zip(axes_row, [coef_l1_LR, coef_en_LR, coef_l2_LR]):
ax.imshow(np.abs(coefs.reshape(8, 8)), interpolation='nearest',
cmap='binary', vmax=1, vmin=0)
ax.set_xticks(())
ax.set_yticks(())
axes_row[0].set_ylabel('C = %s' % C)
plt.show()
# coding: utf-8
# Regularization path of L1- Logistic Regression
from time import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn import datasets
from sklearn.svm import l1_min_c
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 2]
y = y[y != 2]
X /= X.max()
# sklearn.svm.l1_min_c allows to calculate the lower bound for C in order to get a non “null” (all feature weights to zero) model
cs = l1_min_c(X, y, loss='log') * np.logspace(0, 7, 16)
print('Computing regularization path ...')
start = time()
clf = linear_model.LogisticRegression(penalty='l1', solver='liblinear',
tol=1e-6, max_iter=int(1e6),
warm_start=True, intercept_scaling=10000.)
coefs_ = []
for c in cs:
clf.set_params(C=c)
clf.fit(X, y)
coefs_.append(clf.coef_.ravel().copy())
print('This took %0.3fs' % (time()-start))
coefs_ = np.array(coefs_)
plt.plot(np.log10(cs), coefs_, marker='o')
ymin, ymax = plt.ylim()
plt.xlabel('log(C)')
plt.ylabel('Coefficients')
plt.title('Logistic Regression Path')
plt.axis('tight')
plt.show()
随机梯度下降(Stochastic Gradient Descent)
随机梯度下降是一种拟合线性模型的简单并有效的方法。类SGDClassifier
和SGDRegressor
实现了上述算法,能够适应不同的损失函数和惩罚。
感知器(Perceptron)
Perception
是另外一种用于大规模学习的分类算法。
被动攻击算法(Passive Aggressive Algorithms)
被动攻击算法是另一类大规模学习的算法。PassiveAggressiveClassifier
和PassiveAggressiveRegressor
。
鲁棒回归(Robustness regression)
鲁棒回归主要用于拟合存在异常点和错误的回归模型。
- RANSAC:RANdom SAmple Consensus
- Theil-Sen estimator
- Huber Regression
多项式回归:使用基函数扩展线性模型
在机器学习中常用的模式是使用线性模型在数据的非线性函数上训练。这种方法在维持线性模型快的性能上,允许他们拟合更广的数据。
多项式回归(Polynomial Regression)通过考虑在使用这些基函数构建的高维空间线性拟合,模型有拟合更广数据的灵活性。
# coding: utf-8
# Polynomial interpolation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
def f(x):
return x*np.sin(x)
x_plot = np.linspace(0, 10, 100)
x = np.linspace(0, 10, 100)
rng = np.random.RandomState(0)
rng.shuffle(x)
x = np.sort(x[:20])
y = f(x)
X = x[:, np.newaxis]
X_plot = x_plot[:, np.newaxis]
colors = ['teal', 'yellowgreen', 'gold']
lw = 2
plt.plot(x_plot, f(x_plot), color='cornflowerblue', linewidth=lw,
label='ground truth')
plt.scatter(x, y, color='navy', s=30, marker='o', label='training points')
for count, degree in enumerate([3, 4, 5]):
model = make_pipeline(PolynomialFeatures(degree), Ridge())
model.fit(X, y)
y_plot = model.predict(X_plot)
plt.plot(x_plot, y_plot, color=colors[count], linewidth=lw,
label='degree %d'%degree)
plt.legend(loc='best')
plt.show()