模型评估与模型选择
一、损失函数和风险函数
- 损失函数:一次预测的好坏
- 风险函数:平均意义下模型预测的好坏
- 0-1损失函数 (0-1 loss function)
L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) L(Y,f(X))= \begin{cases} 1, &Y\neq{f(X)}\\0,&Y={f(X)} \end{cases} L(Y,f(X))={1,0,Y̸=f(X)Y=f(X) - 平方损失函数(quadratic loss function)
L ( Y , f ( X ) ) = ( Y − f ( X ) ) 2 L(Y,f(X))=(Y-f(X))^2 L(Y,f(X))=(Y−f(X))2 - 绝对损失函数(absolute loss function)
L ( Y , f ( X ) ) = ∣ Y − f ( X ) ∣ L(Y,f(X))=|Y-f(X)| L(Y,f(X))=∣Y−f(X)∣ - 对数损失函数(logarithmic loss function) 或 对数似然损失函数 (loglikelihood loss function)
L ( Y , P ( Y ∣ X ) ) = − log P ( Y ∣ X ) L(Y,P(Y|X))=-\log{P(Y|X)} L(Y,P(Y∣X))=−logP(Y∣X)
1.经验风险和结构风险
- 损失函数的期望,称为风险函数(risk function)或期望损失(expected loss):
R e x p ( f ) = E p [ L ( Y , f ( X ) ) ] = ∫ x × y L ( y , f ( x ) ) P ( x , y ) d x d y R_{exp}(f) = E_p[L(Y,f(X))]=\int_{x\times{y}}L(y,f(x))P(x,y)dxdy Rexp(f)=Ep[L(Y,f(X))]=∫x×yL(y,f(x))P(x,y)dxdy
学习目标是期望风险最小,但不知道联合分布 P ( Y , X ) P(Y,X) P(Y,X),不能直接计算 R e x p ( f ) R_{exp}(f) Rexp(f),则只有计算给定数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\} T={(x1,y1),(x2,y2),...,(xN,yN)}下的经验风险(empirical risk)或经验损失(empirical loss):
R e m p ( f ) = 1 N ∑ i = 1 N L ( y i , f ( x i ) ) R_{emp}(f) =\frac{1}{N}\sum_{i=1}^NL(y_i,f(x_i)) Remp(f)=N1i=1∑NL(yi,f(xi))
期望风险 R e x p ( f ) R_{exp}(f) Rexp(f)是关于联合分布的期望损失,经验风险 R e m p ( f ) R_{emp}(f) Remp(f)是关于训练样本集的平均损失,根据大数定律,当样本容量N趋于无穷时,经验风险趋于期望风险。
2.经验风险最小化与结构风险最小化
由于现实中的训练样本数据有限,所以经验风险估计期望风险往往不理想。当样本容量很小时,经验风险最小化(ERM)会产生“过拟合”(over-fitting),而结构风险最小化的提出,很好的解决了这一现象,等价于正则化(regularization),加 入正则化项(regularizer),或罚项(penalty term):
R
s
r
m
(
f
)
=
1
N
∑
i
=
1
N
L
(
y
i
,
f
(
x
i
)
)
+
λ
J
(
f
)
R_{srm}(f) =\frac{1}{N}\sum_{i=1}^NL(y_i,f(x_i))+\lambda{J(f)}
Rsrm(f)=N1i=1∑NL(yi,f(xi))+λJ(f)
-
J
(
f
)
J(f)
J(f)为模型复杂度。
例如:贝叶斯中的最大后验概率估计就是结构风险最小化。
当样本足够大时,经验风险最小化能保证很好的学习率,比如:极大似然估计。
二、模型评估与模型选择
过拟合是指,学习模型中的参数过多,对已知数据预测的很好,对未知数据预测的很差。
1.最小二乘法
回归问题中最常用的损失函数是平方损失函数,在此情况下回归问题可以用最小二乘法(least squares)求解。使用最小二乘法拟和曲线,对于数据:
(
x
i
,
y
i
)
(
i
=
1
,
2
,
3...
,
m
)
(x_i, y_i)(i=1, 2, 3...,m)
(xi,yi)(i=1,2,3...,m)
拟合出函数:
h
(
x
)
h(x)
h(x)有误差,即残差:
r
i
=
h
(
x
i
)
−
y
i
r_i=h(x_i)-y_i
ri=h(xi)−yi;此时,
L
2
L_2
L2范数(残差平方和)最小时,
h
(
x
)
h(x)
h(x)和
y
y
y相似度最高,更拟合度更高。
一般的
H
(
x
)
H(x)
H(x)为
n
n
n次的多项式,
H
(
x
)
=
w
0
+
w
1
x
+
w
2
x
2
+
.
.
.
w
n
x
n
H(x)=w_0+w_1x+w_2x^2+...w_nx^n
H(x)=w0+w1x+w2x2+...wnxn,
w
(
w
0
,
w
1
,
w
2
,
.
.
.
,
w
n
)
w(w_0,w_1,w_2,...,w_n)
w(w0,w1,w2,...,wn)为参数,最小二乘法就是要找到一组
w
(
w
0
,
w
1
,
w
2
,
.
.
.
,
w
n
)
w(w_0,w_1,w_2,...,w_n)
w(w0,w1,w2,...,wn) 使得
∑
i
=
1
n
(
h
(
x
i
)
−
y
i
)
2
\sum_{i=1}^n(h(x_i)-y_i)^2
∑i=1n(h(xi)−yi)2 (残差平方和) 最小,即,求
m
i
n
∑
i
=
1
n
(
h
(
x
i
)
−
y
i
)
2
min\sum_{i=1}^n(h(x_i)-y_i)^2
min∑i=1n(h(xi)−yi)2
2.过拟合代码实现
例:用目标函数 y=sin2πx , 加上一个正太分布的噪音干扰,用多项式去拟合。
- Tips:
numpy.poly1d([1,2,3]) 会生成 1 x 2 + 2 x 1 + 3 x 0 1x^2+2x^1+3x^0 1x2+2x1+3x0。
import numpy as np
import scipy as sp
from scipy.optimize import leastsq
import matplotlib.pyplot as plt
%matplotlib inline
# m目标函数
def real_func(x):
return np.sin(2*np.pi*x)
# 多项式
def fit_func(p, x):
f = np.poly1d(p)
return f(x)
# 残差
def residuals_func(p, x, y):
ret = fit_func(p, x) - y
return ret
# 十个点
x = np.linspace(0, 1, 10)
x_points = np.linspace(0, 1, 1000)
# 控制产生相同的随机数的种子
sed =np.random.RandomState(21)
# 加上正态分布噪音的目标函数的值
y_ = real_func(x)
y = [sed.normal(0, 0.1) + y1 for y1 in y_]
def fitting(M=0):
'''
n为多项式的次数
'''
# 随机初始化多项式参数
p_init = np.random.rand(M+1)
print('p_init:',p_init)
# 最小二乘法
p_lsq = leastsq(residuals_func, p_init, args=(x, y))
print('Fitting Parameters:', p_lsq)
# 可视化
plt.plot(x_points, real_func(x_points), label='real')
plt.plot(x_points, fit_func(p_lsq[0], x_points), label='fitted curve')
plt.plot(x, y, 'bo', label='noise')
plt.legend()
return p_lsq
# M=0
p_lsq_0 = fitting(M=0)
p_init:[0.46587517]
Fitting Parameters:(array([-0.00668687]), 1)
# M=1
p_lsq_1 = fitting(M=1)
p_init:[0.62648156 0.81328629]
Fitting Parameters: (array([-1.3305229 , 0.65857459]), 1)
# M=3
p_lsq_3 = fitting(M=3)
p_init:[0.12654294 0.42151414 0.75198249 0.3700417 ]
Fitting Parameters: (array([ 21.44292392, -31.88550029, 10.59452721, -0.0413459 ]), 1)
# M=9
p_lsq_2 = fitting(M=9)
p_init:[0.96064814 0.78296949 0.07377036 0.26075913 0.43151252 0.23580643 0.25113008 0.10649864 0.83788795 0.67427705]
Fitting Parameters: (array([-4.85959241e+04, 2.20624465e+05, -4.21250214e+05, 4.39338967e+05, -2.71868355e+05, 1.01360274e+05, -2.19859062e+04, 2.47930883e+03, -1.02611751e+02, -5.19642495e-03]), 2)
当M=9时,多项式曲线通过了每个数据点,但是造成了过拟合。
3.模型选择
下图描述了训练误差和测试误差与模型复杂度之间的关系。
为了选择最优模型,常用的方法:正则化和交叉验证。
a.正则化
正则化符合奥卡姆剃刀原理,如果结果显示过拟合, 引入正则化项(regularizer),降低过拟合
Q
(
x
)
=
∑
i
=
1
n
(
h
(
x
i
−
y
i
)
2
+
λ
∣
∣
w
∣
∣
2
Q(x)=\sum^{n}_{i=1}(h(x_i-y_i)^2 +\lambda||w||^2
Q(x)=i=1∑n(h(xi−yi)2+λ∣∣w∣∣2
回归问题中,损失函数是平方损失,正则化可以是参数向量的L2范数,也可以是L1范数;
-
L 1 : r e g u l a r i z a t i o n ∗ a b s ( p ) L_1: regularization*abs(p) L1:regularization∗abs(p)
-
L 2 : 0.5 ∗ r e g u l a r i z a t i o n ∗ n p . s q u a r e ( p ) L_2: 0.5 * regularization * np.square(p) L2:0.5∗regularization∗np.square(p)
regularization = 0.0001
def residuals_func_regularization(p, x, y):
ret = fit_func(p, x) -y
# print(ret)
ret = np.append(ret, np.sqrt(0.5*regularization*np.square(p))) # L2f范数作为正则化项
# print('regularization after-',ret)
return ret
# 最小二乘法,加正则化项
p_init = np.random.rand(9+1)
p_lsq_regularization = leastsq(residuals_func_regularization, p_init, args=(x,y))
p_init,p_lsq_regularization
(array([7.65285012e-01, 8.52894992e-01, 8.13129743e-01, 4.63652998e-01, 5.00845195e-01, 5.31076396e-01, 7.44912796e-01, 2.72188092e-01, 5.81395570e-01, 6.08023769e-04]),
(array([-7.10884861e+00, -2.50953883e+00, 2.61943098e+00, 7.18836823e+00, 9.10024923e+00, 5.24227933e+00, -6.35936158e+00, -1.65316867e+01, 8.37207391e+00, -1.39990642e-02]), 1))
plt.plot(x_points, real_func(x_points), label='real')
plt.plot(x_points, fit_func(p_lsq_9[0], x_points), label='fitted curve')
plt.plot(x_points, fit_func(p_lsq_regularization[0], x_points),label='regularization')
plt.plot(x, y, 'bo', label='noise')
plt.legend()
三、ROC曲线绘制
1.定义
二分类评价指标 :
- TP - true positive
- FN - false negative
- FP - false positive
- TN - true negative
精确率(真正例
/
/
/预测为正例的):
P
=
T
P
T
P
+
F
P
P = \frac{TP}{TP + FP}
P=TP+FPTP
召回率(真正例
/
/
/实为正例的):
P
=
T
P
T
P
+
F
N
P = \frac{TP}{TP + FN}
P=TP+FNTP
F
1
F_1
F1值:
2
F
1
=
1
P
+
1
R
F
1
=
2
T
P
2
T
P
+
F
N
+
F
P
\frac{2}{F_1} = \frac{1}{P} + \frac{1}{R}\\F_1=\frac{2TP}{2TP + FN + FP}
F12=P1+R1F1=2TP+FN+FP2TP
真正例率 - TPR:
T
P
R
=
T
P
T
P
+
F
N
TPR= \frac{TP}{TP + FN}
TPR=TP+FNTP
假正例率 - FPR:
F
P
R
=
F
P
F
P
+
T
N
FPR= \frac{FP}{FP + TN}
FPR=FP+TNFP
ROC曲线的横轴是FPR,纵轴是TPR。
2.绘制ROC曲线
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn import svm,datasets
from sklearn.metrics import roc_curve,auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp
# 载入数据
iris = datasets.load_iris()
X = iris.data # (150,4)
y = iris.target # (150,)
# 二值化输出
y = label_binarize(y, classes=[0, 1, 2]) # (150, 3)
n_classes = y.shape[1]
# 添加噪声
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape # (150,4)
X = np.c_[X, random_state.randn(n_samples, 200*n_features)] # (150, 804)
# 切分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)
#model
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
random_state=random_state))
# fit,score
y_score = classifier.fit(X_train,y_train).decision_function(X_test)
# 计算每个类别的ROC曲线和面积
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], thresholds = roc_curve(y_test[:, i], y_score[:, i])
# print(fpr[i].shape, tpr[i].shape, thresholds.shape)
roc_auc[i] = auc(fpr[i], tpr[i])
# print(roc_auc[i])
#计算微平均ROC曲线和面积
fpr['micro'], tpr['micro'], _ =roc_curve(y_test.ravel(),y_score.ravel())
roc_auc['micro'] = auc(fpr['micro'], tpr['micro'])
# 取出所有的FPR
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
#
mean_tpr = np.zeros_like(all_fpr)
# print(tpr)
# print(mean_tpr)
for i in range(n_classes):
mean_tpr += interp(all_fpr, fpr[i], tpr[i])
# print(mean_tpr)
mean_tpr/=n_classes
fpr['macro'] = all_fpr
tpr['macro'] = mean_tpr
roc_auc['macro'] = auc(fpr['macro'], tpr['macro'])
plt.figure()
# 线的宽度
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
lw=lw, label='ROC curve of class 2 (area =%0.4f)'% roc_auc[2])
plt.plot(fpr[1], tpr[1],lw=lw,color='aqua',
label='ROC curve of class 1 (area =%0.4f)'% roc_auc[1])
plt.plot(fpr[0], tpr[0],lw=lw,color='cornflowerblue',
label='ROC curve of class 0 (area =%0.4f)'% roc_auc[0])
plt.plot([0,1], [0,1], lw=lw, linestyle='--')
plt.plot(fpr['micro'], tpr['micro'], lw=lw, color='deeppink', linestyle=':',
label='micro-average ROC curve(area =%0.4f)'% roc_auc['micro'])
plt.plot(fpr['macro'], tpr['macro'], lw=lw, color='navy', linestyle=':',
label='macro-average ROC curve(area =%0.4f)'% roc_auc['macro'])
plt.xlim([0.0, 1.0], 'k--', lw=lw)
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
# label的位置
plt.legend(loc="lower right")
plt.show()