模型评估与模型选择

最新推荐文章于 2022-05-05 07:00:00 发布

Ashlyn_z

最新推荐文章于 2022-05-05 07:00:00 发布

阅读量341

点赞数

分类专栏：统计学习方法 Python 机器学习文章标签：《统计学习方法》

本文链接：https://blog.csdn.net/qq_42231980/article/details/88374585

版权

机器学习同时被 3 个专栏收录

7 篇文章 0 订阅

订阅专栏

统计学习方法

3 篇文章 0 订阅

订阅专栏

Python

3 篇文章 0 订阅

订阅专栏

模型评估与模型选择

一、损失函数和风险函数
- 1.经验风险和结构风险
- 2.经验风险最小化与结构风险最小化
二、模型评估与模型选择
三、ROC曲线绘制
- 1.定义
- 2.绘制ROC曲线

一、损失函数和风险函数

损失函数：一次预测的好坏
风险函数：平均意义下模型预测的好坏
0-1损失函数 (0-1 loss function)
$\begin{cases} 1, &Y\neq{f(X)}\\0,&Y={f(X)} \end{cases}$
平方损失函数(quadratic loss function)
$L(Y,f(X))=(Y-f(X))^2$
绝对损失函数(absolute loss function)
$L (Y, f (X)) = ∣ Y - f (X) ∣$
对数损失函数(logarithmic loss function) 或对数似然损失函数 (loglikelihood loss function)
$L(Y,P(Y|X))=-\log{P(Y|X)}$

1.经验风险和结构风险

损失函数的期望，称为风险函数(risk function)或期望损失(expected loss)：
$R_{exp}(f) = E_p[L(Y,f(X))]=\int_{x\times{y}}L(y,f(x))P(x,y)dxdy$
学习目标是期望风险最小，但不知道联合分布 $P (Y, X)$ ，不能直接计算 $R_{exp}(f)$ ，则只有计算给定数据集 $T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\}$ 下的经验风险(empirical risk)或经验损失(empirical loss)：
$R_{emp}(f) =\frac{1}{N}\sum_{i=1}^NL(y_i,f(x_i))$
期望风险 $R_{exp}(f)$ 是关于联合分布的期望损失，经验风险 $R_{emp}(f)$ 是关于训练样本集的平均损失，根据大数定律，当样本容量N趋于无穷时，经验风险趋于期望风险。

2.经验风险最小化与结构风险最小化

由于现实中的训练样本数据有限，所以经验风险估计期望风险往往不理想。当样本容量很小时，经验风险最小化(ERM)会产生“过拟合”(over-fitting),而结构风险最小化的提出，很好的解决了这一现象，等价于正则化(regularization)，加入正则化项(regularizer)，或罚项(penalty term)：
$R_{srm}(f) =\frac{1}{N}\sum_{i=1}^NL(y_i,f(x_i))+\lambda{J(f)}$

$J (f)$ 为模型复杂度。
例如：贝叶斯中的最大后验概率估计就是结构风险最小化。
当样本足够大时，经验风险最小化能保证很好的学习率，比如：极大似然估计。

二、模型评估与模型选择

过拟合是指，学习模型中的参数过多，对已知数据预测的很好，对未知数据预测的很差。

1.最小二乘法

回归问题中最常用的损失函数是平方损失函数，在此情况下回归问题可以用最小二乘法(least squares)求解。使用最小二乘法拟和曲线，对于数据： $x_i, y_i)(i=1, 2, 3...,m)$
拟合出函数： $h (x)$ 有误差，即残差： $r_i=h(x_i)-y_i$ ；此时， $L_2$ 范数(残差平方和)最小时， $h (x)$ 和 $y$ 相似度最高，更拟合度更高。
一般的 $H (x)$ 为 $n$ 次的多项式， $H(x)=w_0+w_1x+w_2x^2+...w_nx^n$ ， $w(w_0,w_1,w_2,...,w_n)$ 为参数，最小二乘法就是要找到一组 $w(w_0,w_1,w_2,...,w_n)$ 使得 $\sum_{i=1}^n(h(x_i)-y_i)^2$ (残差平方和) 最小，即，求 $min\sum_{i=1}^n(h(x_i)-y_i)^2$

2.过拟合代码实现

例：用目标函数 y=sin2πx , 加上一个正太分布的噪音干扰，用多项式去拟合。

Tips:
numpy.poly1d([1,2,3]) 会生成 $1x^2+2x^1+3x^0$ 。

import numpy as np
import scipy as sp
from scipy.optimize import leastsq
import matplotlib.pyplot as plt
%matplotlib inline
# m目标函数
def real_func(x):
    return np.sin(2*np.pi*x)

# 多项式
def fit_func(p, x):
    f = np.poly1d(p)
    return f(x)

# 残差
def residuals_func(p, x, y):
    ret = fit_func(p, x) - y
    return ret

# 十个点
x = np.linspace(0, 1, 10)
x_points = np.linspace(0, 1, 1000)
# 控制产生相同的随机数的种子
sed =np.random.RandomState(21)
# 加上正态分布噪音的目标函数的值
y_ = real_func(x)
y = [sed.normal(0, 0.1) + y1 for y1 in y_]

def fitting(M=0):
    '''
    n为多项式的次数
    '''
    # 随机初始化多项式参数
    p_init = np.random.rand(M+1)
    print('p_init：',p_init)
    # 最小二乘法
    p_lsq = leastsq(residuals_func, p_init, args=(x, y))
    print('Fitting Parameters：', p_lsq)
    
    # 可视化
    plt.plot(x_points, real_func(x_points), label='real')
    plt.plot(x_points, fit_func(p_lsq[0], x_points), label='fitted curve')
    plt.plot(x, y, 'bo', label='noise')
    plt.legend()
    return p_lsq

# M=0
p_lsq_0 = fitting(M=0)

p_init：[0.46587517]
Fitting Parameters：(array([-0.00668687]), 1) 在这里插入图片描述

# M=1
p_lsq_1 = fitting(M=1)

p_init：[0.62648156 0.81328629]
Fitting Parameters: (array([-1.3305229 , 0.65857459]), 1)
在这里插入图片描述

# M=3
p_lsq_3 = fitting(M=3)

p_init：[0.12654294 0.42151414 0.75198249 0.3700417 ]
Fitting Parameters: (array([ 21.44292392, -31.88550029, 10.59452721, -0.0413459 ]), 1) 在这里插入图片描述

# M=9
p_lsq_2 = fitting(M=9)

p_init：[0.96064814 0.78296949 0.07377036 0.26075913 0.43151252 0.23580643 0.25113008 0.10649864 0.83788795 0.67427705]
Fitting Parameters: (array([-4.85959241e+04, 2.20624465e+05, -4.21250214e+05, 4.39338967e+05, -2.71868355e+05, 1.01360274e+05, -2.19859062e+04, 2.47930883e+03, -1.02611751e+02, -5.19642495e-03]), 2)

在这里插入图片描述
当M=9时，多项式曲线通过了每个数据点，但是造成了过拟合。

3.模型选择

下图描述了训练误差和测试误差与模型复杂度之间的关系。
在这里插入图片描述
为了选择最优模型，常用的方法：正则化和交叉验证。

a.正则化

正则化符合奥卡姆剃刀原理，如果结果显示过拟合，引入正则化项(regularizer)，降低过拟合
$Q(x)=\sum^{n}_{i=1}(h(x_i-y_i)^2 +\lambda||w||^2$
回归问题中，损失函数是平方损失，正则化可以是参数向量的L2范数，也可以是L1范数；

$L_1: regularization*abs(p)$
$L_2: 0.5 * regularization * np.square(p)$

regularization = 0.0001

def residuals_func_regularization(p, x, y):
    ret = fit_func(p, x) -y
#     print(ret)
    ret = np.append(ret, np.sqrt(0.5*regularization*np.square(p))) # L2f范数作为正则化项
#     print('regularization after-',ret)
    return ret
# 最小二乘法，加正则化项
p_init = np.random.rand(9+1)
p_lsq_regularization = leastsq(residuals_func_regularization, p_init, args=(x,y))
p_init,p_lsq_regularization

(array([7.65285012e-01, 8.52894992e-01, 8.13129743e-01, 4.63652998e-01, 5.00845195e-01, 5.31076396e-01, 7.44912796e-01, 2.72188092e-01, 5.81395570e-01, 6.08023769e-04]),
(array([-7.10884861e+00, -2.50953883e+00, 2.61943098e+00, 7.18836823e+00, 9.10024923e+00, 5.24227933e+00, -6.35936158e+00, -1.65316867e+01, 8.37207391e+00, -1.39990642e-02]), 1))

plt.plot(x_points, real_func(x_points), label='real')
plt.plot(x_points, fit_func(p_lsq_9[0], x_points), label='fitted curve')
plt.plot(x_points, fit_func(p_lsq_regularization[0], x_points),label='regularization')
plt.plot(x, y, 'bo', label='noise')
plt.legend()

在这里插入图片描述

三、ROC曲线绘制

1.定义

二分类评价指标：

TP - true positive
FN - false negative
FP - false positive
TN - true negative

精确率（真正例 $/$ 预测为正例的）：
$\frac{TP}{TP + FP}$
召回率（真正例 $/$ 实为正例的）：
$\frac{TP}{TP + FN}$
$F_1$ 值:
$\frac{2}{F_1} = \frac{1}{P} + \frac{1}{R}\\F_1=\frac{2TP}{2TP + FN + FP}$
真正例率 - TPR：
$\frac{TP}{TP + FN}$
假正例率 - FPR：
$\frac{FP}{FP + TN}$
ROC曲线的横轴是FPR，纵轴是TPR。

2.绘制ROC曲线

import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn import svm,datasets
from sklearn.metrics import roc_curve,auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp

# 载入数据
iris = datasets.load_iris()
X = iris.data         # (150,4)
y = iris.target       # (150,)
# 二值化输出
y = label_binarize(y, classes=[0, 1, 2])     # (150, 3)
n_classes = y.shape[1]

# 添加噪声
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape  # (150,4)
X = np.c_[X, random_state.randn(n_samples, 200*n_features)]  # (150, 804)

# 切分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)

#model
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
                                         random_state=random_state))
# fit,score
y_score = classifier.fit(X_train,y_train).decision_function(X_test)

# 计算每个类别的ROC曲线和面积
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], thresholds = roc_curve(y_test[:, i], y_score[:, i])
   # print(fpr[i].shape, tpr[i].shape, thresholds.shape)
    roc_auc[i] = auc(fpr[i], tpr[i])
    # print(roc_auc[i])
    
#计算微平均ROC曲线和面积
fpr['micro'], tpr['micro'], _ =roc_curve(y_test.ravel(),y_score.ravel())
roc_auc['micro'] = auc(fpr['micro'], tpr['micro'])

# 取出所有的FPR
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
# 
mean_tpr = np.zeros_like(all_fpr)
# print(tpr)
# print(mean_tpr)
for i in range(n_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])
    # print(mean_tpr)
    
mean_tpr/=n_classes

fpr['macro'] = all_fpr
tpr['macro'] = mean_tpr
roc_auc['macro'] = auc(fpr['macro'], tpr['macro'])

plt.figure()
# 线的宽度
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
        lw=lw, label='ROC curve of class 2 (area =%0.4f)'% roc_auc[2])
plt.plot(fpr[1], tpr[1],lw=lw,color='aqua',
         label='ROC curve of class 1 (area =%0.4f)'% roc_auc[1])
plt.plot(fpr[0], tpr[0],lw=lw,color='cornflowerblue',
         label='ROC curve of class 0 (area =%0.4f)'% roc_auc[0])
plt.plot([0,1], [0,1], lw=lw, linestyle='--')

plt.plot(fpr['micro'], tpr['micro'], lw=lw, color='deeppink', linestyle=':', 
         label='micro-average ROC curve(area =%0.4f)'% roc_auc['micro'])
plt.plot(fpr['macro'], tpr['macro'], lw=lw, color='navy', linestyle=':', 
         label='macro-average ROC curve(area =%0.4f)'% roc_auc['macro'])

plt.xlim([0.0, 1.0], 'k--', lw=lw)
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
# label的位置
plt.legend(loc="lower right")
plt.show()

在这里插入图片描述

Ashlyn_z

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
模型评估与模型选择

模型评估与模型选择一、损失函数和风险函数1.经验风险和结构风险2.经验风险最小化与结构风险最小化二、模型评估与模型选择1.最小二乘法2.过拟合代码实现3.模型选择a.正则化一、损失函数和风险函数损失函数：一次预测的好坏风险函数：平均意义下模型预测的好坏0-1损失函数 (0-1 loss function)L(Y,f(X))={1,Y≠f(X)0,Y=f(X)L(Y,f(X))=\be...
复制链接

扫一扫