机器学习——模型评估与选择、分类、三要素、交叉验证

hellobigorange

已于 2024-07-10 10:58:10 修改

阅读量1.2k

点赞数 1

分类专栏：机器学习和深度学习文章标签：机器学习模型评估与选择

于 2019-01-15 23:59:44 首次发布

本文链接：https://blog.csdn.net/qq_34229228/article/details/86500311

版权

机器学习和深度学习专栏收录该内容

41 篇文章 12 订阅

订阅专栏

文章目录

X：特征；Y：类别标签
特征空间——由n个特征张成的空间。
特征向量——组成特征空间的每一行（样本点）
训练集：训练学习器参数。
测试集：训练好后，检验学习器学习效果怎么样。
均方误差：回归任务中常用的性能度量手段。对应几何上的欧式距离，基于均方误差最小化求解的方法称为最小二乘法。线性回归问题中，最小二乘法就是试图找一条直线，使所有的样本点到这条线的欧式距离最短。

一、错误率与精度

错误率：分类错误的数量 / 总数
精度：1 – 错误率

二、性能/混淆矩阵

前面叙述的主要是二类分问题

二分类问题中，根据样本预测样例与真实类别的组合可划分为四类：TP（真正例），FP（假正例），TN（真反例），FN（假反例），它们构成一个二类分混淆矩阵。
在这里插入图片描述
1、准确率(Accuracy)：预测正确的样本占总样本的比例：
$\frac{TP+TN}{TP+FP+FN+TN}$

2、查准率、精确率（P）：预测的真正例占预测为正例的比例
$\frac{TP}{TP+FP}$
3、查全率、召回率、真正率（R）：预测的真正例占实际为正例的比例
$\frac{TP}{TP+FN}$

4、假正率（FPR）：
$\frac{FP}{FP+TN}$
5、F1度量：精确率和召回率的调和平均
$F_1 score = \frac{2PR}{P+R}$
注意：两个概念是相互矛盾的，一般来说很难同时满足查准率和查全率都高。

三、PR、ROC、AUC

1、PR曲线：查准率查全率曲线。实际应用中，常绘制PR曲线来比较算法性能优劣。
在这里插入图片描述
上图中，A完全包围C，故A的性能必定优于C，但A、B的性能无法比较，对此，我们提出了平衡点概念。
平衡点（BEP）：查准率等于查全率时，对应的概率值。平衡点越大越好，故该图中A优于B。

2、ROC曲线：是学习性能比较的常用的另一条曲线，称受试者工作特征曲线。横坐标为 FPR（假正率），纵坐标为 TPR（真正率）。

在这里插入图片描述
对于上图，由于最外层曲线完全包裹住最内层曲线，故显然最外层的算法特性更好一些。同样，若两条曲线有交叉，我们引入了AUC的概念。
3、AUC：ROC线下面积。（就是那个凸多边形的面积）

四、机器学习步骤框架

把数据拆成训练集和测试集；
用训练集和训练集上的特征向量得到算法；
用测试集评估算法。

五、机器学习分类

监督学习：训练集每个输入样本都有类别标签

分类（垃圾邮件分类）：决策树、KNN、贝叶斯、SVM、LR
回归（房价预测）：线性回归、Ridge回归、LASSO

无监督学习： 训练集无类别标签。
聚类算法：通过特征间的相似性；
降维（PCA、LDA）：通过机器学习算法达到降维目的，区别于特征选择。

半监督学习： 一部分有类别标签，一部分没有类别标签。
聚类假设：将两类样本混合，通过特征间的相似性分成若干组，组内相似性较大，组间差异大，此时，组内包含了有类别标签和无类别标签，根据有类别标签样本，按少数服从多数原则，对为标记样本进行标记，至此，所有样本都为有标签样本。

强化学习： 主要解决连续决策问题，如围棋下一步走法。
在这里插入图片描述

迁移学习： 适用于小数据集问题、个性化问题。解决模型适应性问题。

六、机器学习三要素

机器学习模型=数据+算法+策略
模型学习模型=模型+算法+策略

模型： $y=\theta_0+\theta_1x+\theta_2x^2$
- 决策函数：输出0|1
- 条件概率函数：按照概率输出。
- 模型选择：泛化性能要好，过拟合（增加样本，增加 $\lambda$ 正则项，减小特征项）、欠拟合（减小 $\lambda$ 正则项，增加多项式的项（特征项）或项的次数）.。泛化能力相同时，根据奥坎姆剃刀原则，选择模型简单的。
策略（损失函数）：0-1损失函数、平方损失、绝对损失、对数损失
算法，求解参数的方法，是最优化的过程（牛顿法、梯度下降法）

七、交叉验证

将数据集D随机分为k个包（这里假定K=6）。
每次将其中一个包作为测试集test，剩下k-1个包作为训练集train进行训练。
在这里插入图片描述
最后计算k次求得的准确率的平均值，作为该模型或者假设函数的真实准确率。

留一验证： 每次取一个样本作为训练集，剩余作为测试集。是一种特殊的K折交叉验证。
为什么用交叉验证法？

交叉验证用于评估模型的预测性能，尤其是训练好的模型在新数据上的表现，可以在一定程度上减小过拟合。
还可以从有限的数据中获取尽可能多的有效信息。

建议：数据总量较小时，其他方法无法继续提升性能，可以尝试K-Fold。使用K-Fold记录较好的超参数。

+补充：

1、经验风险：

机器学习模型关于训练数据集的平均损失称为经验风险。度量平均意义下模型预测效果的好坏。
在这里插入图片描述
2、结构风险：

结构风险是在经验风险的基础上加上表示模型复杂度的正则项（罚项）。

在这里插入图片描述

3、正则项：

降低模型复杂度

4、超参数：

模型训练前需事先指定的参数，如迭代次数

八、AUC_python

import numpy as np
from sklearn import metrics
import  matplotlib.pyplot as plt

"""利用构建的数据建立TPR,FPR"""
y = np.array([0, 0, 1, 1])
y_pred = np.array([0.1, 0.5, 0.3, 0.8])
# 获得TPR,FPR,阈值从y_pred里从大到小取，比如首次阈值为0.8,若y_pred>0.8预测值为1，反之为0
fpr, tpr, threthod = metrics.roc_curve(y, y_pred)
auc=metrics.roc_auc_score(y,y_pred) # 得到线下面积auc
# TPR,FPR一定过（0,0）点，为避免阈值选取不合适，丢失（0,0）,直接insert
fpr = np.insert(fpr, 0, 0)
tpr = np.insert(tpr, 0, 0)
print('tpr', tpr)
print('fpr', fpr)
print('threthod', threthod)
print('auc',auc)

"""绘制AUC曲线"""
plt.figure()
plt.scatter(fpr,tpr,c='green')
plt.plot(fpr,tpr,c='red')
plt.plot([0,1],[0,1],'b--')
plt.title('ROC curve')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()

在这里插入图片描述

鸢尾花_逻辑回归_AUC

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
import matplotlib.patches as mpatches
from sklearn.decomposition import PCA
from matplotlib.font_manager import FontProperties
from sklearn import metrics

font = FontProperties(fname=r'C:\Windows\Fonts\simsun.ttc', size=12)

if __name__ == '__main__':
    """数据读取与可视化"""
    list1 = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
    data = pd.read_csv(r'iris.data', header=None,
                       names=list1)
    x = data.iloc[:, 0:4]
    y = data['class']
    y = pd.Categorical(y).codes  # 预编码

    """PCA降维"""
    pca = PCA(n_components=2)
    x = pca.fit_transform(x)

    """分为训练集测试集"""
    x_test, x_train, y_test, y_train = train_test_split(x, y, test_size=0.6)
    pipe_lr = Pipeline([('sc', StandardScaler()),
                        ('poly', PolynomialFeatures(degree=2)),
                        ('clc', LogisticRegression())])
    pipe_lr.fit(x_train, y_train)  # 训练
    y_hat = pipe_lr.predict(x_test)  # 预测
    print('测试集R^2 score:', pipe_lr.score(x_test, y_test))
    print('训练集R^2 score:', pipe_lr.score(x_train, y_train))

    """ROC曲线和AUC"""
    """one-hot编码"""
    one_hot = OneHotEncoder(sparse=False, categories='auto')
    y_one_hot = one_hot.fit_transform(y_test.reshape(-1, 1))
    y_pro = pipe_lr.decision_function(x_test)
    fpr, tpr, threthod = metrics.roc_curve(y_one_hot.ravel(), y_pro.ravel(),drop_intermediate = False)
    auc=metrics.roc_auc_score(y_one_hot.ravel(), y_pro.ravel())
    print('auc',auc)
    plt.figure()
    plt.plot(fpr, tpr)
    plt.title('ROC of Iris Logistic Regression')
    plt.xlabel('FPR')
    plt.ylabel('TPR')
    plt.legend(['auc:{}'.format(auc)],fontsize=13)
    plt.show()

    """画Logistic回归的分类图"""
    N = 500  # 50*50的网格
    x1_min, x1_max = x[:, 0].min(), x[:, 0].max()
    x2_min, x2_max = x[:, 1].min(), x[:, 1].max()
    t1 = np.linspace(x1_min, x1_max, N)
    t2 = np.linspace(x2_min, x2_max, N)
    x1, x2 = np.meshgrid(t1, t2)
    x_new = np.stack((x1.flat, x2.flat), axis=1)  # (2500, 2)
    # 预测
    y_new = pipe_lr.predict(x_new)
    y_new = y_new.reshape(N, N)
    cm_light = ListedColormap(['#77E0A0', '#FF8080', '#A0A0FF'])
    color_dark = ListedColormap(['g', 'r', 'b'])

    plt.figure()

    plt.pcolormesh(x1, x2, y_new, cmap=cm_light)
    plt.scatter(x_train[:, 0], x_train[:, 1], c=y_train, cmap=color_dark)
    plt.scatter(x_test[:, 0], x_test[:, 1], c=y_test, cmap=color_dark, marker='+')
    plt.xlabel(u'组分1', fontproperties=font)
    plt.ylabel(u'组分2', fontproperties=font)
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.title('Iris Logistic Regression')
    plt.grid()
    patchs = [mpatches.Patch(color='#77E0A0', label='Iris-setosa'),
              mpatches.Patch(color='#FF8080', label='Iris-versicolor'),
              mpatches.Patch(color='#A0A0FF', label='Iris-virginica')]
    plt.legend(handles=patchs, fancybox=True, framealpha=0.8)
    plt.show()

在这里插入图片描述

#!/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, fbeta_score
from sklearn.metrics import precision_recall_fscore_support, classification_report


if __name__ == "__main__":
    y_true = np.array([1, 1, 1, 1, 0, 0])
    y_hat = np.array([1, 0, 1, 1, 1, 1])
    print ('Accuracy：\t', accuracy_score(y_true, y_hat))

    # The precision is the ratio 'tp / (tp + fp)' where 'tp' is the number of
    # true positives and 'fp' the number of false positives. The precision is
    # intuitively the ability of the classifier not to label as positive a sample
    # that is negative.
    # The best value is 1 and the worst value is 0.
    precision = precision_score(y_true, y_hat)
    print ('Precision:\t', precision)

    # The recall is the ratio 'tp / (tp + fn)' where 'tp' is the number of
    # true positives and 'fn' the number of false negatives. The recall is
    # intuitively the ability of the classifier to find all the positive samples.
    # The best value is 1 and the worst value is 0.
    recall = recall_score(y_true, y_hat)
    print ('Recall:  \t', recall)

    # F1 score, also known as balanced F-score or F-measure
    # The F1 score can be interpreted as a weighted average of the precision and
    # recall, where an F1 score reaches its best value at 1 and worst score at 0.
    # The relative contribution of precision and recall to the F1 score are
    # equal. The formula for the F1 score is:
    #     F1 = 2 * (precision * recall) / (precision + recall)
    print( 'f1 score: \t', f1_score(y_true, y_hat))
    print (2 * (precision * recall) / (precision + recall))

    # The F-beta score is the weighted harmonic mean of precision and recall,
    # reaching its optimal value at 1 and its worst value at 0.
    # The 'beta' parameter determines the weight of precision in the combined
    # score. 'beta < 1' lends more weight to precision, while 'beta > 1'
    # favors recall ('beta -> 0' considers only precision, 'beta -> inf' only recall).
    print ('F-beta：')
    for beta in np.logspace(-3, 3, num=7, base=10):
        fbeta = fbeta_score(y_true, y_hat, beta=beta)
        print ('\tbeta=%9.3f\tF-beta=%.5f' % (beta, fbeta))
        #print (1+beta**2)*precision*recall / (beta**2 * precision + recall)

    print (precision_recall_fscore_support(y_true, y_hat, beta=1))
    print (classification_report(y_true, y_hat))