机器学习笔记（二）机器学习模型验证

最新推荐文章于 2024-04-03 23:53:36 发布

Ella1019

最新推荐文章于 2024-04-03 23:53:36 发布

阅读量292

点赞数

分类专栏：机器学习笔记文章标签：机器学习笔记人工智能算法深度学习

本文链接：https://blog.csdn.net/Ella1019/article/details/130686667

版权

机器学习笔记专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Model Metrics

损失衡量的是模型在监督学习中预测结果的好坏

一些用于分类的指标
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yAaDJLM7-1684137846509)(attachment:image.png)]

Accuracy: correct predictions

sum(y==y_hat)/y.size

Precision: 对某一个具体的类的预测

sum((y_hat==1)&(y==1))/sum(y_hat==1)

Recall: 对某一个具体的类的预测

sum((y_hat==1)&(y==1))/sum(y==1)

F1: the harmonic mean of precision and recall: 2pr/(p + r)

处理二分类问题
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8siWJ5ba-1684137846510)(attachment:image.png)]

ROC曲线：接收者操作特征(receiveroperating characteristic), roc曲线上每个点反映着对同一信号刺激的感受性。
横轴：假正类率 (false postive rate, FPR)，特异度，划分实例中所有负例占所有负例的比例；TNR=1-FPR
纵轴：真正类率 ==Recall(true postive rate, TPR)，灵敏度，Sensitivity(正类覆盖率)

AUC(Area under Curve)：Roc曲线下的面积，介于0.1和1之间,值越大越好

Underfiting & Overfitting

Training error: model error on the training data
Generalization error: model error on new data

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pGSzQ9se-1684137846511)(attachment:image-2.png)]

Model Complexity
The capacity of a set of function to fit data points

The number of learnable parameters
The value range for those parameters

Data Complexity

of examples
of features in each example
the separability of the classes

Generalization error
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TP7jvulr-1684137846512)(attachment:image-6.png)]

Model Validation

test dataset 从未被模型看到过，只能使用一次
Validation dataset 通常是数据集的子集，不用于模型训练,可多次用于超参数调整

不符合随机分布的数据集
K-fold cross validation

将原始数据集划分为相等的K部分，将一部分作为测试集，其余作为训练集，计算模型在测试集上的准确率，每次用不同的部分作为测试集，将平均准确率作为最终的模型准确率

from sklearn.model_selection import KFold
#划分为几块，是否随机打乱了，是否固定随机起点
KFold(n_splits=2,shuffle=False,random_state=None)

KFold(n_splits=2, random_state=None, shuffle=False)

常见错误

Bias&Variance 偏差和方差

偏差学习模型和真实模型之间的区别位移
方差学习模型内部结果差别
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fnFginky-1684137846515)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7l4Aj5K8-1684137846515)(attachment:image-2.png)]

交叉项均为0
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RWsNfazb-1684137846516)(attachment:image-3.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DQyrGdWo-1684137846517)(attachment:image-4.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DkVmIE8o-1684137846517)(attachment:image-5.png)]

Bagging (Bootstrap AGGregatING)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kIGBGMLU-1684137846518)(attachment:image.png)]

并行式随机有放回采样
对分类任务使用多数投票法，对回归任务使用简单平均回归法
把多个不稳定的模型组合成一个相对稳定方差较低的模型
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bPWsRdjC-1684137846518)(attachment:image-2.png)]

此处利用的是方差不为0作的一个不等式和柯西不等式
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qJFeozMz-1684137846519)(attachment:image-3.png)]

即应该使用不那么稳定的learner来做base learner
主要效果降低模型方差

class Bagging:
    def __init__(self, base_learner, n_learners):
    self.learners = [clone(base_learner) for _ in range(n_learners)]

    def fit(self, X, y):
    for learner in self.learners:
    #随机采样
    examples = np.random.choice(
    np.arange(len(X)), int(len(X)), replace=True)
    learner.fit(X.iloc[examples, :], y.iloc[examples])
    
    def predict(self, X):
    preds = [learner.predict(X) for learner in self.learners]
    #回归模型采用均值
    return np.array(preds).mean(axis=0)

#随机森林 使用bagging算法
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
'''
n_estimators: the number of trees
boostrap: True is to boostrap samples
warm_start: False is to fit completely new forest
class_weight:'balanced','balanced_subsample'
max_samples: the number of samples to train base estimator
'''
model_RF = RandomForestClassifier(n_estimators=100, criterion='gini',max_depth=None, min_samples_split=2,
                                  min_samples_leaf=1, min_weight_fraction_leaf=0.0,max_features='auto', max_leaf_nodes=None,
                                  min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None,
                                  random_state=None, verbose=0, warm_start=False,class_weight=None, ccp_alpha=0.0, max_samples=None)
model_RF = RandomForestRegressor(criterion='mse')  # criterion:'mse','mae'

Boosting

把多个比较弱偏差比较大的模型组合成一个较强偏差较小的模型
按顺序地学习模型

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PybeXMOD-1684137846519)(attachment:image.png)]

主要效果降低模型偏差

Ada boosting
每个子模型模型都在尝试增强（boost）整体的效果，通过不断的模型迭代，更新样本点的权重
如果某个样本点已经被准确地分类，那么在构造下一个训练集中，它的权值就被降低
如果某个样本点没有被准确地分类，那么它的权值就得到提高

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2), n_estimators=500)
ada_clf.fit(X_train, y_train)

ada_clf.score(X_test, y_test)

Gradient Boosting

梯度下降的决策树

class GradientBoosting:
    def __init__(self, base_learner, n_learners, learning_rate):
        self.learners = [clone(base_learner) for _ in range(n_learners)]
        self.lr = learning_rate

    def fit(self, X, y):
        #残差
        residual = y.copy()
        for learner in self.learners:
        #用原始特征和标号的残差拟合
        learner.fit(X, residual)
        #残差减去在训练集上的预测乘以学习率
        residual -= self.lr * learner.predict(X)

    def predict(self,X):
        preds = [learner.predict(X) for learner in self.learners]
        #求和
        return np.array(preds).sum(axis=0) * self.lr

# gbdt
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
#loss: 'deviance' 表示将回归树结果通过逻辑回归进行分类，'exponential'相当与adaboost
#subsample: <1 总样本比例，抽样样本训练降低方差提高偏差
#criterion:'friedman_mse','mse','mae' split 质量
#n_iter_no_change:early stop,验证集分数没有变化的迭代次数

#分类类
model_gb = GradientBoostingClassifier( loss='deviance', learning_rate=0.1, n_estimators=100,
                                       subsample=1.0, criterion='friedman_mse', min_samples_split=2,
                                       min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3,
                                       min_impurity_decrease=0.0, init=None,
                                       random_state=None, max_features=None, verbose=0,
                                       max_leaf_nodes=None, warm_start=False, validation_fraction=0.1,
                                       n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)
# 回归类
model_gb = GradientBoostRegressor(ls='ls', learning_rate=0.1, n_estimators=100,
                                       subsample=1.0, criterion='friedman_mse', min_samples_split=2,
                                       min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3,
                                       min_impurity_decrease=0.0, min_impurity_split=None, init=None,
                                       random_state=None, max_features=None, verbose=0,
                                       max_leaf_nodes=None, warm_start=False, validation_fraction=0.1,
                                       n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)

Stacking

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-whzTjcuw-1684137846520)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-F4v8lYjd-1684137846520)(attachment:image-2.png)]

多层stacking很容易过拟合
因此不同级别用不同数据训练，每一级别重复使用K折bagging得到数据集用于下一级训练，代价昂贵

在这里插入图片描述
学习课程：李沐的斯坦福21秋季：实用机器学习（B站上有中文版）
英文版课程主页：https://c.d2l.ai/stanford-cs329p/

Ella1019

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
机器学习笔记（二）机器学习模型验证

损失衡量的是模型在监督学习中预测结果的好坏一些用于分类的指标Precision: 对某一个具体的类的预测Recall: 对某一个具体的类的预测处理二分类问题ROC曲线：接收者操作特征(receiveroperating characteristic), roc曲线上每个点反映着对同一信号刺激的感受性。横轴：假正类率 (false postive rate, FPR)，特异度，划分实例中所有负例占所有负例的比例；TNR=1-FPR。
复制链接

扫一扫

专栏目录