【WISC大学及其学习课程】《Evaluating Machine Learning Methods》学习笔记(附PDF链接)

《Evaluating Machine Learning Methods》
原文链接:www.cs.wisc.edu/~dpage/cs760/

Lecture Outline

•  test sets
•  learning curves
•  validation (tuning) sets
•  stratified sampling
•  cross validation
•  internal cross validation
•  confusion matrices
•  TP, FP, TN, FN
•  ROC curves
•  confidence intervals for error
•  pairwise t-tests for comparing learning systems
•  scatter plots for comparing learning systems
•  lesion studies
•  recall/sensitivity/true positive rate (TPR)
•  precision/positive predictive value (PPV)
•  specificity and false positive rate (FPR or 1-specificity)
•  precision-recall (PR) curves

Test Sets revisited

Q: How can we get an unbiased estimate of the accuracy of a learned model?

A:

  1. when learning a model, you should pretend that you don’t
    have the test data yet.*
  2. if the test-set labels influence the learned model in any way,
    accuracy estimates will be biased

*: In some applications, it is reasonable to assume that you have access to the feature vector (i.e. x) but not the y part of each test instance

在这里插入图片描述

Learning Curve

Q: How does the accuracy of a learning method change as a function of the training-set size? How to realize it?

A: given training/test set partition (集合划分)
•  for each sample size S on learning curve
•  (optionally) repeat n times
•  randomly select S instances from training set
•  learn model
•  evaluate model on test set to determine accuracy a
•  plot (s, a) or (s, avg. accuracy and error bars)

在这里插入图片描述

Validation (tuning) Sets Revisited

Suppose we want unbiased estimates of accuracy during the learning process (e.g. to choose the best level of decision-tree pruning)?

在这里插入图片描述

Limitations of using a single training/test partition

Q1:没有足够多的数据集:we may not have enough data to make sufficiently large training and test sets
1.测试集会对精度有一个可靠的估计(即得到更低的方差估计):a larger test set gives us more reliable estimate of
accuracy (i.e. a lower variance estimate)
2  but… a larger training set will be more representative of
how much data we actually have for learning process

Q2:单训练集不能体现对特定样本的敏感度:a single training set doesn’t tell us how sensitive accuracy is to a particular training sample(解决方法见如下:random resampling

Random Resampling
  1. We can address the second issue by repeatedly randomly
    partitioning the available data into training and set sets
  2. 补充上简单随机抽样(SRS)的特点:每个样本单位被抽中的概率相等,样本的每个单位完全独立,彼此间无一定的关联性和排斥性。
    在这里插入图片描述

Stratified Sampling

为了让类别比例保持在同一个值:When randomly selecting training or validation sets, we may want to ensure that class proportions are maintained in each selected set

在这里插入图片描述

Q: 如何理解?
A: 如上图,大致分成如下步骤:

  1. labeled dataset按类实例进行分类:本例中,总20个类被分成12个 + 类和8个 - 类
  2. 按照一定的比例从各个类别中抽取,分成training dataset和testing dataset:本例中将以50%比例抽取,故training dataset分配了6+4-,同样,testing dataset也分配了6+4-
  3. 最后在training dataset中也按照同样的比例进行抽取得到validation dataset

Cross Validation

在这里插入图片描述
以下给出了一个实例
在这里插入图片描述

Keypoints

  • 10-fold cross validation is common, but smaller values of
    n are often used when learning takes a lot of time
  • in leave-one-out cross validation, n=instances
  • in stratified cross validation, stratified sampling is used
    when partitioning the data
  • CV makes efficient use of the available data for testing
  • note that whenever we use multiple training sets, as in
    CV and random resampling, we are evaluating a learning
    method as opposed to an individual learned model.

深度学习中比较好的一句话:We are evaluating a learning method as opposed to an individual learned model.

Internal Cross validation

本方法是在training dataset中进行内部交叉验证,注意Cross validation是对全部的labeled dataset进行k-fold cross validation
在这里插入图片描述
以下给出了一个算法流程示意图:

# Example: using internal cross validation to select k in k-NN
given a training set:
		partition training set into n folds:
			get (s1, … ,sn)
		for each value of k considered:
			for i = 1 to n:
				learn k-NN model using all folds but si
				evaluate accuracy on si
			end for
		end for
		select k that resulted in best accuracy for s1 … sn
		print(k)
learn model using entire training set and selected k


NOTE!!
the steps inside the box are run independently for each training set
(i.e. if we’re using 10-fold CV to measure the overall accuracy
of our k-NN approach, then the box would be executed 10 times)

Confusion Matrices

How can we understand what types of mistakes a learned model makes?
混淆矩阵:给出了分类以及预测正确与预测错误的个数,非常直观的展示了统计量分布情况。

在这里插入图片描述

此处可以补充计算

Confusion Matrix for 2-class Problem

以下给出了最简单的情况——二分类情况下的统计量分析
在这里插入图片描述

Q:Is accuracy an adequate measure of predictive performance?

accuracy may not be useful measure in cases where

  • there is a large class skew (类偏移很大)
    Is 98% accuracy good if 97% of the instances are negative?
    也就是说,当其中一个实例的数量很大的时候,accuracy不能很好的度量其性能
  • there are differential misclassification costs – say,
    getting a positive wrong costs more than getting a
    negative wrong.(i.e. Consider a medical domain in which a false positive results in an extraneous test but a false negative results in a failure to treat a disease)
    医学领域中假负类的cost比假正类的cost要大
    •  we are most interested in a subset of high-confidence
    predictions

Other Accuracy Metrics

在这里插入图片描述

Further Discussion

召回率:Recall = TPR,即当前被分到正样本类别中,真实的正样本占所有正样本的比例,即召回率(召回了多少正样本比例)
R e c a l l = T P R = T P T P + F N Recall=TPR=\frac{TP}{TP+FN} Recall=TPR=TP+FNTP
准确率:Precision就是当前划分到正样本类别中,被正确分类的比例(即正式正样本所占比例),就是我们一般理解意义上所关心的正样本的分类准确率
P r e c i s i o n = T P T P + F P Precision=\frac{TP}{TP+FP} Precision=TP+FPTP
另外:虽然Precision 和 Recall 的值我们预期是越高越好,但是这两个值在某些场景下却是存在互斥的,比如仅仅取一个样本,并且这个样本也确实是正样本,那么Precision = 1.0, 然而 Recall 可能就会比较低(在该样本集中可能存在多个样本);相反,如果取所有样本,那么Recall = 1.0,而Precision就会很低了。所以在这个意义上,该两处值需要有一定的约束变量来控制。

ROC Curve

ROC曲线(受试者工作特征曲线 / 接收器操作特性曲线), 是反映敏感性和特异性连续变量的综合指标,是用构图法揭示敏感性和特异性的相互关系,它通过将连续变量设定出多个不同的临界值,从而计算出一系列敏感性和特异性,再以敏感性为纵坐标、(1-特异性)为横坐标绘制成曲线,曲线下面积越大,诊断准确性越高。在ROC曲线上,最靠近坐标图左上方的点为敏感性和特异性均较高的临界值。
ROC曲线可以用于评价一个分类器的好坏,即设定不同阈值的时候分类器的表现应该均很优异。
在这里插入图片描述
给出一个实例来分析TPR/FPR
在这里插入图片描述

ROC example
  • Naive Bayes Classifiers(朴素贝叶斯分类器)
    在机器学习中,朴素贝叶斯分类器是一个基于贝叶斯定理的比较简单的概率分类器,其中 naive(朴素)是指的对于模型中各个 feature(特征) 有强独立性的假设,并未将特征与特征之间的相关性纳入考虑中。
    朴素贝叶斯分类器一个比较著名的应用是用于对垃圾邮件分类,通常用文字特征来识别垃圾邮件,是文本分类中比较常用的一种方法。朴素贝叶斯分类通过选择 token(标号,表征,通常是邮件中的单词)来得到垃圾邮件和非垃圾邮件间的关联,再通过贝叶斯定理来计算概率从而对邮件进行分类。
    在这里插入图片描述
    可以看出BayesNet比NaiveBayes的效果要好

NaiveBayes: https://blog.csdn.net/syoya1997/article/details/78618885

Algorithm for creating an ROC curve

ROC曲线的算法分为以下几个步骤

在这里插入图片描述

QUESTION

在这里插入图片描述
可以看出,在TPR/FPR一定时,实例的正负分类占比,影响了预测正类且正确的正确率,计算过程如下:

占坑补图

P-R Curve

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Pratap Dangeti, "Statistics for Machine Learning" English | ISBN: 1788295757 | 2017 | EPUB | 311 pages | 12 MB Key Features Learn about the statistics behind powerful predictive models with p-value, ANOVA, F-statistics. Implement statistical computations programmatically for supervised and unsupervised learning through K-means clustering. Master the statistical aspect of machine learning with the help of this example-rich guide in R & Python. Book Description Complex statistics in machine learning worries a lot of developers. Knowing statistics helps in building strong machine learning models that are optimized for a given problem statement. This book will teach you all it takes to perform complex statistical computations required for machine learning. You will gain information on statistics behind supervised learning, unsupervised learning, reinforcement learning, and more. You will see real-world examples that discuss the statistical side of machine learning and make you comfortable with it. You will come across programs for performing tasks such as model, parameters fitting, regression, classification, density collection, working with vectors, matrices, and more.By the end of the book, you will understand concepts of required statistics for Machine Learning and will be able to apply your new skills to any sort of industry problems. What you will learn Understanding Statistical & Machine learning fundamentals necessary to build models Understanding major differences & parallels between statistics way of solving problem & machine learning way of solving problem Know how to prepare data and "feed" the models by using the appropriate machine learning algorithms from the adequate R & Python packages Analyze the results and tune the model appropriately to his or her own predictive goals Understand concepts of required statistics for Machine Learning Draw parallels between statistics and machine learning Understand each component of machine learning models and see impact of changing them
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值