《Evaluating Machine Learning Methods》
原文链接:www.cs.wisc.edu/~dpage/cs760/
Lecture Outline
• test sets
• learning curves
• validation (tuning) sets
• stratified sampling
• cross validation
• internal cross validation
• confusion matrices
• TP, FP, TN, FN
• ROC curves
• confidence intervals for error
• pairwise t-tests for comparing learning systems
• scatter plots for comparing learning systems
• lesion studies
• recall/sensitivity/true positive rate (TPR)
• precision/positive predictive value (PPV)
• specificity and false positive rate (FPR or 1-specificity)
• precision-recall (PR) curves
Test Sets revisited
Q: How can we get an unbiased estimate of the accuracy of a learned model?
A:
- when learning a model, you should pretend that you don’t
have the test data yet.*- if the test-set labels influence the learned model in any way,
accuracy estimates will be biased*: In some applications, it is reasonable to assume that you have access to the feature vector (i.e. x) but not the y part of each test instance
Learning Curve
Q: How does the accuracy of a learning method change as a function of the training-set size? How to realize it?
A: given training/test set partition (集合划分)
• for each sample size S on learning curve
• (optionally) repeat n times
• randomly select S instances from training set
• learn model
• evaluate model on test set to determine accuracy a
• plot (s, a) or (s, avg. accuracy and error bars)
Validation (tuning) Sets Revisited
Suppose we want unbiased estimates of accuracy during the learning process (e.g. to choose the best level of decision-tree pruning)?
Limitations of using a single training/test partition
Q1:没有足够多的数据集:we may not have enough data to make sufficiently large training and test sets
1.测试集会对精度有一个可靠的估计(即得到更低的方差估计):a larger test set gives us more reliable estimate of
accuracy (i.e. a lower variance estimate)
2 but… a larger training set will be more representative of
how much data we actually have for learning process
Q2:单训练集不能体现对特定样本的敏感度:a single training set doesn’t tell us how sensitive accuracy is to a particular training sample(解决方法见如下:random resampling)
Random Resampling
- We can address the second issue by repeatedly randomly
partitioning the available data into training and set sets- 补充上简单随机抽样(SRS)的特点:每个样本单位被抽中的概率相等,样本的每个单位完全独立,彼此间无一定的关联性和排斥性。
Stratified Sampling
为了让类别比例保持在同一个值:When randomly selecting training or validation sets, we may want to ensure that class proportions are maintained in each selected set
Q: 如何理解?
A: 如上图,大致分成如下步骤:
- labeled dataset按类实例进行分类:本例中,总20个类被分成12个 + 类和8个 - 类
- 按照一定的比例从各个类别中抽取,分成training dataset和testing dataset:本例中将以50%比例抽取,故training dataset分配了6+4-,同样,testing dataset也分配了6+4-
- 最后在training dataset中也按照同样的比例进行抽取得到validation dataset
Cross Validation
以下给出了一个实例
Keypoints
- 10-fold cross validation is common, but smaller values of
n are often used when learning takes a lot of time- in leave-one-out cross validation, n=instances
- in stratified cross validation, stratified sampling is used
when partitioning the data- CV makes efficient use of the available data for testing
- note that whenever we use multiple training sets, as in
CV and random resampling, we are evaluating a learning
method as opposed to an individual learned model.
深度学习中比较好的一句话:We are evaluating a learning method as opposed to an individual learned model.
Internal Cross validation
本方法是在training dataset中进行内部交叉验证,注意Cross validation是对全部的labeled dataset进行k-fold cross validation
以下给出了一个算法流程示意图:
# Example: using internal cross validation to select k in k-NN
given a training set:
partition training set into n folds:
get (s1, … ,sn)
for each value of k considered:
for i = 1 to n:
learn k-NN model using all folds but si
evaluate accuracy on si
end for
end for
select k that resulted in best accuracy for s1 … sn
print(k)
learn model using entire training set and selected k
NOTE!!
the steps inside the box are run independently for each training set
(i.e. if we’re using 10-fold CV to measure the overall accuracy
of our k-NN approach, then the box would be executed 10 times)
Confusion Matrices
How can we understand what types of mistakes a learned model makes?
混淆矩阵:给出了分类以及预测正确与预测错误的个数,非常直观的展示了统计量分布情况。
此处可以补充计算
Confusion Matrix for 2-class Problem
以下给出了最简单的情况——二分类情况下的统计量分析
Q:Is accuracy an adequate measure of predictive performance?
accuracy may not be useful measure in cases where
- there is a large class skew (类偏移很大)
Is 98% accuracy good if 97% of the instances are negative?
也就是说,当其中一个实例的数量很大的时候,accuracy不能很好的度量其性能 - there are differential misclassification costs – say,
getting a positive wrong costs more than getting a
negative wrong.(i.e. Consider a medical domain in which a false positive results in an extraneous test but a false negative results in a failure to treat a disease)
医学领域中假负类的cost比假正类的cost要大
• we are most interested in a subset of high-confidence
predictions
Other Accuracy Metrics
Further Discussion
召回率:Recall = TPR,即当前被分到正样本类别中,真实的正样本占所有正样本的比例,即召回率(召回了多少正样本比例)
R
e
c
a
l
l
=
T
P
R
=
T
P
T
P
+
F
N
Recall=TPR=\frac{TP}{TP+FN}
Recall=TPR=TP+FNTP
准确率:Precision就是当前划分到正样本类别中,被正确分类的比例(即正式正样本所占比例),就是我们一般理解意义上所关心的正样本的分类准确率
P
r
e
c
i
s
i
o
n
=
T
P
T
P
+
F
P
Precision=\frac{TP}{TP+FP}
Precision=TP+FPTP
另外:虽然Precision 和 Recall 的值我们预期是越高越好,但是这两个值在某些场景下却是存在互斥的,比如仅仅取一个样本,并且这个样本也确实是正样本,那么Precision = 1.0, 然而 Recall 可能就会比较低(在该样本集中可能存在多个样本);相反,如果取所有样本,那么Recall = 1.0,而Precision就会很低了。所以在这个意义上,该两处值需要有一定的约束变量来控制。
ROC Curve
ROC曲线(受试者工作特征曲线 / 接收器操作特性曲线), 是反映敏感性和特异性连续变量的综合指标,是用构图法揭示敏感性和特异性的相互关系,它通过将连续变量设定出多个不同的临界值,从而计算出一系列敏感性和特异性,再以敏感性为纵坐标、(1-特异性)为横坐标绘制成曲线,曲线下面积越大,诊断准确性越高。在ROC曲线上,最靠近坐标图左上方的点为敏感性和特异性均较高的临界值。
ROC曲线可以用于评价一个分类器的好坏,即设定不同阈值的时候分类器的表现应该均很优异。
给出一个实例来分析TPR/FPR
ROC example
- Naive Bayes Classifiers(朴素贝叶斯分类器)
在机器学习中,朴素贝叶斯分类器是一个基于贝叶斯定理的比较简单的概率分类器,其中 naive(朴素)是指的对于模型中各个 feature(特征) 有强独立性的假设,并未将特征与特征之间的相关性纳入考虑中。
朴素贝叶斯分类器一个比较著名的应用是用于对垃圾邮件分类,通常用文字特征来识别垃圾邮件,是文本分类中比较常用的一种方法。朴素贝叶斯分类通过选择 token(标号,表征,通常是邮件中的单词)来得到垃圾邮件和非垃圾邮件间的关联,再通过贝叶斯定理来计算概率从而对邮件进行分类。
可以看出BayesNet比NaiveBayes的效果要好
NaiveBayes: https://blog.csdn.net/syoya1997/article/details/78618885
Algorithm for creating an ROC curve
ROC曲线的算法分为以下几个步骤
QUESTION
可以看出,在TPR/FPR一定时,实例的正负分类占比,影响了预测正类且正确的正确率,计算过程如下:
占坑补图