论文阅读笔记第一系列第一篇_cost sensitive论文-CSDN博客

本文链接：https://blog.csdn.net/weixin_44518524/article/details/89316166

一种对cost敏感的boosting方法用于不均衡数据集的分类

处理数据集的正负样例分布不均衡的问题——相关论文第一
论文题目：Cost-sensitive boosting for classification of imbalanced data
Reference：Sun, Y., Kamel, M. S., Wong, A. K., & Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12), 3358-3378.

*imbalanced class distribution：不均衡的类别分布
*bi-class problem：二分类问题
*binary classification problem: 二分类问题
*harmonic mean: 调和均值【是一个比较有用的衡量方法】
*Two workshops on this problem were held in 2000 and 2003 at the AAAI and ICML conferences, respectively. A number of research papers dedicated to this problem can be found in Ref. [19] and other publications.

Object (what)

如何解决正负样例分布不均衡（CIP【class imbalance problem】）的数据集对于分类器准确性性能的影响
back up:The nature of the ***Class imbalance problem**有以下四点：
1>Imbalanced class distribution
2>Small sample size
3>Separability
4>Within-class concepts

Contributions (why)

减小了分布不均衡的数据集对于学习器预测准确度的影响

Method (how)

本文的创新点：在Adaboost的学习框架部分加入了三个cost-items，用来提高Adaboost算法的cost-sensitivity,三种方法分别为***“AdaC1”, “AdaC2”, “AdaC3”***
对于AdaBoost来说，AdaBoost是增加被错分的sample的权重，减少被正确分类的sample的权重，但是这个方法有个问题，即【不能显示数据集中不同类别的数据表的区别】
本文提出的三种新的算法，就解决了Adaboost上述的问题。为了显示不同sample的识别的重要性（例如：负例多，正例少的情况，则正例被正确识别的重要性就要比负例被正确识别的重要性要高），新的方法在保持Adaboost算法基本框架的基础上，把cost items加进到了
，每个sample都对应了一个cost item，cost item的数值越大，对应的sample期望被预测的正确的重要性就越大。
Adaboost的权重更新公式当中，有三种方法：
1>加到Exponent里面（AdaC1）
2>加到Exponent外面（AdaC2）
3>加到Exponent里面和外面（AdaC3）
文章中直接截图的
***back up:***另外一些可以用来对应数据分布不均衡的boosting方法有以下几种，分为两类：第一类是可以拿来直接用的分类算法，如：AdaCost, CSB1, CSB2和RareBoost；第二类是以数据合成算法为基础，再结合boosting的process，如SMOTEBoost, DataBoost-IM,本文主要研究第一类，即三种新的方法都致力于开发好了之后可以直接拿来用。
本文对于前人方法的总结：Data-level approaches, Algorithm-level approaches , Cost-sensitive learning&Ensemble learning
[1]Data-level approaches:resampling
1>randomly oversampling the small class
2>randomly undersampling the prevalent class
3>oversampling the small class(targeted)
4>undersampling the prevalent class(targeted)
back up: 本文中Ref[33]的结论是"a balanced class distribution (class size ratio = 1:1) peforms relatively well but is not necessarily optimal. Optimal class distributions differ from data to data.",即一般来说1:1的不同类别比例的dataset是recommended.
By 2007, there is no systematically solutions on how to selecting quality samples.

[2]Algorithm-level approaches
1>A common strategy to deal with the CIP is to choose an approprite inductive bias
eg:decision trees(决策树) adjust the probabilistic estimate at the tree leaf.
2>Another approach is to develop new pruning techniques
eg:SVMs using different penalty constants for different classes, or adjusting the class boundary on kernel-alignment ideal.(论文中Ref[13]有详细介绍）
back up: 这里有两类问题的定义需要明确，第一类“recognition-based one-class learning”指的是只有一类需要被识别出来的目标，没有对应的类别，第二类“two classifier learning algorithm”指的是数据集中需要被识别出来的目标既有正例又有负例
3>具体的算法有以下几种：Adaboost algorithm, Random forest,
[3]Cost-sensitive learning(the combination of data-level and algorithm-level, which is [1]+[2])
知识铺垫：cost matrix: A cost matrix encodes the penalty of classifying samples from one class as another. Let C(i,j) denote the cost of predicting an instance from class i as class j. With this notation,c(+,-) is the cost of misclassifying a positive (rare class) instance as the negative (prevalent class) instance and C(-,+)is the cost of the contrary case.

Cost-sensitive classification technique takes the cost matrix into consideration during model builing and generates a model that has the lowest cost.

3 main categories of cost-sensitive classification:
<1>weighting the data space (also named as “cost-sensitive earning by example weighting”)
which is derived from “translation theorem”(in linear algebra),
在这里插入图片描述
详情请参阅本篇论文原文
<2>Making a specific classifier learning algorithm cost-sensitive
<3>Using Bayes risk theory to assign each sample to its lowest risk class

[4]Ensemble learning
简单来说，集成学习的本质是想要把不同种的子分类学习器组合到一起，再把每种子学习器的预测结果进行集合，最终得出答案。集成学习器的泛化能力普遍要高于每个子学习器。

Experimental verification

通用的几种衡量模型好坏的方法： 大部分是从混淆矩阵（confusion matrix）推导衍生出来的，下面主要介绍三种，F-measure, G-means & ROC analysis
从原论文中截图下来的，侵权删
[1]F-measure
适用条件：当只关心预测结果中正例的准确程度时用这个
ps:公式详见论文，因为目前还没找到在编辑界面输入公式的功能，哭哭
[2]G-means
适用条件：当预测结果中的正例和负例的准确度都很重要的时候用这个
[3]ROC analysis
简单来说，ROC曲线就是由不同阈值对应的（FP_rate，TP_rate）组成的一条曲线。除非有哪一条曲线完全的包含了另外一条曲线，否则很难说明那个分类器绝对的更好一点，所以AUC应运而生，其定义为ROC曲线下面的面积
本篇论文的实验数据集是用了医学诊断的阴阳性的病例的数据集来验证的三种算法。

Result

基于医学诊断的数据集，测试三种新的算法，得到结果如下：
1>除了AdaC1，剩下两种算法可以实现recall value 比precision value更高
2>AdaC2和AdaC3都对cost的建立很敏感
3>把AdaC1和AdaCost作比较，在大多数情况下，AdaCost 的recall value要比AdaC1的高