论文阅读：CSAL: Cost sensitive active learning for multi-source drifting stream

最新推荐文章于 2024-08-28 11:21:36 发布

浴缸里看海

最新推荐文章于 2024-08-28 11:21:36 发布

阅读量497

点赞数 1

文章标签：论文阅读

本文链接：https://blog.csdn.net/m0_46581637/article/details/132660941

版权

这篇论文提出了一个在multi source classification背景下进行active learning的系统。

Multi source classification problem（多源流分类问题）：需要分类模型由多个source stream 和 targe stream训练而来，并在target stream 上进行分类。

主要难点在于

stream environment中ground true label的获取需要消耗大量人力资源
多个streams中的knowledge transfer该如何解决
在stream环境下存在大量的concept drift，如何在没有ground true label的情况下侦测到它们

CSAL系统提出了三种新算法以解决以上问题

a novel asymmetry weighting mechanism：不对称加权机制，增加beneficial transfer的权重，减少negative transfer的权重
a novel multi-perspective similarity estimation: 源相似度评估,用于量化target stream 和 source stream 的相似度
a parallel multiple hypothesis drift detection method：DDM算法，用于在缺少ground true label 的条件下进行drift detection(监视预测模型的置信度，置信度下降则说明出现drift.
a novel cost-sensitive hybrid active labeling strategy（好长的buff）:混合式费用敏感型主动标注策略，说白了，当multi-perspective similarity estimation发现targe stream 和 source streams关联降低时（可能发生concept drift），就降低采样阈值，采样更多的ground true label.

CSAL大概结构就是图上这样，Multi-source ensemble framework(简称ME)，就是由CSAL训练而来的预测模型。当PMH-DDM检测到出现 concept drift时，就会为ME添加新的classifier，若不出现concept drift，则是用通过主动学习策略标记的样本来训练最新的classifier。

伪代码表示，和上文图片中图表示的是一个东西。

接下来详细介绍这篇文章提出的创新点

asymmetry weighting mechanism

这个很好理解，在ensemble learning中，每一个classifier都有与之对应的权重，通过调节权重可以控制这个classifier在整个ensemble中的话语权，通过将增加分类正确的classifier的权重并降低分类错误的classifier的权重可以很好的实现knowledge transfer。

multi-perspective similarity estimation，源相似评估，为了侦测出covariance shift(数据分布改变，decision boundary不变)和inductive transfer problem（decision boundary发生改变），这个算法对target ensemble classifier进行了两种测试，一种decision consistency-based，一种是probability-distribution based

decision consistency-based直接比较source ensemble classifiers 和target ensemble classifiers的判断是不是完全一致。

Probability distribution based 也很简单，就是对target stream 和 source stream进行了个K-S test

cost sensitive active learning strategy

一种主动选择需要标注样本的策略，节省人工开支（yysy，active learning本来就节省开支了，纯纯画蛇添足）。

也很好理解，就是同时采用了active learning中的random selection strategy和uncertainty strategy, random selection strategy就不多说了，重点讲一下这个uncertainty strategy。

Uncertainty strategy实际上就是classifier前两位预测结果的差值，如果这个差值偏大，就说明模型对自己的预测结果很有自信，反之则说明模型也不确定自己的预测结果，就需要人工提供ground true label。

算法看起来很复杂，其实那些公式就是在计算uncertainty strategy的阈值，知道是在干嘛就行了，文章中也没提供公式的推导，我们就直接把它们当作映射函数就行。

Parallel multi-hypothesis drift detection（看了老半天，和我有关的就这么一点点。。。）

Unsupervised DDM核心就是通过观察target stream classifier的来确定是否发生concept drift。

Ref： CSAL: Cost sensitive active learning for multi-source drifting stream