imbalanced data相关论文

最新推荐文章于 2021-12-20 15:27:10 发布

KpLn_HJL

最新推荐文章于 2021-12-20 15:27:10 发布

阅读量397

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/sinat_41679123/article/details/115351262

版权

零星的小发现：

很多都是多分类问题，二分类的不平衡样本问题比较少

文章目录

A Novel Model for Imbalanced Data Classification(aaai2020)
Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss(nips2019)
A Scalable Exemplar-based Subspace Clustering Algorithm for Class-Imbalanced Data(eccv2018)
Learning to Balance - Bayesian Meta-Learning for Imbalanced and Out-of-distribution Tasks(iclr2020)
Trainable Undersampling for Class-Imbalance Learning(aaai2019)
Dynamic Curriculum Learning for Imbalanced Data Classification(iccv2019)
Online Continual Learning from Imbalanced Data(icml2020)
Multi-Class Imbalanced Graph Convolutional Network Learning(ijcai2020)
Learning from Few Positives: a Provably Accurate Metric Learning Algorithm to Deal with Imbalanced Data(ijcal2020)
Long-tail Session-based Recommendation(recsys2020)
Iterative Metric Learning for Imbalance Data Classification(ijcai2018)

A Novel Model for Imbalanced Data Classification(aaai2020)

采样、自定义loss、调整权重、集成学习结合

4个模块：DBC、DSI、AWA、EL

训练集中的data，连接DBC（多个平衡的小数据集data block），每个data block连接DSI，同时AWA调整weight，用kNN模型集成

采样：DBC/Data Block Construction
核心思想：对数量较多的类别降采样
步骤：

把量大的数据 $S_{maj}$ 分成 $\delta^*$ 个块： $\{C_1, C_2, \dots, C_{\delta^*}\}$ ，每个块的样本数量和量小的样本差不多
对每个 $C_i$ ，把 $C_i$ 和 $S_{min}$ 放在一起组成block $B_i$
返回 $\{B_1, B_2, \dots, B_{\delta^*}\}$

自定义loss：DSI/Data Space Improvement
LMNN算法，学习转移矩阵 $L$ ，定义损失函数：
$\begin{aligned} \varphi(L) &= (1 - \lambda) \varphi_{pull}(L) + \lambda\varphi_{push}(L) \\ \varphi_{pull}(L) &= \sum_{i, j \in N(i)} || L(x_i - x_j) ||^2 \\ \varphi_{push}(L) &= \sum_{i,j,l}[1 + ||L(s)i - s_j)||^2 - ||L(s_i - s_l)||^2]_+ \end{aligned}$
其中：

$\varphi_{pull}(L)$ 用来惩罚和当前样本离得远的、同样label的样本
- $N (i)$ 是样本i附近相同label的其他样本
$\varphi_{push}(L)$ 用来惩罚和当前样本离得近、不同label的样本
- $a]_+ = \max(a,0)$

调整权重：AWA/Adaptive Weiht Adjustment
动态调整weight，weight根据当前分类器对该 $B_i$ 分类的情况来确定

unstable混淆矩阵

Sample	Predict as negative	Predict as positive
Positive	$c_{1,0}$	$c_{1,1}$
Negative	$c_{0,0}$	$c_{0,1}$

记少样本和多样本之间的importance比例为x，计算：
$\begin{cases} \begin{aligned} gain_{mat} &= x * (c_{1,1} - c_{1,0}) + (c_{0,0} - c_{0,1}) \\ gain_{pos} &= x*(c_{1,1} + c_{1,0}) + (-c_{0,0} - c_{0,1}) \\ gain_{neg} &= x * (-c_{1,1} - c_{1,0}) + (c_{0,0} + c_{0,1}) \end{aligned} \end{cases}$

$gain_{mat}$ 是总体的提升，如果最大的gain与 $gain_{mat}$ 相等，则设置 $W_n = W_d$ ，否则取 $gain_{pos}$ 和 $gain_{neg}$ 里较大的那个，然后更新 $W_n = W_t + \Delta$ ， $W_t$ 是一个初始值

集成学习：EL/Ensemble Learning
投票的时候用上AWA中计算的weight，判断一个样本是positive，需要weight_p * 判断是positive的分类器个数 > weight_n * 判断是negative的分类器个数才可以

实验
公开数据集
做了Ablation Study，分别去掉了DBC, AWA, DSI
消融实验证明DBC最有用

Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss(nips2019)

自己设计的loss function(LDAM loss) + re-weight(DRW)

LDAM loss

margin的定义：
$\gamma(x,y) = f(x)_y - \max_{j \neq y}f(x)_j$
对于target label $y$ ，样本 $x$ 与target label $y$ 之间的margin就是模型把样本x预测为y的概率，减掉预测为非y的概率里最大的那个。例如：
假设有多个label $y_1, y_2, y_3$ ，模型预测x为他们的概率分别为 $p_1 = f(x)_{y_1}, p_2 = f(x)_{y_2}, p_3 = f(x)_{y_3}$ ，且 $p_2 > p_1 > p_3$ ，那么 $\gamma(x, y_1) = p_1 - p_2, \gamma(x, y_2) = p_2 - p_1, \gamma(x, y_3) = p_3 - p_2$