零星的小发现:
- 很多都是多分类问题,二分类的不平衡样本问题比较少
文章目录
- A Novel Model for Imbalanced Data Classification(aaai2020)
- Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss(nips2019)
- A Scalable Exemplar-based Subspace Clustering Algorithm for Class-Imbalanced Data(eccv2018)
- Learning to Balance - Bayesian Meta-Learning for Imbalanced and Out-of-distribution Tasks(iclr2020)
- Trainable Undersampling for Class-Imbalance Learning(aaai2019)
- Dynamic Curriculum Learning for Imbalanced Data Classification(iccv2019)
- Online Continual Learning from Imbalanced Data(icml2020)
- Multi-Class Imbalanced Graph Convolutional Network Learning(ijcai2020)
- Learning from Few Positives: a Provably Accurate Metric Learning Algorithm to Deal with Imbalanced Data(ijcal2020)
- Long-tail Session-based Recommendation(recsys2020)
- Iterative Metric Learning for Imbalance Data Classification(ijcai2018)
A Novel Model for Imbalanced Data Classification(aaai2020)
采样、自定义loss、调整权重、集成学习结合
4个模块:DBC、DSI、AWA、EL
训练集中的data,连接DBC(多个平衡的小数据集data block),每个data block连接DSI,同时AWA调整weight,用kNN模型集成
采样:DBC/Data Block Construction
核心思想:对数量较多的类别降采样
步骤:
- 把量大的数据 S m a j S_{maj} Smaj分成 δ ∗ \delta^* δ∗个块: { C 1 , C 2 , … , C δ ∗ } \{C_1, C_2, \dots, C_{\delta^*}\} { C1,C2,…,Cδ∗},每个块的样本数量和量小的样本差不多
- 对每个 C i C_i Ci,把 C i C_i Ci和 S m i n S_{min} Smin放在一起组成block B i B_i Bi
- 返回 B = { B 1 , B 2 , … , B δ ∗ } B = \{B_1, B_2, \dots, B_{\delta^*}\} B={ B1,B2,…,Bδ∗}
自定义loss:DSI/Data Space Improvement
LMNN算法,学习转移矩阵 L L L,定义损失函数:
φ ( L ) = ( 1 − λ ) φ p u l l ( L ) + λ φ p u s h ( L ) φ p u l l ( L ) = ∑ i , j ∈ N ( i ) ∣ ∣ L ( x i − x j ) ∣ ∣ 2 φ p u s h ( L ) = ∑ i , j , l [ 1 + ∣ ∣ L ( s ) i − s j ) ∣ ∣ 2 − ∣ ∣ L ( s i − s l ) ∣ ∣ 2 ] + \begin{aligned} \varphi(L) &= (1 - \lambda) \varphi_{pull}(L) + \lambda\varphi_{push}(L) \\ \varphi_{pull}(L) &= \sum_{i, j \in N(i)} || L(x_i - x_j) ||^2 \\ \varphi_{push}(L) &= \sum_{i,j,l}[1 + ||L(s)i - s_j)||^2 - ||L(s_i - s_l)||^2]_+ \end{aligned} φ(L)φpull(L)φpush(L)=(1−λ)φpull(L)+λφpush(L)=i,j∈N(i)∑∣∣L(xi−xj)∣∣2=i,j,l∑[1+∣∣L(s)i−sj)∣∣2−∣∣L(si−sl)∣∣2]+
其中:
- φ p u l l ( L ) \varphi_{pull}(L) φpull(L)用来惩罚和当前样本离得远的、同样label的样本
- N ( i ) N(i) N(i)是样本
i
附近相同label的其他样本
- N ( i ) N(i) N(i)是样本
- φ p u s h ( L ) \varphi_{push}(L) φpush(L)用来惩罚和当前样本离得近、不同label的样本
- [ a ] + = max ( a , 0 ) [a]_+ = \max(a,0) [a]+=max(a,0)
调整权重:AWA/Adaptive Weiht Adjustment
动态调整weight,weight根据当前分类器对该 B i B_i Bi分类的情况来确定
unstable混淆矩阵
Sample | Predict as negative | Predict as positive |
---|---|---|
Positive | c 1 , 0 c_{1,0} c1,0 | c 1 , 1 c_{1,1} c1,1 |
Negative | c 0 , 0 c_{0,0} c0,0 | c 0 , 1 c_{0,1} c0,1 |
记少样本和多样本之间的importance比例为x
,计算:
{ g a i n m a t = x ∗ ( c 1 , 1 − c 1 , 0 ) + ( c 0 , 0 − c 0 , 1 ) g a i n p o s = x ∗ ( c 1 , 1 + c 1 , 0 ) + ( − c 0 , 0 − c 0 , 1 ) g a i n n e g = x ∗ ( − c 1 , 1 − c 1 , 0 ) + ( c 0 , 0 + c 0 , 1 ) \begin{cases} \begin{aligned} gain_{mat} &= x * (c_{1,1} - c_{1,0}) + (c_{0,0} - c_{0,1}) \\ gain_{pos} &= x*(c_{1,1} + c_{1,0}) + (-c_{0,0} - c_{0,1}) \\ gain_{neg} &= x * (-c_{1,1} - c_{1,0}) + (c_{0,0} + c_{0,1}) \end{aligned} \end{cases} ⎩⎪⎨⎪⎧gainmatgainposgainneg=x∗(c1,1−c1,0)+(c0,0−c0,1)=x∗(c1,1+c1,0)+(−c0,0−c0,1)=x∗(−c1,1−c1,0)+(c0,0+c0,1)
g a i n m a t gain_{mat} gainmat是总体的提升,如果最大的gain与 g a i n m a t gain_{mat} gainmat相等,则设置 W n = W d W_n = W_d Wn=Wd,否则取 g a i n p o s gain_{pos} gainpos和 g a i n n e g gain_{neg} gainneg里较大的那个,然后更新 W n = W t + Δ W_n = W_t + \Delta Wn=Wt+Δ, W t W_t Wt是一个初始值
集成学习:EL/Ensemble Learning
投票的时候用上AWA中计算的weight,判断一个样本是positive,需要weight_p * 判断是positive的分类器个数 > weight_n * 判断是negative的分类器个数才可以
实验
公开数据集
做了Ablation Study,分别去掉了DBC, AWA, DSI
消融实验证明DBC最有用
Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss(nips2019)
自己设计的loss function(LDAM loss) + re-weight(DRW)
LDAM loss
margin的定义:
γ ( x , y ) = f ( x ) y − max j ≠ y f ( x ) j \gamma(x,y) = f(x)_y - \max_{j \neq y}f(x)_j γ(x,y)=f(x)y−j=ymaxf(x)j
对于target label y y y,样本 x x x与target label y y y之间的margin就是模型把样本x预测为y的概率,减掉预测为非y的概率里最大的那个。例如:
假设有多个label y 1 , y 2 , y 3 y_1, y_2, y_3 y1,y2,y3,模型预测x为他们的概率分别为 p 1 = f ( x ) y 1 , p 2 = f ( x ) y 2 , p 3 = f ( x ) y 3 p_1 = f(x)_{y_1}, p_2 = f(x)_{y_2}, p_3 = f(x)_{y_3} p1=f(x)y1,p2=f(x)y2,p3=f(x)y3,且 p 2 > p 1 > p 3 p_2 > p_1 > p_3 p2>p1>p3,那么 γ ( x , y 1 ) = p 1 − p 2 , γ ( x , y 2 ) = p 2 − p 1 , γ ( x , y 3 ) = p 3 − p 2 \gamma(x, y_1) = p_1 - p_2, \gamma(x, y_2) = p_2 - p_1, \gamma(x, y_3) = p_3 - p_2 γ(x,y1)=p1−p2,γ(x,y2)=p2−p1,γ(x,y3)=p3−p2
定义标签j
的margin loss为:
γ j = min i ∈ S j γ ( x i , y i ) \gamma_j = \min_{i \in S_j} \gamma(x_i, y_i) γj=i∈Sjminγ(xi,yi)
认为不平衡样本的泛化误差有上界,即:
imbalanced test error ≤ 1 γ m i n C ( F ) n \text{imbalanced test error} \leq \frac{1}{\gamma_{min}}\sqrt{\frac{C(F)}{n}} imbalanced test error≤γmin1nC(F)
其中 C ( F ) C(F) C(F)是一个模型的函数
为了最小化误差,需要最小化 1 γ m i n C ( F ) n \frac{1}{\gamma_{min}}\sqrt{\frac{C(F)}{n}} γmin1nC(F),对于二分类,即:
min 1 γ 1 n 1 + 1 γ 2 n 2 \text{min } \frac{1}{\gamma_1 \sqrt{n_1}} + \frac{1}{\gamma_2 \sqrt{n_2}} min γ1n1