数据不平衡问题及解决方案

最新推荐文章于 2024-04-19 09:08:41 发布

adam-liu

最新推荐文章于 2024-04-19 09:08:41 发布

阅读量643

点赞数

分类专栏：算法

本文链接：https://blog.csdn.net/qq_41664845/article/details/125202908

版权

数据不平衡过采样 Focalloss 修正交叉熵模型性能

关键词由CSDN通过智能技术生成

算法专栏收录该内容

10 篇文章 1 订阅

订阅专栏

数据不平衡问题及解决方案

数据集分布
解决方案1：过采样
- 实现代码
- 实验结果
解决方案2：Focal loss
解决方案3：[修正的交叉熵损失](https://kexue.fm/archives/4293)

数据集分布

	1	0
训练数据集	187035 （79.3%）	48923（20.7%）
测试数据集	20702（79.4%）	5383 （20.6%）

数据为1的类别比数据为0的类别要多约4倍，存在比较严重的不平衡问题。

解决方案1：过采样

过采样流程

实现代码

train_x, train_y = get(data)
# repeat(a,b) 将这个array的第b个维度重复a次
ridx = np.argwhere(train_y==0).reshape(-1).repeat(4,0)
idx = np.argwhere(train_y==1).reshape(-1)
oversample_idx = np.concatenate([ridx, idx])
train_x, train_y = train_x[oversample_idx], train_y[oversample_idx]

实验结果

效果有一定提升

Accuracy：+0.48
Precision：+0.59
Recall：+0.48
F1：+0.53

解决方案2：Focal loss

[1] Focal Loss for Dense Object Detection, 何凯明, PDF, ICCV,2017

内容

一种处理样本分类不均衡的损失函数
$\alpha$ ：侧重的点是根据样本分辨的难易程度给样本对应的损失添加权重，即给容易区分的样本添加较小的权重，给难分辨的样本添加较大的权重
$\gamma$ ：调节简单样本权重降低的速率。简单的样本权重会低一些，难区分的样本权重会高一些
$Focal\_loss(p_{t})=-\alpha_{t}(1-p_{t})^{\gamma}\log (p_{t})$

实现代码

from tensorflow.keras import backend as K
import tensorflow as tf

def binary_focal_loss(gamma=2, alpha=0.25):
    alpha = tf.constant(alpha, dtype=tf.float32)
    gamma = tf.constant(gamma, dtype=tf.float32)

    def binary_focal_loss_fixed(y_true, y_pred):
        """
        y_true shape need be (None,1)
        y_pred need be compute after sigmoid
        """
        y_true = tf.cast(y_true, tf.float32)
        alpha_t = y_true * alpha + (K.ones_like(y_true) - y_true) * (1 - alpha)

        p_t = y_true * y_pred + (K.ones_like(y_true) - y_true) * (K.ones_like(y_true) - y_pred) + K.epsilon()
        focal_loss = - alpha_t * K.pow((K.ones_like(y_true) - p_t), gamma) * K.log(p_t)
        return K.mean(focal_loss)

    return binary_focal_loss_fixed
    
model.compile(loss=[binary_focal_loss(alpha=.25, gamma=2)], metrics=["accuracy"], optimizer=optimizer)

实验结果

效果非常不明显

F1：-0.05%

解决方案3：修正的交叉熵损失

内容

对于二分类模型，我们总希望模型能够给正样本输出1，负样本输出0，但限于模型的拟合能力等问题，一般来说做不到这一点。而事实上在预测中，我们也是认为大于0.5的就是正样本了，小于0.5的就是负样本。这样就意味着，我们可以“有选择”地更新模型，比如，设定一个阈值为0.6，那么模型对某个正样本的输出大于0.6，我就不根据这个样本来更新模型了，模型对某个负样本的输出小于0.4，我也不根据这个样本来更新模型了，只有在0.4~0.6之间的，才让模型更新，这时候模型会更“集中精力”去关心那些“模凌两可”的样本，从而使得分类效果更好，这跟传统的SVM思想是一致的。不仅如此，这样的做法理论上还能防止过拟合，因为它防止了模型专门挑那些容易拟合的样本来“拼命”拟合（使得损失函数下降），这就好比老师只关心优生，希望优生能从80分提高到90分，而不想办法提高差生的成绩，这显然不是一个好老师。

代码

from tensorflow.keras import backend as K

margin = 0.6
# sign(x): 求x的符号，x>0,则输出1;x<0则输出-1;x=0则输出0。
theta = lambda t: (K.sign(t)+1.)/2.

def loss(y_true, y_pred):
    return - (1 - theta(y_true - margin) * theta(y_pred - margin)
                - theta(1 - margin - y_true) * theta(1 - margin - y_pred)
             ) * (y_true * K.log(y_pred + 1e-8) + (1 - y_true) * K.log(1 - y_pred + 1e-8))