【机器学习】类别不平衡数据的处理

小言从不摸鱼

已于 2024-09-03 11:37:14 修改

阅读量1.1k

点赞数 27

分类专栏：机器学习文章标签：机器学习人工智能

于 2024-08-31 20:03:30 首次发布

本文链接：https://blog.csdn.net/2301_76820214/article/details/141757414

版权

机器学习专栏收录该内容

24 篇文章 5 订阅

订阅专栏

🍔 前言

🍔 方案1：LR自带参数

🍔 方案2：imbalanced-learn

3.1 安装

3.2 过采样

3.3 欠采样

🍔 前言

在现实环境中，采集的数据（建模样本）往往是比例失衡的。比如：一个用于模型训练的数据集中，A 类样本占 95%，B 类样本占 5%。

🐼 类别的不平衡会影响到模型的训练，所以，我们需要对这种情况进行处理。处理的主要方法如下：

过采样：增加少数类别样本的数量，例如：减少 A 样本数量，达到 AB 两类别比例平衡。

欠采样：减少多数类别样本的数量，例如：增加 B 类样本数量，达到 AB 两类别比例平衡。

🍔 方案1：LR自带参数

处理不均衡的数据 class_weight=“balanced” 参数根据样本出现的评论自动给样本设置权重

示例代码💯：

# 处理不均衡的数据
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
•
iris = datasets.load_iris()
# 移走40个数据，使数据不均衡
features = iris.data[40:, :]
target = iris.target[40:]
target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
# 打标签 
target = np.where((target == 0), 0, 1)
target
# 标准化数据
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
# class_weight="balanced"  参数  根据样本出现的评论自动给 样本设置 权重
logistic_regression = LogisticRegression(random_state=0, class_weight="balanced")
model = logistic_regression.fit(features_standardized, target)

🍔 方案2：imbalanced-learn

imbalanced-learn是一个基于Python的开源库，专门用于处理不平衡数据集的机器学习问题。该库提供了一系列的重采样技术、组合方法和机器学习算法，旨在提高在不平衡数据集上的分类性能。以下是对imbalanced-learn的详细介绍：

主要功能

🐻 重采样技术：包括欠采样（如Tomek Links、Random Under Sampler等）、过采样（如SMOTE、ADASYN等）以及结合欠采样和过采样的方法（如SMOTEENN、SMOTETomek等）。这些技术通过重新调整数据集中各类别的样本数量，以达到类别平衡的目的。

🐻 组合方法：imbalanced-learn还提供了一些组合方法，如集成学习和自适应集成学习等，这些方法通过结合多个分类器的预测结果来提高整体分类性能。

🐻 机器学习算法：除了重采样技术和组合方法外，imbalanced-learn还包含了一些专门为不平衡数据集设计的机器学习算法，如Easy Ensemble classifier、Balanced Random Forest等。

特点

🐻 多样性：imbalanced-learn提供了多种不同的重采样技术和组合方法，用户可以根据具体的数据集和需求选择合适的方法。

🐻 可扩展性：该库支持与scikit-learn和Pandas等常见的Python库集成，可以方便地与其他的机器学习算法和工具进行组合和使用。

🐻 灵活性：imbalanced-learn提供了多种参数调整和定制化的选项，用户可以根据不同的应用场景和需求进行调整和定制化。

3.1 安装

imbalanced-learn的安装非常简单，用户可以通过pip或conda等包管理工具进行安装。例如，使用pip安装imbalanced-learn的命令如下：

pip install imbalanced-learn

3.2 过采样

随机过采样：随机在少数类别样本中选择一些样本，通过复制所选择的样本方式补充少数类别样本数量。
合成少数类过采样（SMOTE）: 1. 计算每个样本的 K 个近邻
1. 对每个少数样本，从其 K 近邻中随机选择若干个样本
2. 在少数样本和选择的近邻样本之间的连线上选择一点作为新的样本
3. 将新样本添加到少数类样本集中

示例代码💯:

from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
from collections import Counter


# 随机过采样
def test01(X, y):
    from imblearn.over_sampling import RandomOverSampler
    # 构建随机过采样对象
    ros = RandomOverSampler(random_state=0)
    # 对X中的少数样本进行随机过采样，返回类别平衡的数据集
    X_resampled, y_resampled = ros.fit_resample(X, y)
    # 查看新数据集类别比例
    print(Counter(y_resampled))
    # 数据可视化
    plt.title("过采样数据集")
    plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled)
    plt.show()


# 合成少数过采样
def test02(X, y):
    from imblearn.over_sampling import SMOTE
    # 构建 SMOTE 对象
    ros = SMOTE(random_state=0)
    # 对X中的少数样本进行合成少数过采样，返回类别平衡的数据集
    X_resampled, y_resampled = ros.fit_resample(X, y)
    # 查看新数据集类别比例
    print(Counter(y_resampled))
    # 数据可视化
    plt.title("过采样数据集")
    plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled)
    plt.show()


if __name__ == "__main__":

    # 构建数据集
    X, y = make_classification(n_samples=5000,
                               n_features=2,
                               n_informative=2,
                               n_redundant=0,
                               n_repeated=0,
                               n_redundant 特征
                               n_classes=3,
                               n_clusters_per_class=1, 
                               weights=[0.01, 0.05, 0.94],
                               random_state=0)

    # 统计各类别样本数量
    print(Counter(y))

    # 数据可视化
    plt.title("类别不平衡数据集")
    plt.scatter(X[:, 0], X[:, 1], c=y)
    plt.show()

    # 随机过采样
    test01(X, y)
    # 合成少数过采样
    test02(X, y)

3.3 欠采样

随机欠采样: 随机减少多数类别样本数量, 达到样本数量平衡.

示例代码💯:

from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
from collections import Counter


def test(X, y):
    from imblearn.under_sampling import RandomUnderSampler
    # 构建随机欠采样对象
    ros = RandomUnderSampler(random_state=0)
    # 对X中的少数样本进行随机过采样，返回类别平衡的数据集
    X_resampled, y_resampled = ros.fit_resample(X, y)
    # 查看新数据集类别比例
    print(Counter(y_resampled))
    # 数据可视化
    plt.title("过采样数据集")
    plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled)
    plt.show()


if __name__ == "__main__":

    # 构建数据集
    X, y = make_classification(n_samples=5000,
                               n_features=2,
                               n_informative=2,
                               n_redundant=0,
                               n_repeated=0,
                               n_redundant 特征
                               n_classes=3,
                               n_clusters_per_class=1, 
                               weights=[0.01, 0.05, 0.94],
                               random_state=0)

    # 统计各类别样本数量
    print(Counter(y))

    # 数据可视化
    plt.title("类别不平衡数据集")
    plt.scatter(X[:, 0], X[:, 1], c=y)
    plt.show()

    # 随机欠采样
    test(X, y)