预处理--python实现处理类的不平衡问题（对少数类上采样）

最新推荐文章于 2024-04-10 15:29:09 发布

糯米君_

最新推荐文章于 2024-04-10 15:29:09 发布

阅读量1k

点赞数 2

分类专栏：预处理文章标签： python 机器学习算法

本文链接：https://blog.csdn.net/fgg1234567890/article/details/110209687

版权

预处理专栏收录该内容

17 篇文章 3 订阅

订阅专栏

python实现处理类的不平衡问题（对少数类上采样）

在模型拟合过程中，处理不平衡类比例的一种方法是对少数类的错误预测给予更大的惩罚。在scikit-learn中，只要把参数class_weight设置成 class_weight=‘balanced’，就可以很方便地调整这种惩罚的力度，大多数的分类器都是这么实现的。
处理类不平衡问题的其他常用策略包括对少数类上采样，对多数类下采样以及生成人造训练样本。不幸的是没有万能的最优解决方案，没有对所有问题都最有效的技术。因此，建议在实践中对给定问题尝试不同的策略，通过评估结果选择最合适的技术。

scikit-learn库实现了简单的resample函数，可以通过从数据集中有替换地提取新样本帮助少数类上采样：

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import resample

df = pd.read_csv('xxx\\wdbc.data',
                 header=None)
print(df.head())

X = df.loc[:, 2:].values
y = df.loc[:, 1].values
le = LabelEncoder()
y = le.fit_transform(y)

# 选取了所有357个良性肿瘤的样本，并将它们与前40个恶性肿瘤的样本叠加，形成一个明显的不平衡类。
X_imb = np.vstack((X[y == 0], X[y == 1][:40]))
y_imb = np.hstack((y[y == 0], y[y == 1][:40]))

# 如果要计算能预测大多数类（良性，0类）模型的准确度，将可以达到大约90%的准确度
y_pred = np.zeros(y_imb.shape[0])
np.mean(y_pred == y_imb) * 100
print(np.mean(y_pred == y_imb))

# 上采样
print('Number of class 1 samples before:', X_imb[y_imb == 1].shape[0])

X_upsampled, y_upsampled = resample(X_imb[y_imb == 1],
                                    y_imb[y_imb == 1],
                                    replace=True,
                                    n_samples=X_imb[y_imb == 0].shape[0],
                                    random_state=123)

print('Number of class 1 samples after:', X_upsampled.shape[0])

X_bal = np.vstack((X[y == 0], X_upsampled))
y_bal = np.hstack((y[y == 0], y_upsampled))

# 重采样可以把原来的0类样本与上采样的1类样本叠加获得平衡的数据集，因此，多数票预测规则只能达到50%的准确度
y_pred = np.zeros(y_bal.shape[0])
print(np.mean(y_pred == y_bal) * 100)

运行结果：
0 1 2 3 4 … 27 28 29 30 31
0 842302 M 17.99 10.38 122.80 … 0.6656 0.7119 0.2654 0.4601 0.11890
1 842517 M 20.57 17.77 132.90 … 0.1866 0.2416 0.1860 0.2750 0.08902
2 84300903 M 19.69 21.25 130.00 … 0.4245 0.4504 0.2430 0.3613 0.08758
3 84348301 M 11.42 20.38 77.58 … 0.8663 0.6869 0.2575 0.6638 0.17300
4 84358402 M 20.29 14.34 135.10 … 0.2050 0.4000 0.1625 0.2364 0.07678

[5 rows x 32 columns]
0.8992443324937027
Number of class 1 samples before: 40
Number of class 1 samples after: 357
50.0