机器学习11-不平衡数据之采样

最新推荐文章于 2023-09-06 12:56:55 发布

哎呦-_-不错

最新推荐文章于 2023-09-06 12:56:55 发布

阅读量1.5k

点赞数 1

分类专栏： # 机器学习基础文章标签：机器学习 python 不平衡数据采样

本BLOG上原创文章未经本人许可，不得用于商业用途，转载请注明出处。

本文链接：https://blog.csdn.net/weixin_46649052/article/details/108602302

版权

本文介绍了处理不平衡数据的策略，包括过采样方法如随机过采样、SMOTE、Border-line SMOTE和ADASYN，下采样方法如原型生成、原型选择和NearMiss，以及结合过采样和下采样的SMOTEENN和SMOTETomek方法。

摘要由CSDN通过智能技术生成

文章目录

1.过采样（上采样）

1）随机过采样

# 统计数据
from collections import Counter

import matplotlib.pyplot as plt
# 过采样
from imblearn.over_sampling import RandomOverSampler
# 构造数据
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=50, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1,
                           weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)

print(Counter(y))  # Counter({2: 47, 1: 2, 0: 1})
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()

ros = RandomOverSampler(random_state=0)
# 过采样
X_resampled, y_resampled = ros.fit_resample(X, y)

plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled)
print(Counter(y_resampled))         # Counter({2: 47, 1: 47, 0: 47})

在这里插入图片描述

2）SMOTE方法

SMOTE会随机选取少数类样本用以合成新样本，而不考虑周边样本的情况，这样容易带来两个问题：

如果选取的少数类样本周围也都是少数类样本，则新合成的样本不会提供太多有用信息。这就像支持向量机中远离margin的点对决策边界影响不大。
如果选取的少数类样本周围都是多数类样本，这类的样本可能是噪音，则新合成的样本会与周围的多数类样本产生大部分重叠，致使分类困难。

# SMOTE算法的基本思想是对少数类样本进行分析并根据少数类样本人工合成新样本添加到数据集中
from collections import Counter
import matplotlib.pyplot as plt
# 过采样
from imblearn.over_sampling import RandomOverSampler,SMOTE
# 构造数据
from sklearn.datasets import make_classification

X, y =  make_classification(n_samples=1000, n_features=2,
                           n_informative=2, n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1,
                           weights=[0.01, 0.05, 0.94],
                           random_state=500)

plt.scatter(X[:,0],X[:,1],c=y)
plt.show()
print(Counter(y))           # Counter({2: 930, 1: 57, 0: 13})
X_resampled_smote,y_resampled_smote = SMOTE().fit_sample