数据不平衡处理方式之过采样和欠采样（Python代码）

gao_vip

已于 2023-06-27 11:11:27 修改

阅读量6.8k

点赞数 6

分类专栏：数据处理篇文章标签： python 开发语言 sklearn

于 2023-06-26 23:15:00 首次发布

本文链接：https://blog.csdn.net/weixin_41233157/article/details/131403617

版权

数据处理篇专栏收录该内容

16 篇文章

订阅专栏

sklearn.datasets.make_classification
官方地址：
https://www.w3cschool.cn/doc_scikit_learn/scikit_learn-modules-generated-sklearn-datasets-make_classification.html

sklearn.datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None) [source]

一、过采样与欠采样原理

在进行数据分析建模的过程中，数据不均衡是非常常见的问题，一般可以用过采样，欠采样，过采样+欠采样等发放处理。过采样一般包括随机过采样、插值Smote和KNN分类器的Adasyn；欠采样一般包括随机欠采样、EasyEnsemble、BalanceCascade、NearMiss、Tomek Link、Edited Nearest Neighbours (ENN)等。

过采样，又称上采样（over-sampling），通过增加分类中少数类样本的数量来实现样本均衡。
如SMOTE算法，通过插值生成合成样本，非直接对少数类进行重采样，从而使得少数类样本的特征空间得到扩展，有助于模型更好地探索和学习少数类的特征，提高模型的性能。其主要步骤如下：

在少数类样本中随机选一个样本
找到该样本的K个近邻（假设K = 5）
随机从K个近邻中选出一个样本
在该样本和随机选出的这个近邻样本之间的连线上，随机找一点，即是人工合成的新样本
重复上述步骤，生成指定数量的合成样本

欠采样，又称下采样（under-sampling），其通过减少分类中多数类样本的数量来实现样本均衡。
如随机欠采样，随机从多数类样本中抽取一部分数据进行删除，随机欠采样有一个很大的缺点是未考虑样本的分布情况，而采样过程又具有很大的随机性，可能会误删多数类样本中一些重要的信息。

二、代码示例

# -*- coding: utf-8 -*-
import numpy as np
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import NearMiss
from sklearn.datasets import make_classification

# 随机生成原始数据
x, y = make_classification(n_samples=1000, n_features=5, n_classes=2,
                           weights=[0.1, 0.9], random_state=123)
print('原始正样本数：', np.sum(y == 1), '原始负样本数：', np.sum(y == 0),   '原始总数：', len(x))


# smote过采样
smote = SMOTE()
x_new, y_new = smote.fit_resample(x, y)
print('smote后正样本数：', np.sum(y_new == 1), 'smote后负样本数：', np.sum(y_new == 0), 'smote后总数：', len(x_new))

# 随机欠采样
rus = RandomUnderSampler()
x_new2, y_new2 = rus.fit_resample(x, y)
print('随机欠采样后正样本数：', np.sum(y_new2 == 1), '随机欠采样后负样本数：', np.sum(y_new2 == 0), '随机欠采样后总数：', len(x_new2))

结果展示：
在这里插入图片描述

附录

=================================================
make_classification分类函数参数

1.n_samples：int, optional (default=100)
样本的数量

2.n_features：int, optional (default=20)
样本的特征数

3.n_informative：int, optional (default=2)
样本中有用的特征数量。这个参数只有当数据集的分类数为2时才有效

4.n_redundant：int, optional (default=2)
样本中冗余特征的数量，这些特征是从有用特征中随机组合而成的

5.n_repeated: int，int, optional (default=0)
从信息特征和冗余特征中随机抽取的重复特征的数量。

6.n_classes：int, optional (default=2)
数据集分类的数量

7.n_clusters_per_class: int, optional (default=2)
每个类的簇数。

8.weights : list of floats or None (default=None)
分配给每个类别的样本比例。如果无，则类是平衡的。请注意，如果len（weights）=n_classes-1，则会自动推断出最后一个类的权重。如果权重之和超过1，则可以返回多于n_samples的样本。

9.flip_y : float, optional (default=0.01)
类被随机交换的样本的分数。

10.random_state：int, RandomState instance or None, optional (default=None)
随机数的种子