Imblearn package study（不平衡数据处理之过采样、下采样、综合采样）

最新推荐文章于 2025-04-19 22:57:42 发布

芒萝

最新推荐文章于 2025-04-19 22:57:42 发布

阅读量6.3w

点赞数 80

分类专栏： python 文章标签： python 不平衡数据

本文链接：https://blog.csdn.net/kizgel/article/details/78553009

版权

Imblearn package study
1. 准备知识
- 1.1 Compressed Sparse Rows(CSR) 压缩稀疏的行
2. 过采样(Over-sampling)
- 2.1 实用性的例子
3. 下采样(Under-sampling)
- 3.1 原型生成(prototype generation)
- 3.2 原型选择(prototype selection)
  - 3.2.1 Controlled under-sampling techniques
  - 3.2.2 Cleaning under-sampling techniques
4. 过采样与下采样的结合
5. Ensemble的例子
- 5.1 例子
- 5.2 Chaining ensemble of samplers and estimators
6. 数据载入
- 6.1 不平衡数据集
- 6.2 生成不平衡数据
参考资料

Imblearn package study

1. 准备知识

Sparse input

For sparse input the data is converted to the Compressed Sparse Rows representation (see scipy.sparse.csr_matrix) before being fed to the sampler. To avoid unnecessary memory copies, it is recommended to choose the CSR representation upstream.

1.1 Compressed Sparse Rows(CSR) 压缩稀疏的行

稀疏矩阵中存在许多0元素, 按矩阵A进行存储会占用很大的空间(内存).

CSR方法采取按行压缩的办法, 将原始的矩阵用三个数组进行表示:

data = np.array([1, 2, 3, 4, 5, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
indptr = np.array([0, 2, 3, 6])

data数组: 存储着矩阵A中所有的非零元素;

indices数组: data数组中的元素在矩阵A中的列索引

indptr数组: 存储着矩阵A中每行第一个非零元素在data数组中的索引.

from scipy import sparse
mtx = sparse.csr_matrix((data,indices,indptr),shape=(3,3))
mtx.todense()

Out[27]: 
matrix([[1, 0, 2],
        [0, 0, 3],
        [4, 5, 6]])

为什么会有针对不平衡数据的研究? 当我们的样本数据中, 正负样本的数据占比极其不均衡的时候, 模型的效果就会偏向于多数类的结果. 具体的可参照官网利用支持向量机进行可视化不同正负样本比例情况下的模型分类结果.

2. 过采样(Over-sampling)

2.1 实用性的例子

2.1.1 朴素随机过采样

针对不平衡数据, 最简单的一种方法就是生成少数类的样本, 这其中最基本的一种方法就是: 从少数类的样本中进行随机采样来增加新的样本, RandomOverSampler 函数就能实现上述的功能.

from sklearn.datasets import make_classification
from collections import Counter
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1,
                           weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)
Counter(y)
Out[10]: Counter({
  0: 64, 1: 262, 2: 4674})

from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_sample(X, y)


sorted(Counter(y_resampled).items())
Out[13]:
[(0, 4674), (1, 4674), (2, 4674)]

以上就是通过简单的随机采样少数类的样本, 使得每类样本的比例为1:1:1.

2.1.2 从随机过采样到`SMOTE`与`ADASYN`

相对于采样随机的方法进行过采样, 还有两种比较流行的采样少数类的方法: (i) Synthetic Minority Oversampling Technique (SMOTE); (ii) Adaptive Synthetic (ADASYN) .

SMOTE: 对于少数类样本a, 随机选择一个最近邻的样本b, 然后从a与b的连线上随机选取一个点c作为新的少数类样本;

ADASYN: 关注的是在那些基于K最近邻分类器被错误分类的原始样本附近生成新的少数类样本

from imblearn.over_sampling import SMOTE, ADASYN

X_resampled_smote, y_resampled_smote = SMOTE().fit_sample(X, y)

sorted(Counter(y_resampled_smote).items())
Out[29]:
[(0, 4674), (1, 4674), (2, 4674)]

X_resampled_adasyn, y_resampled_adasyn = ADASYN().fit_sample(X, y)

sorted(Counter(y_resampled_adasyn).items())
Out[30]:
[(0, 4674), (1, 4674), (2, 4674)]

2.1.3 `SMOTE`的变体

相对于基本的SMOTE算法, 关注的是所有的少数类样本, 这些情况可能会导致产生次优的决策函数, 因此SMOTE就产生了一些变体: 这些方法关注在最优化决策函数边界的一些少数类样本, 然后在最近邻类的相反方向生成样本.

SMOTE函数中的kind参数控制了选择哪种变体, (i) borderline1, (ii) borderline2, (iii) svm:

from imblearn.over_sampling import SMOTE, ADASYN
X_resampled, y_resampled = SMOTE(kind='borderline1').fit_sample(X, y)

print sorted(Counter(y_resampled).items())
Out[31]:
[(0, 4674), (1, 4674), (2, 4674)]