数据分析学习之：如何均衡样本——使用 imblearn 库实现重采样（resampling），过采样(over-sampling) + 欠采样(under-sampling))

暖仔会飞

已于 2022-05-04 00:58:46 修改

阅读量3k

点赞数 2

分类专栏：日常学习 Python数据分析与挖掘文章标签：数据处理，数据均衡

于 2021-12-02 13:00:02 首次发布

本文链接：https://blog.csdn.net/qq_42902997/article/details/121674852

版权

日常学习同时被 2 个专栏收录

100 篇文章 9 订阅

订阅专栏

Python数据分析与挖掘

22 篇文章 9 订阅

订阅专栏

什么是样本不平衡

import pandas as pd
import numpy as np
import seaborn as sns


values = {"姓名":["A","B","C","D","E","F","G","H","I","J","K","L","G","H","I","J","K","L"],
         "年龄":[55,70,80,90,60,30,67,44,60,30,67,44,30,67,30,67,30,67],
          "头发颜色":["白","白","白","白","白","黑","白","黑","白","白","黑","白","白","黑","白","白","黑","黑"]}
table = pd.DataFrame(values)

table

	姓名	年龄	头发颜色
0	A	55	白
1	B	70	白
2	C	80	白
3	D	90	白
4	E	60	白
5	F	30	黑
6	G	67	白
7	H	44	黑
8	I	60	白
9	J	30	白
10	K	67	黑
11	L	44	白
12	G	30	白
13	H	67	黑
14	I	30	白
15	J	67	白
16	K	30	黑
17	L	67	黑

table["头发颜色"] = pd.Categorical(table["头发颜色"]).codes

table

	姓名	年龄	头发颜色
0	A	55	0
1	B	70	0
2	C	80	0
3	D	90	0
4	E	60	0
5	F	30	1
6	G	67	0
7	H	44	1
8	I	60	0
9	J	30	0
10	K	67	1
11	L	44	0
12	G	30	0
13	H	67	1
14	I	30	0
15	J	67	0
16	K	30	1
17	L	67	1

从下面的统计图中可以看出，以头发颜色作为 label 进行分类的时候，样本是不均衡的
因为 12个白头发，但是有 6 个黑头发

table["头发颜色"].plot(x=[0,1],kind="hist")

在这里插入图片描述

如何平衡数据集的样本——重采样

我们的最终目标是保证数据集中各个 label 下的样本数量是几乎完全相等的
要么我们就需要把样本多的组的样本按照随机的原则砍掉一部分来平衡，要么就把少样本的一组进行扩充

欠采样（也叫 undersampling）

顾名思义，削减大的样本集

将大的样本集的数据全部筛选出来

df_white = table.loc[table["头发颜色"] == 0]  #选出头发为白色的人
df_black = table.loc[table["头发颜色"] == 1] #选出头发为黑色的人

df_white

	姓名	年龄
0	A	55
1	B	70
2	C	80
3	D	90
4	E	60
6	G	67
8	I	60
9	J	30
11	L	44
12	G	30
14	I	30
15	J	67

df_black

	姓名	年龄	头发颜色
5	F	30	1
7	H	44	1
10	K	67	1
13	H	67	1
16	K	30	1
17	L	67	1

通过随机采样操作采样固定个数的样本留下

df_white = df_white.sample(n=6,random_state=30)

df_white

	姓名	年龄
0	A	55
8	I	60
12	G	30
11	L	44
1	B	70
3	D	90

和少样本的样本集拼合成最终的样本集

table_undersampling = pd.concat([df_black,df_white],axis=0,ignore_index=True)

table_undersampling

	姓名	年龄	头发颜色
0	F	30	1
1	H	44	1
2	K	67	1
3	H	67	1
4	K	30	1
5	L	67	1
6	A	55	0
7	I	60	0
8	G	30	0
9	L	44	0
10	B	70	0
11	D	90	0

样本均衡了

table_undersampling["头发颜色"].plot(kind="hist")

在这里插入图片描述

过采样（over-sampling）

通过 imblearn 库扩充小的样本集

from imblearn.over_sampling import SMOTE

# Resample the minority class. You can change the strategy to 'auto' if you are not sure.

# 如果这里选 minority 只能保证两个 class 样本均衡
# 但是使用 auto 可以保证多个类样本均衡
sm = SMOTE(sampling_strategy='auto', random_state=7)


# Fit the model to generate the data.

oversampled_data,oversampled_label=sm.fit_resample(table.drop(['姓名','头发颜色'], axis=1), table['头发颜色'])
oversampled_table =pd.concat([oversampled_data, oversampled_label], axis=1)

样本均衡了

oversampled_table["头发颜色"].plot(kind="hist")

在这里插入图片描述

暖仔会飞

关注

2
点赞
踩
18

收藏

觉得还不错? 一键收藏
打赏
1
评论
数据分析学习之：如何均衡样本——使用 imblearn 库实现重采样（resampling），过采样(over-sampling) + 欠采样(under-sampling))

文章目录什么是样本不平衡如何平衡数据集的样本——重采样欠采样（也叫 undersampling）将大的样本集的数据全部筛选出来通过随机采样操作采样固定个数的样本留下和少样本的样本集拼合成最终的样本集样本均衡了过采样（over-sampling）通过 imblearn 库扩充小的样本集样本均衡了什么是样本不平衡import pandas as pdimport numpy as npimport seaborn as snsvalues = {"姓名":["A","B","C","D","E",
复制链接

扫一扫