SMOTE抽样 数据不平衡的问题

from imblearn.over_sampling import SMOTE
import pandas as pd 
C:\ProgramData\Anaconda3\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)
C:\ProgramData\Anaconda3\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)
df = pd.read_csv('base_done.csv')
data = df[:20].iloc[:,1:10]
data
sexageproviderlevelverifiedusing_timeregist_typecard_a_cntcard_b_cnt
00248530102471312471924712
11250110102474372471224712
21248770202474472471924725
30249250202471512471224712
41248772102470632471224712
50249440202472012471924712
60248400202472712471224712
70249440202470912471224712
80249080202473012471224712
91249560202474172471924719
100249200202474172471924719
110248710202472222471224712
121248890202472512471924719
131248652202471032471224712
140249440202472712471224712
151249310202473372471224719
160249630102472122471224712
170248770202472712471224712
180249010202473372472524719
190248592102471332471224712
X = data.drop(columns='provider').values
y = data.provider
data.provider.value_counts()
0    17
2     3
Name: provider, dtype: int64
sm = SMOTE(sampling_strategy={0:17,2:15},k_neighbors=2)  # sampling_strategy 抽样策略,默认为1:1, k = knn的
X_res, y_res = sm.fit_resample(X, y)  # 
y_res.value_counts()
0    17
2    15
Name: provider, dtype: int64
X_res
sexagelevelverifiedusing_timeregist_typecard_a_cntcard_b_cnt
0024853102471312471924712
1125011102474372471224712
2124877202474472471924725
3024925202471512471224712
4124877102470632471224712
5024944202472012471924712
6024840202472712471224712
7024944202470912471224712
8024908202473012471224712
9124956202474172471924719
10024920202474172471924719
11024871202472222471224712
12124889202472512471924719
13124865202471032471224712
14024944202472712471224712
15124931202473372471224719
16024963102472122471224712
17024877202472712471224712
18024901202473372472524719
19024859102471332471224712
20024860102471232471224712
21024871102470832471224712
22024872102470732471224712
23124865102470932471224712
24024863102471032471224712
25024859102471232471224712
26124870102470832471224712
27124875102470632471224712
28124872102470732471224712
29024873102470732471224712
30024862102471132471224712
31024859102471232471224712
32024873102470732471224712
33124866102470932471224712

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
SMOTE (Synthetic Minority Over-sampling Technique) 是一种数据抽样算法,用于解决数据平衡问题。它通过合成新少数类样本来增加少数类样本的数量,从而解决数据平衡问题。 下面是 SMOTE 算法的 Python 代码实现: ``` python import numpy as np from sklearn.neighbors import NearestNeighbors def smote(X, y, k=5, sampling_ratio=1.0): """ X: 二维数组,表示原始数据集 y: 一维数组,表示原始数据集的标签 k: 选取的最近邻数 sampling_ratio: 抽样比例 """ # 将原始数据集按类别分组 groups = {} for i, label in enumerate(y): if label not in groups: groups[label] = [] groups[label].append(i) # 计算每个少数类样本需要合成的样本数 N = len(X) n_samples = int(sampling_ratio * N) n_majority = max([len(groups[label]) for label in groups]) n_synthetic = int((n_samples - N) / n_majority) # 对每个少数类样本合成样本 new_X = [] new_y = [] for label, indices in groups.items(): if len(indices) >= n_synthetic: # 找到 k 个最近邻 nn = NearestNeighbors(n_neighbors=k).fit(X[indices]) for i in indices: # 找到 i 的 k 个最近邻 nn_indices = nn.kneighbors(X[i].reshape(1, -1), return_distance=False)[0] # 随机选择一个最近邻 j j = np.random.choice(nn_indices) # 在 i 和 j 之间进行插值 diff = X[j] - X[i] gap = np.random.rand() new_X.append(X[i] + gap * diff) new_y.append(label) else: # 少数类样本数量不足,无法进行 SMOTE pass # 将合成的样本添加到原始数据集中 new_X = np.array(new_X) new_y = np.array(new_y) X = np.vstack([X, new_X]) y = np.hstack([y, new_y]) return X, y ``` 其中,`X` 是原始数据集,`y` 是原始数据集的标签。`k` 表示选取的最近邻数,`sampling_ratio` 表示抽样比例。 这段代码的实现思路是先按类别分组,然后计算每个少数类样本需要合成的样本数。对于每个少数类样本,找到它的 k 个最近邻中的一个随机样本,在它和最近邻之间进行插值,生成新的合成样本。最后将合成的样本添加到原始数据集中返回。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值