1.过采样(上采样)
1)随机过采样
from collections import Counter
import matplotlib.pyplot as plt
from imblearn.over_sampling import RandomOverSampler
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=50, n_features=2, n_informative=2,
n_redundant=0, n_repeated=0, n_classes=3,
n_clusters_per_class=1,
weights=[0.01, 0.05, 0.94],
class_sep=0.8, random_state=0)
print(Counter(y))
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)
plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled)
print(Counter(y_resampled))
2)SMOTE方法
SMOTE会随机选取少数类样本用以合成新样本,而不考虑周边样本的情况,这样容易带来两个问题:
如果选取的少数类样本周围也都是少数类样本,则新合成的样本不会提供太多有用信息。这就像支持向量机中远离margin的点对决策边界影响不大。
如果选取的少数类样本周围都是多数类样本,这类的样本可能是噪音,则新合成的样本会与周围的多数类样本产生大部分重叠,致使分类困难。
from collections import Counter
import matplotlib.pyplot as plt
from imblearn.over_sampling import RandomOverSampler,SMOTE
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=2,
n_informative=2, n_redundant=0, n_repeated=0, n_classes=3,
n_clusters_per_class=1,
weights=[0.01, 0.05, 0.94],
random_state=500)
plt.scatter(X[:,0],X[:,1],c=y)
plt.show()
print(Counter(y))
X_resampled_smote,y_resampled_smote = SMOTE().fit_sample