欠采样与过采样方法

1、使用SMOTE进行过采样

使用SMOTE过采样时应先切分训练集和验证集,再对训练集进行过采样,否则将会导致严重的过拟合
https://beckernick.github.io/oversampling-modeling/

使用方法:

X_train, X_val, y_train, y_val = train_test_split(train_df[predictors], train_df[target], test_size=0.15, random_state=1234)
from imblearn.over_sampling import SMOTE
oversampler = SMOTE(ratio='auto', random_state=np.random.randint(100), k_neighbors=5, m_neighbors=10, kind='regular', n_jobs=-1)
os_X_train, os_y_train = oversampler.fit_sample(X_train,y_train)
from collections import Counter
print('Resampled dataset shape {}'.format(Counter(os_y_train)))

注意,过采样之后就不能直接把Pandas.DataFrame数据传入模型,特征名称已改变

model=XGBClassifier(
    learning_rate =0.1,
    n_estimators=1000,
    max_depth=5,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    objective= 'binary:logistic',
    nthread=-1,
    scale_pos_weight=1,
    seed=27
)

model.fit(
    os_X_train,
    os_y_train,
    eval_set=[(X_val.values, y_val)],
    early_stopping_rounds=3,
    verbose=True,
    eval_metric='auc'
)

https://www.kaggle.com/ktattan/recall-97-with-smote-random-forest-tsne

2、欠采样,也叫下采样

def down_sample(df):
    """
    欠采样
    """
    df1 = df[df['acc_now_delinq'] == 1]
    df2 = df[df['acc_now_delinq'] == 0]
    df3 = df2.sample(frac=0.1)
    return pd.concat([df1, df3], ignore_index=True)
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值