Python——机器学习：不平衡数据集常用处理方法和实例

兰泽S

已于 2024-02-05 18:24:36 修改

阅读量2.1k

点赞数 20

分类专栏： Python 机器学习文章标签：机器学习算法人工智能 python

于 2024-02-05 18:22:35 首次发布

本文链接：https://blog.csdn.net/weixin_53848907/article/details/135976144

版权

Python 同时被 2 个专栏收录

13 篇文章 3 订阅

订阅专栏

机器学习

9 篇文章 1 订阅

订阅专栏

本文梳理了几种常用的不平衡数据集处理方法，包括过采样、欠采样，类别加权和数据加权的方法。以下通过信用卡违约实例数据进行说明。

不平衡数据集，尤其长尾数据一直都是重点和难点。实际应用中，应根据具体的业务需求，确定应该尽量提高模型的哪个指标。如：对于信用卡违约这样一个对正类样本（违约）判定要求较高的场景，往往需要更高的召回率。我们采用AUC和F1得分评价模型结果，总体情况见下表。可见效果都一般，但处理后，F1值确实都有提升。（注：本文中除基模型外的模型均未进行调优，可能对处理后的数据未必合适。）

1. 导入数据集

本文为了简单起见，直接采用预处理过的信用卡违约数据进行训练。训练集中，未违约和违约用户约比为3.5：1。

# 导入所需模块
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import sklearn
from sklearn import metrics 
from sklearn.metrics import precision_recall_curve,classification_report
from sklearn.metrics import confusion_matrix, accuracy_score

import warnings
warnings.filterwarnings('ignore')

import lightgbm as lgb

# 预处理后的数据
df_train = pd.read_csv('./train_set_pre.csv')
df_test = pd.read_csv('./test_set_pre.csv')

X_train = df_train.drop(['default.payment.next.month'],axis=1)
y_train = df_train['default.payment.next.month']

X_test = df_test.drop(['default.payment.next.month'],axis=1)
y_test = df_test['default.payment.next.month']

# 自定义得分报告函数
def model_report(model,train_x,train_y,test_x,test_y):
    # 模型在训练集上的表现
    train_pre = model.predict(train_x)
    train_score = model.predict_proba(train_x)[:,1]
    train_auc = metrics.roc_auc_score(train_y,train_score)
    # 模型在测试集上的表现
    test_pre = model.predict(test_x)
    test_score = model.predict_proba(test_x)[:,1]
    test_auc = metrics.roc_auc_score(test_y,test_score)
    test_f1 = metrics.f1_score(test_y,test_pre)
    print('训练集auc为：', train_auc)  # 用于对比确定模型是否过拟合
    print('测试集auc为：', test_auc)
    print('测试集f1得分为：', test_f1)

2. 直接通过原数据集训练

查看原数据集标签类别分布情况：

print(y_train.value_counts())
print("\n 0-1标签比例为：", 16304/4696)

我们使用lightgbm模型，已进行过gridSearch参数调优。

gbm_base = lgb.LGBMClassifier(max_depth=5, num_leaves=15, subsample=0.8, learning_rate=0.1, 
                         colsample_bytree = 0.8, n_estimators=80, metrics='auc')
gbm_model_base = gbm_base.fit(X_train, y_train)

查看模型在训练集和测试集上的表现。

二分类问题可以通过调整对预测得分进行类别划分的阈值，来调整预测结果。但对多分类方法不适用。

3. 过采样方法

过采样方法即增加少数类样本的数量。

使用imblearn包中的SMOTE函数进行过采样处理。

from imblearn.over_sampling import SMOTE 

oversampler=SMOTE(random_state=0)
os_x,os_y=oversampler.fit_resample(X_train,y_train)

# 查看过采样后标签分布
os_y.value_counts()

查看处理后的标签分布情况：

进行模型训练并查看结果：

gbm_model_os = gbm_base.fit(os_x, os_y)
model_report(gbm_model_os,os_x,os_y,X_test,y_test)

可以看到，测试集上AUC下降了0.016，f1得分提高了0.035。

4. 欠采样方法

欠采样方法即减少多数类样本的数量。本文以随机欠采样方法为例。

rus = RandomUnderSampler(sampling_strategy='not minority',random_state=42)
"""
参数说明：
sampling_strategy：
  如果是小数，表示 少数类样本数/降采样后的多数类样本数，只适用于二分类
  如果是字符串，'majority' 表示只降采样最多数类，
               'not minority'表示降采样除了最少数类的其它所有类 
               'not majority'表示降采样除了最多数类的其它所有类
               'all'表示降采样所有类
               'auto' 等同于'not minority'
  如果是字典：key是类别，value是该类别的相对样本量
  默认取值为字符串‘auto’；
"""

rus_x, rus_y = rus.fit_resample(X_train, y_train)
print(rus_y.value_counts())

查看欠采样后标签分布

进行模型训练并查看结果：

gbm_model_rus = gbm_base.fit(rus_x, rus_y)
model_report(gbm_model_rus,rus_x,rus_y,X_test,y_test)

从结果看，auc值比不做处理稍有下降，f1值提高显著。

（可通过dir(imblearn.under_sampling)查看其它欠采样方法。）

5. 类别权重

即通过提高少数类的权重，来增强模型对少数类的拟合能力。通常通过模型中的class_weight参数设置。

from sklearn.utils import class_weight

class_weight= class_weight.compute_class_weight(class_weight='balanced',classes=np.unique(y_train),y=y_train)
"""参数说明：
class_weight: 
    'balance': 权重计算公式为 n_samples / (n_classes * np.bincount(y))，np.bincount(y)函数从0到n返回每个类比的数据量;
    字典：类别及其对应的权重；
    默认为值为均匀权重;
classes: 标签的所有取值
y: 原数据中的标签
"""

class_weight_dict = {key : value for (key, value) in zip(np.unique(y_train), class_weight)}
print(class_weight_dict)
# 输出：{0: 0.6440137389597644, 1: 2.2359454855195913}

进行模型拟合，设置class_weight参数：

gbm_w1 = lgb.LGBMClassifier(max_depth=5, num_leaves=15, subsample=0.8, learning_rate=0.1, 
                         colsample_bytree = 0.8, n_estimators=80, metrics='auc', 
                         class_weight = class_weight_dict)

gbm_model_w1 = gbm_w1.fit(X_train, y_train)
model_report(gbm_model_w1,X_train, y_train, X_test,y_test)

6. 样本权重

通过提高少数类别的样本权重，来增强模型对少数类的拟合能力。当少数类样本权重和类别权重取值一致时，两者效果相同。如果同时设置类别权重和样本权重，其最终的权重是二者的乘积。

样本权重应该是一个和数据长度相同的数组，每个值对应每条样本损失的权重，我们通过编写以下函数获得。以下样本权重的效果等同于第5节的类别权重。

def BalancedSampleWeights(y_train,class_weight_coef):
    classes = np.unique(y_train, axis =0)  # 全部类别
    classes.sort()
    class_samples = np.bincount(y_train) # 返回从0到array数组中最大数字，每个数字出现的个数的函数
    total_samples = class_samples.sum() # 计算训练样本长度
    n_classes = len(class_samples) # 计算类别数
    weights = total_samples / (n_classes* class_samples * 1.0) # 计算每个类别对应的样本权重,计算同上面的类别权重
    class_weight_dict = {key : value for (key, value) in zip(classes, weights)} # 将类别、权重转成字典形式
    ## 可以调整不同类别数据的权重系数，由class_weight_coef参数控制
    # class_weight_dict[classes[1]] = class_weight_dict[classes[1]] * class_weight_coef 
    sample_weights = [class_weight_dict[i] for i in y_train] # 得到每条样本对应的权重
    return sample_weights  

class_weight_coef = 1
weight=BalancedSampleWeights(y_train,class_weight_coef)

gbm_sw = lgb.LGBMClassifier(max_depth=5, num_leaves=15, subsample=0.8, learning_rate=0.1, 
                         colsample_bytree = 0.8, n_estimators=80, metrics='auc')

gbm_model_sw = gbm_sw.fit(X_train,y_train,sample_weight = weight)

model_report(gbm_model_sw,X_train,y_train,X_test,y_test)

结果同第5节。

兰泽S

关注

20
点赞
踩
35

收藏

觉得还不错? 一键收藏
1
评论
Python——机器学习：不平衡数据集常用处理方法和实例

本文梳理了几种常用的不平衡数据集处理方法，包括过采样、欠采样，类别加权和数据加权的方法。以下通过信用卡违约实例数据进行说明。不平衡数据集，尤其长尾数据一直都是重点和难点。实际应用中，应根据具体的业务需求，确定应该尽量提高模型的哪个指标。如：对于信用卡违约这样一个对正类样本（违约）判定要求较高的场景，往往需要更高的召回率。我们采用AUC和F1得分评价模型结果，总体情况见下表。可见效果都一般，但处理后，F1值确实都有提升。（注：本文中除基模型外的模型均未进行调优，可能对处理后的数据未必合适。
复制链接

扫一扫

专栏目录