KESCI 迁移学习提供「借贷风险评估」解决方案的baseline改写,XGBOOST方法+SMOTE

最新推荐文章于 2024-05-21 16:15:59 发布

伊玛目的门徒

最新推荐文章于 2024-05-21 16:15:59 发布

阅读量2.1k

点赞数 2

分类专栏： python sklearn 量化文章标签： XGBOOST SMOTE KESCI KAGGLE 数据模型

本文链接：https://blog.csdn.net/qq_37195257/article/details/102882769

版权

python 同时被 3 个专栏收录

85 篇文章 6 订阅

订阅专栏

sklearn

9 篇文章 0 订阅

订阅专栏

量化

7 篇文章 2 订阅

订阅专栏

迁移学习提供「借贷风险评估」解决方案

大赛简介

金融场景是算法落地的重要场景。本次练习赛，我们聚焦于「借贷风险评估」问题。探索机器学习细分领域——迁移学习，在金融场景的更多可能性，以及其实践落地。

本练习赛所用数据集为业内开放数据集，我们将其设计为迁移学习问题：

参赛选手需依据给定的4万条业务A数据及4千条业务B数据，建立业务B的信用评分模型。其中业务A为信用贷款, 其特征是债务人无需提供抵押品，仅凭自己的信誉取得贷款，并以借款人信用程度作为还款保证；业务B为现金贷，即发薪日贷款（payday loan），与一般的消费金融产品相比，现金贷主要具有以下五个特点：额度小、周期短、无抵押、流程快、利率高，这也是与其借贷门槛低的特征相适应的。
由于业务A、B存在关联性，选手如何将业务A的知识迁移到业务B，以此增强业务B的信用评分模型，是本场练习赛的重点

如果你有一定机器学习基础，希望扩充自己的能力边界。那本练习赛便是你最好的赛场，在这里，你将收获：

迁移学习领域入门的绝佳机会
结交赛事圈好友，组建你的自有车队

----------------------------------------------

原baseline没有数据的上采样处理处理（SMOTE），然后也没有调参过程，我稍加改动

数据极度不平衡，对其进行上采样处理的合适且必要的

运用的XGBOOST算法

改写的解读https://www.kesci.com/home/project/5dbe6302080dc300371f3219

很遗憾，得到的模型并没有很好地在测试集中得到较高的scroe，因为并没有如出题方所设想地用到迁移学习，而仅仅是对给到B数据集进行了学习，而B数据集数量较小，模型或多或少欠拟合。

附代码供个人回顾或其他学习者参考:

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')
import os
import gc
sns.set(style = 'white', context= 'notebook', palette = 'deep')
sns.set_style('white')
pd.set_option('display.max_columns', 500)
train_A = pd.read_csv('/home/kesci/input/qyxx6708/A_train.csv')
train_B = pd.read_csv('/home/kesci/input/qyxx6708/B_train.csv')
test = pd.read_csv('/home/kesci/input/qyxx6708/B_test.csv')
sample = pd.read_csv('/home/kesci/input/qyxx6708/submit_sample.csv')

#数据描述省略
#.....

#去除完全空的数据列
train_B = train_B.drop('UserInfo_170', axis = 1)
#严重缺失值处理
'''严重缺失的特征会给模型带来极大的噪音，模型在学习的过程中，会被干扰。为了增强模型的鲁棒性，我们考虑将有很大的噪音的数据进行删除，此处我们设置阈值为1%，我们将缺失的特征大于99%的特征删除（阈值可以自己进行调整）'''
train_B_info = train_B.describe()
useful_col = []
for col in train_B_info.columns:
    if train_B_info.ix[0,col] > train_B.shape[0]*0.01:
        useful_col.append(col)
train_B_1 = train_B[useful_col].copy()


#缺失值填充-999
train_B_1 = train_B_1.fillna(-999)

#高线性相关性数据处理
'''
如果两个特征是完全线性相关的，这个时候我们只需要保留其中一个即可。因为第二个特征包含的信息完全被第一个特征所包含。此时，如果两个特征同时都保留的话，模型的性能很大情况会出现下降的情况。

我们选择将高线性相关的特征进行删除'''
relation = train_B_1.corr()
length = relation.shape[0]
high_corr = list()
final_cols = []
del_cols = []
for i in range(length):
    if relation.columns[i] not in del_cols:
        final_cols.append(relation.columns[i])
        for j in range(i+1, length):
            if (relation.iloc[i,j] > 0.98) and (relation.columns[j] not in del_cols):
                del_cols.append(relation.columns[j])

train_B_1 = train_B_1[final_cols]
train_B_flag = train_B_1['flag']
train_B_1.drop('no', axis = 1, inplace = True)
train_B_1.drop('flag', axis = 1, inplace = True)


#数据模型数据准备，解决非平衡数据
#通过镜像pip安装xgboost
!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple xgboost

import xgboost as xgb
dtrain_B = xgb.DMatrix(data = train_B_1, label = train_B_flag)
#安装imblearn包，处理上采样
!pip install imblearn

from imblearn.over_sampling import SMOTE
# 对训练集进行上采样处理
smote = SMOTE(random_state=2)
X_train_os,y_train_os = smote.fit_sample(train_B_1, train_B_flag) # ravel(): change the shape of y to (n_samples, )

print('上采样后，训练集的交易记录条数：', len(X_train_os))
print('其中，训练集X的shape:',X_train_os.shape,'，y的shape:',y_train_os.shape)
print('交易记录总数：',X_train_os.shape[0])
print('上采样后，类别为‘1’的共有{}个，类别为‘0’的共有{}个。'.format(sum(y_train_os==1),sum(y_train_os==0)))

X_train_os=pd.DataFrame(X_train_os)
X_train_os.columns = train_B_1.columns
dtrain_B_caiyang = xgb.DMatrix(data = X_train_os, label = y_train_os)


-----------
#调参过程
from sklearn.model_selection import GridSearchCV
cv_params={'eta':[0.1,0.01]}    #0.1
param_test1 = {
  'max_depth':range(3,10,1),
 'min_child_weight':range(1,6,1)
}     



#参数最佳取值:{'max_depth': 9, 'min_child_weight': 1}
param_test2 = {
  'lambda':range(3,10,1),
          'alpha':range(3,10,1),
          
}     

#参数最佳取值:{'lambda': 3, 'alpha': 3}


#gbm = xgb.XGBClassifier(**params)
#opt_clf = GridSearchCV(estimator=gbm,param_grid=cv_params,cv=5)
gsearch3 = GridSearchCV(  
    estimator=xgb.XGBClassifier(learning_rate=0.1, gamma=0,  
                            subsample=0.8, colsample_bytree=0.8,max_depth=9,min_child_weight=1, objective='binary:logistic', nthread=8,  
                            scale_pos_weight=1, seed=27), param_grid=param_test2, scoring='roc_auc',  
    iid=False, cv=5)  


gsearch3.fit(X_train_os,y_train_os.ravel())
print('参数最佳取值:{0}'.format(gsearch3.best_params_))
print('最佳模型得分:{0}'.format(gsearch3.best_score_))
-------------

#上采样后训练模型

Trate = 0.25
params = {'booster':'gbtree',
          'eta':0.1,
          'max_depth':9,
          'max_delta_step':0,
          'subsample':0.9,
          'colsample_bytree':0.9,
          'base_score':Trate,
          'objective':'binary:logistic',
          'lambda':3,
          'alpha':3,
          'random_seed':100,
          'min_child_weight':1
}
params['eval_metric'] ='auc'
xgb_model2 = xgb.train(params, dtrain_B_caiyang, num_boost_round=200, maximize = True,
                      verbose_eval= True )

#看上采样后的训练集的训练score
from sklearn.metrics import roc_auc_score
y_pred = xgb_model2.predict(xgb.DMatrix(train_B_1))
auc_score = roc_auc_score(train_B_flag,y_pred)
auc_score

#结果为0.8564752128867918

--------
#输出预测结果为CSV文件
prediction = xgb_model2.predict(xgb.DMatrix(test[train_B_1.columns].fillna(-999)))
test['pred'] = prediction
test[['no','pred']].to_csv('submission2.csv', index = None)