金融风控-贷款违约预测-Task04 建模与调参

金融风控学习赛

https://tianchi.aliyun.com/competition/entrance/531830/information

一、赛题数据

赛题以预测用户贷款是否违约为任务,数据集报名后可见并可下载,该数据来自某信贷平台的贷款记录,总数据量超过120w,包含47列变量信息,其中15列为匿名变量。为了保证比赛的公平性,将会从中抽取80万条作为训练集,20万条作为测试集A,20万条作为测试集B,同时会对employmentTitle、purpose、postCode和title等信息进行脱敏。

导入数据分析相关库

# 导入标准库
import io, os, sys, types, time, datetime, math, random, requests, subprocess,io, tempfile, math

# 导入第三方库
# 数据处理
import numpy as np
import pandas as pd

# 数据可视化
import matplotlib.pyplot as plt
from tqdm import tqdm
import missingno
import seaborn as sns 
# from pandas.tools.plotting import scatter_matrix  # No module named 'pandas.tools'
from mpl_toolkits.mplot3d import Axes3D
# plt.style.use('seaborn')  # 改变图像风格
plt.rcParams['font.family'] = ['Arial Unicode MS', 'Microsoft Yahei', 'SimHei', 'sans-serif']  # 解决中文乱码
plt.rcParams['axes.unicode_minus'] = False  # simhei黑体字 负号乱码 解决

# 特征选择和编码
from sklearn.feature_selection import RFE, RFECV
from sklearn.svm import SVR
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize # Imputer
# from fancyimpute import BiScaler, KNN, NuclearNormMinimization, SoftImpute

# 机器学习
import sklearn.ensemble as ske
from sklearn import datasets, model_selection, tree, preprocessing, metrics, linear_model
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor

# 网格搜索、随机搜索
import scipy.stats as st
from scipy.stats import randint as sp_randint
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

# 模型度量(分类)
from sklearn.metrics import precision_recall_fscore_support, roc_curve, auc
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss

# 警告处理 
import warnings
warnings.filterwarnings('ignore')

# 在Jupyter上画图
%matplotlib inline

# 数据预处理
import numpy as np
import scipy as sc
import sklearn as sk
import matplotlib.pyplot as plt

# 绘图工具包
import seaborn as sns
import pyecharts.options as opts
from pyecharts.charts import Line, Grid

数据集导入

  • train
  • test
# 数据集路径

train_path = 'train.csv'
test_path = 'testA.csv'
dataset_path = './'
data_train_path = dataset_path + train_path
data_test_path = dataset_path + test_path


# 2.数据集csv读入
train = pd.read_csv(data_train_path)
test_a = pd.read_csv(data_test_path)

Task4 建模与调参

  • 学习在金融分控领域常用的机器学习模型
  • 学习机器学习模型的建模过程与调参流程

模型相关原理介绍

由于相关算法原理篇幅较长,推荐一些博客与教材供初学者们进行学习,用于补全相关知识。

  • 1 逻辑回归模型

    • https://blog.csdn.net/han_xiaoyang/article/details/49123419
  • 2 决策树模型

    • https://blog.csdn.net/c406495762/article/details/76262487
  • 3 GBDT模型

    • https://zhuanlan.zhihu.com/p/45145899
  • 4 XGBoost模型

    • https://blog.csdn.net/wuzhongqiang/article/details/104854890
  • 5 LightGBM模型

    • https://blog.csdn.net/wuzhongqiang/article/details/105350579
  • 6 Catboost模型

    • https://mp.weixin.qq.com/s/xloTLr5NJBgBspMQtxPoFA
  • 7 时间序列模型(选学)

    • RNN:https://zhuanlan.zhihu.com/p/45289691

    • LSTM:https://zhuanlan.zhihu.com/p/83496936

  • 8 推荐教材:

    • 《机器学习》 https://book.douban.com/subject/26708119/

    • 《统计学习方法》 https://book.douban.com/subject/10590856/

    • 《面向机器学习的特征工程》 https://book.douban.com/subject/26826639/

    • 《信用评分模型技术与应用》https://book.douban.com/subject/1488075/

    • 《数据化风控》https://book.douban.com/subject/30282558/

建模代码

导入相关模块并初始化配置

import pandas as pd
import numpy as np
import warnings
import os
import seaborn as sns
import matplotlib.pyplot as plt
"""
sns 相关设置
@return:
"""
# 声明使用 Seaborn 样式
sns.set()
# 有五种seaborn的绘图风格,它们分别是:darkgrid, whitegrid, dark, white, ticks。默认的主题是darkgrid。
sns.set_style("whitegrid")
# 有四个预置的环境,按大小从小到大排列分别为:paper, notebook, talk, poster。其中,notebook是默认的。
sns.set_context('talk')
# 中文字体设置-黑体
plt.rcParams['font.sans-serif'] = ['SimHei']
# 解决保存图像是负号'-'显示为方块的问题
plt.rcParams['axes.unicode_minus'] = False
# 解决Seaborn中文显示问题并调整字体大小
sns.set(font='SimHei')

读取数据

  • reduce_mem_usage函数,转换数据集格式,用于reduce数据集,这有利于减少内存占用,适合数据集较大的适合使用
def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df
# 读取数据
train = pd.read_csv('./train.csv')
test = pd.read_csv('./testA.csv')
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)
Memory usage of dataframe is 300800080.00 MB
Memory usage after optimization is: 72834896.00 MB
Decreased by 75.8%
Memory usage of dataframe is 73600080.00 MB
Memory usage after optimization is: 18034472.00 MB
Decreased by 75.5%
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
id                    800000 non-null int32
loanAmnt              800000 non-null float16
term                  800000 non-null int8
interestRate          800000 non-null float16
installment           800000 non-null float16
grade                 800000 non-null category
subGrade              800000 non-null category
employmentTitle       799999 non-null float32
employmentLength      753201 non-null category
homeOwnership         800000 non-null int8
annualIncome          800000 non-null float32
verificationStatus    800000 non-null int8
issueDate             800000 non-null category
isDefault             800000 non-null int8
purpose               800000 non-null int8
postCode              799999 non-null float16
regionCode            800000 non-null int8
dti                   799761 non-null float16
delinquency_2years    800000 non-null float16
ficoRangeLow          800000 non-null float16
ficoRangeHigh         800000 non-null float16
openAcc               800000 non-null float16
pubRec                800000 non-null float16
pubRecBankruptcies    799595 non-null float16
revolBal              800000 non-null float32
revolUtil             799469 non-null float16
totalAcc              800000 non-null float16
initialListStatus     800000 non-null int8
applicationType       800000 non-null int8
earliesCreditLine     800000 non-null category
title                 799999 non-null float16
policyCode            800000 non-null float16
n0                    759730 non-null float16
n1                    759730 non-null float16
n2                    759730 non-null float16
n3                    759730 non-null float16
n4                    766761 non-null float16
n5                    759730 non-null float16
n6                    759730 non-null float16
n7                    759730 non-null float16
n8                    759729 non-null float16
n9                    759730 non-null float16
n10                   766761 non-null float16
n11                   730248 non-null float16
n12                   759730 non-null float16
n13                   759730 non-null float16
n14                   759730 non-null float16
dtypes: category(5), float16(30), float32(3), int32(1), int8(8)
memory usage: 69.5 MB
# 特征工程参考上一篇
from sklearn.model_selection import KFold
# 分离数据集,方便进行交叉验证
y_train = train.loc[:,'isDefault']
X_train = train.drop(['id','issueDate','isDefault'], axis=1)
X_test = test.drop(['id','issueDate'], axis=1)

# 5折交叉验证
folds = 5
seed = 2020
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

建模示例

  • 使用机器学习的集成算法lightgbm,并5折交叉验证
import lightgbm as lgb
"""使用lightgbm 5折交叉验证进行建模预测"""
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
    print('************************************ {} ************************************'.format(str(i+1)))
    X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]
    
    train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
    valid_matrix = lgb.Dataset(X_val, label=y_val)

    params = {
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'learning_rate': 0.1,
                'metric': 'auc',
        
                'min_child_weight': 1e-3,
                'num_leaves': 31,
                'max_depth': -1,
                'reg_lambda': 0,
                'reg_alpha': 0,
                'feature_fraction': 1,
                'bagging_fraction': 1,
                'bagging_freq': 0,
                'seed': 2020,
                'nthread': 8,
                'silent': True,
                'verbose': -1,
    }
    
    model = lgb.train(params, train_set=train_matrix, num_boost_round=2000, valid_sets=valid_matrix, verbose_eval=1000, early_stopping_rounds=10)
    val_pred = model.predict(X_val, num_iteration=model.best_iteration)
    
    cv_scores.append(roc_auc_score(y_val, val_pred))
    print(cv_scores)

print("lgb_scotrainre_list:{}".format(cv_scores))
print("lgb_score_mean:{}".format(np.mean(cv_scores)))
print("lgb_score_std:{}".format(np.std(cv_scores)))
************************************ 1 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[108]	valid_0's auc: 0.71916
[0.7191601264391831]
************************************ 2 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[110]	valid_0's auc: 0.715545
[0.7191601264391831, 0.715544695574905]
************************************ 3 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[88]	valid_0's auc: 0.718961
[0.7191601264391831, 0.715544695574905, 0.7189611956227128]
************************************ 4 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[132]	valid_0's auc: 0.718808
[0.7191601264391831, 0.715544695574905, 0.7189611956227128, 0.7188078144554632]
************************************ 5 ************************************
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[126]	valid_0's auc: 0.71875
[0.7191601264391831, 0.715544695574905, 0.7189611956227128, 0.7188078144554632, 0.7187502453796062]
lgb_scotrainre_list:[0.7191601264391831, 0.715544695574905, 0.7189611956227128, 0.7188078144554632, 0.7187502453796062]
lgb_score_mean:0.718244815494374
lgb_score_std:0.0013575028097615738
from sklearn import metrics
from sklearn.metrics import roc_auc_score

"""预测并计算roc的相关指标"""
val_pre_lgb = model.predict(X_val, num_iteration=model.best_iteration)
fpr, tpr, threshold = metrics.roc_curve(y_val, val_pre_lgb)
roc_auc = metrics.auc(fpr, tpr)
print('未调参前lightgbm单模型在验证集上的AUC:{}'.format(roc_auc))
"""画出roc曲线图"""
plt.figure(figsize=(8, 8))
plt.title('Validation ROC')
plt.plot(fpr, tpr, 'b', label = 'Val AUC = %0.4f' % roc_auc)
plt.ylim(0,1)
plt.xlim(0,1)
plt.legend(loc='best')
plt.title('ROC')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
# 画出对角线
plt.plot([0,1],[0,1],'r--')
plt.show()
未调参前lightgbm单模型在验证集上的AUC:0.7187502453796062

在这里插入图片描述

模型调参

  • 网格搜索(一般推荐使用)

  • sklearn 提供GridSearchCV用于进行网格搜索,只需要把模型的参数输进去,就能给出最优化的结果和参数。相比起贪心调参,网格搜索的结果会更优,但是网格搜索只适合于小数据集,一旦数据的量级上去了,很难得出结果。

  • 同样以Lightgbm算法为例,进行网格搜索调参:

"""通过网格搜索确定最优参数"""
from sklearn.model_selection import GridSearchCV

def get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=31, max_depth=-1, bagging_fraction=1.0, 
                       feature_fraction=1.0, bagging_freq=0, min_data_in_leaf=20, min_child_weight=0.001, 
                       min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=None):
    # 设置5折交叉验证
    cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
    
    model_lgb = lgb.LGBMClassifier(learning_rate=learning_rate,
                                   n_estimators=n_estimators,
                                   num_leaves=num_leaves,
                                   max_depth=max_depth,
                                   bagging_fraction=bagging_fraction,
                                   feature_fraction=feature_fraction,
                                   bagging_freq=bagging_freq,
                                   min_data_in_leaf=min_data_in_leaf,
                                   min_child_weight=min_child_weight,
                                   min_split_gain=min_split_gain,
                                   reg_lambda=reg_lambda,
                                   reg_alpha=reg_alpha,
                                   n_jobs= 8
                                  )
    grid_search = GridSearchCV(estimator=model_lgb, 
                               cv=cv_fold,
                               param_grid=param_grid,
                               scoring='roc_auc'
                              )
    grid_search.fit(X_train, y_train)

    print('模型当前最优参数为:{}'.format(grid_search.best_params_))
    print('模型当前最优得分为:{}'.format(grid_search.best_score_))
"""以下代码未运行,耗时较长,请谨慎运行,且每一步的最优参数需要在下一步进行手动更新,请注意"""

"""
需要注意一下的是,除了获取上面的获取num_boost_round时候用的是原生的lightgbm(因为要用自带的cv)
下面配合GridSearchCV时必须使用sklearn接口的lightgbm。
"""
"""设置n_estimators 为581,调整num_leaves和max_depth,这里选择先粗调再细调"""
lgb_params = {'num_leaves': range(10, 80, 5), 'max_depth': range(3,10,2)}
get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=None, max_depth=None, min_data_in_leaf=20, 
                   min_child_weight=0.001,bagging_fraction=1.0, feature_fraction=1.0, bagging_freq=0, 
                   min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)

"""num_leaves为30,max_depth为7,进一步细调num_leaves和max_depth"""
lgb_params = {'num_leaves': range(25, 35, 1), 'max_depth': range(5,9,1)}
get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=None, max_depth=None, min_data_in_leaf=20, 
                   min_child_weight=0.001,bagging_fraction=1.0, feature_fraction=1.0, bagging_freq=0, 
                   min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)

"""
确定min_data_in_leaf为45,min_child_weight为0.001 ,下面进行bagging_fraction、feature_fraction和bagging_freq的调参
"""
lgb_params = {'bagging_fraction': [i/10 for i in range(5,10,1)], 
              'feature_fraction': [i/10 for i in range(5,10,1)],
              'bagging_freq': range(0,81,10)
             }
get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45, 
                   min_child_weight=0.001,bagging_fraction=None, feature_fraction=None, bagging_freq=None, 
                   min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)

"""
确定bagging_fraction为0.4、feature_fraction为0.6、bagging_freq为 ,下面进行reg_lambda、reg_alpha的调参
"""
lgb_params = {'reg_lambda': [0,0.001,0.01,0.03,0.08,0.3,0.5], 'reg_alpha': [0,0.001,0.01,0.03,0.08,0.3,0.5]}
get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45, 
                   min_child_weight=0.001,bagging_fraction=0.9, feature_fraction=0.9, bagging_freq=40, 
                   min_split_gain=0, reg_lambda=None, reg_alpha=None, param_grid=lgb_params)

"""
确定reg_lambda、reg_alpha都为0,下面进行min_split_gain的调参
"""
lgb_params = {'min_split_gain': [i/10 for i in range(0,11,1)]}
get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45, 
                   min_child_weight=0.001,bagging_fraction=0.9, feature_fraction=0.9, bagging_freq=40, 
                   min_split_gain=None, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)
"""
参数确定好了以后,我们设置一个比较小的learning_rate 0.005,来确定最终的num_boost_round
"""
# 设置5折交叉验证
# cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
final_params = {
                'boosting_type': 'gbdt',
                'learning_rate': 0.01,
                'num_leaves': 29,
                'max_depth': 7,
                'min_data_in_leaf':45,
                'min_child_weight':0.001,
                'bagging_fraction': 0.9,
                'feature_fraction': 0.9,
                'bagging_freq': 40,
                'min_split_gain': 0,
                'reg_lambda':0,
                'reg_alpha':0,
                'nthread': 6
               }

cv_result = lgb.cv(train_set=lgb_train,
                   early_stopping_rounds=20,
                   num_boost_round=5000,
                   nfold=5,
                   stratified=True,
                   shuffle=True,
                   params=final_params,
                   metrics='auc',
                   seed=0,
                  )

print('迭代次数{}'.format(len(cv_result['auc-mean'])))
print('交叉验证的AUC为{}'.format(max(cv_result['auc-mean'])))
  • 在实际调整过程中,可先设置一个较大的学习率(上面的例子中0.1),通过Lgb原生的cv函数进行树个数的确定,之后再通过上面的实例代码进行参数的调整优化。

  • 最后针对最优的参数设置一个较小的学习率(例如0.05),同样通过cv函数确定树的个数,确定最终的参数。

  • 需要注意的是,针对大数据集,上面每一层参数的调整都需要耗费较长时间,

总结

  • 我们建模基本流程包括划分数据集、K折交叉验证等方式对模型的性能进行评估验证,并通过可视化方式绘制模型ROC曲线,经过调参后使得模型参数匹配对应数据集的分布,得到理想的分类效果。

二、评测标准

提交结果为每个测试样本是1的概率,也就是y为1的概率。评价方法为AUC评估模型效果(越大越好)。

分类常用使用的评估指标是:

  • Accuracy(精确度),AUC,Recall(召回率),Precision(准确度),F1,Kappa

本次是学习赛使用的评估指标是AUC

  • AUC也就是ROC曲线下与坐标轴围成的面积
  • ROC空间将假正例率(FPR)定义为 X 轴,真正例率(TPR)定义为 Y 轴。
    • TPR:在所有实际为正例的样本中,被正确地判断为正例之比率。
    • FPR:在所有实际为负例的样本中,被错误地判断为正例之比率。
  • AUC的取值范围子是0.5和1之间,面积越大,精准度越高,因此AUC越接近1.0,模型精准率预告,AUC为1时精准率为100%,

三、结果提交

提交前请确保预测结果的格式与sample_submit.csv中的格式一致,以及提交文件后缀名为csv。

©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页