基于lightgbm的kaggle比赛实践：Give me some credit

最新推荐文章于 2024-04-23 12:14:13 发布

guyu1003

最新推荐文章于 2024-04-23 12:14:13 发布

阅读量3.4k

点赞数 3

分类专栏：机器学习文章标签： python 数据分析机器学习 kaggle

本文链接：https://blog.csdn.net/guyu1003/article/details/109092790

版权

机器学习专栏收录该内容

7 篇文章 1 订阅

订阅专栏

0 背景介绍

Give Me Some Credit https://www.kaggle.com/c/GiveMeSomeCredit/overview，是Kaggle上关于信用评分的项目，通过改进信用评分技术，预测未来两年借款人会遇到财务困境的可能性。并以此为依据来决定是否给予借贷人信用授权。目标是建立帮助银行做出最佳财务借贷决策的模型。今天这

数据类型如下：

其中：SeriousDlqin2yrs代表过去两年内的情况，也是test集要预测的字段。

第一部分：导入需要的包和数据

import numpy as np
import pandas as pd
import os, datetime, sys, random, time

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

plt.style.use('fivethirtyeight')
%matplotlib inline

from scipy import stats, special
import shap                # 

import warnings
warnings.filterwarnings('ignore')

train_data=pd.read_csv("./GiveMeSomeCredit/cs-training.csv",encoding="utf-8")
test_data=pd.read_csv("./GiveMeSomeCredit/cs-test.csv",encoding="utf-8")

print(train_data.head())
# print(test_data.head())

打印出来的train数据集如下：

Unnamed: 0	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents
0	1	1	0.766127	45	2	0.802982	9120.0	13	0	6	0	2.0
1	2	0	0.957151	40	0	0.121876	2600.0	4	0	0	0	1.0
2	3	0	0.658180	38	1	0.085113	3042.0	2	1	0	0	0.0
3	4	0	0.233810	30	0	0.036050	3300.0	5	0	0	0	0.0
4	5	0	0.907239	49	1	0.024926	63588.0	7	0	1	0	0.0

train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 12 columns):
Unnamed: 0                              150000 non-null int64
SeriousDlqin2yrs                        150000 non-null int64
RevolvingUtilizationOfUnsecuredLines    150000 non-null float64
age                                     150000 non-null int64
NumberOfTime30-59DaysPastDueNotWorse    150000 non-null int64
DebtRatio                               150000 non-null float64
MonthlyIncome                           120269 non-null float64
NumberOfOpenCreditLinesAndLoans         150000 non-null int64
NumberOfTimes90DaysLate                 150000 non-null int64
NumberRealEstateLoansOrLines            150000 non-null int64
NumberOfTime60-89DaysPastDueNotWorse    150000 non-null int64
NumberOfDependents                      146076 non-null float64
dtypes: float64(4), int64(8)
memory usage: 13.7 MB

总数据有150000条，其中有MonthlyIncome 和 NumberOfDependents，存在空值，需要进一步处理。

第二部分：对数据进行探索式分析及清理

例如：‘Unnamed: 0’列是指记录的ID，这里是无用的数据因此要去除；

# remove id 
dev_train=train_data.drop("Unnamed: 0",axis=1)
# 测试集也做同样操作
print(test_data.info())
dev_test=test_data.drop("Unnamed: 0",axis=1)

(1)查看各列数据分布情况

print(dev_train.describe())

SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age    	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents
count	150000.000000	150000.000000	    150000.000000	150000.000000	150000.000000	1.202690e+05	150000.000000	150000.000000	150000.000000	150000.000000	146076.000000
mean	0.066840	6.048438	    52.295207	0.421033	353.005076	6.670221e+03	8.452760	0.265973	1.018240	0.240387	0.757222
std		0.249746	249.755371	    14.771866	4.192781	2037.818523	1.438467e+04	5.145951	4.169304	1.129771	4.155179	1.115086
min		0.000000	0.000000	    0.000000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	0.000000	0.000000	0.000000
25%		0.000000	0.029867	    41.000000	0.000000	0.175074	3.400000e+03	5.000000	0.000000	0.000000	0.000000	0.000000
50%		0.000000	0.154181	    52.000000	0.000000	0.366508	5.400000e+03	8.000000	0.000000	1.000000	0.000000	0.000000
75%		0.000000	0.559046	    63.000000	0.000000	0.868254	8.249000e+03	11.000000	0.000000	2.000000	0.000000	1.000000
max		1.000000	50708.000000	109.000000	98.000000	329664.000000	3.008750e+06	58.000000	98.000000	54.000000	98.000000	20.000000

仔细分析发现：SeriousDlqin2yrs：的分布不是均衡的，这代表正负样本的比例有显著失衡

RevolvingUtilizationOfUnsecuredLines：的最大值和最小值很极端，但均值却很小，代表数据离散值较多。

age: 最小值有0，应该属于异常值，可能是空值导致的。最大值109也是不正常的。

NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse三种的最大值都是98，标准差近似，可能具有相关性。

so，下面开始进行可视化分析。

# 检查数据正负样本是否平衡
fig,axes=plt.subplots(1,2,figsize=(12,6))
# pandas自带绘图
dev_train['SeriousDlqin2yrs'].value_counts().plot.pie(explode=[0,0.1],autopct="%1.1f%%",ax=axes[0])
axes[0].set_title("SeriousDlqin2yrs")
sns.countplot("SeriousDlqin2yrs",data=dev_train,ax=axes[1])
axes[1].set_title("SeriousDlqin2yrs")
plt.show()

由此可以看出正负样本失衡严重，这可以考虑通过欠采样解决。

lightgbm中，可以设置两个参数is_unbalance和scale_pos_weight。
is_unbalace：当其为True时，算法将尝试自动平衡占主导地位的标签的权重(使用列集中的pos/neg分数)
scale_pos_weight：默认1，即假设正负标签都是相等的。在不平衡数据集的情况下，建议使用以下公式：
sample_pos_weight = number of negative samples / number of positive samples

（2）分析各个维度与结果（SeriousDlqin2yrs）的相关性

利用regplot()绘制各个维度与结果集的相关性

fig=plt.figure(figsize=[25,25])
for col,i in zip(dev_train.columns,range(1,13)):
    axes=fig.add_subplot(7,2,i)
    sns.regplot(dev_train[col],dev_train.SeriousDlqin2yrs,ax=axes)
plt.show()

如上图所示，可以明显看出

NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse与分类标签SeriousDlqin2yrs有很大的正相关性，并且这三者的离群点的分布也很相似。NumberOfDependents与标签正相关性表明家庭成员与违约成正比，这与人们的现实直观感受相同。
MonthlyIncome，age,DebtRatio与标签有明显的负相关性。MonthlyIncome收入与违约的负相关性是显而易见的。RevolvingUtilizationOfUnsecuredLines，NumberOfOpenCreditLinesAndLoans，NumberRealEstateLoansOrLines，三者的负相关性稍微弱一点。

2.正相关性分析

绘制NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse三者的箱线图：

dev_train.boxplot(column=['NumberOfTime30-59DaysPastDueNotWorse',
          'NumberOfTime60-89DaysPastDueNotWorse',
          'NumberOfTimes90DaysLate'],figsize=(15,5))

进一步分析：NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse三者大于80的值

# NumberOfTime30-59DaysPastDueNotWorse大于80的value_counts
print(dev_train[dev_train['NumberOfTime30-59DaysPastDueNotWorse']>=80]
      ['NumberOfTime30-59DaysPastDueNotWorse'].value_counts())
# NumberOfTime60-89DaysPastDueNotWorse大于80的value_counts
print(dev_train[dev_train['NumberOfTime60-89DaysPastDueNotWorse']>=80]
      ['NumberOfTime60-89DaysPastDueNotWorse'].value_counts())
# NumberOfTimes90DaysLate大于80的value_counts
print(dev_train[dev_train['NumberOfTimes90DaysLate']>=80]
      ['NumberOfTimes90DaysLate'].value_counts())

结果是出奇的一致。

# 当'NumberOfTime30-59DaysPastDueNotWorse>80时，
# NumberOfTime60-89DaysPastDueNotWorse，NumberOfTimes90DaysLate的结果
print(np.unique(dev_train[dev_train['NumberOfTime30-59DaysPastDueNotWorse']>=80]
                     ['NumberOfTime60-89DaysPastDueNotWorse']))

print(np.unique(dev_train[dev_train['NumberOfTime30-59DaysPastDueNotWorse']>=80]
                     ['NumberOfTimes90DaysLate']))

结果都是[96,98]。

同时，看看80以下的数据的value counts,

# NumberOfTime30-59DaysPastDueNotWorse小于80的value_counts
print(dev_train[dev_train['NumberOfTime30-59DaysPastDueNotWorse']<80]
      ['NumberOfTime30-59DaysPastDueNotWorse'].value_counts())
# NumberOfTime60-89DaysPastDueNotWorse小于80的value_counts
print(dev_train[dev_train['NumberOfTime60-89DaysPastDueNotWorse']<80]
      ['NumberOfTime60-89DaysPastDueNotWorse'].value_counts())
# NumberOfTimes90DaysLate小于80的value_counts
print(dev_train[dev_train['NumberOfTimes90DaysLate']<80]
      ['NumberOfTimes90DaysLate'].value_counts())

结果：三者的最大值分别为13，11，17

相对于整个数据样本的数量，这部分[96,98]异常值可以去除。但考虑到test数据也会发生类似结果，我们也可以用三者的最大值替换掉现在的异常结果，而且，三者最大值的记录都是1，也表明极大值也是极稀少的。

# 更正异常值
dev_train.loc[dev_train['NumberOfTime30-59DaysPastDueNotWorse'] >= 80, 'NumberOfTime30-59DaysPastDueNotWorse'] = 13
dev_train.loc[dev_train['NumberOfTime60-89DaysPastDueNotWorse'] >= 80, 'NumberOfTime60-89DaysPastDueNotWorse'] = 11
dev_train.loc[dev_train['NumberOfTimes90DaysLate'] >= 80, 'NumberOfTimes90DaysLate'] = 17

这步在处理test数据时也要执行一遍。

3.负相关性分析

结合上面的分析，我们我们觉得DebtRatio可能与其他负相关因素也有某种关联。所以，我们先对DebtRatio进行进一步分析，首先，DebtRatio的四分位特征分布如下

Debt Ratio: 
 count    150000.000000
mean        353.005076
std        2037.818523
min           0.000000
25%           0.175074
50%           0.366508
75%           0.868254
max      329664.000000
Name: DebtRatio, dtype: float64

绘制DebtRatio的箱线图

dev_train.boxplot(column=['DebtRatio'],figsize=(5,5))

做为比率值，DebtRatio的离散值显著异常。

quantiles=[x for x in range(75,100,3)]
for i in quantiles:
    print(i,'% quantile of debt ratio is: ',dev_train.DebtRatio.quantile(i/100))

75 % quantile of debt ratio is:  0.86825377325
78 % quantile of debt ratio is:  1.2750686934600006
81 % quantile of debt ratio is:  14.0
84 % quantile of debt ratio is:  121.0
87 % quantile of debt ratio is:  635.0
90 % quantile of debt ratio is:  1267.0
93 % quantile of debt ratio is:  1917.070000000007
96 % quantile of debt ratio is:  2791.0
99 % quantile of debt ratio is:  4979.040000000037

通过对75%位点以上的数据进行分析，可以看到在81%时，DebtRatio明显增大。

结合DebtRatio，我们再看看MonthlyIncome，age以及其他几项指标：

print(dev_train[dev_train['DebtRatio'] >= 
           dev_train['DebtRatio'].quantile(0.95)][['age',
            'MonthlyIncome','RevolvingUtilizationOfUnsecuredLines',
            'NumberOfOpenCreditLinesAndLoans',
            'NumberRealEstateLoansOrLines',
            'NumberOfDependents']].describe())

                age  MonthlyIncome  RevolvingUtilizationOfUnsecuredLines  \
count  7501.000000     379.000000                           7501.000000   
mean     53.515131       0.084433                             10.105646   
std      10.987772       0.278403                            271.335260   
min      25.000000       0.000000                              0.000000   
25%      46.000000       0.000000                              0.043129   
50%      54.000000       0.000000                              0.188120   
75%      62.000000       0.000000                              0.535650   
max      94.000000       1.000000                          13930.000000   

       NumberOfOpenCreditLinesAndLoans  NumberRealEstateLoansOrLines  \
count                      7501.000000                   7501.000000   
mean                         10.585389                      1.924143   
std                           5.105795                      1.227998   
min                           1.000000                      0.000000   
25%                           7.000000                      1.000000   
50%                          10.000000                      2.000000   
75%                          13.000000                      2.000000   
max                          43.000000                     23.000000   

       NumberOfDependents  
count         6997.000000  
mean             0.528798  
std              1.013590  
min              0.000000  
25%              0.000000  
50%              0.000000  
75%              1.000000  
max             10.000000

我们把分位数换成DebtRatio的值，依次下探，在DebtRatio值为500时，总记录20614，MonthlyIncome有 1305个不为空，986为0，319为1，比率近3：1； NumberOfDependents 有 18803个不为空， 14444个为0；

我们设立一个空值填充规则，在DebtRatio>=500时，将MonthlyIncome的空值设为0， NumberOfDependents 设为0

小于500是，MonthlyIncome的空值则设为中位数，NumberOfDependents设为中位值。

# 对空值用中位数填充
dev_train['NumberOfDependents'].fillna(dev_train['NumberOfDependents'].median(), inplace=True)
dev_train.loc[(dev_train['DebtRatio']>=500)&(dev_train['MonthlyIncome'].isnull()),'MonthlyIncome']=0.0
dev_train.loc[(dev_train['DebtRatio']<500)&(dev_train['MonthlyIncome'].isnull()),
              'MonthlyIncome']=dev_train['MonthlyIncome'].mean()

之后，我们再看看age异常的数据，

# 处理年纪等于0的数据，发现只有一条，于是用中位数进行替换
print(dev_train.loc[dev_train['age'] < 18])
dev_train.loc[dev_train['age'] == 0, 'age'] = dev_train['age'].median()

最后做各个列之间的相关性分析：

fig=plt.figure(figsize=[15,10])

masked = np.zeros_like(dev_train.corr(), dtype=np.bool)
masked[np.triu_indices_from(masked)] = True
sns.heatmap(dev_train.corr(), cmap=sns.diverging_palette(150, 275, s=80, l=55, n=9), mask = masked, annot=True, center = 0)
plt.title("Correlation Matrix (HeatMap)", fontsize = 15)

由上我们可以看到，与标签SeriousDlqin2yrs相关性最高的是 NumberOfTime30-59DaysPastDueNotWorse , NumberOfTime60-89DaysPastDueNotWorse 和 NumberOfTimes90DaysLate。

第三部分 baseline

1.切分数据集

from sklearn import preprocessing,metrics,model_selection,ensemble,tree,linear_model
dev_x=dev_train.drop(['SeriousDlqin2yrs'],axis=1)
dev_y=dev_train['SeriousDlqin2yrs']

# 切分数据集
X_train,X_val,y_train,y_val=model_selection.train_test_split(dev_x,dev_y,test_size=0.3,random_state=2020)

2.构建模型

import lightgbm as lgb

lgb_classifer=lgb.LGBMClassifier(objective='binary', # 二分类的log loss
                                 n_jobs=-1, random_state=2020,
                                 importance_type='gain' # 增益作为重要性度量)

lgbParameters={
    'max_depth' : [2,3,4,5],
    'learning_rate': [0.05, 0.1,0.125,0.15],
    'colsample_bytree' : [0.2,0.4,0.6,0.8,1],
    'n_estimators' : [400,500,600,700,800,900],
    'min_split_gain' : [0.15,0.20,0.25,0.3,0.35], #equivalent to gamma in XGBoost
    'subsample': [0.6,0.7,0.8,0.9,1],
    'min_child_weight': [6,7,8,9,10],
    'scale_pos_weight': [10,15,20],
    'min_data_in_leaf' : [100,200,300,400,500,600,700,800,900],
    'num_leaves' : [20,30,40,50,60,70,80,90,100]
}
# 随机交叉验证
lgbModel=model_selection.RandomizedSearchCV(lgb_classifer,
                                            param_distributions=lgbParameters,
                                            cv=5, # 5折交叉验证
                                            random_state=2020
                                            )
# 开始训练
lgbModel.fit(X_train,y_train,feature_name=X_train.columns.to_list())

3.获取模型的最好参数

bestEstimatorLGB=lgbModel.best_estimator_
bestEstimatorLGB

""" LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=0.2,
               importance_type='gain', learning_rate=0.125, max_depth=4,
               min_child_samples=20, min_child_weight=9, min_data_in_leaf=500,
               min_split_gain=0.15, n_estimators=500, n_jobs=-1, num_leaves=80,
               objective='binary', random_state=2020, reg_alpha=0.0,
               reg_lambda=0.0, scale_pos_weight=10, silent=True, subsample=0.9,
               subsample_for_bin=200000, subsample_freq=0)

"""

4,用最优参数，构建模型

# 最优模型的训练
bestEstimatorLGB=lgb.LGBMClassifier(colsample_bytree=0.2,
                                    importance_type='gain',
                                    max_depth=4,
                                    min_child_weight=9,  # 子节点所需的样本权重
                                    min_data_in_leaf=500,
                                    min_split_gain=0.15,  #执行切分的最小增益
                                    n_estimators=500,
                                    num_leaves=80,
                                    objective='binary',
                                    random_state=2020,
                                    scale_pos_weight=10,  ## 正样本的权重
                                    subsample=0.9,  #不进行重采样的情况下随机选择部分数据
                                    ).fit(X_train,y_train,
                                          feature_name=X_train.columns.to_list())

5.交叉验证的预测结果


val_test_pred_lgb=bestEstimatorLGB.predict(X_val)
print(metrics.classification_report(y_val,val_test_pred_lgb))

                 precision    recall  f1-score   support

                   0       0.97      0.87      0.92     41969
                   1       0.27      0.67      0.39      3031

    accuracy                           0.86     45000
   macro avg       0.62      0.77      0.65     45000
weighted avg       0.93      0.86      0.88     45000

6.预测结果在各种指标下的评测

# 
metrics.confusion_matrix(y_val,val_test_pred_lgb)
LGBMMetrics=pd.DataFrame({'Model':'LightGBM',
        'MSE':round(metrics.mean_squared_error(y_val,val_test_pred_lgb)*100,2),
        'RMSE':round(np.sqrt(metrics.mean_squared_error(y_val,val_test_pred_lgb)*100),2),
        'MAE':round(metrics.mean_absolute_error(y_val,val_test_pred_lgb)*100,2),
        'Accuracy Train':round(bestEstimatorLGB.score(X_train,y_train)*100,2),
        'Accuracy Test': round(bestEstimatorLGB.score(X_val,y_val)*100,2),
    'F-Beta Score (B=2)':round(metrics.fbeta_score(y_val,
                                                   val_test_pred_lgb,
                                                   beta=2)*100,2)
                          },index=[1])

print(LGBMMetrics)

Model MSE RMSE MAE Accuracy Train Accuracy Test F-1 F-Beta Score (B=2)
1 LightGBM 14.34 3.79 14.34 86.12 85.66 51.5 51.53

注：MSE：均方误差；RMSE：均方根误差；MAE：平均绝对误差； F-Beta Score：F2分数（召回率的权重高于精确率）

绘制AUC曲线：

val_pred_lgb=bestEstimatorLGB.predict_proba(X_val)
val_pred_lgb=val_pred_lgb[:,1]
# roc_curve根据分类结果和分类概率，返回false positive rage和true positive rate
fpr,tpr,_=metrics.roc_curve(y_val,val_pred_lgb)
rocAuc=metrics.auc(fpr,tpr)
plt.figure(figsize=(12,6))
plt.title("ROC Curve")
sns.lineplot(fpr,tpr,label="AUC for LightGBM Model = %0.2f"% rocAuc)

plt.legend(loc=' lower right')
plt.plot([0,1],[0,1],'r--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

# 看看gbdt对各个特征的重视程度
lgb.plot_importance(bestEstimatorLGB,importance_type='gain')

初步预测：

lgb_probs=bestEstimatorLGB.predict_proba(dev_test)
lgb_df=pd.DataFrame({'ID':test_id,'Probability':lgb_probs[:,1]})

lgb_df.to_csv('./submission.csv',index=False)

在kaggle的提交页面上提交以后，得到以下结果

优化：

1.初步优化，尝试扩大lgbParameters参数的搜索范围：如下

lgbParameters={
    'max_depth' : [3,4,5,6,7],
    'learning_rate': [0.01,0.025,0.05, 0.075,0.1,0.125],
    'colsample_bytree' : [0.2,0.4,0.6,0.8,1],
    'n_estimators' : [400,450,500,550,600,650,700,800],
    'min_split_gain' : [0.15,0.20,0.25,0.3,0.35], #equivalent to gamma in XGBoost
    'subsample': [0.65,0.7,0.75,0.8,0.85,0.9,0.95,1],
    'min_child_weight': [6,7,8,9,10,11],
    'min_child_samples':[x for x in range(15,70,10)],
    'scale_pos_weight': [3,5,8,10,13,15],
    'min_data_in_leaf' : [150,200,350,450,500,550,600,650,700],
    'num_leaves' : [5,15,25,35,40,45,55,60,70,75]
}

得到新的最优参数：

bestEstimatorLGB=lgbModel.best_estimator_
bestEstimatorLGB

"""
LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=0.8,
               importance_type='gain', learning_rate=0.01, max_depth=6,
               min_child_samples=25, min_child_weight=6, min_data_in_leaf=650,
               min_split_gain=0.15, n_estimators=600, n_jobs=-1, num_leaves=75,
               objective='binary', random_state=2020, reg_alpha=0.0,
               reg_lambda=0.0, scale_pos_weight=3, silent=True, subsample=1,
               subsample_for_bin=200000, subsample_freq=0)
"""

bestEstimatorLGB=lgb.LGBMClassifier(colsample_bytree=0.8,  # 特征抽样
                                    importance_type='gain',
                                    learning_rate=0.01 ,
                                    max_depth=6,
                                    min_child_samples=25,
                                    min_child_weight=6,    # 子节点所需的样本权重
                                    min_data_in_leaf=650,
                                    min_split_gain=0.15,  #执行切分的最小增益
                                    n_estimators=600,
                                    num_leaves=75,
                                    objective='binary',
                                    random_state=2020,
                                    scale_pos_weight=3,   # 正样本的权重
                                    subsample=1,  # 不进行重采样的情况下随机选择部分数据
                                    
                                    ).fit(X_train,y_train,
                                          feature_name=X_train.columns.to_list())

通过上面得到的最优参数模型在交叉验证集上得到的AUC 指标达到0.87

最后将新的结果提交到kaggle上，得到结果：

这个结果虽然只提升了0.003个百分点，但却提高了300过个名次。

2,进一步的分析优化

我们利用shap工具包，来看一下各个feature在对最后打分结果的贡献。

explainer=shap.TreeExplainer(bestEstimatorLGB)
shap_values=explainer.shap_values(X_train)

shap.summary_plot(shap_values[1],X_train)

由上图可以看出，RevolvingUtilizationOfUnsecuredLines，NumberOfTime30-59DaysPastDueNotWorse , NumberOfTime60-89DaysPastDueNotWorse 和 NumberOfTimes90DaysLate，值越大产生的shap值也越大，这说明这三者对最后的结果贡献很大。同时，DebtRatio,monthlyIncome,NumberOfDependents三者数据很集中，而且对最后的结果也很小，所以，我们尝试一下对这三者进行一下特征组合。

# 新的feature
new_dev_x=dev_x.copy()
new_dev_x['monthlyIncomePerPerson']= new_dev_x['MonthlyIncome']/(new_dev_x[
    'NumberOfDependents']+1)
#
new_dev_x['monthlyDebt']=new_dev_x['MonthlyIncome']*new_dev_x['DebtRatio']

得到新的最优模型：

bestEstimatorLGB=lgb.LGBMClassifier(colsample_bytree=0.6,  # 特征抽样
                                    importance_type='gain',
                                    learning_rate=0.025 ,
                                    max_depth=7,
                                    min_child_samples=55,
                                    min_child_weight=10,    # 子节点所需的样本权重
                                    min_data_in_leaf=450,
                                    min_split_gain=0.35,  #执行切分的最小增益
                                    n_estimators=450,
                                    num_leaves=75,
                                    objective='binary',
                                    random_state=2020,
                                    scale_pos_weight=5,   # 正样本的权重
                                    subsample=0.9,  # 不进行重采样的情况下随机选择部分数据
                                    ).fit(X_train,y_train,
                                          feature_name=X_train.columns.to_list())

val_test_pred_lgb=bestEstimatorLGB.predict(X_val)

print(metrics.classification_report(y_val,val_test_pred_lgb))

"""
   precision    recall  f1-score   support

           0       0.97      0.93      0.95     27980
           1       0.37      0.56      0.44      2020

    accuracy                           0.91     30000
   macro avg       0.67      0.74      0.70     30000
weighted avg       0.93      0.91      0.91     30000
"""

val_pred_lgb=bestEstimatorLGB.predict_proba(X_val)
val_pred_lgb=val_pred_lgb[:,1]
# 
metrics.confusion_matrix(y_val,val_test_pred_lgb)

fpr,tpr,_=metrics.roc_curve(y_val,val_pred_lgb)
rocAuc=metrics.auc(fpr,tpr)
plt.figure(figsize=(12,6))
plt.title("ROC Curve")
sns.lineplot(fpr,tpr,label="AUC for LightGBM Model = %0.2f"% rocAuc)

plt.legend(loc=' lower right')
plt.plot([0,1],[0,1],'r--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#
lgb.plot_importance(bestEstimatorLGB,importance_type='gain')

除了NumberOfTime30-59DaysPastDueNotWorse , NumberOfTime60-89DaysPastDueNotWorse 和 NumberOfTimes90DaysLate贡献提升外，的最后得到的AUC提升很有限，提交到kaggle的结果如下：