0 背景介绍
Give Me Some Credit https://www.kaggle.com/c/GiveMeSomeCredit/overview,是Kaggle上关于信用评分的项目,通过改进信用评分技术,预测未来两年借款人会遇到财务困境的可能性。并以此为依据来决定是否给予借贷人信用授权。目标是建立帮助银行做出最佳财务借贷决策的模型。今天这
数据类型如下:
其中:SeriousDlqin2yrs代表过去两年内的情况,也是test集要预测的字段。
第一部分:导入需要的包和数据
import numpy as np
import pandas as pd
import os, datetime, sys, random, time
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
plt.style.use('fivethirtyeight')
%matplotlib inline
from scipy import stats, special
import shap #
import warnings
warnings.filterwarnings('ignore')
train_data=pd.read_csv("./GiveMeSomeCredit/cs-training.csv",encoding="utf-8")
test_data=pd.read_csv("./GiveMeSomeCredit/cs-test.csv",encoding="utf-8")
print(train_data.head())
# print(test_data.head())
打印出来的train数据集如下:
Unnamed: 0 SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
0 1 1 0.766127 45 2 0.802982 9120.0 13 0 6 0 2.0
1 2 0 0.957151 40 0 0.121876 2600.0 4 0 0 0 1.0
2 3 0 0.658180 38 1 0.085113 3042.0 2 1 0 0 0.0
3 4 0 0.233810 30 0 0.036050 3300.0 5 0 0 0 0.0
4 5 0 0.907239 49 1 0.024926 63588.0 7 0 1 0 0.0
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 12 columns):
Unnamed: 0 150000 non-null int64
SeriousDlqin2yrs 150000 non-null int64
RevolvingUtilizationOfUnsecuredLines 150000 non-null float64
age 150000 non-null int64
NumberOfTime30-59DaysPastDueNotWorse 150000 non-null int64
DebtRatio 150000 non-null float64
MonthlyIncome 120269 non-null float64
NumberOfOpenCreditLinesAndLoans 150000 non-null int64
NumberOfTimes90DaysLate 150000 non-null int64
NumberRealEstateLoansOrLines 150000 non-null int64
NumberOfTime60-89DaysPastDueNotWorse 150000 non-null int64
NumberOfDependents 146076 non-null float64
dtypes: float64(4), int64(8)
memory usage: 13.7 MB
总数据有150000条,其中有MonthlyIncome 和 NumberOfDependents,存在空值,需要进一步处理。
第二部分:对数据进行探索式分析及清理
例如:‘Unnamed: 0’列是指记录的ID,这里是无用的数据因此要去除;
# remove id
dev_train=train_data.drop("Unnamed: 0",axis=1)
# 测试集也做同样操作
print(test_data.info())
dev_test=test_data.drop("Unnamed: 0",axis=1)
(1)查看各列数据分布情况
print(dev_train.describe())
SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
count 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 1.202690e+05 150000.000000 150000.000000 150000.000000 150000.000000 146076.000000
mean 0.066840 6.048438 52.295207 0.421033 353.005076 6.670221e+03 8.452760 0.265973 1.018240 0.240387 0.757222
std 0.249746 249.755371 14.771866 4.192781 2037.818523 1.438467e+04 5.145951 4.169304 1.129771 4.155179 1.115086
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.029867 41.000000 0.000000 0.175074 3.400000e+03 5.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.154181 52.000000 0.000000 0.366508 5.400000e+03 8.000000 0.000000 1.000000 0.000000 0.000000
75% 0.000000 0.559046 63.000000 0.000000 0.868254 8.249000e+03 11.000000 0.000000 2.000000 0.000000 1.000000
max 1.000000 50708.000000 109.000000 98.000000 329664.000000 3.008750e+06 58.000000 98.000000 54.000000 98.000000 20.000000
仔细分析发现:SeriousDlqin2yrs:的分布不是均衡的,这代表正负样本的比例有显著失衡
RevolvingUtilizationOfUnsecuredLines:的最大值和最小值很极端,但均值却很小,代表数据离散值较多。
age: 最小值有0,应该属于异常值,可能是空值导致的。最大值109也是不正常的。
NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse三种的最大值都是98,标准差近似,可能具有相关性。
so,下面开始进行可视化分析。
# 检查数据正负样本是否平衡
fig,axes=plt.subplots(1,2,figsize=(12,6))
# pandas自带绘图
dev_train['SeriousDlqin2yrs'].value_counts().plot.pie(explode=[0,0.1],autopct="%1.1f%%",ax=axes[0])
axes[0].set_title("SeriousDlqin2yrs")
sns.countplot("SeriousDlqin2yrs",data=dev_train,ax=axes[1])
axes[1].set_title("SeriousDlqin2yrs")
plt.show()
由此可以看出正负样本失衡严重,这可以考虑通过欠采样解决。
lightgbm中,可以设置两个参数is_unbalance和scale_pos_weight。
is_unbalace:当其为True时,算法将尝试自动平衡占主导地位的标签的权重(使用列集中的pos/neg分数)
scale_pos_weight:默认1,即假设正负标签都是相等的。在不平衡数据集的情况下,建议使用以下公式:
sample_pos_weight = number of negative samples / number of positive samples
(2)分析各个维度与结果(SeriousDlqin2yrs)的相关性
-
利用regplot()绘制各个维度与结果集的相关性
fig=plt.figure(figsize=[25,25])
for col,i in zip(dev_train.columns,range(1,13)):
axes=fig.add_subplot(7,2,i)
sns.regplot(dev_train[col],dev_train.SeriousDlqin2yrs,ax=axes)
plt.show()
如上图所示,可以明显看出
- NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse与分类标签SeriousDlqin2yrs有很大的正相关性,并且这三者的离群点的分布也很相似。NumberOfDependents与标签正相关性表明家庭成员与违约成正比,这与人们的现实直观感受相同。
- MonthlyIncome,age,DebtRatio与标签有明显的负相关性。MonthlyIncome收入与违约的负相关性是显而易见的。RevolvingUtilizationOfUnsecuredLines,NumberOfOpenCreditLinesAndLoans,NumberRealEstateLoansOrLines,三者的负相关性稍微弱一点。
2.正相关性分析
绘制NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse三者的箱线图:
dev_train.boxplot(column=['NumberOfTime30-59DaysPastDueNotWorse',
'NumberOfTime60-89DaysPastDueNotWorse',
'NumberOfTimes90DaysLate'],figsize=(15,5))
进一步分析:NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse三者大于80的值
# NumberOfTime30-59DaysPastDueNotWorse大于80的value_counts
print(dev_train[dev_train['NumberOfTime30-59DaysPastDueNotWorse']>=80]
['NumberOfTime30-59DaysPastDueNotWorse'].value_counts())
# NumberOfTime60-89DaysPastDueNotWorse大于80的value_counts
print(dev_train[dev_train['NumberOfTime60-89DaysPastDueNotWorse']>=80]
['NumberOfTime60-89DaysPastDueNotWorse'].value_counts())
# NumberOfTimes90DaysLate大于80的value_counts
print(dev_train[dev_train['NumberOfTimes90DaysLate']>=80]
['NumberOfTimes90DaysLate'].value_counts())
结果是出奇的一致。
# 当'NumberOfTime30-59DaysPastDueNotWorse>80时,
# NumberOfTime60-89DaysPastDueNotWorse,NumberOfTimes90DaysLate的结果
print(np.unique(dev_train[dev_train['NumberOfTime30-59DaysPastDueNotWorse']>=80]
['NumberOfTime60-89DaysPastDueNotWorse']))
print(np.unique(dev_train[dev_train['NumberOfTime30-59DaysPastDueNotWorse']>=80]
['NumberOfTimes90DaysLate']))
结果都是[96,98]。
同时,看看80以下的数据的value counts,
# NumberOfTime30-59DaysPastDueNotWorse小于80的value_counts
print(dev_train[dev_train['NumberOfTime30-59DaysPastDueNotWorse']<80]
['NumberOfTime30-59DaysPastDueNotWorse'].value_counts())
# NumberOfTime60-89DaysPastDueNotWorse小于80的value_counts
print(dev_train[dev_train['NumberOfTime60-89DaysPastDueNotWorse']<80]
['NumberOfTime60-89DaysPastDueNotWorse'].value_counts())
# NumberOfTimes90DaysLate小于80的value_counts
print(dev_train[dev_train['NumberOfTimes90DaysLate']<80]
['NumberOfTimes90DaysLate'].value_counts())
结果:三者的最大值分别为13,11,17
相对于整个数据样本的数量,这部分[96,98]异常值可以去除。但考虑到test数据也会发生类似结果,我们也可以用三者的最大值替换掉现在的异常结果,而且,三者最大值的记录都是1,也表明极大值也是极稀少的。
# 更正异常值
dev_train.loc[dev_train['NumberOfTime30-59DaysPastDueNotWorse'] >= 80, 'NumberOfTime30-59DaysPastDueNotWorse'] = 13
dev_train.loc[dev_train['NumberOfTime60-89DaysPastDueNotWorse'] >= 80, 'NumberOfTime60-89DaysPastDueNotWorse'] = 11
dev_train.loc[dev_train['NumberOfTimes90DaysLate'] >= 80, 'NumberOfTimes90DaysLate'] = 17
这步在处理test数据时也要执行一遍。
3.负相关性分析
结合上面的分析,我们我们觉得DebtRatio可能与其他负相关因素也有某种关联。所以,我们先对DebtRatio进行进一步分析,首先,DebtRatio的四分位特征分布如下
Debt Ratio:
count 150000.000000
mean 353.005076
std 2037.818523
min 0.000000
25% 0.175074
50% 0.366508
75% 0.868254
max 329664.000000
Name: DebtRatio, dtype: float64
绘制DebtRatio的箱线图
dev_train.boxplot(column=['DebtRatio'],figsize=(5,5))
做为比率值,DebtRatio的离散值显著异常。
quantiles=[x for x in range(75,100,3)]
for i in quantiles:
print(i,'% quantile of debt ratio is: ',dev_train.DebtRatio.quantile(i/100))
75 % quantile of debt ratio is: 0.86825377325
78 % quantile of debt ratio is: 1.2750686934600006
81 % quantile of debt ratio is: 14.0
84 % quantile of debt ratio is: 121.0
87 % quantile of debt ratio is: 635.0
90 % quantile of debt ratio is: 1267.0
93 % quantile of debt ratio is: 1917.070000000007
96 % quantile of debt ratio is: 2791.0
99 % quantile of debt ratio is: 4979.040000000037
通过对75%位点以上的数据进行分析,可以看到在81%时,DebtRatio明显增大。
结合DebtRatio,我们再看看MonthlyIncome,age以及其他几项指标:
print(dev_train[dev_train['DebtRatio'] >=
dev_train['DebtRatio'].quantile(0.95)][['age',
'MonthlyIncome','RevolvingUtilizationOfUnsecuredLines',
'NumberOfOpenCreditLinesAndLoans',
'NumberRealEstateLoansOrLines',
'NumberOfDependents']].describe())
age MonthlyIncome RevolvingUtilizationOfUnsecuredLines \
count 7501.000000 379.000000 7501.000000
mean 53.515131 0.084433 10.105646
std 10.987772 0.278403 271.335260
min 25.000000 0.000000 0.000000
25% 46.000000 0.000000 0.043129
50% 54.000000 0.000000 0.188120
75% 62.000000 0.000000 0.535650
max 94.000000 1.000000 13930.000000
NumberOfOpenCreditLinesAndLoans NumberRealEstateLoansOrLines \
count 7501.000000 7501.000000
mean 10.585389 1.924143
std 5.105795 1.227998
min 1.000000 0.000000
25% 7.000000 1.000000
50% 10.000000 2.000000
75% 13.000000 2.000000
max 43.000000 23.000000
NumberOfDependents
count 6997.000000
mean 0.528798
std 1.013590
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 10.000000
我们把分位数换成DebtRatio的值,依次下探,在DebtRatio值为500时,总记录20614,MonthlyIncome有 1305个不为空,986为0,319为1,比率近3:1; NumberOfDependents 有 18803个不为空, 14444个为0;
我们设立一个空值填充规则,在DebtRatio>=500时,将MonthlyIncome的空值设为0, NumberOfDependents 设为0
小于500是,MonthlyIncome的空值则设为中位数,NumberOfDependents设为中位值。
# 对空值用中位数填充
dev_train['NumberOfDependents'].fillna(dev_train['NumberOfDependents'].median(), inplace=True)
dev_train.loc[(dev_train['DebtRatio']>=500)&(dev_train['MonthlyIncome'].isnull()),'MonthlyIncome']=0.0
dev_train.loc[(dev_train['DebtRatio']<500)&(dev_train['MonthlyIncome'].isnull()),
'MonthlyIncome']=dev_train['MonthlyIncome'].mean()
之后,我们再看看age异常的数据,
# 处理年纪等于0的数据,发现只有一条,于是用中位数进行替换
print(dev_train.loc[dev_train['age'] < 18])
dev_train.loc[dev_train['age'] == 0, 'age'] = dev_train['age'].median()
最后做各个列之间的相关性分析:
fig=plt.figure(figsize=[15,10])
masked = np.zeros_like(dev_train.corr(), dtype=np.bool)
masked[np.triu_indices_from(masked)] = True
sns.heatmap(dev_train.corr(), cmap=sns.diverging_palette(150, 275, s=80, l=55, n=9), mask = masked, annot=True, center = 0)
plt.title("Correlation Matrix (HeatMap)", fontsize = 15)
由上我们可以看到,与标签SeriousDlqin2yrs相关性最高的是 NumberOfTime30-59DaysPastDueNotWorse , NumberOfTime60-89DaysPastDueNotWorse 和 NumberOfTimes90DaysLate。
第三部分 baseline
1.切分数据集
from sklearn import preprocessing,metrics,model_selection,ensemble,tree,linear_model
dev_x=dev_train.drop(['SeriousDlqin2yrs'],axis=1)
dev_y=dev_train['SeriousDlqin2yrs']
# 切分数据集
X_train,X_val,y_train,y_val=model_selection.train_test_split(dev_x,dev_y,test_size=0.3,random_state=2020)
2.构建模型
import lightgbm as lgb
lgb_classifer=lgb.LGBMClassifier(objective='binary', # 二分类的log loss
n_jobs=-1, random_state=2020,
importance_type='gain' # 增益作为重要性度量)
lgbParameters={
'max_depth' : [2,3,4,5],
'learning_rate': [0.05, 0.1,0.125,0.15],
'colsample_bytree' : [0.2,0.4,0.6,0.8,1],
'n_estimators' : [400,500,600,700,800,900],
'min_split_gain' : [0.15,0.20,0.25,0.3,0.35], #equivalent to gamma in XGBoost
'subsample': [0.6,0.7,0.8,0.9,1],
'min_child_weight': [6,7,8,9,10],
'scale_pos_weight': [10,15,20],
'min_data_in_leaf' : [100,200,300,400,500,600,700,800,900],
'num_leaves' : [20,30,40,50,60,70,80,90,100]
}
# 随机交叉验证
lgbModel=model_selection.RandomizedSearchCV(lgb_classifer,
param_distributions=lgbParameters,
cv=5, # 5折交叉验证
random_state=2020
)
# 开始训练
lgbModel.fit(X_train,y_train,feature_name=X_train.columns.to_list())
3.获取模型的最好参数
bestEstimatorLGB=lgbModel.best_estimator_
bestEstimatorLGB
""" LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=0.2,
importance_type='gain', learning_rate=0.125, max_depth=4,
min_child_samples=20, min_child_weight=9, min_data_in_leaf=500,
min_split_gain=0.15, n_estimators=500, n_jobs=-1, num_leaves=80,
objective='binary', random_state=2020, reg_alpha=0.0,
reg_lambda=0.0, scale_pos_weight=10, silent=True, subsample=0.9,
subsample_for_bin=200000, subsample_freq=0)
"""
4,用最优参数,构建模型
# 最优模型的训练
bestEstimatorLGB=lgb.LGBMClassifier(colsample_bytree=0.2,
importance_type='gain',
max_depth=4,
min_child_weight=9, # 子节点所需的样本权重
min_data_in_leaf=500,
min_split_gain=0.15, #执行切分的最小增益
n_estimators=500,
num_leaves=80,
objective='binary',
random_state=2020,
scale_pos_weight=10, ## 正样本的权重
subsample=0.9, #不进行重采样的情况下随机选择部分数据
).fit(X_train,y_train,
feature_name=X_train.columns.to_list())
5.交叉验证的预测结果
val_test_pred_lgb=bestEstimatorLGB.predict(X_val)
print(metrics.classification_report(y_val,val_test_pred_lgb))
precision recall f1-score support
0 0.97 0.87 0.92 41969
1 0.27 0.67 0.39 3031
accuracy 0.86 45000
macro avg 0.62 0.77 0.65 45000
weighted avg 0.93 0.86 0.88 45000
6.预测结果在各种指标下的评测
#
metrics.confusion_matrix(y_val,val_test_pred_lgb)
LGBMMetrics=pd.DataFrame({'Model':'LightGBM',
'MSE':round(metrics.mean_squared_error(y_val,val_test_pred_lgb)*100,2),
'RMSE':round(np.sqrt(metrics.mean_squared_error(y_val,val_test_pred_lgb)*100),2),
'MAE':round(metrics.mean_absolute_error(y_val,val_test_pred_lgb)*100,2),
'Accuracy Train':round(bestEstimatorLGB.score(X_train,y_train)*100,2),
'Accuracy Test': round(bestEstimatorLGB.score(X_val,y_val)*100,2),
'F-Beta Score (B=2)':round(metrics.fbeta_score(y_val,
val_test_pred_lgb,
beta=2)*100,2)
},index=[1])
print(LGBMMetrics)
Model MSE RMSE MAE Accuracy Train Accuracy Test F-1 F-Beta Score (B=2)
1 LightGBM 14.34 3.79 14.34 86.12 85.66 51.5 51.53
注:MSE:均方误差;RMSE:均方根误差;MAE:平均绝对误差; F-Beta Score:F2分数(召回率的权重高于精确率)
绘制AUC曲线:
val_pred_lgb=bestEstimatorLGB.predict_proba(X_val)
val_pred_lgb=val_pred_lgb[:,1]
# roc_curve根据分类结果和分类概率,返回false positive rage和true positive rate
fpr,tpr,_=metrics.roc_curve(y_val,val_pred_lgb)
rocAuc=metrics.auc(fpr,tpr)
plt.figure(figsize=(12,6))
plt.title("ROC Curve")
sns.lineplot(fpr,tpr,label="AUC for LightGBM Model = %0.2f"% rocAuc)
plt.legend(loc=' lower right')
plt.plot([0,1],[0,1],'r--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
# 看看gbdt对各个特征的重视程度
lgb.plot_importance(bestEstimatorLGB,importance_type='gain')
初步预测:
lgb_probs=bestEstimatorLGB.predict_proba(dev_test)
lgb_df=pd.DataFrame({'ID':test_id,'Probability':lgb_probs[:,1]})
lgb_df.to_csv('./submission.csv',index=False)
在kaggle的提交页面上提交以后,得到以下结果
优化:
1.初步优化,尝试扩大lgbParameters参数的搜索范围:如下
lgbParameters={
'max_depth' : [3,4,5,6,7],
'learning_rate': [0.01,0.025,0.05, 0.075,0.1,0.125],
'colsample_bytree' : [0.2,0.4,0.6,0.8,1],
'n_estimators' : [400,450,500,550,600,650,700,800],
'min_split_gain' : [0.15,0.20,0.25,0.3,0.35], #equivalent to gamma in XGBoost
'subsample': [0.65,0.7,0.75,0.8,0.85,0.9,0.95,1],
'min_child_weight': [6,7,8,9,10,11],
'min_child_samples':[x for x in range(15,70,10)],
'scale_pos_weight': [3,5,8,10,13,15],
'min_data_in_leaf' : [150,200,350,450,500,550,600,650,700],
'num_leaves' : [5,15,25,35,40,45,55,60,70,75]
}
得到新的最优参数:
bestEstimatorLGB=lgbModel.best_estimator_
bestEstimatorLGB
"""
LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=0.8,
importance_type='gain', learning_rate=0.01, max_depth=6,
min_child_samples=25, min_child_weight=6, min_data_in_leaf=650,
min_split_gain=0.15, n_estimators=600, n_jobs=-1, num_leaves=75,
objective='binary', random_state=2020, reg_alpha=0.0,
reg_lambda=0.0, scale_pos_weight=3, silent=True, subsample=1,
subsample_for_bin=200000, subsample_freq=0)
"""
bestEstimatorLGB=lgb.LGBMClassifier(colsample_bytree=0.8, # 特征抽样
importance_type='gain',
learning_rate=0.01 ,
max_depth=6,
min_child_samples=25,
min_child_weight=6, # 子节点所需的样本权重
min_data_in_leaf=650,
min_split_gain=0.15, #执行切分的最小增益
n_estimators=600,
num_leaves=75,
objective='binary',
random_state=2020,
scale_pos_weight=3, # 正样本的权重
subsample=1, # 不进行重采样的情况下随机选择部分数据
).fit(X_train,y_train,
feature_name=X_train.columns.to_list())
通过上面得到的最优参数模型在交叉验证集上得到的AUC 指标达到0.87
最后将新的结果提交到kaggle上,得到结果:
这个结果虽然只提升了0.003个百分点,但却提高了300过个名次。
2,进一步的分析优化
我们利用shap工具包,来看一下各个feature在对最后打分结果的贡献。
explainer=shap.TreeExplainer(bestEstimatorLGB)
shap_values=explainer.shap_values(X_train)
shap.summary_plot(shap_values[1],X_train)
由上图可以看出,RevolvingUtilizationOfUnsecuredLines,NumberOfTime30-59DaysPastDueNotWorse , NumberOfTime60-89DaysPastDueNotWorse 和 NumberOfTimes90DaysLate,值越大产生的shap值也越大,这说明这三者对最后的结果贡献很大。同时,DebtRatio,monthlyIncome,NumberOfDependents三者数据很集中,而且对最后的结果也很小,所以,我们尝试一下对这三者进行一下特征组合。
# 新的feature
new_dev_x=dev_x.copy()
new_dev_x['monthlyIncomePerPerson']= new_dev_x['MonthlyIncome']/(new_dev_x[
'NumberOfDependents']+1)
#
new_dev_x['monthlyDebt']=new_dev_x['MonthlyIncome']*new_dev_x['DebtRatio']
得到新的最优模型:
bestEstimatorLGB=lgb.LGBMClassifier(colsample_bytree=0.6, # 特征抽样
importance_type='gain',
learning_rate=0.025 ,
max_depth=7,
min_child_samples=55,
min_child_weight=10, # 子节点所需的样本权重
min_data_in_leaf=450,
min_split_gain=0.35, #执行切分的最小增益
n_estimators=450,
num_leaves=75,
objective='binary',
random_state=2020,
scale_pos_weight=5, # 正样本的权重
subsample=0.9, # 不进行重采样的情况下随机选择部分数据
).fit(X_train,y_train,
feature_name=X_train.columns.to_list())
val_test_pred_lgb=bestEstimatorLGB.predict(X_val)
print(metrics.classification_report(y_val,val_test_pred_lgb))
"""
precision recall f1-score support
0 0.97 0.93 0.95 27980
1 0.37 0.56 0.44 2020
accuracy 0.91 30000
macro avg 0.67 0.74 0.70 30000
weighted avg 0.93 0.91 0.91 30000
"""
val_pred_lgb=bestEstimatorLGB.predict_proba(X_val)
val_pred_lgb=val_pred_lgb[:,1]
#
metrics.confusion_matrix(y_val,val_test_pred_lgb)
fpr,tpr,_=metrics.roc_curve(y_val,val_pred_lgb)
rocAuc=metrics.auc(fpr,tpr)
plt.figure(figsize=(12,6))
plt.title("ROC Curve")
sns.lineplot(fpr,tpr,label="AUC for LightGBM Model = %0.2f"% rocAuc)
plt.legend(loc=' lower right')
plt.plot([0,1],[0,1],'r--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
#
lgb.plot_importance(bestEstimatorLGB,importance_type='gain')
除了NumberOfTime30-59DaysPastDueNotWorse , NumberOfTime60-89DaysPastDueNotWorse 和 NumberOfTimes90DaysLate贡献提升外,的最后得到的AUC提升很有限,提交到kaggle的结果如下: