基于LogisticRegression 的信用卡评分模型
引言:
- 为了市场的繁荣,满足了消费者的需求,个人和公司都需要获得信贷。而作为借贷方,大多数银行最主要的面对的问题是信贷风险,即对方不能还款的风险,针对信贷风险产生的原因,进行银行信贷风险控制,有利于加强银行的经营管理,实现资源的有效配置。
- 信用评分卡是一种以分数形式来衡量一个客户的信用风险大小的手段,是银行用来确定是否应授予贷款的方法。对于个人来说,有”四张卡“来评判个人的信用程度:A卡,B卡,C卡和F卡。众人常说的“评分卡”是指A卡,又称为申请者评级模型,主要应用于相关融资类业务中新用户的主体评级。
- 本项目旨在基于机器学习算法建立一个信用卡评分模型,预测信用卡申请者会违约的可能性,帮助借贷者更好的作出决策。
思路:
首先理解业务,了解评分卡是什么怎么做;获取导入数据进行数据探索与数据预处理;其次对所有特征进行分箱选择,计算各箱的WOE并映射到数据中;
接下里就是建模与验证并制作评分卡。
数据来源与项目实施情况 :
本数据集来源于kaggle平台提供的信用卡用户数据
Give Me SomeCredit
(Https://www.kaggle.com/c/GiveMeSomeCredit/overview)
。经过理解数据、数据清洗、特征工程、采用Logistic构建模型,通过ROC/AUC评估模型的拟合能力(AUC达到0.87),最终建立了信用卡评分系统。
1、导库,获取数据
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import RandomForestRegressor as rfr
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
%matplotlib inline
data = pd.read_csv(r'H:\数据分析\项目\信用预测\cs-training.csv',index_col=0)
特征指标含义
- SeriousDlqin2yrs : 出现 90 天或更长时间的逾期行为(即定义好坏客户)
- RevolvingUtilizationOfUnsecuredLines : 贷款以及信用卡可用额度与总额度比例
- age : 借款人借款年龄
- NumberOfTime30-59DaysPastDueNotWorse : 过去两年内出现35-59天逾期但是没有发展得更坏的次数
- DebtRatio : 每月偿还债务,赡养费,生活费用除以月总收入
- MonthlyIncome : 月收入
- NumberOfOpenCreditLinesAndLoans : 开放式贷款和信贷数量
- NumberOfTimes90DaysLate : 过去两年内出现90天逾期或更坏的次数
- NumberRealEstateLoansOrLines : 抵押贷款和房地产贷款数量,包括房屋净值信贷额度
- NumberOfTime60-89DaysPastDueNotWorse : 过去两年内出现60-89天逾期但是没有发展得更坏的次数
- NumberOfDependents : 家庭中不包括自身的家属人数(配偶,子女等)
2、探索数据与数据预处理
#观察数据
data.head()
SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0.766127 | 45 | 2 | 0.802982 | 9120.0 | 13 | 0 | 6 | 0 | 2.0 |
2 | 0 | 0.957151 | 40 | 0 | 0.121876 | 2600.0 | 4 | 0 | 0 | 0 | 1.0 |
3 | 0 | 0.658180 | 38 | 1 | 0.085113 | 3042.0 | 2 | 1 | 0 | 0 | 0.0 |
4 | 0 | 0.233810 | 30 | 0 | 0.036050 | 3300.0 | 5 | 0 | 0 | 0 | 0.0 |
5 | 0 | 0.907239 | 49 | 1 | 0.024926 | 63588.0 | 7 | 0 | 1 | 0 | 0.0 |
#观察数据结构
data.shape
(150000, 11)
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 150000 entries, 1 to 150000
Data columns (total 11 columns):
SeriousDlqin2yrs 150000 non-null int64
RevolvingUtilizationOfUnsecuredLines 150000 non-null float64
age 150000 non-null int64
NumberOfTime30-59DaysPastDueNotWorse 150000 non-null int64
DebtRatio 150000 non-null float64
MonthlyIncome 120269 non-null float64
NumberOfOpenCreditLinesAndLoans 150000 non-null int64
NumberOfTimes90DaysLate 150000 non-null int64
NumberRealEstateLoansOrLines 150000 non-null int64
NumberOfTime60-89DaysPastDueNotWorse 150000 non-null int64
NumberOfDependents 146076 non-null float64
dtypes: float64(4), int64(7)
memory usage: 13.7 MB
2.1去除重复值
- 银行业务数据经常是几百个特征,所有特征都一样的可能性是微乎其微的。即便真的出现了如此极端的情况,我们也可以当作是少量信息损失,将这条记录当作重复值除去。
data.drop_duplicates(inplace=True)
data.info() #数据量减少,确定重复值存在
<class 'pandas.core.frame.DataFrame'>
Int64Index: 149391 entries, 1 to 150000
Data columns (total 11 columns):
SeriousDlqin2yrs 149391 non-null int64
RevolvingUtilizationOfUnsecuredLines 149391 non-null float64
age 149391 non-null int64
NumberOfTime30-59DaysPastDueNotWorse 149391 non-null int64
DebtRatio 149391 non-null float64
MonthlyIncome 120170 non-null float64
NumberOfOpenCreditLinesAndLoans 149391 non-null int64
NumberOfTimes90DaysLate 149391 non-null int64
NumberRealEstateLoansOrLines 149391 non-null int64
NumberOfTime60-89DaysPastDueNotWorse 149391 non-null int64
NumberOfDependents 145563 non-null float64
dtypes: float64(4), int64(7)
memory usage: 13.7 MB
data.reset_index(inplace=True,drop=True) #恢复索引
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149391 entries, 0 to 149390
Data columns (total 11 columns):
SeriousDlqin2yrs 149391 non-null int64
RevolvingUtilizationOfUnsecuredLines 149391 non-null float64
age 149391 non-null int64
NumberOfTime30-59DaysPastDueNotWorse 149391 non-null int64
DebtRatio 149391 non-null float64
MonthlyIncome 120170 non-null float64
NumberOfOpenCreditLinesAndLoans 149391 non-null int64
NumberOfTimes90DaysLate 149391 non-null int64
NumberRealEstateLoansOrLines 149391 non-null int64
NumberOfTime60-89DaysPastDueNotWorse 149391 non-null int64
NumberOfDependents 145563 non-null float64
dtypes: float64(4), int64(7)
memory usage: 12.5 MB
2.2填补缺失值
#探索缺失值比例
data.isnull().mean()
SeriousDlqin2yrs 0.000000
RevolvingUtilizationOfUnsecuredLines 0.000000
age 0.000000
NumberOfTime30-59DaysPastDueNotWorse 0.000000
DebtRatio 0.000000
MonthlyIncome 0.195601
NumberOfOpenCreditLinesAndLoans 0.000000
NumberOfTimes90DaysLate 0.000000
NumberRealEstateLoansOrLines 0.000000
NumberOfTime60-89DaysPastDueNotWorse 0.000000
NumberOfDependents 0.025624
dtype: float64
在这里我们需要填补的特征是“收入”和“家属人数”。“家属人数”缺失不多大约为2.5%,本项目使用均值来填补。
data['NumberOfDependents'].fillna(data['NumberOfDependents'].mean(),inplace=True)
data.isnull().mean()
SeriousDlqin2yrs 0.000000
RevolvingUtilizationOfUnsecuredLines 0.000000
age 0.000000
NumberOfTime30-59DaysPastDueNotWorse 0.000000
DebtRatio 0.000000
MonthlyIncome 0.195601
NumberOfOpenCreditLinesAndLoans 0.000000
NumberOfTimes90DaysLate 0.000000
NumberRealEstateLoansOrLines 0.000000
NumberOfTime60-89DaysPastDueNotWorse 0.000000
NumberOfDependents 0.000000
dtype: float64
“收入”缺失了几乎20%,缺失量很大,“收入”是一个对信用评分来说很重要的因素,因此这个特征必须要进行填补,在本项目中选择用随机森林进行填补收入
def fill_missing_rf(X,y,to_fill):
"""
使用随机森林填补一个特征的缺失值的函数
参数:
X:要填补的特征矩阵
y:没有缺失值的那部分数据所对应的标签
to_fill:要填补的特征
"""
#构建新特征矩阵和新标签
df = X.copy()
fill = df.loc[:,to_fill]
df = pd.concat([df.loc[:,df.columns != to_fill],pd.DataFrame(y)],axis=1)
# 找出我们的训练集和测试集
Ytrain = fill[fill.notnull()]
Ytest = fill[fill.isnull()]
Xtrain = df.iloc[Ytrain.index,:]
Xtest = df.iloc[Ytest.index,:]
#用随机森林回归来填补缺失值
from sklearn.ensemble import RandomForestRegressor as rfr
rfr = rfr(n_estimators=100)
rfr = rfr.fit(Xtrain, Ytrain)
Y_predict = rfr.predict(Xtest)
return Y_predict
X = data.iloc[:,1:]
y = data.iloc[:,0]
X.shape
(149391, 10)
# 把 X,y 和 含有缺失值的特征 to_fill带入定义好的函数里面
y_pred = fill_missing_rf(X,y,'MonthlyIncome')
y_pred
array([0.19, 0.34, 0.12, ..., 0.2 , 0.12, 0. ])
# 将预测的缺失值进行填补
data.loc[data.loc[:,'MonthlyIncome'].isnull(),'MonthlyIncome'] = y_pred
data.info() #无缺失值
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149391 entries, 0 to 149390
Data columns (total 11 columns):
SeriousDlqin2yrs 149391 non-null int64
RevolvingUtilizationOfUnsecuredLines 149391 non-null float64
age 149391 non-null int64
NumberOfTime30-59DaysPastDueNotWorse 149391 non-null int64
DebtRatio 149391 non-null float64
MonthlyIncome 149391 non-null float64
NumberOfOpenCreditLinesAndLoans 149391 non-null int64
NumberOfTimes90DaysLate 149391 non-null int64
NumberRealEstateLoansOrLines 149391 non-null int64
NumberOfTime60-89DaysPastDueNotWorse 149391 non-null int64
NumberOfDependents 149391 non-null float64
dtypes: float64(4), int64(7)
memory usage: 12.5 MB
2.3描述性统计进行处理异常值
data.describe([0.01,0.1,0.25,0.5,0.75,0.9,0.99]).T
count | mean | std | min | 1% | 10% | 25% | 50% | 75% | 90% | 99% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SeriousDlqin2yrs | 149391.0 | 0.066999 | 0.250021 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.0 |
RevolvingUtilizationOfUnsecuredLines | 149391.0 | 6.071087 | 250.263672 | 0.0 | 0.0 | 0.003199 | 0.030132 | 0.154235 | 0.556494 | 0.978007 | 1.093922 | 50708.0 |
age | 149391.0 | 52.306237 | 14.725962 | 0.0 | 24.0 | 33.000000 | 41.000000 | 52.000000 | 63.000000 | 72.000000 | 87.000000 | 109.0 |
NumberOfTime30-59DaysPastDueNotWorse | 149391.0 | 0.393886 | 3.852953 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 4.000000 | 98.0 |
DebtRatio | 149391.0 | 354.436740 | 2041.843455 | 0.0 | 0.0 | 0.034991 | 0.177441 | 0.368234 | 0.875279 | 1275.000000 | 4985.100000 | 329664.0 |
MonthlyIncome | 149391.0 | 5428.750772 | 13241.513403 | 0.0 | 0.0 | 0.190000 | 1800.000000 | 4429.000000 | 7416.000000 | 10800.000000 | 23250.000000 | 3008750.0 |
NumberOfOpenCreditLinesAndLoans | 149391.0 | 8.480892 | 5.136515 | 0.0 | 0.0 | 3.000000 | 5.000000 | 8.000000 | 11.000000 | 15.000000 | 24.000000 | 58.0 |
NumberOfTimes90DaysLate | 149391.0 | 0.238120 | 3.826165 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 | 98.0 |
NumberRealEstateLoansOrLines | 149391.0 | 1.022391 | 1.130196 | 0.0 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 2.000000 | 4.000000 | 54.0 |
NumberOfTime60-89DaysPastDueNotWorse | 149391.0 | 0.212503 | 3.810523 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 98.0 |
NumberOfDependents | 149391.0 | 0.759863 | 1.101749 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 4.000000 | 20.0 |
#观察到年龄的最小值为0,这不符合银行的业务需求,可以查看一下年龄为0的人有多少
(data["age"] == 0).sum()
1
#可以当成是缺失值来处理,删除掉这个样本
data = data[data["age"] != 0]
"""
另外,有三个指标看起来很奇怪:
"NumberOfTime30-59DaysPastDueNotWorse"
"NumberOfTime60-89DaysPastDueNotWorse"
"NumberOfTimes90DaysLate"
这三个指标分别是“过去两年内出现35-59天逾期但是没有发展的更坏的次数”,“过去两年内出现60-89天逾期但是没
有发展的更坏的次数”,“过去两年内出现90天逾期的次数”。这三个指标,在99%的分布的时候依然是2,最大值却是
98,看起来很不正常。
"""
'\n另外,有三个指标看起来很奇怪:\n"NumberOfTime30-59DaysPastDueNotWorse"\n"NumberOfTime60-89DaysPastDueNotWorse"\n"NumberOfTimes90DaysLate"\n这三个指标分别是“过去两年内出现35-59天逾期但是没有发展的更坏的次数”,“过去两年内出现60-89天逾期但是没\n有发展的更坏的次数”,“过去两年内出现90天逾期的次数”。这三个指标,在99%的分布的时候依然是2,最大值却是\n98,看起来很不正常。\n'
data[data.loc[:,"NumberOfTimes90DaysLate"] > 90].count()
SeriousDlqin2yrs 225
RevolvingUtilizationOfUnsecuredLines 225
age 225
NumberOfTime30-59DaysPastDueNotWorse 225
DebtRatio 225
MonthlyIncome 225
NumberOfOpenCreditLinesAndLoans 225
NumberOfTimes90DaysLate 225
NumberRealEstateLoansOrLines 225
NumberOfTime60-89DaysPastDueNotWorse 225
NumberOfDependents 225
dtype: int64
#有225个样本存在这样的情况,并且这些样本,我们观察一下,标签并不都是1,他们并不都是坏客户这显然是不正常的。
#因此,我们基本可以判断,这些样本是某种异常,应该把它们删除。
data = data[data.loc[:,'NumberOfTimes90DaysLate']<90]
data.shape
(149165, 11)
data.describe([0.01,0.1,0.25,0.5,0.75,0.9,0.99])