1.背景介绍
在大数据自动化审批实践中,信用评分技术已经是一项逐渐成熟的风险估值方法。在消费金融的风险控制实践中,信用评分卡模型已经得到广泛地应用。
何为信用评分卡?
简而言之就是利用客户已有的信息,这些数据可以来自一些三方平台(例如芝麻分、京东白条、微信、银行信用卡)等。利用已有的历史数据对客户的信用状况进行量化,这种量化的直观反映就是信用的分值。
今天我们向大家展示如何来构造一个银行业普遍使用的信用卡评分模型。这里我们使用的数据是国际上鼎鼎有名的data比赛Kaggle上的数据集:Give Me Some Credit ,一家德国银行的信用卡客户历史数据。整个数据集上有超过10万条客户数据,数据量的庞大也为模型的准确度提高了保障。Kaggle大神Zoe已经给出了一个庞大且系统的完成代码集,我们这里则简化很多,以期能够管中窥豹。
一个完整的信用卡评分模型主要包括以下几个部分:
数据处理、特征变量选择、变量WOE编码离散化、logistic回归模型开发评估、信用评分卡和自动评分系统创建以及模型评估。
数据来源于Kaggle上的数据集:Give Me Some Credit,共计有15万条样本数据,主要包括以下11个变量。
2 数据预处理
说句实话,这一步很繁琐。任何一个统计分析的过程,数据的预处理占据了7层甚至更多的时间。可以第一手的数据总是杂乱无章的,无用的数据太多。一个不干净的数据会让我们得到很多匪夷所思的结果。因此,我们还是乖乖地去|“清洗“数据。无奈清洗数据真实太熬人了,为了方便,我们索性就直接删去了清洗数据这一步。。。
现在你们看到的就是一个非常干净和清爽的数据。此处省略......字
3 变量WOE 分箱处理
特征变量选择(排序)对于数据分析、机器学习来说非常重要。好的特征选择能够提升模型的性能,更能帮助我们理解数据的特点、底层结构,这对进一步改善模型、算法都有着重要作用。 首先选择对连续变量进行最优分段,在连续变量的分布不满足最优分段的要求时,再考虑对连续变量进行等距分段。
- Python学习交流群:1004391443
import scipy.stats as statsdef mono_bin(Y,X,n): good=Y.sum() bad=Y.count()-good r=0 while np.abs(r)<1: d1=pd.DataFrame({'X':X,'Y':Y,'Bucket':pd.qcut(X,n)}) d2=d1.groupby(['Bucket']) r,p=stats.spearmanr(d2['X'].mean(),d2['Y'].mean()) n=n-1 print(r,n) d3=pd.DataFrame(d2['X'].min(),columns=['min']) d3['min']=d2['X'].min() d3['max']=d2['X'].max() d3['sum']=d2['Y'].sum() d3['total']=d2['Y'].count() d3['rate']=d2['Y'].mean() d3['goodattribute']=d3['sum']/good d3['badattribute']=(d3['total']-d3['sum'])/bad d3['woe']=np.log(d3['goodattribute']/d3['badattribute']) iv=((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum() d4=d3.sort_index(by='min') woe=list(d4['woe'].values) print(d4) print('-'*30) cut=[] #float('inf') 为正无穷,而不是直接写inf cut.append(float('-inf')) for i in range(1,n+1): qua=X.quantile(i/(n+1)) cut.append(round(qua,4)) cut.append(float('inf')) return d4,iv,woe,cutdfx1,ivx1,woex1,cutx1=mono_bin(train['SeriousDlqin2yrs'],train['RevolvingUtilizationOfUnsecuredLines'],n=10)dfx2, ivx2,woex2,cutx2=mono_bin(train['SeriousDlqin2yrs'], train['age'], n=10)dfx4, ivx4,woex4,cutx4 =mono_bin(train['SeriousDlqin2yrs'],train['DebtRatio'], n=20)dfx5, ivx5,woex5,cutx5=mono_bin(train['SeriousDlqin2yrs'], train['MonthlyIncome'], n=10)
针对不能最优分箱的变量,分箱如下:
def self_bin(Y,X,cat): good=Y.sum() bad=Y.count()-good d1=pd.DataFrame({'X':X,'Y':Y,'Bucket':pd.cut(X,cat)}) d2=d1.groupby(['Bucket']) d3=pd.DataFrame(d2['X'].min(),columns=['min']) d3['min']=d2['X'].min() d3['max']=d2['X'].max() d3['sum']=d2['Y'].sum() d3['total']=d2['Y'].count() d3['rate']=d2['Y'].mean() d3['goodattribute']=d3['sum']/good d3['badattribute']=(d3['total']-d3['sum'])/bad d3['woe']=np.log(d3['goodattribute']/d3['badattribute']) iv=((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum() d4=d3.sort_index(by='min') print(d4) print('-'*40) woe=list(d3['woe'].values) return d4,iv,woe ninf=float('-inf')pinf=float('inf')cutx3=[ninf,0,1,3,5,pinf]cutx6 = [ninf, 1, 2, 3, 5, pinf]cutx7 = [ninf, 0, 1, 3, 5, pinf]cutx8 = [ninf, 0,1,2, 3, pinf]cutx9 = [ninf, 0, 1, 3, pinf]cutx10 = [ninf, 0, 1, 2, 3, 5, pinf] dfx3,ivx3,woex3=self_bin(train['SeriousDlqin2yrs'],train['NumberOfTime30-59DaysPastDueNotWorse'],cutx3)dfx6, ivx6 ,woex6= self_bin(train['SeriousDlqin2yrs'], train['NumberOfOpenCreditLinesAndLoans'], cutx6)dfx7, ivx7,woex7 = self_bin(train['SeriousDlqin2yrs'], train['NumberOfTimes90DaysLate'], cutx7)dfx8, ivx8,woex8 = self_bin(train['SeriousDlqin2yrs'], train['NumberRealEstateLoansOrLines'], cutx8)dfx9, ivx9,woex9 = self_bin(train['SeriousDlqin2yrs'], train['NumberOfTime60-89DaysPastDueNotWorse'], cutx9)dfx10, ivx10,woex10 = self_bin(train['SeriousDlqin2yrs'], train['NumberOfDependents'], cutx10)
data = pd.read_csv('data/TrainData.csv')#value>=cut[0]=from pandas import Seriesdef replace_woe(series,cut,woe): list=[] i=0 while i<len(series): valuek=series[i] j=len(cut)-2 m=len(cut)-2 while j>=0: if valuek>=cut[j]: j=-1 else: j -=1 m -= 1 list.append(woe[m]) i += 1 return listdata['RevolvingUtilizationOfUnsecuredLines'] = Series(replace_woe(data['RevolvingUtilizationOfUnsecuredLines'], cutx1, woex1))data['age'] = Series(replace_woe(data['age'], cutx2, woex2))data['NumberOfTime30-59DaysPastDueNotWorse'] = Series(replace_woe(data['NumberOfTime30-59DaysPastDueNotWorse'], cutx3, woex3))data['DebtRatio'] = Series(replace_woe(data['DebtRatio'], cutx4, woex4))data['MonthlyIncome'] = Series(replace_woe(data['MonthlyIncome'], cutx5, woex5))data['NumberOfOpenCreditLinesAndLoans'] = Series(replace_woe(data['NumberOfOpenCreditLinesAndLoans'], cutx6, woex6))data['NumberOfTimes90DaysLate'] = Series(replace_woe(data['NumberOfTimes90DaysLate'], cutx7, woex7))data['NumberRealEstateLoansOrLines'] = Series(replace_woe(data['NumberRealEstateLoansOrLines'], cutx8, woex8))data['NumberOfTime60-89DaysPastDueNotWorse'] = Series(replace_woe(data['NumberOfTime60-89DaysPastDueNotWorse'], cutx9, woex9))data['NumberOfDependents'] = Series(replace_woe(data['NumberOfDependents'], cutx10, woex10)) test= pd.read_csv('data/TestData.csv') # 替换成woetest['RevolvingUtilizationOfUnsecuredLines'] = Series(replace_woe(test['RevolvingUtilizationOfUnsecuredLines'], cutx1, woex1))test['age'] = Series(replace_woe(test['age'], cutx2, woex2))test['NumberOfTime30-59DaysPastDueNotWorse'] = Series(replace_woe(test['NumberOfTime30-59DaysPastDueNotWorse'], cutx3, woex3))test['DebtRatio'] = Series(replace_woe(test['DebtRatio'], cutx4, woex4))test['MonthlyIncome'] = Series(replace_woe(test['MonthlyIncome'], cutx5, woex5))test['NumberOfOpenCreditLinesAndLoans'] = Series(replace_woe(test['NumberOfOpenCreditLinesAndLoans'], cutx6, woex6))test['NumberOfTimes90DaysLate'] = Series(replace_woe(test['NumberOfTimes90DaysLate'], cutx7, woex7))test['NumberRealEstateLoansOrLines'] = Series(replace_woe(test['NumberRealEstateLoansOrLines'], cutx8, woex8))test['NumberOfTime60-89DaysPastDueNotWorse'] = Series(replace_woe(test['NumberOfTime60-89DaysPastDueNotWorse'], cutx9, woex9))test['NumberOfDependents'] = Series(replace_woe(test['NumberOfDependents'], cutx10, woex10))
4 Logistic 模型建立
import statsmodels.api as smY=data['SeriousDlqin2yrs']X=data.drop(['SeriousDlqin2yrs','DebtRatio','MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans','NumberRealEstateLoansOrLines','NumberOfDependents'],axis=1)X1=sm.add_constant(X)logit=sm.Logit(Y,X1)result=logit.fit()print(result.summary())
假设显著性水平设定为0.01,因此,我们构造的逻辑斯特回归模型是非常显著的。对已经构建的模型进行验证,ROC曲线和AUC来评估模型的拟合能力。
from sklearn.metrics import roc_curve,aucimport matplotlibmatplotlib.rcParams['font.sans-serif'] = ['FangSong'] # 指定默认字体matplotlib.rcParams['axes.unicode_minus'] = FalseY_test=test['SeriousDlqin2yrs']X_test=test.drop(['SeriousDlqin2yrs','DebtRatio','MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans','NumberRealEstateLoansOrLines','NumberOfDependents'],axis=1) #通过ROC曲线和AUC来评估模型的拟合能力。X2=sm.add_constant(X_test)resu=result.predict(X2)fpr,tpr,threshold=roc_curve(Y_test,resu)# %f,%d,%s输出 rocauc=auc(fpr,tpr)plt.plot(fpr,tpr,'b',label='AUC=%0.2f'% rocauc)plt.legend(loc='lower right')plt.plot([0,1],[0,1],'r--')plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.show()
从上图可知,AUC值为0.85,说明模型的预测能力较好,正确率较高。证明了用当前这五个特征,去构成信用评分卡的一部分分值是有效的,预测能力是较好的。
5 信用评分卡模型构建
实际上,评分卡模型构建一个最基本的要素就是基础分值和翻倍分值。
评分卡的参数设定:基础分值+PDO(比率翻倍分值)
基础分值:设定为600分
比率翻倍分值PDO: 20--每高20分好坏比翻一倍,好坏比为20。
个人总评分= 基础分+ 各部分得分
Score = offset + factor * log(odds)
def get_score(coe,woe,p): scores=[] for w in woe: score=round(coe*w*p,0) scores.append(score) return scoresdef compute_score(series,cut,scores): i=0 list=[] while i<len(series): value=series[i] j=len(cut)-2 m=len(cut)-2 while j>=0: if value>=cut[j]: j=-1 else: j=j-1 m=m-1 list.append(scores[m]) i=i+1 return list coe=[9.738849,0.638002,0.505995,1.032246,1.790041,1.131956] # 回归系数import mathp = 20 / math.log(2) #p值(比例因子) q = 600 - 20 * math.log(20) / math.log(2) # basescore = round(q + p * coe[0], 0)#构建评分卡时候只需要选出那些,IV值高的特征就行,最后相加得到总分x1 = get_score(coe[1], woex1, p)x2 = get_score(coe[2], woex2, p)x3 = get_score(coe[3], woex3, p)x7 = get_score(coe[4], woex7, p)x9 = get_score(coe[5], woex9, p)# x1的四个值分别对应cut的四个区间.PDO Point Double Odds, 就是好坏比翻一倍, odds就是好坏比print(x1)print(x2)print(x3)print(x7)print(x9)test1=pd.read_csv('data/TestData.csv') test1['BaseScore']=Series(np.zeros(len(test1))+basescore)test1['x1']=Series(compute_score(test1['RevolvingUtilizationOfUnsecuredLines'],cutx1,x1))test1['x2'] = Series(compute_score(test1['age'], cutx2, x2))test1['x3'] = Series(compute_score(test1['NumberOfTime30-59DaysPastDueNotWorse'], cutx3, x3))test1['x7'] = Series(compute_score(test1['NumberOfTimes90DaysLate'], cutx7, x7))test1['x9'] = Series(compute_score(test1['NumberOfTime60-89DaysPastDueNotWorse'], cutx9, x9))test1['score']= test1['BaseScore']+test1['x1']+test1['x2']+test1['x3']+test1['x7']+test1['x9']test1.to_csv('data/scoredata.csv')
总结
在大数据自动化审批实践中,信用评分技术已经是一项逐渐成熟的风险估值方法。在消费金融的风险控制实践中,信用评分卡模型已经得到广泛地应用。利用已有的历史数据对客户的信用状况进行量化,这种量化的直观反映就是信用的分值。通过对kaggle上的数据Give Me Some Credit的挖掘分析,结合信用评分卡的建立原理,通过数据预处理、变量选择、建模分析预测等方法创建了一个简单的信用评分系统。