评分卡构建学习

关于评分卡构建的学习,复现实验。参考文章:基于Python的信用评分卡模型分析(一)(二)

虽然是重现但还是很激动呀~
利用jupyter notebook中代码断点运行的特点,非常方便学习和做笔记。

我将数据集中的属性名称转换为中文,便于观察,代码打印了许多中间结果,便于自己理解评分卡构建的整体过程。难点在于,分箱和分数计算(公式还在琢磨)

数据处理和分析

import pandas as pd
import matplotlib.pyplot as plt #导入图像库
import matplotlib
import seaborn as sns
from sklearn.metrics import roc_curve,auc
import statsmodels.api as sm
data = pd.read_csv('dataSet/cs-training.csv')
data.describe().to_csv('dataSet/cs-trainingDes.csv')
data.head()
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
010.7661274520.8029829120.0130602.0
100.9571514000.1218762600.040001.0
200.6581803810.0851133042.021000.0
300.2338103000.0360503300.050000.0
400.9072394910.02492663588.070100.0
#查看data的描述信息
dataDes = pd.read_csv('dataSet/cs-trainingDes.csv')
dataDes
Unnamed: 0SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
0count150000.000000150000.000000150000.000000150000.000000150000.0000001.202690e+05150000.000000150000.000000150000.000000150000.000000146076.000000
1mean0.0668406.04843852.2952070.421033353.0050766.670221e+038.4527600.2659731.0182400.2403870.757222
2std0.249746249.75537114.7718664.1927812037.8185231.438467e+045.1459514.1693041.1297714.1551791.115086
3min0.0000000.0000000.0000000.0000000.0000000.000000e+000.0000000.0000000.0000000.0000000.000000
425%0.0000000.02986741.0000000.0000000.1750743.400000e+035.0000000.0000000.0000000.0000000.000000
550%0.0000000.15418152.0000000.0000000.3665085.400000e+038.0000000.0000001.0000000.0000000.000000
675%0.0000000.55904663.0000000.0000000.8682548.249000e+0311.0000000.0000002.0000000.0000001.000000
7max1.00000050708.000000109.00000098.000000329664.0000003.008750e+0658.00000098.00000054.00000098.00000020.000000

修改data的列名

data.rename(columns={'SeriousDlqin2yrs':'是否逾期','RevolvingUtilizationOfUnsecuredLines':'信用额度','NumberOfTime30-59DaysPastDueNotWorse':'逾期30到60天次数','DebtRatio':'债务占收入比','NumberOfOpenCreditLinesAndLoans':'未偿还贷款','NumberOfTimes90DaysLate':'逾期90天次数','NumberRealEstateLoansOrLines':'抵押财产','NumberOfTime60-89DaysPastDueNotWorse':'逾期60到89天次数','NumberOfDependents':'家庭人数'},inplace = True)
data.head()
是否逾期信用额度age逾期30到60天次数债务占收入比MonthlyIncome未偿还贷款逾期90天次数抵押财产逾期60到89天次数家庭人数
010.7661274520.8029829120.0130602.0
100.9571514000.1218762600.040001.0
200.6581803810.0851133042.021000.0
300.2338103000.0360503300.050000.0
400.9072394910.02492663588.070100.0

上表中MonthlyIncome和NumberOfDependents的count计数不是150000,所以存在缺失值

用随机森林对缺失值预测填充函数。首先,将MonthlyIncome的位置放到第一列,作为标签列y,其他部分作为特征列,放到随机森林的算法里进行计算,预测monthlyIncome的值。将得到的预测值填充到原来的数据中。

from sklearn.ensemble import RandomForestRegressor
# 用随机森林对缺失值预测填充函数
def set_missing(df):
    # 把已有的数值型特征取出来
    process_df = df.ix[:,[5,0,1,2,3,4,6,7,8,9]]   
    #变换了数据列的顺序,将MonthlyIncome的位置放到第一列,作为标签列y,其他部分作为特征列,放到随机森林的算法里进行计算,预测monthlyIncome的值。
    # 分成已知该特征和未知该特征两部分
    known = process_df[process_df.MonthlyIncome.notnull()].as_matrix()
    unknown = process_df[process_df.MonthlyIncome.isnull()].as_matrix()
    # X为特征属性值
    X = known[:, 1:]
    # y为结果标签值
    y = known[:, 0]
    # fit到RandomForestRegressor之中
    rfr = RandomForestRegressor(random_state=0,n_estimators=200,max_depth=3,n_jobs=-1)
    rfr.fit(X,y)
    # 用得到的模型进行未知特征值预测
    predicted = rfr.predict(unknown[:, 1:]).round(0)
    print(predicted)
    # 用得到的预测结果填补原缺失数据
    df.loc[(df.MonthlyIncome.isnull()), 'MonthlyIncome'] = predicted
    return df
data=set_missing(data)#用随机森林填补比较多的缺失值
[8311. 1159. 8311. ... 1159. 2554. 2554.]
data=data.dropna()#删除比较少的缺失值
data = data.drop_duplicates()#删除重复项    
data.to_csv('MissingData.csv',index=False)
#删除到某一行,行号会缺失,所以需要再次读取
data=pd.read_csv('MissingData.csv')
#print(data)
data = data[data['age'] > 0] # 年龄等于0的异常值进行剔除
data.ix[:100,[1,2]].boxplot() #也可用plot.box()
print(data.head())
plt.show()
   是否逾期      信用额度  age  逾期30到60天次数    债务占收入比  MonthlyIncome  未偿还贷款  逾期90天次数  \
0     1  0.766127   45           2  0.802982         9120.0     13        0   
1     0  0.957151   40           0  0.121876         2600.0      4        0   
2     0  0.658180   38           1  0.085113         3042.0      2        1   
3     0  0.233810   30           0  0.036050         3300.0      5        0   
4     0  0.907239   49           1  0.024926        63588.0      7        0   

   抵押财产  逾期60到89天次数  家庭人数  
0     6           0   2.0  
1     0           0   1.0  
2     0           0   0.0  
3     0           0   0.0  
4     1           0   0.0  
箱型图可以方便的查看属性变量的取值范围,此图中将俩个变量放在一起,Y的变量取值范围导致另一个变量的箱型图显示不出来,单个画就没问题了。

在这里插入图片描述

剔除变量NumberOfTime30-59DaysPastDueNotWorse、NumberOfTimes90DaysLate、NumberOfTime60-89DaysPastDueNotWorse的异常值。
数据集中好客户为0,违约客户为1,考虑到正常的理解,能正常履约并支付利息的客户为1,所以我们将其取反。

#剔除异常值
data = data[data['逾期30到60天次数'] < 90]
#变量SeriousDlqin2yrs取反
data['是否逾期']=1-data['是否逾期']
from sklearn.cross_validation import train_test_split
Y = data['是否逾期']
X = data.ix[:, 1:]
#测试集占比30%
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
# print(Y_train)
train = pd.concat([Y_train, X_train], axis=1)
test = pd.concat([Y_test, X_test], axis=1)
clasTest = test.groupby('是否逾期')['是否逾期'].count()
train.to_csv('TrainData.csv',index=False)
test.to_csv('TestData.csv',index=False)

变量分箱(binning)是对连续变量离散化(discretization)的一种称呼。信用评分卡开发中一般有常用的等距分段、等深分段、最优分段。其中等距分段(Equval length intervals)是指分段的区间是一致的,比如年龄以十年作为一个分段;等深分段(Equal frequency intervals)是先确定分段数量,然后令每个分段中数据数量大致相等;最优分段(Optimal Binning)又叫监督离散化(supervised discretizaion),使用递归划分(Recursive Partitioning)将连续变量分为分段,背后是一种基于条件推断查找较佳分组的算法。

# 定义自动分箱函数  最优分箱
def mono_bin(Y, X, n = 20):
    r = 0
    good=Y.sum()
    bad=Y.count()-good
    while np.abs(r) < 1:
        #将X的值对应到Bucket每个区间上
        d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.qcut(X, n)})
        d2 = d1.groupby('Bucket', as_index = True)
        r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
        n = n - 1
    #print(d1)
    d3 = pd.DataFrame(d2.X.min(), columns = ['min'])
    print(d3)
    d3['min']=d2.min().X
    print(d2.X)
    d3['max'] = d2.max().X
    d3['sum'] = d2.sum().Y
    d3['total'] = d2.count().Y
    d3['rate'] = d2.mean().Y
    d3['woe']=np.log((d3['rate']/(1-d3['rate']))/(good/bad))
    d3['goodattribute']=d3['sum']/good
    d3['badattribute']=(d3['total']-d3['sum'])/bad
    iv=((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum()
    d3['IV'] = iv
    d4 = (d3.sort_index(by = 'min'))
    print("=" * 60)
    print(d4)
    cut=[]
    cut.append(float('-inf'))
    for i in range(1,n+1):
        qua=X.quantile(i/(n+1))
        cut.append(round(qua,4))
    cut.append(float('inf'))
    woe=list(d4['woe'].round(3))
    return d4,iv,cut,woe

WoE分析, 是对指标分箱、计算各个档位的WoE值并观察WoE值随指标变化的趋势。其中WoE的数学定义是:
woe=ln(goodattribute/badattribute)
在进行分析时,我们需要对各指标从小到大排列,并计算出相应分档的WoE值。其中正向指标越大,WoE值越小;反向指标越大,WoE值越大。正向指标的WoE值负斜率越大,反响指标的正斜率越大,则说明指标区分能力好。WoE值趋近于直线,则意味指标判断能力较弱。若正向指标和WoE正相关趋势、反向指标同WoE出现负相关趋势,则说明此指标不符合经济意义,则应当予以去除

import numpy as np
import scipy.stats.stats as stats
dfx1, ivx1,cutx1,woex1=mono_bin(data.是否逾期,data.信用额度,n=10)
#print(dfx1, ivx1,cutx1,woex1)
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6CD6A0>
============================================================
                       min           max    sum  total      rate       woe  \
Bucket                                                                       
(-0.001, 0.0311]  0.000000      0.031125  35659  36339  0.981287  1.322345   
(0.0311, 0.158]   0.031128      0.158089  35590  36338  0.979415  1.225098   
(0.158, 0.558]    0.158100      0.558255  34499  36338  0.949392  0.294389   
(0.558, 50708.0]  0.558278  50708.000000  29900  36339  0.822807 -1.101834   

                  goodattribute  badattribute        IV  
Bucket                                                   
(-0.001, 0.0311]       0.262879      0.070060  0.989174  
(0.0311, 0.158]        0.262370      0.077066  0.989174  
(0.158, 0.558]         0.254327      0.189470  0.989174  
(0.558, 50708.0]       0.220423      0.663404  0.989174  
dfx2, ivx2,cutx2,woex2=mono_bin(data.是否逾期, data.age, n=10)
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6CD828>
============================================================
                min  max    sum  total      rate       woe  goodattribute  \
Bucket                                                                      
(20.999, 33.0]   21   33  14471  16287  0.888500 -0.561809       0.106681   
(33.0, 40.0]     34   40  16073  17737  0.906185 -0.369403       0.118491   
(40.0, 45.0]     41   45  14683  16043  0.915228 -0.258113       0.108243   
(45.0, 49.0]     46   49  13619  14828  0.918465 -0.215647       0.100400   
(49.0, 54.0]     50   54  16516  17814  0.927136 -0.093814       0.121756   
(54.0, 59.0]     55   59  15757  16670  0.945231  0.210985       0.116161   
(59.0, 64.0]     60   64  15923  16613  0.958466  0.501509       0.117385   
(64.0, 71.0]     65   71  14194  14608  0.971659  0.897390       0.104638   
(71.0, 107.0]    72  107  14412  14754  0.976820  1.103687       0.106246   

                badattribute        IV  
Bucket                                  
(20.999, 33.0]      0.187101  0.241178  
(33.0, 40.0]        0.171440  0.241178  
(40.0, 45.0]        0.140120  0.241178  
(45.0, 49.0]        0.124562  0.241178  
(49.0, 54.0]        0.133732  0.241178  
(54.0, 59.0]        0.094066  0.241178  
(59.0, 64.0]        0.071090  0.241178  
(64.0, 71.0]        0.042654  0.241178  
(71.0, 107.0]       0.035236  0.241178  
dfx4, ivx4,cutx4,woex4 =mono_bin(data.是否逾期, data.债务占收入比, n=20)
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6CD6A0>
============================================================
                        min            max    sum  total      rate       woe  \
Bucket                                                                         
(-0.001, 0.236]    0.000000       0.235948  45593  48452  0.940993  0.131963   
(0.236, 0.545]     0.235953       0.544862  45434  48451  0.937731  0.074679   
(0.545, 329664.0]  0.544864  329664.000000  44621  48451  0.920951 -0.181979   

                   goodattribute  badattribute        IV  
Bucket                                                    
(-0.001, 0.236]         0.336113      0.294560  0.019231  
(0.236, 0.545]          0.334940      0.310839  0.019231  
(0.545, 329664.0]       0.328947      0.394601  0.019231  
dfx5, ivx5,cutx5,woex5 =mono_bin(data.是否逾期, data.MonthlyIncome, n=10) 
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6FBF28>
============================================================
                        min        max    sum  total      rate       woe  \
Bucket                                                                     
(-0.001, 3400.0]        0.0     3400.0  44952  48760  0.921903 -0.168828   
(3400.0, 6850.0]     3401.0     6850.0  44600  48145  0.926368 -0.105123   
(6850.0, 3008750.0]  6851.0  3008750.0  46096  48449  0.951433  0.337716   

                     goodattribute  badattribute        IV  
Bucket                                                      
(-0.001, 3400.0]          0.331387      0.392335  0.047012  
(3400.0, 6850.0]          0.328792      0.365238  0.047012  
(6850.0, 3008750.0]       0.339821      0.242427  0.047012  
def self_bin(Y,X,cat):
    good=Y.sum()
    bad=Y.count()-good
    d1=pd.DataFrame({'X':X,'Y':Y,'Bucket':pd.cut(X,cat)})
    d2=d1.groupby('Bucket', as_index = True)
    d3 = pd.DataFrame(d2.X.min(), columns=['min'])
    d3['min'] = d2.min().X
    d3['max'] = d2.max().X
    d3['sum'] = d2.sum().Y
    d3['total'] = d2.count().Y
    d3['rate'] = d2.mean().Y
    d3['woe'] = np.log((d3['rate'] / (1 - d3['rate'])) / (good / bad))
    d3['goodattribute'] = d3['sum'] / good
    d3['badattribute'] = (d3['total'] - d3['sum']) / bad
    iv = ((d3['goodattribute'] - d3['badattribute']) * d3['woe']).sum()
    d4 = (d3.sort_index(by='min'))
    print("=" * 60)
    print(d4)
    woe = list(d4['woe'].round(3))
    return d4, iv,woe

#连续变量离散化
pinf = float('inf')#正无穷大
ninf = float('-inf')#负无穷大
cutx3 = [ninf, 0, 1, 3, 5, pinf]
cutx6 = [ninf, 1, 2, 3, 5, pinf]
cutx7 = [ninf, 0, 1, 3, 5, pinf]
cutx8 = [ninf, 0,1,2, 3, pinf]
cutx9 = [ninf, 0, 1, 3, pinf]
cutx10 = [ninf, 0, 1, 2, 3, 5, pinf]
dfx3, ivx3,woex3 = self_bin(data.是否逾期, data['逾期30到60天次数'], cutx3)
dfx6, ivx6 ,woex6= self_bin(data.是否逾期, data['未偿还贷款'], cutx6)
dfx7, ivx7,woex7 = self_bin(data.是否逾期, data['逾期90天次数'], cutx7)
dfx8, ivx8,woex8 = self_bin(data.是否逾期, data['抵押财产'], cutx8)
dfx9, ivx9,woex9 = self_bin(data.是否逾期, data['逾期60到89天次数'], cutx9)
dfx10, ivx10,woex10 = self_bin(data.是否逾期, data['家庭人数'], cutx10)

============================================================
             min  max     sum   total      rate       woe  goodattribute  \
Bucket                                                                     
(-inf, 0.0]    0    0  117077  122020  0.959490  0.527540       0.863094   
(0.0, 1.0]     1    1   13381   15744  0.849911 -0.903415       0.098645   
(1.0, 3.0]     2    3    4467    6279  0.711419 -1.735033       0.032931   
(3.0, 5.0]     4    5     606    1075  0.563721 -2.381042       0.004467   
(5.0, inf]     6   13     117     236  0.495763 -2.654269       0.000863   

             badattribute  
Bucket                     
(-inf, 0.0]      0.509273  
(0.0, 1.0]       0.243458  
(1.0, 3.0]       0.186689  
(3.0, 5.0]       0.048321  
(5.0, inf]       0.012260  
============================================================
             min  max    sum   total      rate       woe  goodattribute  \
Bucket                                                                    
(-inf, 1.0]    0    1   4438    5322  0.833897 -1.023817       0.032717   
(1.0, 2.0]     2    2   5577    6162  0.905063 -0.382525       0.041114   
(2.0, 3.0]     3    3   7853    8519  0.921822 -0.169958       0.057892   
(3.0, 5.0]     4    5  22082   23622  0.934807  0.025661       0.162789   
(5.0, inf]     6   58  95698  101729  0.940715  0.126966       0.705488   

             badattribute  
Bucket                     
(-inf, 1.0]      0.091078  
(1.0, 2.0]       0.060272  
(2.0, 3.0]       0.068617  
(3.0, 5.0]       0.158665  
(5.0, inf]       0.621368  
============================================================
             min  max     sum   total      rate       woe  goodattribute  \
Bucket                                                                     
(-inf, 0.0]    0    0  131008  137449  0.953139  0.375256       0.965794   
(0.0, 1.0]     1    1    3396    5130  0.661988 -1.965152       0.025035   
(1.0, 3.0]     2    3    1041    2178  0.477961 -2.725530       0.007674   
(3.0, 5.0]     4    5     142     417  0.340528 -3.298263       0.001047   
(5.0, inf]     6   17      61     180  0.338889 -3.305569       0.000450   

             badattribute  
Bucket                     
(-inf, 0.0]      0.663610  
(0.0, 1.0]       0.178652  
(1.0, 3.0]       0.117144  
(3.0, 5.0]       0.028333  
(5.0, inf]       0.012260  

============================================================
             min  max    sum  total      rate       woe  goodattribute  \
Bucket                                                                   
(-inf, 0.0]    0    0  48757  53172  0.916968 -0.235478       0.359438   
(0.0, 1.0]     1    1  48477  51191  0.946983  0.245347       0.357373   
(1.0, 2.0]     2    2  29410  31155  0.943990  0.187261       0.216811   
(2.0, 3.0]     3    3   5812   6230  0.932905 -0.005120       0.042846   
(3.0, inf]     4   54   3192   3606  0.885191 -0.594782       0.023531   

             badattribute  
Bucket                     
(-inf, 0.0]      0.454873  
(0.0, 1.0]       0.279621  
(1.0, 2.0]       0.179786  
(2.0, 3.0]       0.043066  
(3.0, inf]       0.042654  
============================================================
             min  max     sum   total      rate       woe  goodattribute  \
Bucket                                                                     
(-inf, 0.0]    0    0  130993  138127  0.948352  0.272953       0.965683   
(0.0, 1.0]     1    1    3905    5647  0.691518 -1.830095       0.028788   
(1.0, 3.0]     2    3     688    1415  0.486219 -2.692457       0.005072   
(3.0, inf]     4   11      62     165  0.375758 -3.144914       0.000457   

             badattribute  
Bucket                     
(-inf, 0.0]      0.735009  
(0.0, 1.0]       0.179477  
(1.0, 3.0]       0.074902  
(3.0, inf]       0.010612  

============================================================
             min   max    sum  total      rate       woe  goodattribute  \
Bucket                                                                    
(-inf, 0.0]  0.0   0.0  81248  86234  0.942181  0.153553       0.598962   
(0.0, 1.0]   1.0   1.0  24370  26291  0.926933 -0.096812       0.179656   
(1.0, 2.0]   2.0   2.0  17929  19500  0.919436 -0.202612       0.132173   
(2.0, 3.0]   3.0   3.0   8646   9479  0.912122 -0.297501       0.063738   
(3.0, 5.0]   4.0   5.0   3241   3605  0.899029 -0.450836       0.023893   
(5.0, inf]   6.0  20.0    214    245  0.873469 -0.705330       0.001578   

             badattribute  
Bucket                     
(-inf, 0.0]      0.513703  
(0.0, 1.0]       0.197919  
(1.0, 2.0]       0.161859  
(2.0, 3.0]       0.085823  
(3.0, 5.0]       0.037503  
(5.0, inf]       0.003194  
corr = data.corr()#计算各变量的相关性系数
#print(corr.index)
#xticks = ['x0','x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']#x轴标签
xticks =list(corr.index)
yticks = list(corr.index)#y轴标签
fig = plt.figure(figsize=(22,20))#figsize=(14,12)使热力图变大
ax1 = fig.add_subplot(1, 1, 1)
#sns.heatmap(corr, annot=True, cmap='PuRd' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'black'})
#sns.heatmap(corr, annot=True, cmap='YlGnBu' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'black'})
#sns.heatmap(corr, annot=True, cmap='rainbow' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'black'})
sns.heatmap(corr, annot=True, cmap='RdPu' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'green'})
### 绘制相关性系数热力图
#cmap="YlGnBu" (rainbow)设置heatmap颜色
ax1.set_xticklabels(xticks, rotation=90, fontsize=20)
ax1.set_yticklabels(yticks, rotation=0, fontsize=20)
ax1.set_yticklabels(yticks, rotation=0, fontsize=20)

plt.rcParams['font.sans-serif']=['SimHei']     #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False      #用来正常显示负号
plt.show()

在这里插入图片描述

IV指标是一般用来确定自变量的预测能力。
每个字段进行分箱之后会产生一个IV值,代表这个字段对标签字段的影响力,IV越大代表分箱效果越好,字段对标签字段的影响力越大

list(data.columns)#y轴标签
['是否逾期',
 '信用额度',
 'age',
 '逾期30到60天次数',
 '债务占收入比',
 'MonthlyIncome',
 '未偿还贷款',
 '逾期90天次数',
 '抵押财产',
 '逾期60到89天次数',
 '家庭人数']
ivlist=[ivx1,ivx2,ivx3,ivx4,ivx5,ivx6,ivx7,ivx8,ivx9,ivx10]
ivlist
[0.9891738801650342,
 0.24117787840722144,
 0.7189254612784397,
 0.019231014490398168,
 0.04701224378739177,
 0.07968800751468878,
 0.8426781922043317,
 0.059857660209756414,
 0.5586891401396025,
 0.03472056480690539]

原谅老夫的少女心一定要画成粉色,哈哈哈~

ivlist=[ivx1,ivx2,ivx3,ivx4,ivx5,ivx6,ivx7,ivx8,ivx9,ivx10]#各变量IV
#xticks = list(data.columns)#y轴标签
index=['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']#x轴的标签
fig1 = plt.figure(figsize=(14,8))#figsize=(14,12)使热力图变大)
ax1 = fig1.add_subplot(1, 1, 1)
x = np.arange(len(index))+1
#ax1.bar(x, ivlist, width=0.4,facecolor = 'hotpink')#生成柱状图
ax1.bar(x, ivlist, width=0.4,facecolor = 'lightcoral')#生成柱状图
ax1.set_xticks(x)
ax1.set_xticklabels(xticks, rotation=90, fontsize=14)
ax1.set_ylabel('IV(Information Value)', fontsize=14)
#在柱状图上添加数字标签
for a, b in zip(x, ivlist):
    plt.text(a, b + 0.01, '%.4f' % b, ha='center', va='bottom', fontsize=10)
plt.show()

在这里插入图片描述

证据权重(Weight of Evidence,WOE)转换可以将Logistic回归模型转变为标准评分卡格式。引入WOE转换的目的并不是为了提高模型质量,有一些变量不应该被纳入模型,这或者是因为它们不能增加模型值,或者是因为与其模型相关系数有关的误差较大,其实建立标准信用评分卡也可以不采用WOE转换。这种情况下,Logistic回归模型需要处理更大数量的自变量。尽管这样会增加建模程序的复杂性,但最终得到的评分卡都是一样的。

在建立模型之前,我们需要将筛选后的变量转换为WoE值,用于信用评分。
def outlier_processing(df,col):
    s=df[col]
    oneQuoter=s.quantile(0.25)
    threeQuote=s.quantile(0.75)
    irq=threeQuote-oneQuoter
    min=oneQuoter-1.5*irq
    max=threeQuote+1.5*irq
    df=df[df[col]<=max]
    df=df[df[col]>=min]
    return df
data = pd.read_csv('MissingData.csv')
# 年龄等于0的异常值进行剔除
data = data[data['age'] > 0]
data = data[data['逾期30到60天次数'] < 90]#剔除异常值
data['是否逾期']=1-data['是否逾期']
Y = data['是否逾期']
X = data.ix[:, 1:]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
# print(Y_train)
train = pd.concat([Y_train, X_train], axis=1)
test = pd.concat([Y_test, X_test], axis=1)
clasTest = test.groupby('是否逾期')['是否逾期'].count()
train.to_csv('TrainData.csv',index=False)
test.to_csv('TestData.csv',index=False)
print(train.shape)
print(test.shape)
(101747, 11)
(43607, 11)
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
import scipy.stats.stats as stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
import math
#替换成woe函数
def replace_woe(series,cut,woe):
    list=[]
    i=0
    while i<len(series):
        value=series[i]
        print(i)
        j=len(cut)-2
        m=len(cut)-2
        while j>=0:
            if value>=cut[j]:
                j=-1
            else:
                j -=1
                m -= 1
        list.append(woe[m])
        i += 1
    return list

我们将每个变量都进行替换,并将其保存到WoeData.csv文件中:
将整体数据分成俩部分,一部分做训练,一部分做测试
将训练部分属性值转换WOE、测试部分属性值转换WOE
训练集使用逻辑回归训练,得到模型,测试集测试

# TrainData替换成woe
data=pd.read_csv('TrainData.csv')
#print(data.head())
data['信用额度'] = Series(replace_woe(data['信用额度'], cutx1, woex1))
#print(data['信用额度'][1400:1500])
data['age'] = Series(replace_woe(data['age'], cutx2, woex2))
data['逾期30到60天次数'] = Series(replace_woe(data['逾期30到60天次数'], cutx3, woex3))
data['债务占收入比'] = Series(replace_woe(data['债务占收入比'], cutx4, woex4))
data['MonthlyIncome'] = Series(replace_woe(data['MonthlyIncome'], cutx5, woex5))
data['未偿还贷款'] = Series(replace_woe(data['未偿还贷款'], cutx6, woex6))
data['逾期90天次数'] = Series(replace_woe(data['逾期90天次数'], cutx7, woex7))
data['抵押财产'] = Series(replace_woe(data['抵押财产'], cutx8, woex8))
data['逾期60到89天次数'] = Series(replace_woe(data['逾期60到89天次数'], cutx9, woex9))
data['家庭人数'] = Series(replace_woe(data['家庭人数'], cutx10, woex10))
data.to_csv('trainWoeData.csv', index=False)
# TestData替换成woe
test= pd.read_csv('TestData.csv')
# 替换成woe
test['信用额度'] = Series(replace_woe(test['信用额度'], cutx1, woex1))
test['age'] = Series(replace_woe(test['age'], cutx2, woex2))
test['逾期30到60天次数'] = Series(replace_woe(test['逾期30到60天次数'], cutx3, woex3))
test['债务占收入比'] = Series(replace_woe(test['债务占收入比'], cutx4, woex4))
test['MonthlyIncome'] = Series(replace_woe(test['MonthlyIncome'], cutx5, woex5))
test['未偿还贷款'] = Series(replace_woe(test['未偿还贷款'], cutx6, woex6))
test['逾期90天次数'] = Series(replace_woe(test['逾期90天次数'], cutx7, woex7))
test['抵押财产'] = Series(replace_woe(test['抵押财产'], cutx8, woex8))
test['逾期60到89天次数'] = Series(replace_woe(test['逾期60到89天次数'], cutx9, woex9))
test['家庭人数'] = Series(replace_woe(test['家庭人数'], cutx10, woex10))
test.to_csv('TestWoeData.csv', index=False)
#训练部分
matplotlib.rcParams['axes.unicode_minus'] = False
#导入数据
data = pd.read_csv('trainWoeData.csv')
#应变量
Y=data['是否逾期']
#自变量,剔除对因变量影响不明显的变量
X=data.drop(['是否逾期','债务占收入比','MonthlyIncome', '未偿还贷款','抵押财产','家庭人数'],axis=1)
X1=sm.add_constant(X)
logit=sm.Logit(Y,X1)
result=logit.fit()
print(result.params)


#测试部分
test = pd.read_csv('TestWoeData.csv')
Y_test = test['是否逾期']
X_test = test.drop(['是否逾期', '信用额度', 'MonthlyIncome', '未偿还贷款','抵押财产', '家庭人数'], axis=1)
X3 = sm.add_constant(X_test)
resu = result.predict(X3)
fpr, tpr, threshold = roc_curve(Y_test, resu)
rocauc = auc(fpr, tpr)
plt.plot(fpr, tpr, 'b', label='AUC = %0.2f' % rocauc)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('真正率')
plt.xlabel('假正率')
plt.show()

Optimization terminated successfully.
         Current function value: 0.186940
         Iterations 8
const         9.555259
信用额度          0.630777
age           0.511745
逾期30到60天次数    1.035706
逾期90天次数       1.747674
逾期60到89天次数    1.085101
dtype: float64

在这里插入图片描述

通过ROC曲线和AUC来评估模型的拟合能力,上图为ROC曲线,AUC值为0.81,说明该模型的预测效果还是不错的,正确率较高。

计算分数

#计算分数
#coe为逻辑回归模型的系数
coe=[9.738849,0.638002,0.505995,1.032246,1.790041,1.131956]
# 我们取600分为基础分值,PDO为20(每高20分好坏比翻一倍),好坏比取20。
p = 20 / math.log(2)
q = 600 - 20 * math.log(20) / math.log(2)
baseScore = round(q + p * coe[0], 0)
baseScore
795.0
#计算各部分函数
def get_score(coe,woe,factor,label):
    scores=[]
    for w in woe:
        score=round(coe*w*factor,0)
        scores.append(score)
    print(list(data.columns)[label],'woe:',woe,'score:',scores)
    return scores

# 各项部分分数
x1 = get_score(coe[1], woex1, p,1)
x2 = get_score(coe[2], woex2, p,2)
x3 = get_score(coe[3], woex3, p,3)
x7 = get_score(coe[4], woex7, p,7)
x9 = get_score(coe[5], woex9, p,9)

信用额度 woe: [1.322, 1.225, 0.294, -1.102] score: [24.0, 23.0, 5.0, -20.0]
age woe: [-0.562, -0.369, -0.258, -0.216, -0.094, 0.211, 0.502, 0.897, 1.104] score: [-8.0, -5.0, -4.0, -3.0, -1.0, 3.0, 7.0, 13.0, 16.0]
逾期30到60天次数 woe: [0.528, -0.903, -1.735, -2.381, -2.654] score: [16.0, -27.0, -52.0, -71.0, -79.0]
逾期90天次数 woe: [0.375, -1.965, -2.726, -3.298, -3.306] score: [19.0, -101.0, -141.0, -170.0, -171.0]
逾期60到89天次数 woe: [0.273, -1.83, -2.692, -3.145] score: [9.0, -60.0, -88.0, -103.0]

评分标准

在这里插入图片描述

#根据变量计算分数
def compute_score(series,cut,score):
    list = []
    i = 0
    while i < len(series):
        value = series[i]
        j = len(cut) - 2
        m = len(cut) - 2
        while j >= 0:
            if value >= cut[j]:
                j = -1
            else:
                j -= 1
                m -= 1
        list.append(score[m])
        i += 1
    return list
test1 = pd.read_csv('TestData.csv')
test1['BaseScore']=Series(np.zeros(len(test1)))+baseScore
test1['x1'] = Series(compute_score(test1['信用额度'], cutx1, x1))
test1['x2'] = Series(compute_score(test1['age'], cutx2, x2))
test1['x3'] = Series(compute_score(test1['逾期30到60天次数'], cutx3, x3))
test1['x7'] = Series(compute_score(test1['逾期90天次数'], cutx7, x7))
test1['x9'] = Series(compute_score(test1['逾期60到89天次数'], cutx9, x9))
test1['Score'] = test1['x1'] + test1['x2'] + test1['x3'] + test1['x7'] +test1['x9']  + baseScore
test1.to_csv('ScoreData.csv', index=False)

x1-x9是对应字段的得分,基础分795和对应得分相加,得到最后的分数

test1.head()
是否逾期信用额度age逾期30到60天次数债务占收入比MonthlyIncome未偿还贷款逾期90天次数抵押财产逾期60到89天次数家庭人数BaseScorex1x2x3x7x9Score
010.6173524140.16758915000.0141102.0795.0-20.0-4.0-71.0-141.0-60.0499.0
110.0841765800.38885114583.0120301.0795.023.03.0-27.0-101.0-60.0633.0
210.3077574700.18131318900.0100203.0795.05.0-3.0-27.0-101.0-60.0609.0
310.0032656500.3046166000.060200.0795.024.013.0-27.0-101.0-60.0644.0
410.0185173805870.0000002554.060100.0795.024.0-5.0-27.0-101.0-60.0626.0

–参考博客
[1]: http://math.stackexchange.com/
[2]: https://www.jianshu.com/p/159f381c661d

之后如果有新的调研会继续补充
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值