Python数据分析实战【十一】:学习用scorecardpy搭建风控评分卡模型【文末源码地址】

评分卡模型

  • scorecardpy库

github地址:https://github.com/ShichenXie/scorecardpy

一、数据预处理

import scorecardpy as sc
import pandas as pd
import numpy as np

scorecardpy自带数据

dat = sc.germancredit()

查看数据行列

dat.shape
(1000, 21)

数据是由1000行,21列数据组成

查看数据内容,用sample()比head()可以看更多的数据

dat.sample(5)
status.of.existing.checking.accountduration.in.monthcredit.historypurposecredit.amountsavings.account.and.bondspresent.employment.sinceinstallment.rate.in.percentage.of.disposable.incomepersonal.status.and.sexother.debtors.or.guarantors...propertyage.in.yearsother.installment.planshousingnumber.of.existing.credits.at.this.bankjobnumber.of.people.being.liable.to.provide.maintenance.fortelephoneforeign.workercreditability
547no checking account24existing credits paid back duly till nowradio/television1552... < 100 DM4 <= ... < 7 years3male : singlenone...car or other, not in attribute Savings account...32bankown1skilled employee / official2noneyesgood
617... < 0 DM6critical account/ other credits existing (not ...car (new)3676... < 100 DM1 <= ... < 4 years1male : singlenone...real estate37nonerent3skilled employee / official2noneyesgood
1860 <= ... < 200 DM9all credits at this bank paid back dulycar (used)5129... < 100 DM... >= 7 years2female : divorced/separated/marriednone...unknown / no property74bankfor free1management/ self-employed/ highly qualified em...2yes, registered under the customers nameyesbad
776no checking account36critical account/ other credits existing (not ...car (new)3535... < 100 DM4 <= ... < 7 years4male : singlenone...car or other, not in attribute Savings account...37noneown2skilled employee / official1yes, registered under the customers nameyesgood
243no checking account12critical account/ other credits existing (not ...business1185... < 100 DM1 <= ... < 4 years3female : divorced/separated/marriednone...real estate27noneown2skilled employee / official1noneyesgood

5 rows × 21 columns

可以发现有none出现,代表的是缺失,可以用np.nan替换,方便统计每一个变量的缺失占比情况
dat = dat.replace('none',np.nan)

统计每个变量的缺失占比情况

(dat.isnull().sum()/dat.shape[0]).map(lambda x:"{:.2%}".format(x))
status.of.existing.checking.account                          0.00%
duration.in.month                                            0.00%
credit.history                                               0.00%
purpose                                                      0.00%
credit.amount                                                0.00%
savings.account.and.bonds                                    0.00%
present.employment.since                                     0.00%
installment.rate.in.percentage.of.disposable.income          0.00%
personal.status.and.sex                                      0.00%
other.debtors.or.guarantors                                 90.70%
present.residence.since                                      0.00%
property                                                     0.00%
age.in.years                                                 0.00%
other.installment.plans                                     81.40%
housing                                                      0.00%
number.of.existing.credits.at.this.bank                      0.00%
job                                                          0.00%
number.of.people.being.liable.to.provide.maintenance.for     0.00%
telephone                                                   59.60%
foreign.worker                                               0.00%
creditability                                                0.00%
dtype: object

other.debtors.or.guarantors(担保人)这一列数据的缺失占比超过90%,可以删除。

other.installment.plans(分期付款计划)这一列缺失占比也较高,只有两个分类,也可以删除。

dat["other.installment.plans"].value_counts()
bank      139
stores     47
Name: other.installment.plans, dtype: int64

telephone(电话)对建模没有太大意义,就像姓名,对建模没有太大影响。但是电话是否填写应该被考虑进去,这里先不讨论。

dat = dat.drop(columns=["other.debtors.or.guarantors","other.installment.plans","telephone"])

查看数据的信息

dat.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column                                                    Non-Null Count  Dtype   
---  ------                                                    --------------  -----   
 0   status.of.existing.checking.account                       1000 non-null   category
 1   duration.in.month                                         1000 non-null   int64   
 2   credit.history                                            1000 non-null   category
 3   purpose                                                   1000 non-null   object  
 4   credit.amount                                             1000 non-null   int64   
 5   savings.account.and.bonds                                 1000 non-null   category
 6   present.employment.since                                  1000 non-null   category
 7   installment.rate.in.percentage.of.disposable.income       1000 non-null   int64   
 8   personal.status.and.sex                                   1000 non-null   category
 9   present.residence.since                                   1000 non-null   int64   
 10  property                                                  1000 non-null   category
 11  age.in.years                                              1000 non-null   int64   
 12  housing                                                   1000 non-null   category
 13  number.of.existing.credits.at.this.bank                   1000 non-null   int64   
 14  job                                                       1000 non-null   category
 15  number.of.people.being.liable.to.provide.maintenance.for  1000 non-null   int64   
 16  foreign.worker                                            1000 non-null   category
 17  creditability                                             1000 non-null   object  
dtypes: category(9), int64(7), object(2)
memory usage: 80.8+ KB

可以看出数据是由int64,category,object类型的数据组成,category类型的数据在pandas中很特殊,建议转为object类型数据。

查看每个变量有多少分类

# 顺便把category类型的数据转为object
for c in dat.columns:
    if str(dat[c].dtype) == "category":
        dat[c] = dat[c].astype(str)
    print(c,":",len(dat[c].unique()))
status.of.existing.checking.account : 4
duration.in.month : 33
credit.history : 5
purpose : 10
credit.amount : 921
savings.account.and.bonds : 5
present.employment.since : 5
installment.rate.in.percentage.of.disposable.income : 4
personal.status.and.sex : 3
present.residence.since : 4
property : 4
age.in.years : 53
housing : 3
number.of.existing.credits.at.this.bank : 4
job : 4
number.of.people.being.liable.to.provide.maintenance.for : 2
foreign.worker : 2
creditability : 2

可以看到credit.amount(金额)有921个不同的类别,age.in.years(年龄)有53个类别。
类别较多的需要合并区间,类别少的视情况而定。

描述性统计

查看每一个变量的均值,最大,最小,分位数

dat.describe()
duration.in.monthcredit.amountinstallment.rate.in.percentage.of.disposable.incomepresent.residence.sinceage.in.yearsnumber.of.existing.credits.at.this.banknumber.of.people.being.liable.to.provide.maintenance.for
count1000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.000000
mean20.9030003271.2580002.9730002.84500035.5460001.4070001.155000
std12.0588142822.7368761.1187151.10371811.3754690.5776540.362086
min4.000000250.0000001.0000001.00000019.0000001.0000001.000000
25%12.0000001365.5000002.0000002.00000027.0000001.0000001.000000
50%18.0000002319.5000003.0000003.00000033.0000001.0000001.000000
75%24.0000003972.2500004.0000004.00000042.0000002.0000001.000000
max72.00000018424.0000004.0000004.00000075.0000004.0000002.000000

数据之间的相关性

dat.corr()
duration.in.monthcredit.amountinstallment.rate.in.percentage.of.disposable.incomepresent.residence.sinceage.in.yearsnumber.of.existing.credits.at.this.banknumber.of.people.being.liable.to.provide.maintenance.for
duration.in.month1.0000000.6249840.0747490.034067-0.036136-0.011284-0.023834
credit.amount0.6249841.000000-0.2713160.0289260.0327160.0207950.017142
installment.rate.in.percentage.of.disposable.income0.074749-0.2713161.0000000.0493020.0582660.021669-0.071207
present.residence.since0.0340670.0289260.0493021.0000000.2664190.0896250.042643
age.in.years-0.0361360.0327160.0582660.2664191.0000000.1492540.118201
number.of.existing.credits.at.this.bank-0.0112840.0207950.0216690.0896250.1492541.0000000.109667
number.of.people.being.liable.to.provide.maintenance.for-0.0238340.017142-0.0712070.0426430.1182010.1096671.000000
可以看出,credit.amount与duration.in.month的相关性为0.624984。可以根据实际业务,将相关性高的变量保留一个。

二、数据筛选

参考文章:https://zhuanlan.zhihu.com/p/80134853

评分卡建模常用WOE、IV来筛选变量,通常选择IV值>0.02的变量。IV值越大,变量对y的预测能力较强,就越应该进入模型中。

WOE:(Weight of Evidence)中文“证据权重”,某个变量的区间对y的影响程度。

  • 计算方法:
    W O E i = l n ( R 0 i R 0 T ) − l n ( R 1 i R 1 T ) WOE_i=ln(\frac{R_{0i}}{R_{0T}})-ln(\frac{R_{1i}}{R_{1T}}) WOEi=ln(R0TR0i)ln(R1TR1i)
    R 0 i :变量的第 i 个区间, y = 0 的个数。 R 0 T : y = 0 的个数。 R 1 i :变量的第 i 个区间, y = 1 的个数。 R 1 T : y = 1 的个数。 R_{0i}:变量的第i个区间,y=0的个数。\\ R_{0T}:y=0的个数。 \\ R_{1i}:变量的第i个区间,y=1的个数。\\ R_{1T}:y=1的个数。 R0i:变量的第i个区间,y=0的个数。R0Ty=0的个数。R1i:变量的第i个区间,y=1的个数。R1Ty=1的个数。

  • 举例说明:
    将age.in.years划分为[-inf,26.0),[26.0,35.0),[35.0,40.0),[40.0,inf)四个区间,统计各个区间y=0(good),y=1(bad)的数量,计算WOE。
    比如计算age.in.year在[26,35)区间的WOE:
    W O E i = l n ( R 0 i R 0 T ) − l n ( R 1 i R 1 T ) = l n ( 246 700 ) − l n ( 112 300 ) = − 0.060465 WOE_i=ln(\frac{R_{0i}}{R_{0T}})-ln(\frac{R_{1i}}{R_{1T}})=ln(\frac{246}{700})-ln(\frac{112}{300})=-0.060465 WOEi=ln(R0TR0i)ln(R1TR1i)=ln(700246)ln(300112)=0.060465
    同理可以计算出其他区间对应的WOE值。

IV:(Information Value)中文“信息价值”,变量所含信息的价值。

  • 计算方法:
    I V = ∑ i = 1 n ( R 0 i R 0 T − R 1 i R 1 T ) ∗ W O E i IV=\sum_{i=1}^n(\frac{R_{0i}}{R_{0T}}-\frac{R_{1i}}{R_{1T}})*WOE_i IV=i=1n(R0TR0iR1TR1i)WOEi
  • 举例说明:
    I V = ∑ i = 1 n ( R 0 i R 0 T − R 1 i R 1 T ) ∗ W O E i = ( 110 700 − 80 300 ) ∗ 0.528844 + ( 246 700 − 112 300 ) ∗ 0.060465 + ( 123 700 − 30 300 ) ∗ − 0.563689 + ( 221 700 − 78 300 ) ∗ − 0.194156 = 0.112742 IV=\sum_{i=1}^n(\frac{R_{0i}}{R_{0T}}-\frac{R_{1i}}{R_{1T}})*WOE_i\\ =(\frac{110}{700}-\frac{80}{300})*0.528844\\ +(\frac{246}{700}-\frac{112}{300})*0.060465\\ +(\frac{123}{700}-\frac{30}{300})*-0.563689\\ +(\frac{221}{700}-\frac{78}{300})*-0.194156\\ =0.112742 IV=i=1n(R0TR0iR1TR1i)WOEi=(70011030080)0.528844+(700246300112)0.060465+(70012330030)0.563689+(70022130078)0.194156=0.112742

公式看似复杂,其实仔细想想,用到的知识也不是很难。另外,这些程序scorecardpy中已经实现,只需要调用传参即可。

用scorecardpy计算的age.in.years的WOE:

# bins_adj_df[bins_adj_df.variable=="age.in.years"]
level_1variablebincountcount_distrgoodbadbadprobwoebin_ivtotal_ivbreaksis_special_values
40age.in.years[-inf,26.0)1900.190110800.4210530.5288440.0579210.11274226.0False
51age.in.years[26.0,35.0)3580.3582461120.3128490.0604650.0013240.11274235.0False
62age.in.years[35.0,40.0)1530.153123300.196078-0.5636890.0426790.11274240.0False
73age.in.years[40.0,inf)2990.299221780.260870-0.1941560.0108170.112742infFalse

sc.var_filter()

  • dt:数据
  • y:y变量名
  • iv_limit:0.02
  • missing_limit:0.95
  • identical_limit:0.95
  • positive:坏样本的标签
  • dt:DataFrame数据
  • var_rm:强制删除变量的名称
  • var_kp:强制保留变量的名称
  • return_rm_reason:是否返回每个变量被删除的原因
dt_s = sc.var_filter(dat,y="creditability",iv_limit=0.02)
dat.shape
(1000, 18)
dt_s.shape
(1000, 13)

可以看出,用var_filter()方法,将变量从18个筛选到13个变量。

划分数据

sc.split_df(dt, y=None, ratio=0.7, seed=186)

train,test = sc.split_df(dt=dt_s,y="creditability").values()

训练数据y的统计:

train.creditability.value_counts()
0    490
1    210
Name: creditability, dtype: int64

测试数据y的统计:

test.creditability.value_counts()
0    210
1     90
Name: creditability, dtype: int64

三、 变量分箱

常用的分箱:卡方分箱,决策树分箱… ,这里简单介绍一下卡方分箱。

为什么要分箱?
分箱之后,变量的轻微波动,不影响模型的稳定。比如:收入这一变量,10000和11000对y的影响可能是一样的,将其归为一类是一个不错的选择。

分箱要求?

  1. 变量的类别在5到7类最好
  2. 有序,单调,平衡

卡方分箱:

参考文章:https://zhuanlan.zhihu.com/p/115267395

  • 卡方分箱的思想,衡量预测值与观察值的差异,究竟有多大的概率是由随机因素引起的。
  • 卡方值计算:
    χ 2 = ∑ i = 1 n ∑ c = 1 m ( A i c − E i c ) 2 E i c \chi^2=\sum_{i=1}^n\sum_{c=1}^m\frac{(A_{ic}-E_{ic})^2}{E_{ic}} χ2=i=1nc=1mEic(AicEic)2
    n :划分的区间总数。 m : y 的类别,一般为 2 个。 A i c :实际样本在每个区间统计的数量。 n:划分的区间总数。\\ m:y的类别,一般为2个。 \\ A_{ic}:实际样本在每个区间统计的数量。 n:划分的区间总数。my的类别,一般为2个。Aic:实际样本在每个区间统计的数量。

E i c :期望样本在每个区间的数量, E i c = T i ∗ T c T , T i :第 i 个分组的总数, T c :第 c 个类别的总数, T :总样本数。 E_{ic}:期望样本在每个区间的数量,E_{ic}=\frac{T_i*T_c}{T},T_i:第i个分组的总数,T_c:第c个类别的总数,T:总样本数。 Eic:期望样本在每个区间的数量,Eic=TTiTcTi:第i个分组的总数,Tc:第c个类别的总数,T:总样本数。

  • 步骤:(数值型数据)
    1. 将数据去重并排序,得到A1,A2,A3等分组区间,统计每个区间的量。
    2. 计算A1与A2的卡方值,计算A2与A3的卡方值,(计算相邻区间的卡方值)
    3. 如果相邻的卡方值小于阈值(根据自由度和置信度计算得出的出的阈值),就合并区间为一个新的区间。
    4. 重复第2、3步的操作。直到达到某个条件停止计算。
    5. 当最小的卡方值大于阈值,停止。
    6. 当划分的区间到达指定的区间个数,停止。

woebin()

  • scorecardpy默认使用决策树分箱,method=‘tree’
  • 这里使用卡方分箱,method=‘chimerge’
  • 返回的是一个字典数据,用pandas.concat()查看所有数据
bins = sc.woebin(dt_s,y="creditability",method="chimerge")
bins["installment.rate.in.percentage.of.disposable.income"]
variablebincountcount_distrgoodbadbadprobwoebin_ivtotal_ivbreaksis_special_values
0installment.rate.in.percentage.of.disposable.i...[-inf,3.0)3670.367271960.261580-0.1904730.0127890.0197693.0False
1installment.rate.in.percentage.of.disposable.i...[3.0,inf)6330.6334292040.3222750.1039610.0069800.019769infFalse
bins_df = pd.concat(bins).reset_index().drop(columns="level_0")
bins_df
level_1variablebincountcount_distrgoodbadbadprobwoebin_ivtotal_ivbreaksis_special_values
00credit.amount[-inf,1400.0)2670.267185820.3071160.0336610.0003050.1714311400.0False
11credit.amount[1400.0,1800.0)1050.10587180.171429-0.7282390.0468150.1714311800.0False
22credit.amount[1800.0,2000.0)600.06039210.3500000.2282590.0032610.1714312000.0False
33credit.amount[2000.0,4000.0)3220.322248740.229814-0.3620660.0389650.1714314000.0False
44credit.amount[4000.0,inf)2460.2461411050.4268290.5524980.0820850.171431infFalse
50age.in.years[-inf,26.0)1900.190110800.4210530.5288440.0579210.12393526.0False
61age.in.years[26.0,35.0)3580.3582461120.3128490.0604650.0013240.12393535.0False
72age.in.years[35.0,37.0)790.07967120.151899-0.8724880.0486100.12393537.0False
83age.in.years[37.0,inf)3730.373277960.257373-0.2123710.0160800.123935infFalse
90housingown7130.7135271860.260870-0.1941560.0257950.082951ownFalse
101housingrent%,%for free2870.2871731140.3972130.4302050.0571560.082951rent%,%for freeFalse
110propertyreal estate2820.282222600.212766-0.4610350.0540070.112634real estateFalse
121propertybuilding society savings agreement/ life insur...5640.5643911730.3067380.0318820.0005770.112634building society savings agreement/ life insur...False
132propertyunknown / no property1540.15487670.4350650.5860820.0580500.112634unknown / no propertyFalse
140duration.in.month[-inf,8.0)870.0877890.103448-1.3121860.1068490.2826188.0False
151duration.in.month[8.0,16.0)3440.344264800.232558-0.3466250.0382940.28261816.0False
162duration.in.month[16.0,34.0)3990.3992701290.3233080.1086880.0048130.28261834.0False
173duration.in.month[34.0,44.0)1000.10058420.4200000.5245240.0299730.28261844.0False
184duration.in.month[44.0,inf)700.07030400.5714291.1349800.1026890.282618infFalse
190status.of.existing.checking.accountno checking account3940.394348460.116751-1.1762630.4044100.666012no checking accountFalse
201status.of.existing.checking.account... >= 200 DM / salary assignments for at leas...630.06349140.222222-0.4054650.0094610.666012... >= 200 DM / salary assignments for at leas...False
212status.of.existing.checking.account0 <= ... < 200 DM2690.2691641050.3903350.4013920.0464470.6660120 <= ... < 200 DMFalse
223status.of.existing.checking.account... < 0 DM2740.2741391350.4927010.8180990.2056930.666012... < 0 DMFalse
230installment.rate.in.percentage.of.disposable.i...[-inf,3.0)3670.367271960.261580-0.1904730.0127890.0197693.0False
241installment.rate.in.percentage.of.disposable.i...[3.0,inf)6330.6334292040.3222750.1039610.0069800.019769infFalse
250savings.account.and.bonds... >= 1000 DM%,%500 <= ... < 1000 DM%,%unknow...2940.294245490.166667-0.7621400.1422660.189391... >= 1000 DM%,%500 <= ... < 1000 DM%,%unknow...False
261savings.account.and.bonds100 <= ... < 500 DM%,%... < 100 DM7060.7064552510.3555240.2524530.0471250.189391100 <= ... < 500 DM%,%... < 100 DMFalse
270present.employment.since4 <= ... < 7 years%,%... >= 7 years4270.4273241030.241218-0.2987170.0357040.0828654 <= ... < 7 years%,%... >= 7 yearsFalse
281present.employment.since1 <= ... < 4 years3390.3392351040.3067850.0321030.0003520.0828651 <= ... < 4 yearsFalse
292present.employment.sinceunemployed%,%... < 1 year2340.234141930.3974360.4311370.0468090.082865unemployed%,%... < 1 yearFalse
300personal.status.and.sexmale : single%,%male : married/widowed6400.6404691710.267188-0.1616410.0161640.042633male : single%,%male : married/widowedFalse
311personal.status.and.sexfemale : divorced/separated/married3600.3602311290.3583330.2646930.0264690.042633female : divorced/separated/marriedFalse
320credit.historycritical account/ other credits existing (not ...2930.293243500.170648-0.7337410.1324230.291829critical account/ other credits existing (not ...False
331credit.historydelay in paying off in the past%,%existing cre...6180.6184211970.3187700.0878690.0048540.291829delay in paying off in the past%,%existing cre...False
342credit.historyall credits at this bank paid back duly%,%no c...890.08936530.5955061.2340710.1545530.291829all credits at this bank paid back duly%,%no c...False
350purposeretraining%,%car (used)%,%radio/television3920.392312800.204082-0.5136790.0919730.142092retraining%,%car (used)%,%radio/televisionFalse
361purposefurniture/equipment%,%domestic appliances%,%bu...6080.6083882200.3618420.2799200.0501190.142092furniture/equipment%,%domestic appliances%,%bu...False

woebin_plot()

  • 制作变量分布图
bins["age.in.years"]
variablebincountcount_distrgoodbadbadprobwoebin_ivtotal_ivbreaksis_special_values
0age.in.years[-inf,26.0)1900.190110800.4210530.5288440.0579210.12393526.0False
1age.in.years[26.0,35.0)3580.3582461120.3128490.0604650.0013240.12393535.0False
2age.in.years[35.0,37.0)790.07967120.151899-0.8724880.0486100.12393537.0False
3age.in.years[37.0,inf)3730.373277960.257373-0.2123710.0160800.123935infFalse
sc.woebin_plot(bins["age.in.years"])

在这里插入图片描述

sc.woebin_plot(bins["credit.amount"])

在这里插入图片描述

从变量的分布图,看出bad_prob、credit.amount这两个变量并不单调,接下来就需要调整一下区间。

分箱调整

  • scorecardpy可以自定义分箱,也可以自动分箱。
  • 自己手动调整比较好(根据业务,实际经验调整)
# 自动分箱
# break_adj = sc.woebin_adj(dt_s,y="creditability",bins=bins)
bins["credit.amount"]
variablebincountcount_distrgoodbadbadprobwoebin_ivtotal_ivbreaksis_special_values
0credit.amount[-inf,1400.0)2670.267185820.3071160.0336610.0003050.1714311400.0False
1credit.amount[1400.0,1800.0)1050.10587180.171429-0.7282390.0468150.1714311800.0False
2credit.amount[1800.0,2000.0)600.06039210.3500000.2282590.0032610.1714312000.0False
3credit.amount[2000.0,4000.0)3220.322248740.229814-0.3620660.0389650.1714314000.0False
4credit.amount[4000.0,inf)2460.2461411050.4268290.5524980.0820850.171431infFalse
将年龄划分在[-inf,26.0),[26.0,35.0),[35.0,40.0),[40.0,inf)区间大致能满足单调性。 金额划分在[-inf,1400.0),[1400.0,1900.0),[1900.0,4000.0),[4000.0,inf)区间大致能满足单调性。
# 手动分箱
break_adj = {
    'age.in.years':[26,35,40],
    'credit.amount':[1400,1900,4000]
}
bins_adj = sc.woebin(dt_s,y="creditability",breaks_list=break_adj)
bins_adj_df = pd.concat(bins_adj).reset_index().drop(columns="level_0")
bins_adj_df[bins_adj_df.variable.isin(["age.in.years",'credit.amount'])]
level_1variablebincountcount_distrgoodbadbadprobwoebin_ivtotal_ivbreaksis_special_values
00credit.amount[-inf,1400.0)2670.267185820.3071160.0336610.0003050.1411441400.0False
11credit.amount[1400.0,1900.0)1310.131104270.206107-0.5012560.0293590.1411441900.0False
22credit.amount[1900.0,4000.0)3560.356270860.241573-0.2967770.0293950.1411444000.0False
33credit.amount[4000.0,inf)2460.2461411050.4268290.5524980.0820850.141144infFalse
40age.in.years[-inf,26.0)1900.190110800.4210530.5288440.0579210.11274226.0False
51age.in.years[26.0,35.0)3580.3582461120.3128490.0604650.0013240.11274235.0False
62age.in.years[35.0,40.0)1530.153123300.196078-0.5636890.0426790.11274240.0False
73age.in.years[40.0,inf)2990.299221780.260870-0.1941560.0108170.112742infFalse
sc.woebin_plot(bins_adj["age.in.years"])

在这里插入图片描述

sc.woebin_plot(bins_adj['credit.amount']

在这里插入图片描述

四、WOE转化

将原始数据都转化为对应区间的WOE值,当然也可以不转化,但是转化之后:

  • 变量内部之间可以比较
  • 变量与变量之间也可以比较
  • 所有变量都在同一“维度”下
train_woe = sc.woebin_ply(train,bins_adj)
test_woe = sc.woebin_ply(test,bins_adj)
train_woe.sample(5)
creditabilitycredit.amount_woeage.in.years_woehousing_woeproperty_woeduration.in.month_woestatus.of.existing.checking.account_woeinstallment.rate.in.percentage.of.disposable.income_woesavings.account.and.bonds_woepresent.employment.since_woepersonal.status.and.sex_woecredit.history_woepurpose_woe
72300.033661-0.194156-0.194156-0.461035-0.3466250.6142040.103961-0.7621400.0321030.2646930.088319-0.410063
3311-0.5012560.060465-0.194156-0.4610350.108688-1.1762630.1039610.1395520.0321030.264693-0.7337410.279920
69000.0336610.528844-0.1941560.028573-0.3466250.614204-0.1554660.2713580.0321030.264693-0.7337410.279920
5370-0.296777-0.563689-0.1941560.0285730.1086880.6142040.1039610.271358-0.2355660.264693-0.7337410.279920
000.033661-0.194156-0.194156-0.461035-1.3121860.6142040.103961-0.762140-0.235566-0.165548-0.733741-0.410063

五、建立模型

逻辑回归,挺复杂的。

from sklearn.linear_model import LogisticRegression
y_train = train_woe.loc[:,"creditability"]
X_train = train_woe.loc[:,train_woe.columns!="creditability"]
y_test = test_woe.loc[:,"creditability"]
X_test = test_woe.loc[:,test_woe.columns!="creditability"]
lr = LogisticRegression(penalty='l1',C=0.9,solver='saga',n_jobs=-1)
lr.fit(X_train,y_train)
LogisticRegression(C=0.9, n_jobs=-1, penalty='l1', solver='saga')
lr.coef_
array([[0.77881419, 0.6892819 , 0.36660545, 0.37598509, 0.59990642,
        0.75916199, 1.68181704, 0.50153176, 0.23641609, 0.70438936,
        0.63125597, 0.99437898]])
lr.intercept_
array([-0.82463787])

六、模型评估

逻辑回归,预测结果为接近1的概率值。
0.6表示:数据划分为标签1的概率为0.6。那么究竟多大的概率才能划为标签1呢?这就需要一个阈值。这个阈值可以根据KS的值来确定。高于阈值得划分为1标签,低于阈值得划分为0标签。

TRP与FRP:
T R P = 预测为 1 ,真实值为 1 的数据量 预测为 1 的总量 TRP=\frac{预测为1,真实值为1的数据量}{预测为1的总量} TRP=预测为1的总量预测为1,真实值为1的数据量
F R P = 预测为 0 ,真实值为 1 的数据量 预测为 0 的总量 FRP=\frac{预测为0,真实值为1的数据量}{预测为0的总量} FRP=预测为0的总量预测为0,真实值为1的数据量
ROC曲线绘制步骤:

  1. 将预测的y_score去重排序后得到一系列阈值。
  2. 用每一个y_score做为阈值,统计数量并计算TRP、FRP的值。
  3. 这样得到一组数据后,以FPR为横坐标,TPR为纵轴标绘制图像。

AUC:

  • ROC曲线与横坐标轴围成的面积。

KS曲线:用来确定最好的阈值
K S = m a x ( T R P − F R P ) KS=max(TRP-FRP) KS=max(TRPFRP)

  • x轴为一些阈值的长度(区间序号都行),将TRP、FRP绘制在同一个坐标轴中。
train_pred = lr.predict_proba(X_train)[:,1]
test_pred =  lr.predict_proba(X_test)[:,1]
train_perf = sc.perf_eva(y_train,train_pred,title="train")

在这里插入图片描述

test_perf = sc.perf_eva(y_test,test_pred,title="test")

在这里插入图片描述

七、评分稳定性

PSI(Population Stability Index)群组稳定性指标

  • 模型在训练数据得到的实际分布(A),与测试集上得到的预期分布(E)
    P S I = ∑ i = 1 n ( A i − E i ) ∗ l n ( A i E i ) PSI=\sum_{i=1}^n(A_i-E_i)*ln(\frac{A_i}{E_i}) PSI=i=1n(AiEi)ln(EiAi)
    A i :实际分布在第 i 个区间的数量。 A_i:实际分布在第i个区间的数量。 Ai:实际分布在第i个区间的数量。

E i :预期分布在第 i 个区间的数量。 E_i:预期分布在第i个区间的数量。 Ei:预期分布在第i个区间的数量。

PSI越小,说明模型越稳定。通常PSI小于0.1,模型稳定性好。

train_score = sc.scorecard_ply(train, card, print_step=0)
test_score = sc.scorecard_ply(test, card, print_step=0)

sc.perf_psi(
    score = {'train':train_score,'test':test_score},
    label = {'train':y_train,'test':y_test}
)

在这里插入图片描述

评分映射

参考地址:https://github.com/xsj0609/data_science/tree/master/ScoreCard

逻辑回归结果:
f ( x ) = β 0 + β 1 x 1 + β 2 x 2 + . . . + β n x n f(x)=\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_nx_n f(x)=β0+β1x1+β2x2+...+βnxn
评分计算公式:
S c o r e = A − B ∗ l o g ( p 1 − p ) , p :客户违约率 Score=A-B*log(\frac{p}{1-p}),p:客户违约率 Score=ABlog(1pp)p:客户违约率
计算评分前需要先给出两个条件:
1 、给定某个违约率,对应的分数 P 0 。 s c o r e c a r d p y 默认 θ 0 = 1 19 , P 0 = 600 2 、当违约率翻一番的时候,分数变化幅度 P D O 。 s c o r e c a r d p y 默认 P D O = 50 1、给定某个违约率,对应的分数P_0。scorecardpy默认\theta_0=\frac{1}{19},P_0=600\\ 2、当违约率翻一番的时候,分数变化幅度PDO。scorecardpy默认PDO=50 1、给定某个违约率,对应的分数P0scorecardpy默认θ0=191P0=6002、当违约率翻一番的时候,分数变化幅度PDOscorecardpy默认PDO=50
通过推导可以计算出:
B = P D O l o g ( 2 ) , A = P 0 + B ∗ l o g ( θ 0 ) , l o g ( p 1 − p ) = f ( x ) B=\frac{PDO}{log(2)},A=P_0+B*log(\theta_0),log(\frac{p}{1-p})=f(x) B=log(2)PDOA=P0+Blog(θ0)log(1pp)=f(x)
举例说明:

计算基础分:

import math
B = 50/math.log(2)
A = 600+B*math.log(1/19)
basepoints=A-B*lr.intercept_[0]
print("A:",A,"B:",B,"basepoints:",basepoints)
A: 387.6036243278207 B: 72.13475204444818 basepoints: 447.0886723193208

credit.amount分数的计算过程

bins_adj_df[bins_adj_df["variable"]=="credit.amount"]
level_1variablebincountcount_distrgoodbadbadprobwoebin_ivtotal_ivbreaksis_special_values
00credit.amount[-inf,1400.0)2670.267185820.3071160.0336610.0003050.1411441400.0False
11credit.amount[1400.0,1900.0)1310.131104270.206107-0.5012560.0293590.1411441900.0False
22credit.amount[1900.0,4000.0)3560.356270860.241573-0.2967770.0293950.1411444000.0False
33credit.amount[4000.0,inf)2460.2461411050.4268290.5524980.0820850.141144infFalse
lr.coef_
array([[0.77881419, 0.6892819 , 0.36660545, 0.37598509, 0.59990642,
        0.75916199, 1.68181704, 0.50153176, 0.23641609, 0.70438936,
        0.63125597, 0.99437898]])
lr.intercept_
array([-0.82463787])
# [-inf,1400.0)区间分数,按照顺序,对应的系数为0.77881419
-B*0.77881419*0.033661
-1.8910604547516296
# [1400.0,1900.0)
-B*0.77881419*(-0.501256)
28.160345780190216

计算所有区间分数:

card = sc.scorecard(bins_adj,lr,X_train.columns)
card_df = pd.concat(card)
card_df
variablebinpoints
basepoints0basepointsNaN447.0
credit.amount0credit.amount[-inf,1400.0)-2.0
1credit.amount[1400.0,1900.0)28.0
2credit.amount[1900.0,4000.0)17.0
3credit.amount[4000.0,inf)-31.0
age.in.years4age.in.years[-inf,26.0)-26.0
5age.in.years[26.0,35.0)-3.0
6age.in.years[35.0,40.0)28.0
7age.in.years[40.0,inf)10.0
housing8housingown5.0
9housingrent-11.0
10housingfor free-12.0
property11propertyreal estate13.0
12propertybuilding society savings agreement/ life insur...-1.0
13propertycar or other, not in attribute Savings account...-1.0
14propertyunknown / no property-16.0
duration.in.month15duration.in.month[-inf,8.0)57.0
16duration.in.month[8.0,16.0)15.0
17duration.in.month[16.0,34.0)-5.0
18duration.in.month[34.0,44.0)-23.0
19duration.in.month[44.0,inf)-49.0
status.of.existing.checking.account20status.of.existing.checking.accountno checking account64.0
21status.of.existing.checking.account... >= 200 DM / salary assignments for at leas...22.0
22status.of.existing.checking.account0 <= ... < 200 DM%,%... < 0 DM-34.0
installment.rate.in.percentage.of.disposable.income23installment.rate.in.percentage.of.disposable.i...[-inf,2.0)30.0
24installment.rate.in.percentage.of.disposable.i...[2.0,3.0)19.0
25installment.rate.in.percentage.of.disposable.i...[3.0,inf)-13.0
savings.account.and.bonds26savings.account.and.bonds... >= 1000 DM%,%500 <= ... < 1000 DM%,%unknow...28.0
27savings.account.and.bonds100 <= ... < 500 DM-5.0
28savings.account.and.bonds... < 100 DM-10.0
present.employment.since29present.employment.since4 <= ... < 7 years7.0
30present.employment.since... >= 7 years4.0
31present.employment.since1 <= ... < 4 years-1.0
32present.employment.sinceunemployed%,%... < 1 year-7.0
personal.status.and.sex33personal.status.and.sexmale : single8.0
34personal.status.and.sexmale : married/widowed7.0
35personal.status.and.sexfemale : divorced/separated/married-13.0
credit.history36credit.historycritical account/ other credits existing (not ...33.0
37credit.historydelay in paying off in the past-4.0
38credit.historyexisting credits paid back duly till now-4.0
39credit.historyall credits at this bank paid back duly%,%no c...-56.0
purpose40purposeretraining%,%car (used)58.0
41purposeradio/television29.0
42purposefurniture/equipment%,%domestic appliances%,%bu...-20.0
每个变量的每个区间的分数计算完成,将客户的数据对应到区间,将分数相加,即可得出对应的评分。

至此,评分卡模型完成!

源码地址

链接:https://pan.baidu.com/s/1DAI1hxWPHEb6-46erjDaKg?pwd=e4sw
提取码:e4sw

  • 14
    点赞
  • 90
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

帅帅的Python

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值