说明
1.数据集来源为kaggle,因为是一个demo,且需要验证结果,所以仅取其训练集;
2.本项目主要为了说明构建一张简单的评分卡(A卡,也就是贷前)的过程,所以省略了EDA的部分,仅仅做一个简单的流程说明。
3.数据集和代码已经上传,点击下载下载链接
导入相关包
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import math
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV,StratifiedKFold
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.metrics import confusion_matrix,roc_auc_score,roc_curve
import warnings
import plotly.express as px
import sklearn.metrics as metrics
warnings.filterwarnings('ignore')
plt.rcParams['font.family']='Microsoft YaHei' #显示中文标签
plt.style.use ('ggplot') #设定绘图风格
import seaborn as sns
pd.set_option('max_columns',1000)
pd.set_option('max_row',300)
pd.set_option('display.float_format', lambda x:' %.5f' % x)
数据集
导入数据
df_data = pd.read_csv('cs-training.csv',encoding='GB18030',index_col=0)
df_data.shape
缺失值处理
查看数据缺失值
null_val_sums = df_data.isnull().sum()
pd.DataFrame({"特征名称": null_val_sums.index, "缺失值数目": null_val_sums.values,
"缺失值占比": null_val_sums.values / len(df_data) })
缺失值填补
填补方法很多,比如给一个极端值-9999,或者分箱单独分为一箱,平均值填充,中位数填充等等,此处我们用中位数做简单填补
for i in null_val_sums.index:
df_data[i] = df_data[i].fillna(df_data[i].median())
null_val_sums = df_data.isnull().sum()
pd.DataFrame({"特征名称": null_val_sums.index, "缺失值数目": null_val_sums.values,
"缺失值占比": null_val_sums.values / len(df_data) })
切分数据集
分为两部分,df_data1为建卡数据,参与模型的训练和测试,df_data2为验证数据,不参与任何过程。
df_data1, df_data2 = train_test_split(df_data, test_size=0.2, random_state=42,stratify=df_data['SeriousDlqin2yrs'])
df_data3 = df_data1.copy()
分箱
分箱有单独的分箱策略,比如有可能根据业务相关进行分箱,或者采取卡方分箱等等,此处采用qcut进行简单的等频分箱
'''
如果不是二分类,就将其等频分箱为4箱,如果有异常,则加入列表;
qcut按照等频方式分箱,且要求分位点处的取值唯一。
'''
for i in null_val_sums.index:
pro_list = []
if len(set(list(df_data1[i]))) > 2:
try:
df_data1[i] = pd.qcut(df_data1[i],4)
except:
pro_list.append(i)
print(pro_list)
qcut当有多个元素有相同的分位点处取值时,就会报错,添加.rank(method=‘first’)后还有问题,不想改了。。。用一次手动分箱
set(list(df_data1['NumberOfDependents']))
bins = [-math.inf ,2, 10, math.inf ]
df_data1['NumberOfDependents'] = pd.cut(df_data1['NumberOfDependents'],bins)
计算IV值
1 WOE
WOE的全称是“Weight of Evidence”,即证据权重。WOE是对原始自变量的一种编码形式。
要对一个变量进行WOE编码,需要首先把这个变量进行分组处理,对于第i组,WOE的计算公式如下:
W
O
E
i
=
l
n
(
p
y
i
p
n
i
)
=
l
n
(
#
y
i
#
y
T
#
n
i
#
n
T
)
WOE_i = ln(\frac{py_i}{pn_i}) = ln(\frac{\frac{\#y_i}{\#y_T}}{\frac{\#n_i}{\#n_T}})
WOEi=ln(pnipyi)=ln(#nT#ni#yT#yi)
p
y
i
py_i
pyi是这个组中违约客户占所有样本中所有违约客户的比例;
p n i pn_i pni是这个组中未违约客户占样本中所有未违约客户的比例;
# y i \#y_i #yi是这个组中违约客户的数量;
# n i \#n_i #ni是这个组中未违约客户的数量;
# y T \#y_T #yT是样本中所有违约客户的数量;
# n T \#n_T #nT是样本中所有未违约客户的数量。
WOE表示的实际上是“当前分组中违约客户占所有违约客户的比例”和“当前分组中没有违约的客户占所有没有违约的客户的比例”的差异。
W
O
E
i
=
l
n
(
p
y
i
p
n
i
)
=
l
n
(
#
y
i
#
y
T
#
n
i
#
n
T
)
=
l
n
(
#
y
i
#
n
i
#
Y
T
#
n
T
)
WOE_i = ln(\frac{py_i}{pn_i}) = ln(\frac{\frac{\#y_i}{\#y_T}}{\frac{\#n_i}{\#n_T}}) = ln(\frac{\frac{\#y_i}{\#n_i}}{\frac{\#Y_T}{\#n_T}})
WOEi=ln(pnipyi)=ln(#nT#ni#yT#yi)=ln(#nT#YT#ni#yi)
变换后,WOE表示的是当前这个组中坏客户和好客户的比值a,和所有样本中这个比值b的差异。这个差异是用这两个比值的比值,再取对数来表示的。WOE越大,这种差异越大,这个分组里的样本违约的可能性就越大,WOE越小,差异越小,这个分组里的样本违约的可能性就越小。
2 IV的计算公式
IV值代表该变量包含的信息量的大小,即该特征对模型的影响程度;对于分组i,也会有一个对应的IV值,计算公式如下:
I
V
i
=
(
p
y
i
−
p
n
i
)
∗
W
O
E
i
=
(
p
y
i
−
p
n
i
)
∗
l
n
(
p
y
i
p
n
i
)
=
(
#
y
i
#
y
T
−
#
n
i
#
n
T
)
∗
l
n
(
#
y
i
#
y
T
#
n
i
#
n
T
)
IV_i = (py_i-pn_i)*WOE_i=(py_i-pn_i)*ln(\frac{py_i}{pn_i})=(\frac{\#y_i}{\#y_T}-\frac{\#n_i}{\#n_T})*ln(\frac{\frac{\#y_i}{\#y_T}}{\frac{\#n_i}{\#n_T}})
IVi=(pyi−pni)∗WOEi=(pyi−pni)∗ln(pnipyi)=(#yT#yi−#nT#ni)∗ln(#nT#ni#yT#yi)
计算整个变量的IV值,将把各分组的IV相加:
I
V
=
∑
i
n
I
V
i
IV=\sum_i^nIV_i
IV=i∑nIVi
n为变量分组个数。
def cal_IV(df, feature, target):
lst = []
cols=['Variable', 'Value', 'All', 'Bad']
for i in range(df[feature].nunique()):
val = list(df[feature].unique())[i]
lst.append([feature, val, df[df[feature] == val].count()[feature], df[(df[feature] == val) & (df[target] == 1)].count()[feature]])
data = pd.DataFrame(lst, columns=cols)
data = data[data['Bad'] > 0]
data['Share'] = data['All'] / data['All'].sum()
data['Bad Rate'] = data['Bad'] / data['All']
data['Distribution Good'] = (data['All'] - data['Bad']) / (data['All'].sum() - data['Bad'].sum())
data['Distribution Bad'] = data['Bad'] / data['Bad'].sum()
data['WoE'] = np.log(data['Distribution Bad'] / data['Distribution Good'])
data['IV'] = (data['WoE'] * (data['Distribution Bad'] - data['Distribution Good'])).sum()
data = data.sort_values(by=['Variable', 'Value'], ascending=True)
return data['IV'].values[0]
我们把IV小于0.02的看作无效特征;IV大于0.02,小于0.1的为弱效果特征;IV大于0.1,小于0.3中等效果特征;IV大于0.3,小于0.5为强特征,大于0.5为特强特征;
因为特征不多,此处我们取所以iv值>0.02的特征
'''
循环计算每列数据的iv值,并进行如下操作方便后续使用:
1.将iv值>0.02的列名称存入列表,并输出
2.按照iv值降序排列,并输出一个dataframe,方便保存使用
'''
feature_list = []
iv_list = []
cols = [c for c in df_data1.columns.values]
for f in cols:
iv = cal_IV(df_data1,f,'SeriousDlqin2yrs')
if 1000>iv>=0.02:
iv_list.append(iv)
feature_list.append(f)
print(feature_list)
combine_data ={'特征名称':feature_list,'IV值':iv_list}
fea_iv_data = pd.DataFrame(combine_data)
fea_iv_data = fea_iv_data.sort_values(by="IV值" , ascending=False)
fea_iv_data
WOE编码
def cal_WOE(df,features,target):
df_new = df
for f in features:
df_woe = df_new.groupby(f).agg({target:['sum','count']})
df_woe.columns = list(map(''.join, df_woe.columns.values))
df_woe = df_woe.reset_index()
df_woe = df_woe.rename(columns = {target+'sum':'bad'})
df_woe = df_woe.rename(columns = {target+'count':'all'})
df_woe['good'] = df_woe['all']-df_woe['bad']
df_woe = df_woe[[f,'good','bad']]
df_woe['bad_rate'] = df_woe['bad'].mask(df_woe['bad']==0, 1)/df_woe['bad'].sum() # mask 0 to 1 to avoid log(0)
df_woe['good_rate'] = df_woe['good']/df_woe['good'].sum()
df_woe['woe'] = np.log(df_woe['bad_rate'].divide(df_woe['good_rate'],fill_value=1))
df_woe.columns = [c if c==f else c+'_'+f for c in list(df_woe.columns.values)]
df_new = df_new.merge(df_woe,on=f,how='left')
return df_new
feature_cols=feature_list
df_woe = cal_WOE(df_data1,feature_cols,'SeriousDlqin2yrs')
woe_cols = [c for c in list(df_woe.columns.values) if 'woe' in c]
df_woe[woe_cols]
相关性检验
两个特征如果相关系>0.5,保留IV值大的特征即可
df_woe[woe_cols].corr()
其实还可以采用一些其他特征筛选策略进行组合筛选,因为这个数据集特征本身就少,就不做进一步筛选了
输出每个分箱的woe编码值
df_bin_to_woe = pd.DataFrame(columns = ['features','bin','woe'])
for f in feature_cols:
b = f
w = 'woe_'+f
df = df_woe[[w,b]].drop_duplicates()
df.columns = ['woe','bin']
df['features'] = f
df=df[['features','bin','woe']]
df_bin_to_woe = pd.concat([df_bin_to_woe,df])
df_bin_to_woe
模型建立
数据集切分
X_train, X_test, y_train, y_test = train_test_split(df_woe[woe_cols], df_woe['SeriousDlqin2yrs'], test_size=0.2, random_state=42,stratify=df_woe['SeriousDlqin2yrs'])
print('坏样本占比为: ',y_train.mean())
建立逻辑回归模型
model = LogisticRegression(C=1,penalty='l2',random_state=42,class_weight ='balanced').fit(X_train,y_train)
model.score(X_test,y_test)
模型评价标准
1.AUC:AUC为ROC曲线的面积,取值为[0,1\],越接近1证明模型预测效果越好。ROC曲线的纵坐标轴为TPR;横坐标轴为FPR。一般AUC>0.8证明效果还可以,可以考虑上线。
2.KS值:衡量的是好坏样本累计分布之间的差值。好坏样本累计差异越大,KS指标越大,那么模型的风险区分能力越强,当KS值>0.4,且训练集和测试集相差不超过5%,正负样本区分度还不错。
probs = model.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)
ks=abs(fpr-tpr).max()
print('ks_test:',ks)
probs_train = model.predict_proba(X_train)
preds_train = probs_train[:,1]
fpr_train, tpr_train,_= metrics.roc_curve(y_train, preds_train)
ks_train=abs(fpr_train-tpr_train).max()
print('ks_train:',ks_train)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
可以看出,AUC=0.78效果不太好,一般可以通过特征交叉、新增特征,调整分箱等等方法去达到更好的效果。因为是示例,我们就继续向下走,暂时不做调整
model.coef_
y_pred = model.predict(X_test)
metrics.confusion_matrix(y_test,y_pred)
基于Logistic回归的评分卡转换
评分卡转换原理
客户违约概率为p,正常概率为1-p,违约概率与正常概率的比值,称为Odds,即:
O
d
d
s
=
p
1
−
p
−
−
−
−
(
1
)
Odds=\frac{p}{1-p}----(1)
Odds=1−pp−−−−(1)
逻辑回归公式为:
p
=
1
1
+
e
−
θ
T
x
−
−
−
−
(
2
)
p=\frac{1}{1+e^-\theta^Tx}---- (2)
p=1+e−θTx1−−−−(2)
同时取ln
l
n
(
p
1
−
p
)
=
θ
T
x
−
−
−
−
(
3
)
ln(\frac{p}{1-p})=\theta^Tx ---- (3)
ln(1−pp)=θTx−−−−(3)
将公式(1)代入(3),即:
l
n
(
O
d
d
s
)
=
θ
T
x
−
−
−
−
(
4
)
ln(Odds)=\theta^Tx----(4)
ln(Odds)=θTx−−−−(4)
评分卡的背后逻辑是Odds的变动与评分变动的映射(把Odds映射为评分),即通过下述公式进行映射:
S
c
o
r
e
=
A
−
B
∗
l
n
(
O
d
d
s
)
−
−
−
−
(
5
)
Score=A-B*ln(Odds)----(5)
Score=A−B∗ln(Odds)−−−−(5)
其中A与B是常数,B前面取负号的原因,是让违约概率越低,得分越高。因为实际业务里,分数也高风险越低,风控里默认高分高信用低风险
计算出A、B的方法如下:
- 设定基准分。基准分为 A − B ∗ θ 0 A-B*\theta_0 A−B∗θ0,也就是 p 0 p_0 p0,当odds无变动时, θ 0 \theta_0 θ0=1,A= p 0 p_0 p0,此处我们设定基础分为650;
- PDO(point of double),比率翻番时分数的变动值。此处设置为当odds翻倍时,分值减少50。
由上述可得:
P 0 = A − B ∗ l n ( θ 0 ) P_0=A-B*ln(\theta_0) P0=A−B∗ln(θ0)
根据PDO的定义,有下面等式:
P 0 − P D O = A − B ∗ l n ( 2 θ 0 ) P_0-PDO=A-B*ln(2\theta_0) P0−PDO=A−B∗ln(2θ0)
上述两个等式联立求解
可得:
B
=
P
D
O
l
n
2
B=\frac{PDO}{ln2}
B=ln2PDO
A
=
P
0
+
B
∗
l
n
(
θ
0
)
A=P_0+B*ln(\theta_0)
A=P0+B∗ln(θ0)
评分卡里每一个变量x的每一个分箱有一个对应分值,即将每个特征的重要性权值与特征分箱的WOE编码进行结合计算,其计算方法如下:
S
c
o
r
e
=
A
−
B
{
θ
i
ω
i
j
}
Score=A-B\{\theta_i\omega_{ij}\}
Score=A−B{θiωij}
其中,
θ
i
\theta_i
θi表示第i个特征的重要性得分,
ω
i
j
\omega_{ij}
ωij表示第i个特征的第j个分箱的WOE编码的分数。
A = 650
B =72.13
def generate_scorecard(model_coef,binning_df,features,B):
lst = []
cols = ['Variable','Binning','Score']
coef = model_coef[0]
for i in range(len(features)):
f = features[i]
df = binning_df[binning_df['features']==f]
for index,row in df.iterrows():
lst.append([f,row['bin'],int(round(-coef[i]*row['woe']*B))])
data = pd.DataFrame(lst, columns=cols)
return data
评分卡说明
一般来说,第一版评分卡出来之后就要看每个特征分箱对应的得分业务上是否能够解释,解释性差的话,可以考虑重新分箱,因为比较麻烦,这里不再做调整了
score_card = generate_scorecard(model.coef_,df_bin_to_woe,feature_cols,B)
sort_scorecard = score_card.groupby('Variable').apply(lambda x: x.sort_values('Score', ascending=False))
sort_scorecard
按照上述评分卡给建卡数据、验证数据打分
得分映射函数
因为只有五个特征,这里采用比较笨蛋的写法
def no1(a):
a = float(a)
if a <=0.0296:
return 89
elif a<=0.153:
return 82
elif a<=0.558:
return 21
else:
return -74
def no2(a):
if a <=41:
return -18
elif a<=52.0:
return -8
elif a<= 63.0:
return 8
else:
return 38
def no3(a):
if a <= 0.175:
return 4
elif a<=0.366:
return 8
elif a<=0.861:
return -10
else:
return -1
def no4(a):
if a <=3900.0:
return -12
elif a<=5400.0:
return 1
elif a<= 7400.0:
return 2
else:
return 13
def no5(a):
if a <=5.0:
return -2
elif a<=8.0:
return 2
elif a<=11.0:
return 1
else:
return 0
对建卡数据进行得分映射
df_data3['R_score'] = df_data3.apply(lambda x: no1(x.RevolvingUtilizationOfUnsecuredLines), axis = 1)
df_data3['age_score'] = df_data3.apply(lambda x: no2(x.age), axis = 1)
df_data3['D_score'] = df_data3.apply(lambda x: no3(x.DebtRatio), axis = 1)
df_data3['M_score'] = df_data3.apply(lambda x: no4(x.MonthlyIncome), axis = 1)
df_data3['N_score'] = df_data3.apply(lambda x: no5(x.NumberOfOpenCreditLinesAndLoans), axis = 1)
df_1 = df_data3.iloc[:,-5:-1]
df_2 = df_data3.iloc[:,-1]
df_s1 = pd.concat([df_1,df_2], axis=1)
df_s1['总分'] = df_s1.sum(axis = 1)
df_s1 = pd.concat([df_s1,df_data3['SeriousDlqin2yrs']], axis=1)
df_s1.head()
print ('好样本得分区间',df_s1[df_s1['SeriousDlqin2yrs'] == 0][['总分']].min(),df_s1[df_s1['SeriousDlqin2yrs'] == 0][['总分']].max())
print ('坏样本得分区间',df_s1[df_s1['SeriousDlqin2yrs'] == 1][['总分']].min(),df_s1[df_s1['SeriousDlqin2yrs'] == 1][['总分']].max())
对验证数据进行得分映射
df_data2['R_score'] = df_data2.apply(lambda x: no1(x.RevolvingUtilizationOfUnsecuredLines), axis = 1)
df_data2['age_score'] = df_data2.apply(lambda x: no2(x.age), axis = 1)
df_data2['D_score'] = df_data2.apply(lambda x: no3(x.DebtRatio), axis = 1)
df_data2['M_score'] = df_data2.apply(lambda x: no4(x.MonthlyIncome), axis = 1)
df_data2['N_score'] = df_data2.apply(lambda x: no5(x.NumberOfOpenCreditLinesAndLoans), axis = 1)
评估
首先查看建卡数据的得分分布
比较好的分布图肯定是两个单峰,且好用户和坏用户交叉不多,目前这个由于上述各种偷懒操作且特征本身也不多,效果并不理想,但是可以继续向下走流程。
# 得分分布图
def plot_score_hist(df,target,score_col,plt_size=None,cutoff=None):
"""
df:数据集
target:目标变量的字段名
score_col:最终得分的字段名
plt_size:图纸尺寸
cutoff :划分拒绝/通过的点
return :好坏用户的得分分布图
"""
plt.figure(figsize=plt_size)
x1 = df[df[target]==1][score_col]
x2 = df[df[target]==0][score_col]
sns.kdeplot(x1,shade=True,label='坏用户',color='hotpink')
sns.kdeplot(x2,shade=True,label='好用户',color ='seagreen')
plt.axvline(x=cutoff)
plt.legend()
return plt.show()
plot_score_hist(df_s1,'SeriousDlqin2yrs','总分',plt_size=(9,9),cutoff=0)
搜索CUTOFF
对于评分卡来说,需要有一条线或者说是一个得分点来对数据进行切分此出先将其称为Cutoff,高于这条线,接受,低于则拒绝
下面通过寻找KS值最大的点进行一个粗筛,实际业务还会根据通过率、坏样本累计占比等因素进行调整。
def search_cutoff(df,col_score,target):
cutoff = list(range(-150, 151))
refuse_acc=[]
tpr_ls=[]
fpr_ls=[]
KS_ls=[]
for i in cutoff:
df['result'] = df.apply(lambda x:'拒绝' if x[col_score]<=i else '接受',axis=1)
TP = df[(df['result']=='拒绝')&(df[target]==1)].shape[0]
FN = df[(df['result']=='拒绝')&(df[target]==0)].shape[0]
bad = df[df[target]==1].shape[0]
good = df[df[target]==0].shape[0]
refuse = df[df['result']=='拒绝'].shape[0]
passed = df[df['result']==10].shape[0]
tpr = round(TP/bad,3)
fpr = round(FN/good,3)
KS = abs(tpr-fpr)
KS_ls.append(KS)
maxid_KS = KS_ls.index(max(KS_ls))
co4 = cutoff[maxid_KS]
print('最大KS值:{}'.format(max(KS_ls)))
print('KS最大的分数:{}'.format(co4))
return
search_cutoff(df_s1,'总分','SeriousDlqin2yrs')
最大KS值:0.45100000000000007
KS最大的分数:-21
# 设定cutoff点,衡量有效性
def rule_verify(df,col_score,target,cutoff):
"""
df:数据集
target:目标变量的字段名
col_score:最终得分的字段名
cutoff :划分拒绝/通过的点
return :混淆矩阵
"""
df['result'] = df.apply(lambda x:'拒绝' if x[col_score]<=cutoff else '接受',axis=1)
TP = df[(df['result']=='拒绝')&(df[target]==1)].shape[0]
FN = df[(df['result']=='拒绝')&(df[target]==0)].shape[0]
bad = df[df[target]==1].shape[0]
good = df[df[target]==0].shape[0]
refuse = df[df['result']=='拒绝'].shape[0]
passed = df[df['result']==10].shape[0]
acc = round(TP/refuse,3)
tpr = round(TP/bad,3)
fpr = round(FN/good,3)
pass_rate = round(refuse/df.shape[0],3)
matrix_df = pd.pivot_table(df,index='result',columns=target,aggfunc={col_score:pd.Series.count},values=col_score)
print('拒绝准确率:{}'.format(acc))
print('查全率:{}'.format(tpr))
print('误伤率:{}'.format(fpr))
print('规则拒绝率:{}'.format(pass_rate))
return matrix_df
rule_verify(df_s1,'总分','SeriousDlqin2yrs',-21)
可以看出,这张卡可以拒绝掉大部分的坏样本,但是也会有21.9%的好样本被误伤
再看下在验证数据上的结果
plot_score_hist(df_s2,'SeriousDlqin2yrs','总分',plt_size=(9,9),cutoff=-21)
rule_verify(df_s2,'总分','SeriousDlqin2yrs',-21)
可以看出,这张评分卡在验证数据上和建卡数据效果相差不大,大概75%的通过率,同时,也能够识别出大部分的坏样本。
由于数据集原因,这张卡做到这里就结束了,个人也属于相关初学者,如有错误疏漏,欢迎指出,一起学习探讨