评分卡模型案例(GiveMeSomeCredit,kaggle数据)(自己练习版本)

评分卡模型简介

‌评分卡模型‌,也称为信用评分卡模型,是常用的金融风险控制手段之一。根据客户的各种属性和行为数据,利用信用评分模型,对客户的信用进行评分,从而决定是否给予授信,授信的额度和利率,减少在金融交易中存在的交易风险。
按照不同的业务阶段,可以划分为三种:
贷前:申请评分卡(Application score card),称为A卡
贷中:行为评分卡(Behavior score card),称为B卡
贷后:催收评分卡(Collection score card),称为C卡

评分卡模型具有简单、直观、易于理解的特点,广泛应用于个体信用评估、贷款审批、风险控制等领域。同时,评分卡模型也可以根据实际情况进行调整和优化,提高模型的准确性和适应性。例如,支付宝芝麻信用分。

1、数据准备

数据来源:Kaggle,下载地址:Give Me Some Credit | Kaggle

Variable Name中文解释DescriptionType
SeriousDlqin2yrs要预测的标签,逾期90天或更糟糕的人Person experienced 90 days past due delinquency or worse Y/N
RevolvingUtilizationOfUnsecuredLines信用卡和个人信用贷款(花呗、借呗、微粒贷等)已经透支的钱 占 总信用额度的比例。比例越高,说明透支越大,越有可能还不上Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limitspercentage
age年龄Age of borrower in yearsinteger
NumberOfTime30-59DaysPastDueNotWorse借款人逾期30-59天的次数(但在过去两年没有更糟糕)Number of times borrower has been 30-59 days past due but no worse in the last 2 years.integer
DebtRatio每月债务支付、赡养费、生活费总和除以收入总和,比例越高,越有可能还不上Monthly debt payments, alimony,living costs divided by monthy gross incomepercentage
MonthlyIncome月收入Monthly incomereal
NumberOfOpenCreditLinesAndLoans公开贷款(分期贷款,例如汽车贷款、抵押贷款等)和信用额度(例如信用卡、花呗等)的申请数量Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)integer
NumberOfTimes90DaysLate借款人逾期90天及以上的次数Number of times borrower has been 90 days or more past due.integer
NumberRealEstateLoansOrLines抵押贷款和房地产贷款(包括房屋净值的信用贷款)的数量Number of mortgage and real estate loans including home equity lines of creditinteger
NumberOfTime60-89DaysPastDueNotWorse借款人逾期60-89天的次数(但在过去两年没有更糟糕)Number of times borrower has been 60-89 days past due but no worse in the last 2 years.integer
NumberOfDependents家庭中除自己以外的受抚养人数(配偶、子女等)Number of dependents in family excluding themselves (spouse, children etc.)integer
import pandas as pd

# 读取数据
train=pd.read_csv('cs-training.csv', index_col=0)
print(train.shape)  # (150000, 11)
print(train.describe().T)

输出如下:

 2、探索性数据分析(EDA)

这块挺复杂的,我也还在学习中,本文篇幅有限,简单写几个变量

import seaborn as sns
sns.countplot(x="SeriousDlqin2yrs", data=train)
# 0.06684
print("Default Rate: {}".format(train["SeriousDlqin2yrs"].sum() / len(train)))

print(train["NumberOfDependents"].describe())
print(train["NumberOfDependents"].value_counts().sort_index())

print(train["age"].describe())
sns.displot(train["age"])

# 变量之间的相关性
corr = train.corr()
import matplotlib.pyplot as plt
plt.subplots(figsize=(12, 12))
sns.heatmap(corr, annot=True, vmax=1, square=True, cmap='Blues')
plt.show()

 age这个特征出现了0值,后面需要处理下

 

 相关系数,在当前数据集,也可以过滤特征,阈值一般0.7

3、数据处理

1.缺失值处理

# 查看缺失占比
print(train.isnull().mean())

# 删除完全重复的记录
train.drop_duplicates(inplace=True)
print(train.shape)  # (149391, 11)

# 空值填充中位数
df_train = train.fillna(train.median())
print(df_train.isnull().sum())
df_train.info()

输出如下:

空值处理有很多方式,例如:中位数或者平均数填充,缺失值少可删除缺失记录,用模型预测也可以。

MonthlyIncome和NumberOfDependents有一些空值。本次都采取中位数填充

2.异常值处理 

# 发现有一条记录的年龄为0
print(df_train['age'].value_counts().sort_index())  
df_train = df_train[df_train['age'] > 0]

# 
import matplotlib.pyplot as plt
columns = ['NumberOfTime30-59DaysPastDueNotWorse',
          'NumberOfTime60-89DaysPastDueNotWorse',
          'NumberOfTimes90DaysLate']
df_train.loc[:, columns].plot.box(vert=False)
plt.show()

for col in columns:
    df_train = df_train.loc[df_train[col] < 90]

print(train.shape,df_train.shape)  # (149391, 11) (149165, 11)

输出如下: 

年龄为0通常认为是异常值,查看数据发现仅一条,可以删除记录

 从业务上考虑,这些特征不应出现如此高的次数,同样删除异常记录 

4、特征选取

1.分箱

连续变量离散化,直接调用了toad包,进行分箱

import toad
from toad.plot import bin_plot

cb = toad.transform.Combiner()
cb.fit(df_train, 'SeriousDlqin2yrs', method='dt', n_bins=10, min_samples=0.05, empty_separate=True)
# 分箱方法:chi:卡方;dt:决策树;quantile:等频分箱;kmean:kmeans分箱;step:等步长分箱
cut_points = cb.export()
print(cut_points)

#调整分箱
col = "NumberOfTime60-89DaysPastDueNotWorse"
rule = {col:[0,1,2]}
cb.update(rule)
bin_plot(cb.transform(df_train[[col,'SeriousDlqin2yrs']], labels=True), x=col, target='SeriousDlqin2yrs')
plt.show()

cut_points = cb.export()
print(cut_points)

每个特征切分的阈值信息如下: 

手动修改了NumberOfTime60-89DaysPastDueNotWorse的分箱

2.iv值筛选

IV,即信息价值(Information Value),也称信息量。

iv value描述
iv<=0.02无预测能力
0.02<iv<=0.1较弱的预测能力
0.1<iv<=0.3预测能力一般
0.3<iv<=0.5预测能力较强
iv>0.5太强了,需检查
import math
import numpy as np

# 展示每条记录所处的分箱信息
for key, value in cut_points.items():
    print(key, value)
    bins = [-math.inf]+value+[math.inf]
    df_train['bin_'+key] = pd.cut(df_train[key],bins=bins).astype(str)

# 计算iv值
def cal_IV(df, feature, target):
    lst = []
    cols=['Variable', 'Value', 'All', 'Bad']
    for i in range(df[feature].nunique()):
        val = list(df[feature].unique())[i]
        lst.append([feature, val, df[df[feature] == val].count()[feature], df[(df[feature] == val) & (df[target] == 1)].count()[feature]])
    data = pd.DataFrame(lst, columns=cols)
    data = data[data['Bad'] > 0]
    data['Share'] = data['All'] / data['All'].sum()
    data['Bad Rate'] = data['Bad'] / data['All']
    data['Distribution Good'] = (data['All'] - data['Bad']) / (data['All'].sum() - data['Bad'].sum())
    data['Distribution Bad'] = data['Bad'] / data['Bad'].sum()
    data['WoE'] = np.log(data['Distribution Bad'] / data['Distribution Good'])
    data['IV'] = (data['WoE'] * (data['Distribution Bad'] - data['Distribution Good'])).sum()
    data = data.sort_values(by=['Variable', 'Value'], ascending=True)
    return data['IV'].values[0]

x_col_dev = []
bin_cols = [c for c in df_train.columns.values if c.startswith('bin_')]
for f in bin_cols:
    va = cal_IV(df_train,f,'SeriousDlqin2yrs')
    print(f,va)
    if va>0.1:
        x_col_dev.append(f)

# 输出iv>0.1的特征
print(x_col_dev)

输出如下: 

本次选取本次选取iv>0.1的特征,即:['RevolvingUtilizationOfUnsecuredLines', 'NumberOfTimes90DaysLate', 'NumberOfTime30-59DaysPastDueNotWorse', 'NumberOfTime60-89DaysPastDueNotWorse', 'age']

5、模型构建

1.WOE转换

def cal_WOE(df,features,target):
    df_new = df
    for f in features:
        df_woe = df_new.groupby(f).agg({target:['sum','count']})
        df_woe.columns = list(map(''.join, df_woe.columns.values))
        df_woe = df_woe.reset_index()
        df_woe = df_woe.rename(columns = {target+'sum':'bad'})
        df_woe = df_woe.rename(columns = {target+'count':'all'})
        df_woe['good'] = df_woe['all']-df_woe['bad']
        df_woe = df_woe[[f,'good','bad']]
        df_woe['bad_rate'] = df_woe['bad'].mask(df_woe['bad']==0, 1)/df_woe['bad'].sum() # mask 0 to 1 to avoid log(0)
        df_woe['good_rate'] = df_woe['good']/df_woe['good'].sum()
        df_woe['woe'] = np.log(df_woe['bad_rate'].divide(df_woe['good_rate'],fill_value=1))
        df_woe.columns = [c if c==f else c+'_'+f for c in list(df_woe.columns.values)]
        df_new = df_new.merge(df_woe,on=f,how='left')
    return df_new

feature_cols = ['RevolvingUtilizationOfUnsecuredLines','NumberOfTime30-59DaysPastDueNotWorse','age','NumberOfTimes90DaysLate','NumberOfTime60-89DaysPastDueNotWorse']
bin_cols = ['bin_RevolvingUtilizationOfUnsecuredLines','bin_NumberOfTime30-59DaysPastDueNotWorse','bin_age','bin_NumberOfTimes90DaysLate','bin_NumberOfTime60-89DaysPastDueNotWorse']
df_woe = cal_WOE(df_train,bin_cols,'SeriousDlqin2yrs')
woe_cols = [c for c in list(df_woe.columns.values) if 'woe' in c]
df_woe[woe_cols]

# 展示变量分箱对应的woe值
df_bin_to_woe = pd.DataFrame(columns = ['features','bin','woe'])
for f in feature_cols:
    b = 'bin_'+f
    w = 'woe_bin_'+f
    df = df_woe[[w,b]].drop_duplicates()
    df.columns = ['woe','bin']
    df['features'] = f
    df=df[['features','bin','woe']]
    df_bin_to_woe = pd.concat([df_bin_to_woe,df])
print(df_bin_to_woe)

遇到的问题记录下:
我手动调整了 "NumberOfTime60-89DaysPastDueNotWorse" 的分箱
使用toad包直接进行woe转换
transer = toad.transform.WOETransformer()
data01 = transer.fit_transform(cb.transform(df_train[x_col_dev+['SeriousDlqin2yrs']]), df_train['SeriousDlqin2yrs'], exclude=['SeriousDlqin2yrs'])
得到的woe值与我自算的不一致

 2.Logistic模型建立

 cs-test.csv不包含SeriousDlqin2yrs标签,无法验证使用,故从cs-training中将提取70%作为训练集,30%作为验证集。

from sklearn.model_selection import train_test_split
X_train, X_vali, y_train, y_vali = train_test_split(df_woe[woe_cols], df_woe['SeriousDlqin2yrs'], test_size=0.3, random_state=40)
train_date = pd.concat([y_train, X_train], axis=1)
vali_date = pd.concat([y_vali, X_vali], axis=1)

# 输出训练集和测试集的数据量
print(train_date.shape,vali_date.shape) # (104415, 6) (44750, 6)

# 跑模型
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=40)
model.fit(X_train,y_train)

# 在测试集上看性能
print(model.score(X_vali,y_vali)) # 0.9368044692737431

# 模型的auc
import sklearn.metrics as metrics
probs = model.predict_proba(X_vali)

preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_vali, preds)
roc_auc = metrics.auc(fpr, tpr)

import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

7、评分卡转换

def generate_scorecard(model_coef,binning_df,features,B):
    lst = []
    cols = ['Variable','Binning','Score']
    coef = model_coef[0]
    for i in range(len(features)):
        f = features[i]
        df = binning_df[binning_df['features']==f]
        for index,row in df.iterrows():
            lst.append([f,row['bin'],int(round(-coef[i]*row['woe']*B))])
    data = pd.DataFrame(lst, columns=cols)
    return data

B = 50/np.log(2)
A = 650 + B*np.log(1/1)

score_card = generate_scorecard(model.coef_,df_bin_to_woe,feature_cols,B)
sort_scorecard = score_card.groupby('Variable').apply(lambda x: x.sort_values('Score', ascending=False))
sort_scorecard

结语

记录下自己练习评分卡模型的过程,欢迎大家一起交流学习

参考

知乎文档:https://zhuanlan.zhihu.com/p/148102950

  • 9
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值