数据科学 案例5 Logistic回归之构建初始信用评级和分类模型检验(代码)

8 逻辑回归

信用风险建模案例
数据说明:本数据是一份汽车贷款违约数据
#名称—中文含义
#application_id—申请者ID
#account_number—帐户号
#bad_ind—是否违约
#vehicle_year—汽车购买时间
#vehicle_make—汽车制造商
#bankruptcy_ind—曾经破产标识
#tot_derog—五年内信用不良事件数量(比如手机欠费消号)
#tot_tr—全部帐户数量
#age_oldest_tr—最久账号存续时间(月)
#tot_open_tr—在使用帐户数量
#tot_rev_tr—在使用可循环贷款帐户数量(比如信用卡)
#tot_rev_debt—在使用可循环贷款帐户余额(比如信用卡欠款)
#tot_rev_line—可循环贷款帐户限额(信用卡授权额度)
#rev_util—可循环贷款帐户使用比例(余额/限额)
#fico_score—FICO打分
#purch_price—汽车购买金额(元)
#msrp—建议售价
#down_pyt—分期付款的首次交款
#loan_term—贷款期限(月)
#loan_amt—贷款金额
#ltv—贷款金额/建议售价*100
#tot_income—月均收入(元)
#veh_mileage—行使历程(Mile)
#used_ind—是否二手车
#weight—样本权重

import os
import numpy as np
from scipy import stats
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
?pd.read_csv

1、导入数据和数据清洗

表格中有两行列标签(需注意)

accepts = pd.read_csv(r'.\data\accepts.csv',header =1,encoding='gbk')
accepts2 = accepts.drop([5845,5846,11692]).dropna() ##
accepts2.head()
application_idaccount_numberbad_indvehicle_yearvehicle_makebankruptcy_indtot_derogtot_trage_oldest_trtot_open_tr...purch_pricemsrpdown_pytloan_termloan_amtltvtot_incomeveh_mileageused_indweight
023140491161311998FORDN79642...1720017350036172009965502400011
1635391344902000DAEWOON02124011...19588.5419788683.546019588.54994666.672204.75
387251871535911997FORDN310355...12999121003099601080011815001000011
442751271581202000TOYOTAN0101042...26328.042202406026328.0412241441404.75
587125131697902000DODGEY2151364...26272.722637503626272.721005400104.75

5 rows × 25 columns

#把字符数据转化为浮点型数据
accepts2['tot_rev_line'] = accepts2['tot_rev_line'].astype(float)
accepts2['tot_income'] = accepts2['tot_income'].astype(float)
accepts2['loan_amt'] = accepts2['loan_amt'].astype(float)
accepts2['tot_income'] = accepts2['tot_income'].astype(float)
accepts2['down_pyt'] = accepts2['down_pyt'].astype(float)
accepts2['loan_amt'] = accepts2['loan_amt'].astype(float)
accepts2['tot_rev_debt'] = accepts2['tot_rev_debt'].astype(float)

2、衍生变量:

#定义除法
def divMy(x,y):
    import numpy as np
    if x==np.nan or y==np.nan:
        return np.nan
    elif y==0:
        return -1
    else:
        return x/y
divMy(1,2)
0.5
##历史负债收入比:tot_rev_line/tot_income
accepts2["dti_hist"]=accepts2[["tot_rev_line","tot_income"]].apply(lambda x:divMy(x[0],x[1]),axis = 1)
##本次新增负债收入比:loan_amt/tot_income
accepts2["dti_mew"]=accepts2[["loan_amt","tot_income"]].apply(lambda x:divMy(x[0],x[1]),axis = 1)
##本次贷款首付比例:down_pyt/loan_amt
accepts2["fta"]=accepts2[["down_pyt","loan_amt"]].apply(lambda x:divMy(x[0],x[1]),axis = 1)
##新增债务比:loan_amt/tot_rev_debt
accepts2["nth"]=accepts2[["loan_amt","tot_rev_debt"]].apply(lambda x:divMy(x[0],x[1]),axis = 1)
##新增债务额度比:loan_amt/tot_rev_line
accepts2["nta"]=accepts2[["loan_amt","tot_rev_line"]].apply(lambda x:divMy(x[0],x[1]),axis = 1)

accepts2.head()
application_idaccount_numberbad_indvehicle_yearvehicle_makebankruptcy_indtot_derogtot_trage_oldest_trtot_open_tr...ltvtot_incomeveh_mileageused_indweightdti_histdti_mewftanthnta
023140491161311998FORDN79642...996550.0024000110.0763362.6259540.00000033.99209534.400000
1635391344902000DAEWOON02124011...994666.672204.7512.2659204.1975410.0348950.5660610.342212
387251871535911997FORDN310355...1181500.0010000113.9640007.2000000.2869442.6872361.816347
442751271581202000TOYOTAN0101042...1224144.001404.750.4343636.3532920.000000-1.00000014.626689
587125131697902000DODGEY2151364...1005400.00104.751.0642594.8653190.0000007.1960344.571554

5 rows × 30 columns

3、分类变量的相关关系

3.1 交叉表

cross_table = pd.crosstab(accepts2.used_ind,accepts2.bad_ind, margins=True)
#cross_table = pd.crosstab(accepts.bankruptcy_ind,accepts.bad_ind, margins=True)
cross_table
bad_ind01All
used_ind
029146123526
137249604684
All663815728210

3.2 列联表

#法一:
def percConvert(ser):
    return ser/float(ser[-1])

cross_table = pd.crosstab(accepts2.used_ind,accepts2.bad_ind, margins=True)
cross_table.apply(percConvert, axis=1)
bad_ind01All
used_ind
00.8264320.1735681.0
10.7950470.2049531.0
All0.8085260.1914741.0
print('''chisq = %6.4f 
p-value = %6.4f
dof = %i 
expected_freq = %s'''  %stats.chi2_contingency(cross_table.iloc[:2, :2]))
chisq = 12.5979 
p-value = 0.0004
dof = 1 
expected_freq = [[2850.86333739  675.13666261]
 [3787.13666261  896.86333739]]
#法二:(与预计的结果不一样,目前以方法一为准)
cross_table = pd.crosstab(accepts2.used_ind,accepts2.bad_ind, margins=False)
cross_table = cross_table.div(cross_table.sum(1),axis = 0)
cross_table
print('''chisq = %6.4f 
p-value = %6.4f
dof = %i 
expected_freq = %s'''  %stats.chi2_contingency(cross_table.iloc[:2, :2]))

4、线性回归

4.1 数据预处理(字符型转化为数值型,查看变量间的关系)

# 中文乱码的处理
plt.rcParams['font.sans-serif'] = [u'SimHei']
plt.rcParams['axes.unicode_minus'] = False
#导入进来的数据为字符串,先将其转化为数值型
accepts2['age_oldest_tr'] = accepts2['age_oldest_tr'].astype(int)
accepts2['bad_ind'] = accepts2['bad_ind'].astype(int)

accepts2.plot(x = 'age_oldest_tr',y = 'bad_ind', kind = 'scatter')
# plt.scatter(x=accepts2.age_oldest_tr, y=accepts2.bad_ind)
<matplotlib.axes._subplots.AxesSubplot at 0x1fc07313400>

在这里插入图片描述

4.2 随机抽样,建立训练集与测试集

#为5.2 1向前法预处理
candidates = ['bad_ind','tot_derog','age_oldest_tr','tot_open_tr','rev_util','fico_score','loan_term','ltv',
              'veh_mileage','dti_hist','dti_mew','fta','nth','nta']
accepts2[candidates] = accepts2[candidates].astype(float)

train = accepts2.sample(frac=0.7, random_state=1234).copy()
test = accepts2[~ accepts2.index.isin(train.index)].copy()
print(' 训练集样本量: %i \n 测试集样本量: %i' %(len(train), len(test)))
 训练集样本量: 5747 
 测试集样本量: 2463

4.3 线性回归

lg = smf.glm('bad_ind ~ age_oldest_tr', data=train, 
             family=sm.families.Binomial(sm.families.links.logit)).fit()  #逻辑回归,0-1分布(伯努利分布)
lg.summary()
Generalized Linear Model Regression Results
Dep. Variable:bad_ind No. Observations: 5747
Model:GLM Df Residuals: 5745
Model Family:Binomial Df Model: 1
Link Function:logit Scale: 1.0000
Method:IRLS Log-Likelihood: -2730.8
Date:Sat, 15 Feb 2020 Deviance: 5461.6
Time:11:24:04 Pearson chi2: 5.89e+03
No. Iterations:5 Covariance Type: nonrobust
coefstd errzP>|z|[0.0250.975]
Intercept -0.5769 0.065 -8.897 0.000 -0.704 -0.450
age_oldest_tr -0.0057 0.000 -13.682 0.000 -0.006 -0.005

4.4 预测

train['proba'] = lg.predict(train)
test['proba'] = lg.predict(test)

test['proba'].head(10)
4     0.237216
10    0.155657
13    0.277338
20    0.060517
22    0.235165
25    0.098014
27    0.050966
32    0.193184
51    0.254062
52    0.318912
Name: proba, dtype: float64

4.5 模型评估

1、设定阈值
# 设定阈值

test['prediction'] = (test['proba'] > 0.3).astype('int')
2、混淆矩阵
pd.crosstab(test.bad_ind, test.prediction, margins=True)
prediction01All
bad_ind
0.018771382015
1.036682448
All22432202463
3、计算准确率
acc = sum(test['prediction'] == test['bad_ind']) /np.float(len(test))
print('The accurancy is %.2f' %acc)
The accurancy is 0.80
#选取准确率最高的阈值
for i in np.arange(0.02, 0.3, 0.02):
    prediction = (test['proba'] > i).astype('int')
    confusion_matrix = pd.crosstab(prediction,test.bad_ind,
                                   margins = True)
    precision = confusion_matrix.loc[0, 0] /confusion_matrix.loc['All', 0]
    recall = confusion_matrix.loc[0, 0] / confusion_matrix.loc[0, 'All']
    Specificity = confusion_matrix.loc[1, 1] /confusion_matrix.loc[1,'All']
    f1_score = 2 * (precision * recall) / (precision + recall)
    print('threshold: %s, precision: %.2f, recall:%.2f ,Specificity:%.2f , f1_score:%.2f'%(i, precision, recall, Specificity,f1_score))
threshold: 0.02, precision: 0.00, recall:0.50 ,Specificity:0.18 , f1_score:0.00
threshold: 0.04, precision: 0.01, recall:0.94 ,Specificity:0.18 , f1_score:0.02
threshold: 0.06, precision: 0.03, recall:0.92 ,Specificity:0.18 , f1_score:0.06
threshold: 0.08, precision: 0.07, recall:0.93 ,Specificity:0.19 , f1_score:0.14
threshold: 0.1, precision: 0.13, recall:0.92 ,Specificity:0.19 , f1_score:0.23
threshold: 0.12000000000000001, precision: 0.21, recall:0.92 ,Specificity:0.21 , f1_score:0.34
threshold: 0.13999999999999999, precision: 0.27, recall:0.92 ,Specificity:0.21 , f1_score:0.42
threshold: 0.16, precision: 0.35, recall:0.90 ,Specificity:0.22 , f1_score:0.51
threshold: 0.18, precision: 0.44, recall:0.89 ,Specificity:0.23 , f1_score:0.59
threshold: 0.19999999999999998, precision: 0.56, recall:0.87 ,Specificity:0.24 , f1_score:0.68
threshold: 0.22, precision: 0.68, recall:0.87 ,Specificity:0.27 , f1_score:0.77
threshold: 0.24, precision: 0.76, recall:0.85 ,Specificity:0.28 , f1_score:0.81
threshold: 0.26, precision: 0.83, recall:0.85 ,Specificity:0.32 , f1_score:0.84
threshold: 0.28, precision: 0.88, recall:0.85 ,Specificity:0.34 , f1_score:0.86
4、绘制ROC曲线
import sklearn.metrics as metrics

fpr_test, tpr_test, th_test = metrics.roc_curve(test.bad_ind, test.proba)
fpr_train, tpr_train, th_train = metrics.roc_curve(train.bad_ind, train.proba)

plt.figure(figsize=[3, 3])
plt.plot(fpr_test, tpr_test, 'b--')
plt.plot(fpr_train, tpr_train, 'r-')
plt.title('ROC curve')
plt.show()
print('AUC = %.4f' %metrics.auc(fpr_test, tpr_test))

在这里插入图片描述

AUC = 0.6488

5、逻辑回归

5.1 包含分类预测变量的逻辑回归

formula = '''bad_ind ~ C(used_ind)'''

lg_m = smf.glm(formula=formula, data=train, 
             family=sm.families.Binomial(sm.families.links.logit)).fit()
lg_m.summary()
Generalized Linear Model Regression Results
Dep. Variable:bad_ind No. Observations: 5747
Model:GLM Df Residuals: 5745
Model Family:Binomial Df Model: 1
Link Function:logit Scale: 1.0000
Method:IRLS Log-Likelihood: -2836.2
Date:Sat, 15 Feb 2020 Deviance: 5672.4
Time:11:24:31 Pearson chi2: 5.75e+03
No. Iterations:4 Covariance Type: nonrobust
coefstd errzP>|z|[0.0250.975]
Intercept -1.5267 0.053 -29.060 0.000 -1.630 -1.424
C(used_ind)[T.1] 0.1927 0.068 2.838 0.005 0.060 0.326

5.2 多元逻辑回归

1、向前法
def forward_select(data, response):
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = float('inf'), float('inf')
    while remaining:
        aic_with_candidates=[]
        for candidate in remaining:
            formula = "{} ~ {}".format(
                response,' + '.join(selected + [candidate]))
            aic = smf.glm(
                formula=formula, data=data, 
                family=sm.families.Binomial(sm.families.links.logit)
            ).fit().aic
            aic_with_candidates.append((aic, candidate))
        aic_with_candidates.sort(reverse=True)
        best_new_score, best_candidate=aic_with_candidates.pop()
        if current_score > best_new_score: 
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
            print ('aic is {},continuing!'.format(current_score))
        else:        
            print ('forward selection over!')
            break
            
    formula = "{} ~ {} ".format(response,' + '.join(selected))
    print('final formula is {}'.format(formula))
    model = smf.glm(
        formula=formula, data=data, 
        family=sm.families.Binomial(sm.families.links.logit)
    ).fit()
    return(model)
#只有连续变量可以进行变量筛选,分类变量需要进行WOE转换才可以进行变量筛选
candidates = ['bad_ind','tot_derog','age_oldest_tr','tot_open_tr','rev_util','fico_score','loan_term','ltv',
              'veh_mileage','dti_hist','dti_mew','fta','nth','nta']
data_for_select = train[candidates]

lg_m1 = forward_select(data=data_for_select, response='bad_ind')
lg_m1.summary()

aic is 5112.102444955438,continuing!
aic is 4930.620746960663,continuing!
aic is 4867.936361578135,continuing!
aic is 4862.177989178937,continuing!
aic is 4856.0240437714365,continuing!
aic is 4853.140577715231,continuing!
aic is 4851.285414957676,continuing!
forward selection over!
final formula is bad_ind ~ fico_score + ltv + age_oldest_tr + rev_util + nth + tot_derog + fta 
Generalized Linear Model Regression Results
Dep. Variable:bad_ind No. Observations: 5747
Model:GLM Df Residuals: 5739
Model Family:Binomial Df Model: 7
Link Function:logit Scale: 1.0000
Method:IRLS Log-Likelihood: -2417.6
Date:Sat, 15 Feb 2020 Deviance: 4835.3
Time:11:24:47 Pearson chi2: 5.60e+03
No. Iterations:6 Covariance Type: nonrobust
coefstd errzP>|z|[0.0250.975]
Intercept 5.6094 0.591 9.491 0.000 4.451 6.768
fico_score -0.0139 0.001 -16.606 0.000 -0.015 -0.012
ltv 0.0278 0.002 11.201 0.000 0.023 0.033
age_oldest_tr -0.0035 0.000 -7.929 0.000 -0.004 -0.003
rev_util 0.0010 0.000 2.564 0.010 0.000 0.002
nth 0.0004 0.000 2.529 0.011 8.63e-05 0.001
tot_derog 0.0253 0.011 2.271 0.023 0.003 0.047
fta -0.6570 0.349 -1.880 0.060 -1.342 0.028
2、计算方差膨胀因子
# Seemingly wrong when using 'statsmmodels.stats.outliers_influence.variance_inflation_factor'
#计算方差膨胀因子,若其> 10 表示某变量的多重共线性严重.
def vif(df, col_i):
    from statsmodels.formula.api import ols
    
    cols = list(df.columns)
    cols.remove(col_i)
    cols_noti = cols
    formula = col_i + '~' + '+'.join(cols_noti)
    r2 = ols(formula, df).fit().rsquared
    return 1. / (1. - r2)
candidates = ['bad_ind','fico_score','ltv','age_oldest_tr','tot_derog','nth','tot_open_tr','veh_mileage','rev_util']
exog = train[candidates].drop(['bad_ind'], axis=1)

for i in exog.columns:
    print(i, '\t', vif(df=exog, col_i=i))
fico_score 	 1.6474312909303999
ltv 	 1.0312076990808654
age_oldest_tr 	 1.2398180183116478
tot_derog 	 1.387916859405689
nth 	 1.019005900596472
tot_open_tr 	 1.130385791173031
veh_mileage 	 1.0249442971334133
rev_util 	 1.0857772491871416
train['proba'] = lg_m1.predict(train)
test['proba'] = lg_m1.predict(test)
#ROC曲线
import sklearn.metrics as metrics

fpr_test, tpr_test, th_test = metrics.roc_curve(test.bad_ind, test.proba)
fpr_train, tpr_train, th_train = metrics.roc_curve(train.bad_ind, train.proba)

plt.figure(figsize=[3, 3])
plt.plot(fpr_test, tpr_test, 'b--')
plt.plot(fpr_train, tpr_train, 'r-')
plt.title('ROC curve')
plt.show()

print('AUC = %.4f' %metrics.auc(fpr_test, tpr_test))

在这里插入图片描述
AUC = 0.7756

6、其他变量处理

目前vehicle_year、vehicle_make、bankruptcy_ind、used_ind这些分类变量无法通过逐步变量筛选法

解决方案:

  • 1、逐一根据显著性测试
  • 2、使用决策树等方法筛选变量,但是多分类变量需要事先进行变量概化
  • 3、使用WOE转换,多分类变量也需要事先进行概化,使用scorecardpy包中的woe算法可以自动进行概化

使用第一种方法


#formula = '''bad_ind ~ fico_score+ltv+age_oldest_tr+tot_derog+nth+tot_open_tr+veh_mileage+rev_util+C(used_ind)+C(vehicle_year)+C(bankruptcy_ind)'''
formula = '''bad_ind ~ fico_score+ltv+age_oldest_tr+tot_derog+nth+tot_open_tr+veh_mileage+rev_util+C(bankruptcy_ind)'''
lg_m = smf.glm(formula=formula, data=train, 
             family=sm.families.Binomial(sm.families.links.logit)).fit()
lg_m.summary()
Generalized Linear Model Regression Results
Dep. Variable:bad_ind No. Observations: 5747
Model:GLM Df Residuals: 5737
Model Family:Binomial Df Model: 9
Link Function:logit Scale: 1.0000
Method:IRLS Log-Likelihood: -2416.0
Date:Sat, 15 Feb 2020 Deviance: 4832.0
Time:11:44:38 Pearson chi2: 5.54e+03
No. Iterations:6 Covariance Type: nonrobust
coefstd errzP>|z|[0.0250.975]
Intercept 5.4394 0.596 9.122 0.000 4.271 6.608
C(bankruptcy_ind)[T.Y] -0.3261 0.136 -2.397 0.017 -0.593 -0.059
fico_score -0.0138 0.001 -16.393 0.000 -0.015 -0.012
ltv 0.0290 0.002 12.024 0.000 0.024 0.034
age_oldest_tr -0.0034 0.000 -7.282 0.000 -0.004 -0.002
tot_derog 0.0330 0.012 2.788 0.005 0.010 0.056
nth 0.0004 0.000 2.617 0.009 9.82e-05 0.001
tot_open_tr -0.0103 0.012 -0.866 0.387 -0.034 0.013
veh_mileage 8.365e-07 1.17e-06 0.713 0.476-1.46e-06 3.14e-06
rev_util 0.0011 0.000 2.610 0.009 0.000 0.002
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

irober

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值