sklearn_逻辑回归制作评分卡_菜菜视频学习笔记

3.0 前言

逻辑回归与线性回归的关系

逻辑回归是用来处理连续型标签的算法

算法线性回归逻辑回归
输出连续型 (可找到x与y的线性关系)连续型(可对应二分类)
算法类型回归算法广义回归算法(可处理回归问题)

逻辑回归使用了sigmoid函数,把线性回归方程z =>g(z),令其值在0-1,接近0为0,接近1为1,以此实现分类

函数sigmoidMinMaxSclaer
数据压缩范围(0,1)[0,1]

ln( g(z) / (1- g(z) ) )=z(线性回归方程)

消除特征间的多重共线性

线性回归对数据要求满足正态分布,需要消除特征间的多重共线性;
可使用方差过滤,方差膨胀因子VIF

为什么使用逻辑回归处理金融领域数据

在这里插入图片描述

在这里插入图片描述

正则化的选择

:处理过拟合

L1范数L2范数
参数绝对值之和每个参数的平方和开方

在这里插入图片描述
C:平衡损失函数与正则项

L1正则化L2正则化
部分参数为0,稀疏性部分参数接近0

特征选择的方法

除了PCA,SVD此类会消除标签与特征之间的关系的算法

可以用Embedded嵌入法,方差,卡方,互信息法,包装法等

互信息法:是用来捕捉特征和标签之间的任意关系(包括线性和非线性关系)的过滤方法 ,和F检验相似

分箱的作用

特征离散化以后,简化了逻辑回归模型,降低了模型过拟合的风险。

离散特征数目易于更改,益于模型迭代;

离散特征数目少,运算速度快;

特征离散化后,模型会更稳定,组内差距减小,组间差距扩大,边缘数据仍不稳定

3.1导库

%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression as LR#从线性模型导入逻辑回归
data = pd.read_csv(r"D:\class_file\day08_05\rankingcard.csv",index_col=0)
#data = pd.read_csv(r"D:\class_file\day08_05\rankingcard.csv",index_col=0)

3.2数据预处理

data.head()
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
110.7661274520.8029829120.0130602.0
200.9571514000.1218762600.040001.0
300.6581803810.0851133042.021000.0
400.2338103000.0360503300.050000.0
500.9072394910.02492663588.070100.0
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 150000 entries, 1 to 150000
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      150000 non-null  int64  
 1   RevolvingUtilizationOfUnsecuredLines  150000 non-null  float64
 2   age                                   150000 non-null  int64  
 3   NumberOfTime30-59DaysPastDueNotWorse  150000 non-null  int64  
 4   DebtRatio                             150000 non-null  float64
 5   MonthlyIncome                         120269 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       150000 non-null  int64  
 7   NumberOfTimes90DaysLate               150000 non-null  int64  
 8   NumberRealEstateLoansOrLines          150000 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  150000 non-null  int64  
 10  NumberOfDependents                    146076 non-null  float64
dtypes: float64(4), int64(7)
memory usage: 13.7 MB

3.2.1 去重复值

data.drop_duplicates(inplace=True)#把重复行删掉覆盖原数据
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 149391 entries, 1 to 150000
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      149391 non-null  int64  
 1   RevolvingUtilizationOfUnsecuredLines  149391 non-null  float64
 2   age                                   149391 non-null  int64  
 3   NumberOfTime30-59DaysPastDueNotWorse  149391 non-null  int64  
 4   DebtRatio                             149391 non-null  float64
 5   MonthlyIncome                         120170 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       149391 non-null  int64  
 7   NumberOfTimes90DaysLate               149391 non-null  int64  
 8   NumberRealEstateLoansOrLines          149391 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  149391 non-null  int64  
 10  NumberOfDependents                    145563 non-null  float64
dtypes: float64(4), int64(7)
memory usage: 13.7 MB
data.index = range(data.shape[0])#恢复索引为0-特征数排序
data.shape[0]
149391

3.2.2 填补缺失值

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149391 entries, 0 to 149390
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      149391 non-null  int64  
 1   RevolvingUtilizationOfUnsecuredLines  149391 non-null  float64
 2   age                                   149391 non-null  int64  
 3   NumberOfTime30-59DaysPastDueNotWorse  149391 non-null  int64  
 4   DebtRatio                             149391 non-null  float64
 5   MonthlyIncome                         120170 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       149391 non-null  int64  
 7   NumberOfTimes90DaysLate               149391 non-null  int64  
 8   NumberRealEstateLoansOrLines          149391 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  149391 non-null  int64  
 10  NumberOfDependents                    145563 non-null  float64
dtypes: float64(4), int64(7)
memory usage: 12.5 MB
data.isnull().sum()/data.shape[0]#返回布尔值,计算空值数;
SeriousDlqin2yrs                        0.000000
RevolvingUtilizationOfUnsecuredLines    0.000000
age                                     0.000000
NumberOfTime30-59DaysPastDueNotWorse    0.000000
DebtRatio                               0.000000
MonthlyIncome                           0.195601
NumberOfOpenCreditLinesAndLoans         0.000000
NumberOfTimes90DaysLate                 0.000000
NumberRealEstateLoansOrLines            0.000000
NumberOfTime60-89DaysPastDueNotWorse    0.000000
NumberOfDependents                      0.025624
dtype: float64
data.isnull().mean()
SeriousDlqin2yrs                        0.000000
RevolvingUtilizationOfUnsecuredLines    0.000000
age                                     0.000000
NumberOfTime30-59DaysPastDueNotWorse    0.000000
DebtRatio                               0.000000
MonthlyIncome                           0.195601
NumberOfOpenCreditLinesAndLoans         0.000000
NumberOfTimes90DaysLate                 0.000000
NumberRealEstateLoansOrLines            0.000000
NumberOfTime60-89DaysPastDueNotWorse    0.000000
NumberOfDependents                      0.025624
dtype: float64
data["NumberOfDependents"].fillna(int(data["NumberOfDependents"].mean()),inplace=True)#均值填补,去浮点
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149391 entries, 0 to 149390
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      149391 non-null  int64  
 1   RevolvingUtilizationOfUnsecuredLines  149391 non-null  float64
 2   age                                   149391 non-null  int64  
 3   NumberOfTime30-59DaysPastDueNotWorse  149391 non-null  int64  
 4   DebtRatio                             149391 non-null  float64
 5   MonthlyIncome                         120170 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       149391 non-null  int64  
 7   NumberOfTimes90DaysLate               149391 non-null  int64  
 8   NumberRealEstateLoansOrLines          149391 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  149391 non-null  int64  
 10  NumberOfDependents                    149391 non-null  float64
dtypes: float64(4), int64(7)
memory usage: 12.5 MB
data.isnull().sum()/data.shape[0]
SeriousDlqin2yrs                        0.000000
RevolvingUtilizationOfUnsecuredLines    0.000000
age                                     0.000000
NumberOfTime30-59DaysPastDueNotWorse    0.000000
DebtRatio                               0.000000
MonthlyIncome                           0.195601
NumberOfOpenCreditLinesAndLoans         0.000000
NumberOfTimes90DaysLate                 0.000000
NumberRealEstateLoansOrLines            0.000000
NumberOfTime60-89DaysPastDueNotWorse    0.000000
NumberOfDependents                      0.000000
dtype: float64
#随机深林填补缺失值,以缺失特征为新标签Y,其余特征与标签为新特征X;特征不缺失值的为训练数据,缺值的为测试数据
def fill_missing_rf(X,y,to_fill):
    """
    随机森林填补某一个缺失值的函数
    参数:  
    X:要填补的特征矩阵
    y:完整的,没有缺失值的标签
    to_fill:字符串,要填补的那一列的名称
    """
    #构建新特征矩阵和新标签
    df = X.copy()#备份特性矩阵
    fill = df.loc[:,to_fill]#带缺失值的整列特征,有的为空,有的为真;
    df = pd.concat([df.loc[:,df.columns != to_fill],pd.DataFrame(y)],axis=1)#将原未缺失特征和标签组成新特征矩阵
    #找出我们的训练集和测试集
    Ytrain = fill[fill.notnull()]#特征值为非空的行做训练标签,特征值为空的行做测试标签
    Ytest = fill[fill.isnull()]
    Xtrain = df.iloc[Ytrain.index,:]#切出对应新训练标签的特征矩阵
    Xtest = df.iloc[Ytest.index,:]
    #切出对应为空新训练标签的特征矩阵
    #用随机森林回归来填补缺失值
    from sklearn.ensemble import RandomForestRegressor as rfr
    rfr = rfr(n_estimators=100)
    rfr = rfr.fit(Xtrain, Ytrain)
    Ypredict = rfr.predict(Xtest)
    return Ypredict
data.head()
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
010.7661274520.8029829120.0130602.0
100.9571514000.1218762600.040001.0
200.6581803810.0851133042.021000.0
300.2338103000.0360503300.050000.0
400.9072394910.02492663588.070100.0
X = data.iloc[:,1:]
y = data["SeriousDlqin2yrs"] 
X.shape
(149391, 10)
y_pred = fill_missing_rf(X,y,"MonthlyIncome") #随机深林填补缺失值,如果结果正常,我们就可以将数据覆盖了
y_pred
array([0.15, 0.28, 0.12, ..., 0.24, 0.16, 0.  ])
y_pred.shape
(29221,)
data.loc[:,"MonthlyIncome"].isnull()
0         False
1         False
2         False
3         False
4         False
          ...  
149386    False
149387    False
149388     True
149389    False
149390    False
Name: MonthlyIncome, Length: 149391, dtype: bool
data.loc[data.loc[:,"MonthlyIncome"].isnull(),"MonthlyIncome"]#替换(所有行该列的值(不等于空)中的行的行坐标,列类别)的值
6        NaN
8        NaN
16       NaN
32       NaN
41       NaN
          ..
149368   NaN
149369   NaN
149376   NaN
149384   NaN
149388   NaN
Name: MonthlyIncome, Length: 29221, dtype: float64
data.loc[data.loc[:,"MonthlyIncome"].isnull(),"MonthlyIncome"] = y_pred#替换(所有行该列的值(不等于空)中的行的行坐标,列类别)的值
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149391 entries, 0 to 149390
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      149391 non-null  int64  
 1   RevolvingUtilizationOfUnsecuredLines  149391 non-null  float64
 2   age                                   149391 non-null  int64  
 3   NumberOfTime30-59DaysPastDueNotWorse  149391 non-null  int64  
 4   DebtRatio                             149391 non-null  float64
 5   MonthlyIncome                         149391 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       149391 non-null  int64  
 7   NumberOfTimes90DaysLate               149391 non-null  int64  
 8   NumberRealEstateLoansOrLines          149391 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  149391 non-null  int64  
 10  NumberOfDependents                    149391 non-null  float64
dtypes: float64(4), int64(7)
memory usage: 12.5 MB

3.2.3描述性统计处理异常值

#3.2.3描述性统计处理异常值
data.describe([0.01,0.1,0.25,.5,.75,.9,.99]).T
countmeanstdmin1%10%25%50%75%90%99%max
SeriousDlqin2yrs149391.00.0669990.2500210.00.00.0000000.0000000.0000000.0000000.0000001.0000001.0
RevolvingUtilizationOfUnsecuredLines149391.06.071087250.2636720.00.00.0031990.0301320.1542350.5564940.9780071.09392250708.0
age149391.052.30623714.7259620.024.033.00000041.00000052.00000063.00000072.00000087.000000109.0
NumberOfTime30-59DaysPastDueNotWorse149391.00.3938863.8529530.00.00.0000000.0000000.0000000.0000001.0000004.00000098.0
DebtRatio149391.0354.4367402041.8434550.00.00.0349910.1774410.3682340.8752791275.0000004985.100000329664.0
MonthlyIncome149391.05423.90016913228.1532660.00.00.1700001800.0000004424.0000007416.00000010800.00000023200.0000003008750.0
NumberOfOpenCreditLinesAndLoans149391.08.4808925.1365150.00.03.0000005.0000008.00000011.00000015.00000024.00000058.0
NumberOfTimes90DaysLate149391.00.2381203.8261650.00.00.0000000.0000000.0000000.0000000.0000003.00000098.0
NumberRealEstateLoansOrLines149391.01.0223911.1301960.00.00.0000000.0000001.0000002.0000002.0000004.00000054.0
NumberOfTime60-89DaysPastDueNotWorse149391.00.2125033.8105230.00.00.0000000.0000000.0000000.0000000.0000002.00000098.0
NumberOfDependents149391.00.7403931.1082720.00.00.0000000.0000000.0000001.0000002.0000004.00000020.0
data=data[data["age"]!=0]
data.shape
(149390, 11)
data[data.loc[:,"NumberOfTimes90DaysLate"] > 90]
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
173211.027980.02700.0000000980980.0
228501.022980.01448.6117480980980.0
388301.0389812.01977.5800000980980.0
441601.021980.00.0000000980980.0
470401.021980.02000.0000000980980.0
....................................
14666711.025980.02132.8052380980980.0
14718011.06898255.086.1200000980980.0
14854811.0249854.0385.4300000980980.0
14863401.026980.02000.0000000980980.0
14883311.034989.01786.7700000980980.0

225 rows × 11 columns

data=data[data.loc[:,"NumberOfTimes90DaysLate"] <90]
data.index=range(data.shape[0])#恢复索引
data.shape
(149165, 11)
data.describe([0.01,0.1,0.25,.5,.75,.9,.99]).T#显示百分比间距数据分布,转置
countmeanstdmin1%10%25%50%75%90%99%max
SeriousDlqin2yrs149165.00.0661880.2486120.00.00.0000000.0000000.0000000.0000000.000001.0000001.0
RevolvingUtilizationOfUnsecuredLines149165.06.078770250.4531110.00.00.0031740.0300330.1536150.5536980.975021.09406150708.0
age149165.052.33107614.71411421.024.033.00000041.00000052.00000063.00000072.0000087.000000109.0
NumberOfTime30-59DaysPastDueNotWorse149165.00.2467200.6989350.00.00.0000000.0000000.0000000.0000001.000003.00000013.0
DebtRatio149165.0354.9635422043.3444960.00.00.0363850.1782110.3686190.8769941277.300004989.360000329664.0
MonthlyIncome149165.05428.18655613237.3340900.00.00.1700001800.0000004436.0000007417.00000010800.0000023218.0000003008750.0
NumberOfOpenCreditLinesAndLoans149165.08.4936885.1298410.01.03.0000005.0000008.00000011.00000015.0000024.00000058.0
NumberOfTimes90DaysLate149165.00.0907250.4863540.00.00.0000000.0000000.0000000.0000000.000002.00000017.0
NumberRealEstateLoansOrLines149165.01.0239271.1303500.00.00.0000000.0000001.0000002.0000002.000004.00000054.0
NumberOfTime60-89DaysPastDueNotWorse149165.00.0650690.3306750.00.00.0000000.0000000.0000000.0000000.000002.00000011.0
NumberOfDependents149165.00.7409111.1085340.00.00.0000000.0000000.0000001.0000002.000004.00000020.0

3.2.4 以业务为中心,保持数据原貌,不统一量纲,与标准化数据分布

3.2.5 样本不均衡问题

在逻辑回归中使用上采样(增加少数类样本,来平衡标签)

X = data.iloc[:,1:]
y = data.iloc[:,0]
y
0         1
1         0
2         0
3         0
4         0
         ..
149160    0
149161    0
149162    0
149163    0
149164    0
Name: SeriousDlqin2yrs, Length: 149165, dtype: int64
y.value_counts()#值分布计数;
0    139292
1      9873
Name: SeriousDlqin2yrs, dtype: int64
n_sample = X.shape[0]
n_1_sample = y.value_counts()[1]#[]显示值为指定数的计数
n_0_sample = y.value_counts()[0]
print('样本个数:{}; 1占{:.2%}; 0占{:.2%}'.format(n_sample,n_1_sample/n_sample,n_0_sample/n_sample))
样本个数:149165; 1占6.62%; 0占93.38%
#如果报错,就在prompt安装:pip install imblearn
import imblearn
#imblearn是专门用来处理不平衡数据集的库,在处理样本不均衡问题中性能高过sklearn很多
#imblearn里面也是一个个的类,也需要进行实例化,fit拟合(训练),和sklearn用法相似
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42) #实例化
X,y = sm.fit_resample(X,y)
n_sample_ = X.shape[0]
y.value_counts()
1    139292
0    139292
Name: SeriousDlqin2yrs, dtype: int64
n_1_sample = y.value_counts()[1]
n_0_sample = y.value_counts()[0]

print('样本个数:{}; 1占{:.2%}; 0占{:.2%}'.format(n_sample_,n_1_sample/n_sample_,n_0_sample/n_sample_))
样本个数:278584; 1占50.00%; 0占50.00%

3.2.6 分训练集和测试集

#3.2.6分训练数据和测试集
from sklearn.model_selection import train_test_split
X = pd.DataFrame(X)
y = pd.DataFrame(y)
X_train, X_vali, Y_train, Y_vali = train_test_split(X,y,test_size=0.3,random_state=420)
model_data = pd.concat([Y_train, X_train], axis=1)#分箱需要标签和特征链接,标签为第一列
model_data
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
8160200.0154045300.1218024728.00000050000.000000
14904300.1683116300.1419641119.00000050000.000000
21507311.0635703910.4176633500.00000051023.716057
6627800.0886847300.5228225301.000000110200.000000
15708410.6229995300.42365013000.00000090200.181999
....................................
17809410.9162693220.5481326000.000000100103.966830
6223910.4847285010.3706035258.000000120102.000000
15212710.8504474600.5626108000.00000090102.768793
11917401.0000006400.36469410309.00000070300.000000
19360810.5128815301968.4014880.172907120100.000000

195008 rows × 11 columns

data.columns
Index(['SeriousDlqin2yrs', 'RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome',
       'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate',
       'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse',
       'NumberOfDependents'],
      dtype='object')
model_data.index = range(model_data.shape[0])
model_data.columns = data.columns#一样不需要修改
model_data
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
000.0154045300.1218024728.00000050000.000000
100.1683116300.1419641119.00000050000.000000
211.0635703910.4176633500.00000051023.716057
300.0886847300.5228225301.000000110200.000000
410.6229995300.42365013000.00000090200.181999
....................................
19500310.9162693220.5481326000.000000100103.966830
19500410.4847285010.3706035258.000000120102.000000
19500510.8504474600.5626108000.00000090102.768793
19500601.0000006400.36469410309.00000070300.000000
19500710.5128815301968.4014880.172907120100.000000

195008 rows × 11 columns

vali_data = pd.concat([Y_vali, X_vali], axis=1)
vali_data.index = range(vali_data.shape[0])
vali_data.columns = data.columns
#model_data.to_csv(r"‪D:\class_file\day08_05\day08_vali_data.csv")
model_data.to_csv(r"D:\class_file\day08_05\model_data.csv")
vali_data.to_csv(r"D:\class_file\day08_05\vali_data.csv")

3.3 分箱

3.3.1 等频分箱

model_data["age"]
0         53
1         63
2         39
3         73
4         53
          ..
195003    32
195004    50
195005    46
195006    64
195007    53
Name: age, Length: 195008, dtype: int64
model_data["qcut"],updown=pd.qcut(model_data["age"],retbins=True,q=20)
#retbins返回索引为样本索引,元素为分箱上下限  
#pd.qcut一维数据分箱,返回,各个箱子的上下限列表
#q分箱个数
#dataframe["列名"],当列名不存在时自动生成一个名叫这个列个新列
model_data["qcut"]
0         (52.0, 54.0]
1         (61.0, 64.0]
2         (36.0, 39.0]
3         (68.0, 74.0]
4         (52.0, 54.0]
              ...     
195003    (31.0, 34.0]
195004    (48.0, 50.0]
195005    (45.0, 46.0]
195006    (61.0, 64.0]
195007    (52.0, 54.0]
Name: qcut, Length: 195008, dtype: category
Categories (20, interval[float64, right]): [(20.999, 28.0] < (28.0, 31.0] < (31.0, 34.0] < (34.0, 36.0] ... (61.0, 64.0] < (64.0, 68.0] < (68.0, 74.0] < (74.0, 107.0]]
updown.shape
(21,)
updown
array([ 21.,  28.,  31.,  34.,  36.,  39.,  41.,  43.,  45.,  46.,  48.,
        50.,  52.,  54.,  56.,  58.,  61.,  64.,  68.,  74., 107.])
model_data.head()
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependentsqcut
000.0154045300.1218024728.050000.000000(52.0, 54.0]
100.1683116300.1419641119.050000.000000(61.0, 64.0]
211.0635703910.4176633500.051023.716057(36.0, 39.0]
300.0886847300.5228225301.0110200.000000(68.0, 74.0]
410.6229995300.42365013000.090200.181999(52.0, 54.0]
model_data["qcut"].value_counts()#计算各值数
(36.0, 39.0]      12647
(20.999, 28.0]    11786
(58.0, 61.0]      11386
(48.0, 50.0]      11104
(46.0, 48.0]      10968
(31.0, 34.0]      10867
(50.0, 52.0]      10529
(43.0, 45.0]      10379
(61.0, 64.0]      10197
(39.0, 41.0]       9768
(52.0, 54.0]       9726
(41.0, 43.0]       9682
(28.0, 31.0]       9498
(74.0, 107.0]      9110
(64.0, 68.0]       8917
(54.0, 56.0]       8713
(68.0, 74.0]       8655
(56.0, 58.0]       7887
(34.0, 36.0]       7521
(45.0, 46.0]       5668
Name: qcut, dtype: int64
count_y0=model_data[model_data["SeriousDlqin2yrs"]==0].groupby(by="qcut").count()["SeriousDlqin2yrs"]
#取出布尔值为True的样本的列,按“qcut”分组,并计数,[]提取目标列
count_y0
qcut
(20.999, 28.0]    4243
(28.0, 31.0]      3571
(31.0, 34.0]      4075
(34.0, 36.0]      2908
(36.0, 39.0]      5182
(39.0, 41.0]      3956
(41.0, 43.0]      4002
(43.0, 45.0]      4389
(45.0, 46.0]      2419
(46.0, 48.0]      4813
(48.0, 50.0]      4900
(50.0, 52.0]      4728
(52.0, 54.0]      4681
(54.0, 56.0]      4677
(56.0, 58.0]      4483
(58.0, 61.0]      6583
(61.0, 64.0]      6968
(64.0, 68.0]      6623
(68.0, 74.0]      6753
(74.0, 107.0]     7737
Name: SeriousDlqin2yrs, dtype: int64
count_y1=model_data[model_data["SeriousDlqin2yrs"]==1].groupby(by="qcut").count()["SeriousDlqin2yrs"]
num_bins=[*zip(updown,updown[1:],count_y0,count_y1)]
#按照最短的列,两列依次连接
num_bins
[(21.0, 28.0, 4243, 7543),
 (28.0, 31.0, 3571, 5927),
 (31.0, 34.0, 4075, 6792),
 (34.0, 36.0, 2908, 4613),
 (36.0, 39.0, 5182, 7465),
 (39.0, 41.0, 3956, 5812),
 (41.0, 43.0, 4002, 5680),
 (43.0, 45.0, 4389, 5990),
 (45.0, 46.0, 2419, 3249),
 (46.0, 48.0, 4813, 6155),
 (48.0, 50.0, 4900, 6204),
 (50.0, 52.0, 4728, 5801),
 (52.0, 54.0, 4681, 5045),
 (54.0, 56.0, 4677, 4036),
 (56.0, 58.0, 4483, 3404),
 (58.0, 61.0, 6583, 4803),
 (61.0, 64.0, 6968, 3229),
 (64.0, 68.0, 6623, 2294),
 (68.0, 74.0, 6753, 1902),
 (74.0, 107.0, 7737, 1373)]

3.3.2 确保每个箱中都有0和1,这里没做

3.3.3 定义WOE和IV函数

def get_woe(num_bins):
    columns = ["min","max","count_0","count_1"]
    df = pd.DataFrame(num_bins,columns=columns)
#创建列表给其赋名
    df.head()
    #给列表添加列
    df["total"] = df.count_0 + df.count_1
    df["percentage"] = df.total / df.total.sum()#分箱样本占各个分箱样本和的比例
    df["bad_rate"] = df.count_1 / df.total#(该特征分箱中)坏的样本占(该特征分箱中的)样本比例
    df["good%"] = df.count_0/df.count_0.sum()#某一分箱中坏的样本占所有坏的样本的比例
    df["bad%"] = df.count_1/df.count_1.sum()
    df["woe"] = np.log(df["good%"] / df["bad%"])#评价违约概率的指标,证据权重(weight of Evidence)
    return df
def get_iv(df):
    rate = df["good%"] - df["bad%"]
    iv = np.sum(rate * df.woe)
    return iv
#wof证据权重(weight of Evidence),计算特征在单个分箱中的违约概率的指标
#IV代表的意义是我们特征(所有分箱)上的信息量以及这个特征(所有分箱)对模型的贡献

3.3.4 卡方检验,合并箱体,画出IV曲线

num_bins_=num_bins.copy()

import matplotlib.pyplot as plt
import scipy
x1=num_bins_[0][2:]
x1
(4243, 7543)
x2=num_bins_[1][2:]
x2
(3571, 5927)
scipy.stats.chi2_contingency([x1,x2])[0]
5.705081033738888
len(num_bins_)
20
pvs=[]
for i in range(len(num_bins_)-1):#循环19次
    x1=num_bins_[i][2:]
    x2=num_bins_[i+1][2:]
    #[0]返回卡方chi2-value,[1]返回p-value
    #p值大:两箱合并
    pv=scipy.stats.chi2_contingency([x1,x2])[1]
    #chi2=scipy.states.chi2_contingency(x1,x2)[0]
    pvs.append(pv)
len(pvs)
19
pvs.index(max(pvs))
1
num_bins_
[(21.0, 28.0, 4243, 7543),
 (28.0, 31.0, 3571, 5927),
 (31.0, 34.0, 4075, 6792),
 (34.0, 36.0, 2908, 4613),
 (36.0, 39.0, 5182, 7465),
 (39.0, 41.0, 3956, 5812),
 (41.0, 43.0, 4002, 5680),
 (43.0, 45.0, 4389, 5990),
 (45.0, 46.0, 2419, 3249),
 (46.0, 48.0, 4813, 6155),
 (48.0, 50.0, 4900, 6204),
 (50.0, 52.0, 4728, 5801),
 (52.0, 54.0, 4681, 5045),
 (54.0, 56.0, 4677, 4036),
 (56.0, 58.0, 4483, 3404),
 (58.0, 61.0, 6583, 4803),
 (61.0, 64.0, 6968, 3229),
 (64.0, 68.0, 6623, 2294),
 (68.0, 74.0, 6753, 1902),
 (74.0, 107.0, 7737, 1373)]
num_bins_[7:9] = [(
num_bins_[7][0],
num_bins_[7+1][1],
num_bins_[7][2]+num_bins_[7+1][2],
num_bins_[7][3]+num_bins_[7+1][3])]
num_bins_
#合并7,8分箱
[(21.0, 28.0, 4243, 7543),
 (28.0, 31.0, 3571, 5927),
 (31.0, 34.0, 4075, 6792),
 (34.0, 36.0, 2908, 4613),
 (36.0, 39.0, 5182, 7465),
 (39.0, 41.0, 3956, 5812),
 (41.0, 43.0, 4002, 5680),
 (43.0, 46.0, 6808, 9239),
 (46.0, 48.0, 4813, 6155),
 (48.0, 50.0, 4900, 6204),
 (50.0, 52.0, 4728, 5801),
 (52.0, 54.0, 4681, 5045),
 (54.0, 56.0, 4677, 4036),
 (56.0, 58.0, 4483, 3404),
 (58.0, 61.0, 6583, 4803),
 (61.0, 64.0, 6968, 3229),
 (64.0, 68.0, 6623, 2294),
 (68.0, 74.0, 6753, 1902),
 (74.0, 107.0, 7737, 1373)]
len(num_bins_)
19
num_bins_ = num_bins.copy()
import matplotlib.pyplot as plt
import scipy
IV=[]
axisx=[]
while len(num_bins_)>2:#计算各箱p值
    pvs=[]
    for i in range(len(num_bins_)-1):
        x1 = num_bins_[i][2:]
   #     X2=num_bins_[i+1][2:]
        x2=num_bins_[i+1][2:]
        pv=scipy.stats.chi2_contingency([x1,x2])[1]
        pvs.append(pv)
    
    i=pvs.index(max(pvs))#P值最大两箱合并
    num_bins_[i:i+2]=[(
        num_bins_[i][0],
        num_bins_[i+1][1],
        num_bins_[i][2]+num_bins_[i+1][2],
        num_bins_[i][3]+num_bins_[i+1][3])]
    
    bins_df=get_woe(num_bins_)
    axisx.append(len(num_bins_))
    IV.append(get_iv(bins_df))
    
plt.figure()
plt.plot(axisx,IV)
plt.xticks(axisx)
plt.xlabel("number of box")
plt.ylabel("IV")
plt.show()

在这里插入图片描述

3.3.5 用最佳分箱个数分箱,并验证分箱结果

#鉴定为7最佳
#定义分箱函数
def get_bin(num_bins_,n):
    while len(num_bins_)>n:#计算各箱p值
        pvs=[]
        for i in range(len(num_bins_)-1):
            x1 = num_bins_[i][2:]
       #     X2=num_bins_[i+1][2:]
            x2=num_bins_[i+1][2:]
            pv=scipy.stats.chi2_contingency([x1,x2])[1]
            pvs.append(pv)

        i=pvs.index(max(pvs))#P值最大两箱合并
        num_bins_[i:i+2]=[(
            num_bins_[i][0],
            num_bins_[i+1][1],
            num_bins_[i][2]+num_bins_[i+1][2],
            num_bins_[i][3]+num_bins_[i+1][3])]
    return num_bins_
num_bins_ = num_bins.copy()#防止覆盖原数据
afterbins=get_bin(num_bins_,7)
afterbins
[(21.0, 36.0, 14797, 24875),
 (36.0, 46.0, 19948, 28196),
 (46.0, 54.0, 19122, 23205),
 (54.0, 61.0, 15743, 12243),
 (61.0, 64.0, 6968, 3229),
 (64.0, 74.0, 13376, 4196),
 (74.0, 107.0, 7737, 1373)]
bins_df=get_woe(afterbins)
bins_df
#woe应该单调(or一个转折点)
#bad_rate差距要大
minmaxcount_0count_1totalpercentagebad_rategood%bad%woe
021.036.01479724875396720.2034380.6270170.1514670.255608-0.523275
136.046.01994828196481440.2468820.5856600.2041950.289734-0.349887
246.054.01912223205423270.2170530.5482320.1957400.238448-0.197364
354.061.01574312243279860.1435120.4374690.1611510.1258050.247606
461.064.069683229101970.0522900.3166620.0713270.0331800.765320
564.074.0133764196175720.0901090.2387890.1369220.0431171.155495
674.0107.07737137391100.0467160.1507140.0791990.0141091.725180

3.3.6 将选取最佳分箱个数的过程包装为函数

def graphforbestbin(DF, X, Y, n=5,q=20,graph=True):
    """
    自动最优分箱函数,基于卡方检验的分箱
    参数:
    DF: 需要输入的数据
    X: 需要分箱的列名
    Y: 分箱数据对应的标签 Y 列名
    n: 保留分箱个数
    q: 初始分箱的个数
    graph: 是否要画出IV图像
    区间为前开后闭 (]
    """
    bins_df=[] 
    DF = DF[[X,Y]].copy()
    DF["qcut"],bins = pd.qcut(DF[X], retbins=True, q=q,duplicates="drop")
    coount_y0 = DF.loc[DF[Y]==0].groupby(by="qcut").count()[Y]
    coount_y1 = DF.loc[DF[Y]==1].groupby(by="qcut").count()[Y]
    num_bins = [*zip(bins,bins[1:],coount_y0,coount_y1)]
    for i in range(q):
        if 0 in num_bins[0][2:]:
            num_bins[0:2] = [(
                num_bins[0][0],
                num_bins[1][1],
                num_bins[0][2]+num_bins[1][2],
                    num_bins[0][3]+num_bins[1][3])]
            continue

        for i in range(len(num_bins)):
            if 0 in num_bins[i][2:]:
                num_bins[i-1:i+1] = [(
                    num_bins[i-1][0],
                    num_bins[i][1],
                    num_bins[i-1][2]+num_bins[i][2],
                    num_bins[i-1][3]+num_bins[i][3])]
                break
        else:
            break
    def get_woe(num_bins):
        columns = ["min","max","count_0","count_1"]
        df = pd.DataFrame(num_bins,columns=columns)
        df["total"] = df.count_0 + df.count_1
        df["percentage"] = df.total / df.total.sum()
        df["bad_rate"] = df.count_1 / df.total
        df["good%"] = df.count_0/df.count_0.sum()
        df["bad%"] = df.count_1/df.count_1.sum()
        df["woe"] = np.log(df["good%"] / df["bad%"])
        return df
    def get_iv(df):
        rate = df["good%"] - df["bad%"]
        iv = np.sum(rate * df.woe)
        return iv
    IV = []
    axisx = []
    while len(num_bins) > n:
        pvs = []
        for i in range(len(num_bins)-1):
            x1 = num_bins[i][2:]
            x2 = num_bins[i+1][2:]
            pv = scipy.stats.chi2_contingency([x1,x2])[1]
            pvs.append(pv)
        i = pvs.index(max(pvs))
        num_bins[i:i+2] = [(
            num_bins[i][0],
            num_bins[i+1][1],
            num_bins[i][2]+num_bins[i+1][2],
            num_bins[i][3]+num_bins[i+1][3])]
        bins_df = pd.DataFrame(get_woe(num_bins))
        axisx.append(len(num_bins))
        IV.append(get_iv(bins_df))
    if graph: 
        plt.figure()
        plt.plot(axisx,IV) 
        plt.xticks(axisx)
        plt.xlabel("number of box")
        plt.ylabel("IV")
        plt.show() 
    return bins_df

3.3.7 对所有特征进行分箱选择

model_data.columns
Index(['SeriousDlqin2yrs', 'RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome',
       'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate',
       'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse',
       'NumberOfDependents', 'qcut'],
      dtype='object')
for i in model_data.columns[1:-1]:
    print(i)
    graphforbestbin(model_data,i,"SeriousDlqin2yrs",n=2,q=20)
RevolvingUtilizationOfUnsecuredLines

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jiKQNsov-1662219108747)(output_82_1.png)]

age

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TWeOdhcz-1662219108748)(output_82_3.png)]

NumberOfTime30-59DaysPastDueNotWorse

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dzdX5wyL-1662219108748)(output_82_5.png)]

DebtRatio

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KB91GRE5-1662219108748)(output_82_7.png)]

MonthlyIncome

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CgJea62R-1662219108748)(output_82_9.png)]

NumberOfOpenCreditLinesAndLoans

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lWxIUAbr-1662219108749)(output_82_11.png)]

NumberOfTimes90DaysLate

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OCxZFCFI-1662219108749)(output_82_13.png)]

NumberRealEstateLoansOrLines

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CrCnsnJp-1662219108749)(output_82_15.png)]

NumberOfTime60-89DaysPastDueNotWorse

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-L4cGtFux-1662219108749)(output_82_17.png)]

NumberOfDependents

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NtRjPmoQ-1662219108750)(output_82_19.png)]

model_data.describe([0.01,0.1,0.25,.5,.75,.9,.99]).T
countmeanstdmin1%10%25%50%75%90%99%max
SeriousDlqin2yrs195008.00.4990410.5000000.00.00.0000000.0000000.0000001.0000001.000001.0000001.0
RevolvingUtilizationOfUnsecuredLines195008.04.716073151.9345580.00.00.0154920.0991280.4650030.8743431.000001.35340018300.0
age195008.049.05413113.91698021.024.031.00000039.00000048.00000058.00000068.0000084.000000107.0
NumberOfTime30-59DaysPastDueNotWorse195008.00.4224290.8696930.00.00.0000000.0000000.0000001.0000001.000004.00000013.0
DebtRatio195008.0332.4711431845.2055200.00.00.0760500.2080880.4017630.848021994.000005147.758259329664.0
MonthlyIncome195008.05151.96797411369.8382870.00.00.2800002000.0000004167.0000006900.00000010240.0000022000.0000003008750.0
NumberOfOpenCreditLinesAndLoans195008.07.9891445.0158580.00.02.0000004.0000007.00000011.00000015.0000023.00000057.0
NumberOfTimes90DaysLate195008.00.2248830.7188390.00.00.0000000.0000000.0000000.0000001.000003.00000017.0
NumberRealEstateLoansOrLines195008.00.8781081.1245100.00.00.0000000.0000001.0000001.0000002.000005.00000032.0
NumberOfTime60-89DaysPastDueNotWorse195008.00.1174720.4230760.00.00.0000000.0000000.0000000.0000000.000002.0000009.0
NumberOfDependents195008.00.8379911.0815410.00.00.0000000.0000000.2588881.4346802.323164.00000020.0
#为什么有空图或直线图?可分类目标分类数不够
  
auto_col_bins = {"RevolvingUtilizationOfUnsecuredLines":6,
                 "age":5,
                 "DebtRatio":4,
                 "MonthlyIncome":3,
                 "NumberOfOpenCreditLinesAndLoans":5} #不能使用自动分箱的变量
hand_bins = {"NumberOfTime30-59DaysPastDueNotWorse":[0,1,2,13]
             ,"NumberOfTimes90DaysLate":[0,1,2,17]
             ,"NumberRealEstateLoansOrLines":[0,1,2,4,54]
             ,"NumberOfTime60-89DaysPastDueNotWorse":[0,1,2,8]
             ,"NumberOfDependents":[0,1,2,3]}
#保证区间覆盖使用 np.inf替换最大值,用-np.inf替换最小值
hand_bins = {k:[-np.inf,*v[1:-1],np.inf] for k,v in hand_bins.items()}
auto_col_bins
{'RevolvingUtilizationOfUnsecuredLines': 6,
 'age': 5,
 'DebtRatio': 4,
 'MonthlyIncome': 3,
 'NumberOfOpenCreditLinesAndLoans': 5}
hand_bins.items()#字典化
dict_items([('NumberOfTime30-59DaysPastDueNotWorse', [-inf, 1, 2, inf]), ('NumberOfTimes90DaysLate', [-inf, 1, 2, inf]), ('NumberRealEstateLoansOrLines', [-inf, 1, 2, 4, inf]), ('NumberOfTime60-89DaysPastDueNotWorse', [-inf, 1, 2, inf]), ('NumberOfDependents', [-inf, 1, 2, inf])])
hand_bins
{'NumberOfTime30-59DaysPastDueNotWorse': [-inf, 1, 2, inf],
 'NumberOfTimes90DaysLate': [-inf, 1, 2, inf],
 'NumberRealEstateLoansOrLines': [-inf, 1, 2, 4, inf],
 'NumberOfTime60-89DaysPastDueNotWorse': [-inf, 1, 2, inf],
 'NumberOfDependents': [-inf, 1, 2, inf]}
bins_of_col = {}
# 生成自动分箱的分箱区间和分箱后的 IV 值
for col in auto_col_bins:
    #数据,分箱目标列,标签列
    bins_df = graphforbestbin(model_data,col
                             ,"SeriousDlqin2yrs"
                             ,n=auto_col_bins[col]
                             #使用字典的性质来取出每个特征所对应的箱的数量
                             ,q=20
                             ,graph=False)
    #set删除重复数据,sorted排序,union合并集合
    #合并每个箱体分界值的最小最大,并且去重排序
    bins_list = sorted(set(bins_df["min"]).union(bins_df["max"]))
    #保证区间覆盖使用 np.inf 替换最大值 -np.inf 替换最小值
    bins_list[0],bins_list[-1] = -np.inf,np.inf
    bins_of_col[col] = bins_list
bins_of_col
{'RevolvingUtilizationOfUnsecuredLines': [-inf,
  0.09912842857080169,
  0.29777151691114556,
  0.4650029378240421,
  0.9824623886712799,
  0.9999999,
  inf],
 'age': [-inf, 36.0, 54.0, 61.0, 74.0, inf],
 'DebtRatio': [-inf,
  0.0174953101,
  0.5034625055732768,
  1.4722640672420035,
  inf],
 'MonthlyIncome': [-inf, 0.1, 5594.0, inf],
 'NumberOfOpenCreditLinesAndLoans': [-inf, 1.0, 3.0, 5.0, 17.0, inf]}
    bins_list[0],bins_list[-1] = -np.inf,np.inf
#合并手动分箱数据    
bins_of_col.update(hand_bins)
bins_of_col
{'RevolvingUtilizationOfUnsecuredLines': [-inf,
  0.09912842857080169,
  0.29777151691114556,
  0.4650029378240421,
  0.9824623886712799,
  0.9999999,
  inf],
 'age': [-inf, 36.0, 54.0, 61.0, 74.0, inf],
 'DebtRatio': [-inf,
  0.0174953101,
  0.5034625055732768,
  1.4722640672420035,
  inf],
 'MonthlyIncome': [-inf, 0.1, 5594.0, inf],
 'NumberOfOpenCreditLinesAndLoans': [-inf, 1.0, 3.0, 5.0, 17.0, inf],
 'NumberOfTime30-59DaysPastDueNotWorse': [-inf, 1, 2, inf],
 'NumberOfTimes90DaysLate': [-inf, 1, 2, inf],
 'NumberRealEstateLoansOrLines': [-inf, 1, 2, 4, inf],
 'NumberOfTime60-89DaysPastDueNotWorse': [-inf, 1, 2, inf],
 'NumberOfDependents': [-inf, 1, 2, inf]}

3.4 计算各箱的WOE并映射到数据中

data = model_data.copy()
#函数pd.cut,根据所给所有各箱的上下界分箱
#参数为 pd.cut(数据,所有分箱的上下界)
data = data[["age","SeriousDlqin2yrs"]].copy()
data["cut"] = pd.cut(data["age"],[-np.inf, 48.49986200790144, 58.757170160044694, 64.0, 
74.0, np.inf])
data
ageSeriousDlqin2yrscut
0530(48.5, 58.757]
1630(58.757, 64.0]
2391(-inf, 48.5]
3730(64.0, 74.0]
4531(48.5, 58.757]
............
195003321(-inf, 48.5]
195004501(48.5, 58.757]
195005461(-inf, 48.5]
195006640(58.757, 64.0]
195007531(48.5, 58.757]

195008 rows × 3 columns

data.groupby("cut")["SeriousDlqin2yrs"].value_counts()
#(分箱的各个上下界),将数据分箱,并将各样本的标签分类计数
cut             SeriousDlqin2yrs
(-inf, 48.5]    1                   59226
                0                   39558
(48.5, 58.757]  1                   24490
                0                   23469
(58.757, 64.0]  0                   13551
                1                    8032
(64.0, 74.0]    0                   13376
                1                    4196
(74.0, inf]     0                    7737
                1                    1373
Name: SeriousDlqin2yrs, dtype: int64
data.groupby("cut")["SeriousDlqin2yrs"].value_counts().unstack()
#根据索引行列转换,直观上将整个表转置
SeriousDlqin2yrs01
cut
(-inf, 48.5]3955859226
(48.5, 58.757]2346924490
(58.757, 64.0]135518032
(64.0, 74.0]133764196
(74.0, inf]77371373
bins_df = data.groupby("cut")["SeriousDlqin2yrs"].value_counts().unstack()
bins_df["woe"] = np.log((bins_df[0]/bins_df[0].sum())/(bins_df[1]/bins_df[1].sum()))
bins_df
SeriousDlqin2yrs01woe
cut
(-inf, 48.5]3955859226-0.407428
(48.5, 58.757]2346924490-0.046420
(58.757, 64.0]1355180320.519191
(64.0, 74.0]1337641961.155495
(74.0, inf]773713731.725180
def get_woe(df,col,y,bins):
    df = df[[col,y]].copy()
    df["cut"] = pd.cut(df[col],bins)
    bins_df = df.groupby("cut")[y].value_counts().unstack()
    woe = bins_df["woe"] = np.log((bins_df[0]/bins_df[0].sum())/(bins_df[1]/bins_df[1].sum()))
    return woe
#将所有特征的WOE存储到字典woeall当中
woeall = {}
for col in bins_of_col:
    woeall[col] = get_woe(model_data,col,"SeriousDlqin2yrs",bins_of_col[col])
woeall
#0,1列被忽略
{'RevolvingUtilizationOfUnsecuredLines': cut
 (-inf, 0.0991]     2.205113
 (0.0991, 0.298]    0.665610
 (0.298, 0.465]    -0.127577
 (0.465, 0.982]    -1.073125
 (0.982, 1.0]      -0.471851
 (1.0, inf]        -2.040304
 dtype: float64,
 'age': cut
 (-inf, 36.0]   -0.523275
 (36.0, 54.0]   -0.278138
 (54.0, 61.0]    0.247606
 (61.0, 74.0]    1.004098
 (74.0, inf]     1.725180
 dtype: float64,
 'DebtRatio': cut
 (-inf, 0.0175]     1.521696
 (0.0175, 0.503]   -0.011220
 (0.503, 1.472]    -0.472690
 (1.472, inf]       0.175196
 dtype: float64,
 'MonthlyIncome': cut
 (-inf, 0.1]      1.348333
 (0.1, 5594.0]   -0.238024
 (5594.0, inf]    0.232036
 dtype: float64,
 'NumberOfOpenCreditLinesAndLoans': cut
 (-inf, 1.0]   -0.842253
 (1.0, 3.0]    -0.331046
 (3.0, 5.0]    -0.055325
 (5.0, 17.0]    0.123566
 (17.0, inf]    0.464595
 dtype: float64,
 'NumberOfTime30-59DaysPastDueNotWorse': cut
 (-inf, 1.0]    0.133757
 (1.0, 2.0]    -1.377944
 (2.0, inf]    -1.548467
 dtype: float64,
 'NumberOfTimes90DaysLate': cut
 (-inf, 1.0]    0.088506
 (1.0, 2.0]    -2.256812
 (2.0, inf]    -2.414750
 dtype: float64,
 'NumberRealEstateLoansOrLines': cut
 (-inf, 1.0]   -0.146831
 (1.0, 2.0]     0.620994
 (2.0, 4.0]     0.388716
 (4.0, inf]    -0.291518
 dtype: float64,
 'NumberOfTime60-89DaysPastDueNotWorse': cut
 (-inf, 1.0]    0.028093
 (1.0, 2.0]    -1.779675
 (2.0, inf]    -1.827914
 dtype: float64,
 'NumberOfDependents': cut
 (-inf, 1.0]    0.202748
 (1.0, 2.0]    -0.531219
 (2.0, inf]    -0.477951
 dtype: float64}
#创建一个以woe值覆盖原模型数据的新列表
#还原索引
model_woe = pd.DataFrame(index=model_data.index)
#(依据分箱上下界)将原数据分箱后,将结果映射到新列表中
model_woe["age"] = pd.cut(model_data["age"],bins_of_col["age"]).map(woeall["age"])
#循环所有特征
for col in bins_of_col:
    model_woe[col] = pd.cut(model_data[col],bins_of_col[col]).map(woeall[col])
#补充标签
model_woe["SeriousDlqin2yrs"] = model_data["SeriousDlqin2yrs"]
model_woe.head()
ageRevolvingUtilizationOfUnsecuredLinesDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTime30-59DaysPastDueNotWorseNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependentsSeriousDlqin2yrs
0-0.2781382.205113-0.01122-0.238024-0.0553250.1337570.088506-0.1468310.0280930.2027480
11.0040980.665610-0.01122-0.238024-0.0553250.1337570.088506-0.1468310.0280930.2027480
2-0.278138-2.040304-0.01122-0.238024-0.0553250.1337570.088506-0.146831-1.779675-0.4779511
31.0040982.205113-0.47269-0.2380240.1235660.1337570.0885060.6209940.0280930.2027480
4-0.278138-1.073125-0.011220.2320360.1235660.1337570.0885060.6209940.0280930.2027481

3.5建模与模型验证

#3.5建模与模型验证
#处理测试集
vali_woe = pd.DataFrame(index=vali_data.index)
for col in bins_of_col:
    vali_woe[col] = pd.cut(vali_data[col],bins_of_col[col]).map(woeall[col])
vali_woe["SeriousDlqin2yrs"] = vali_data["SeriousDlqin2yrs"]
vali_woe.head()
RevolvingUtilizationOfUnsecuredLinesageDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTime30-59DaysPastDueNotWorseNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependentsSeriousDlqin2yrs
02.2051130.2476061.521696-0.238024-0.0553250.1337570.088506-0.1468310.0280930.2027480
1-1.073125-0.278138-0.0112200.2320360.1235660.1337570.0885060.6209940.028093-0.4779511
22.2051131.004098-0.0112200.232036-0.0553250.1337570.088506-0.1468310.0280930.2027480
32.205113-0.278138-0.011220-0.2380240.1235660.1337570.088506-0.1468310.0280930.2027480
4-1.073125-0.278138-0.011220-0.2380240.1235660.1337570.088506-0.1468310.0280930.2027481
#训练集
X = model_woe.iloc[:,:-1]
y = model_woe.iloc[:,-1]
#测试集
vali_X = vali_woe.iloc[:,:-1]
vali_y = vali_woe.iloc[:,-1]
from sklearn.linear_model import LogisticRegression as LR
lr = LR().fit(X,y)
lr.score(vali_X,vali_y)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)





0.7587824255767206
c_1 = np.linspace(0.01,1,20)
c_2 = np.linspace(0.01,0.2,20)
score = []
for i in c_2: 
    lr = LR(solver='liblinear',C=i).fit(X,y)
    score.append(lr.score(vali_X,vali_y))
plt.figure()
plt.plot(c_2,score)
plt.show()
#警告提示:文件名在未来将更改
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BO6CLOeK-1662219108750)(output_106_1.png)]

lr.n_iter_
array([5], dtype=int32)
score = []
for i in [1,2,3,4,5,6]: 
    lr = LR(solver='liblinear',C=0.025,max_iter=i).fit(X,y)
    score.append(lr.score(vali_X,vali_y))
plt.figure()
plt.plot([1,2,3,4,5,6],score)
plt.show()
D:\py1.1\lib\site-packages\sklearn\svm\_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\svm\_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\svm\_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\svm\_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\svm\_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Z4XM2dcl-1662219108751)(output_108_1.png)]

import scikitplot as skplt
#%%cmd
#pip install scikit-plot
vali_proba_df = pd.DataFrame(lr.predict_proba(vali_X))
skplt.metrics.plot_roc(vali_y, vali_proba_df,
                       plot_micro=False,figsize=(6,6),
                       plot_macro=False)
#模型在捕捉少数类时误判其他类的比例
D:\py1.1\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

  warnings.warn(message, FutureWarning)





<AxesSubplot:title={'center':'ROC Curves'}, xlabel='False Positive Rate', ylabel='True Positive Rate'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EiFtN3BD-1662219108751)(output_109_2.png)]

3.6 制作评分卡

B = 20/np.log(2) 
A = 600 + B*np.log(1/60) 
B,A
(28.85390081777927, 481.8621880878296)
lr.intercept_
array([-0.00672073])
base_score = A - B*lr.intercept_
base_score
array([482.05610739])
model_data.head()
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependentsqcut
000.0154045300.1218024728.050000.000000(52.0, 54.0]
100.1683116300.1419641119.050000.000000(61.0, 64.0]
211.0635703910.4176633500.051023.716057(36.0, 39.0]
300.0886847300.5228225301.0110200.000000(68.0, 74.0]
410.6229995300.42365013000.090200.181999(52.0, 54.0]
woeall
{'RevolvingUtilizationOfUnsecuredLines': cut
 (-inf, 0.0991]     2.205113
 (0.0991, 0.298]    0.665610
 (0.298, 0.465]    -0.127577
 (0.465, 0.982]    -1.073125
 (0.982, 1.0]      -0.471851
 (1.0, inf]        -2.040304
 dtype: float64,
 'age': cut
 (-inf, 36.0]   -0.523275
 (36.0, 54.0]   -0.278138
 (54.0, 61.0]    0.247606
 (61.0, 74.0]    1.004098
 (74.0, inf]     1.725180
 dtype: float64,
 'DebtRatio': cut
 (-inf, 0.0175]     1.521696
 (0.0175, 0.503]   -0.011220
 (0.503, 1.472]    -0.472690
 (1.472, inf]       0.175196
 dtype: float64,
 'MonthlyIncome': cut
 (-inf, 0.1]      1.348333
 (0.1, 5594.0]   -0.238024
 (5594.0, inf]    0.232036
 dtype: float64,
 'NumberOfOpenCreditLinesAndLoans': cut
 (-inf, 1.0]   -0.842253
 (1.0, 3.0]    -0.331046
 (3.0, 5.0]    -0.055325
 (5.0, 17.0]    0.123566
 (17.0, inf]    0.464595
 dtype: float64,
 'NumberOfTime30-59DaysPastDueNotWorse': cut
 (-inf, 1.0]    0.133757
 (1.0, 2.0]    -1.377944
 (2.0, inf]    -1.548467
 dtype: float64,
 'NumberOfTimes90DaysLate': cut
 (-inf, 1.0]    0.088506
 (1.0, 2.0]    -2.256812
 (2.0, inf]    -2.414750
 dtype: float64,
 'NumberRealEstateLoansOrLines': cut
 (-inf, 1.0]   -0.146831
 (1.0, 2.0]     0.620994
 (2.0, 4.0]     0.388716
 (4.0, inf]    -0.291518
 dtype: float64,
 'NumberOfTime60-89DaysPastDueNotWorse': cut
 (-inf, 1.0]    0.028093
 (1.0, 2.0]    -1.779675
 (2.0, inf]    -1.827914
 dtype: float64,
 'NumberOfDependents': cut
 (-inf, 1.0]    0.202748
 (1.0, 2.0]    -0.531219
 (2.0, inf]    -0.477951
 dtype: float64}
score_age = woeall["age"] * (-B*lr.coef_[0][1])
score_age
cut
(-inf, 36.0]   -12.517538
(36.0, 54.0]    -6.653503
(54.0, 61.0]     5.923113
(61.0, 74.0]    24.019569
(74.0, inf]     41.268981
dtype: float64
lr.coef_[0]
array([-0.40528794, -0.8290577 , -0.64610093, -0.61661619, -0.37484361,
       -0.60937781, -0.5959787 , -0.80402665, -0.30654535, -0.55908073])
file = "D:\class_file\day08_05\ScoreData.csv"
#open:打开文件,第一个参数是:文件的路径
#第二个参数是打开方式,"w"写入,"r"阅读
#首先写入基准分数
#之后使用循环,每次生成一组score_age类似的分档和分数,不断写入文件之中
with open(file,"w") as fdata:
    fdata.write("base_score,{}\n".format(base_score))
for i,col in enumerate(X.columns):
    score = woeall[col] * (-B*lr.coef_[0][i])
    score.name = "Score"
    score.index.name = col
    score.to_csv(file,header=True,mode="a")
  • 3
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 5
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值