数据准备
目标:完成一个评分卡,通过预测某人在未来两年内将会经历财务危机的可能性来提高信用评分效果,帮助贷款人做出最好的决策
准备:基于个人信贷的场景, 确定违约的定义:根据新的Basel II Capital Accord(巴塞尔二资本协议),规定逾期90天算作违约。在判别指标上,选择使用历史最大违约天数。
数据的获取与整合
数据来源:数据来自Kaggle,cs-training.csv文件,其中包括了15万条信用相关数据。下载地址:https://www.kaggle.com/c/GiveMeSomeCredit/data(注:需要登录才能下载,自己使用的使yahoo账号进行的登录)
数据描述
数据属于个人消费类贷款,只考虑评分卡最终实施时能够使用到的数据应从如下一些方面获取数据:
- 基本属性:包括了借贷人当时的年龄
- 偿还能力:包括借贷人的月收入、负债比率
- 信用往来:两年内35-59天逾期次数,两年内60-89天逾期次数,两年内90天或高于90天逾期次数。
- 财产状况:包括开放式信贷和贷款数量、不动产贷款或额度数量
- 贷款属性:暂无
- 其他因素:包括借贷人的家庭数量(不包括本人在内)
原始变量
变量名 | 变量类型 | 变量描述 |
SeriousDlqin2yrs | y/n | 超过90天或更糟糕的逾期拖欠 |
RevolvingUtilizationOfUnsecuredLines | percentage | 无担保放款的循环利用:除了不动产和像车贷那样除以信用额度总和的无分期付款债务的信用卡和个人信用额度总额 |
Age | integer | 借款人当时的年龄 |
NumberOfTime30-59DaysPastDueNotWorse | integer | 35-59天逾期但不糟糕的次数 |
DebtRatio | percentage | 负债比率 |
MonthlyIncome | real | 月收入 |
NumberOfOpenCreditLinesAndLoans | integer | 开放信贷和贷款数量,开放式贷款(分期付款如汽车贷款或抵押贷款)和信贷(如信用卡)的数量 |
NumberOfTimes90DaysLate | integer | 90天逾期次数:借款者有90天或更高逾期次数 |
NumbeRealEstateLoansOrLines | integer | 不动产贷款或额度数量:抵押贷款和不动产放款包括房屋净值信用额度 |
NumberOfTime60-89DaysPastDueNotWorse | integer | 60-89天逾期但不糟糕次数:借款人在过去两年内有60-89天逾期还款但不糟糕的次数 |
NumberOfDependents | integer | 家属数量:不包括本人在内的家属数量 |
时间窗口:自变量的观察窗口为过去两年,因变量表现窗口为未来两年
数据处理
导入数据
导入从kaggle下载cs-training.csv文件,去除原始变量中的顺序变量,即第一列的X变量。并将原始变量重命名,将需要预测的响应变量SeriousDlqin2yrs变量重命名为y,其余自变量重命名为x1 – x10.
> # 导入数据集
> traindata <- read.csv('C:\\Users\\yitian.z\\Desktop\\Machine Learning\\dataset\\kaggle\\cs-training.csv')
>
> # 去除第一列X变量
> traindata <- traindata[,-1]
> # class(traindata)
>
> # 对数据集变量进行重命名
> names(traindata) <- c("y", "x1" ,"x2", "x3", "x4", "x5",
+ "x6", "x7", "x8", "x9", "x10")
缺失值分析即处理
在得到数据集后,需要观测数据的分布情况,因为很多的模型对缺失值敏感,因此观察是否有缺失值是其中的一个很重要的步骤。在正式分析开始之前,可以先通过图形进行观测字段的缺失值情况有一个直观的感受。
> library(VIM)
> # 缺失值部分可视化展示
> matrixplot(traindata)
利用matrixplot函数对缺失值部分进行可视化展示,上图中浅色表示值小,深色表示值大,而默认缺失值为红色。因此可以看到x5变量和x10变量,即MonthlyIncome变量和NumberOfDependents两个变量存在缺失值。
> # install.packages("mice")
> library(mice)
> md.pattern(traindata)
y x1 x2 x3 x4 x6 x7 x8 x9 x10 x5
120269 1 1 1 1 1 1 1 1 1 1 1 0
25807 1 1 1 1 1 1 1 1 1 1 0 1
3924 1 1 1 1 1 1 1 1 1 0 0 2
0 0 0 0 0 0 0 0 0 3924 29731 33655
具体缺失值的情况可以见上表,monthlyincome列共有缺失值29731个,numberofdependents有3924个。
R语言中更多的缺失值处理方法见:http://blog.csdn.net/c1z2w3456789/article/details/51442678
图像观测缺失值的分布
> aggr(traindata,prop = FALSE,numbers=TRUE)
> aggr(traindata,prop = TRUE,numbers=TRUE)#用比例代替了计数
用比例代替左侧的数值:
使用marginplot()函数可以生成一幅散点图,在图形边界显示两个变量的缺失值信息
> # 在图形边界展示两个变量的缺失值信息
> marginplot(traindata, pch = c(20),
+ col = c("darkgray","red","blue"))
在处理缺失值之前,查看MonthIncome变量的箱线图:
> boxplot(traindata$x5, data=boxplot.stats(traindata))
对于缺失值的处理方法很多,例如基于聚类的方法,基于回归的方法,基于均值的方法,其中最简单的就是直接移除,但是在本例中因为缺失值所占的比例较高,直接移除会损失大量的观测值,因此并不是最合适的方法。在这里,使用KNN方法对缺失值进行填补(时间较长Waiting…117mins)
> library(DMwR)
> # ??knnImputation
> traindata<-knnImputation(traindata,k=10,meth = "weighAvg")
对缺失值进行处理之后,再次观测数据集中的缺失值分布
> # 处理缺失值后,缺失值部分可视化展示
> matrixplot(traindata)
异常值分析及处理
关于异常值的检测,这里简单介绍以下一些检测方法:
- 单变量异常值检测:在R语言中使用函数boxplot.stats()可以实现单变量检测,该函数根据返回的统计数据生成箱线图。在上述函数的返回结果中,有一个参数out,它是由异常值组成的列表。更明确的说就是里面列出了箱线图中箱须线外面的数据点。比如我们可以查看月收入分布,第一幅图为没有删除异常值的箱线图。第二幅箱线图删除异常值后,可以发现月收入主要集中分布在3000-8000之间。但是在这份分析报告中,因为我们对业务尚不熟悉,不好将大于8000的数据直接归为异常值,因此对该变量未做处理。
- 使用LOF(局部异常因子)检测异常值:LOF(局部异常因子)是一种基于密度识别异常值的算法。算法实现是:将一个点的局部密度与分布在它周围的点的密度相比较,如果前者明显的比后者小,那么这个点相对于周围的点来说就处于一个相对比较稀疏的区域,这就表明该点事一个异常值。LOF算法的缺点是它只对数值型数据有效。包‘DMwR’和包‘dprep’中的lofactor()可以计算LOF算法中的局部异常因子。
- 通过聚类检测异常值:检测异常值的另外一种方式就是聚类。先把数据聚成不同的类,选择不属于任何类的数据作为异常值。例如,基于密度的聚类DBSCAN算法的实现就是将与数据稠密区域紧密相连的数据对象划分为一个类,因此与其他对象分离的数据就会作为异常值。也可以使用K均值算法实现异常值的检测。首先通过把数据划分为k组,划分方式是选择距离各自簇中心最近的点为一组;然后计算每个对象和对应的簇中心的距离(或者相似度),并挑出拥有最大的距离的点作为异常值。
对单个变量进行定量分析
首先对x2变量,即客户的年龄进行定量分析,发现有下值:
> # 对age变量进行定量分析
> unique(traindata$x2)
[1] 45 40 38 30 49 74 57 39 27 51 46 76 64 78 53 43 25 32 58 50 69 24 28 62 42 75 26 52 41 81 31 68 70 73 29 55 35 72
[39] 60 67 36 56 37 66 83 34 44 48 61 80 47 59 77 63 54 33 79 65 86 92 23 87 71 22 90 97 84 82 91 89 85 88 21 93 96 99
[77] 94 95 101 98 103 102 107 105 0 109
可以看到年龄中存在0值,显然是异常值,予以剔除:
> # 去除单个变量:age数据中的0值
> traindata <- traindata[-which(traindata$x2 == 0),]
> unique(traindata$x2)
[1] 45 40 38 30 49 74 57 39 27 51 46 76 64 78 53 43 25 32 58 50 69 24 28 62 42 75 26 52 41 81 31 68 70 73 29 55 35 72
[39] 60 67 36 56 37 66 83 34 44 48 61 80 47 59 77 63 54 33 79 65 86 92 23 87 71 22 90 97 84 82 91 89 85 88 21 93 96 99
[77] 94 95 101 98 103 102 107 105 109
而对于x3,x7,x9三个变量的箱线图可以看出,均存在异常值,且由unique函数可以得知均存在96,98两个异常值,因此予以剔除。同时会发现,剔除其中一个变量的96,98值,其他变量的96,98的异常值也会相应被剔除。
> # 绘制x3,x7,x9的箱线图
> boxplot(traindata$x3, traindata$x7, traindata$x9,
+ data = traindata)
> # 去除这三个变量中的离群点
> unique(traindata$x3)
[1] 2 0 1 3 4 5 7 10 6 98 12 8 9 96 13 11
> unique(traindata$x7)
[1] 0 1 3 2 5 4 98 10 9 6 7 8 15 96 11 13 14 17 12
> unique(traindata$x9)
[1] 0 1 2 5 3 98 4 6 7 8 96 11 9
> traindata <- traindata[-which(traindata$x3==96), ]
> traindata <- traindata[-which(traindata$x3==98), ]
> # 再次查看上述变量
> unique(traindata$x3)
[1] 2 0 1 3 4 5 7 10 6 12 8 9 13 11
> unique(traindata$x7)
[1] 0 1 3 2 5 4 10 9 6 7 8 15 11 13 14 17 12
> unique(traindata$x9)
[1] 0 1 2 5 3 4 6 7 8 11 9
其他变量暂时不做处理。
变量分析
变量分布情况
首先可以简单的观测一下部分变量的分布情况,例如age变量,如下:
> library(ggplot2)
> # 查看部分变量的分布,age
> ggplot(traindata, aes(x=x2, y=..density..)) +
+ geom_histogram(fill="blue", colour="grey60",
+ size=0.2, alpha=0.2) +
+ geom_density()
可以看到年龄变量大致呈正态分布,符合统计分析的假设。再比如月收入变量,也可以做图观测,如下:
> # 观察其他变量,monthIncome
> ggplot(traindata, aes(x=x5, y=..density..)) +
+ geom_histogram(fill="blue", colour="grey60",
+ size=0.2, alpha=0.2) +
+ geom_density() +
+ xlim(1, 20000)
MonthIncome变量也呈现正态分布,符合统计分析的需要。
变量之间的相关性
建模之前首先需要对变量之间的相关性进行检测,如果变量之间相关性显著,会影响模型的预测效果。下面通过corrplot函数,画出各变量之间,包括相应变量与自变量的相关性。
> library(corrplot)
> # 绘制各个变量的相关性矩阵图像
> cor1 <- cor(traindata[, 1:11])
> corrplot(cor1)
# 相关性用数字表示
corrplot(cor1, method = "number")
由上图可以看出,各变量之间的相关性非常小。其实logistic回归同样需要检测多重共线性问题,不过此处由于各变量之间的相关性较小,可以初步判断不存在多重共线性的问题,当然我们在建模后还可以用VIF(方差膨胀因子)来检测多重共线性问题。如果存在多重共线性,即有可能存在两个变量之间高度相关,需要做降维或者剔除处理。
切分数据集
> # 切分数据集
> table(traindata$y)
0 1
139851 9879
由上表可以看出,对于响应变量SeriousDlqin2yrs,存在明显的类失衡问题,SeriousDlqin2yrs等于1的观测值为9879,仅为所有观测值的6.6%。因此我们需要对非平衡数据进行处理,在这里可以采用SMOTE算法,用R对稀有事件进行超级采样。
利用caret包中的createDataPartition(数据分割功能)函数将数据随机分成相同的两份。
> # 使用SMOTE算法,用R对数据随机分成相同的两份
> # 使用caret包中的createDataPartition(数据分割功能)函数
> # 对数据随机分成相同的两份
> # install.packages("caret")
> library(caret)
> set.seed(1234)
> splitIndex <- createDataPartition(traindata$y, time=1,
+ p=0.5, list = FALSE)
> train <- traindata[splitIndex,]
> test <- traindata[-splitIndex,]
对于分割后的训练集和测试集均有74865个数据,分类结果的平衡性如下:
> # 显示分类结果的平衡性
> prop.table(table(train$y)) # 训练集
0 1
0.93314633 0.06685367
> prop.table(table(test$y)) # 测试集
0 1
0.93489615 0.06510385
两者的分类结果是平衡的,仍然有6.6%左右的代表,数据集仍然处于良好的水平。因此,我们可以采用这份切割后的数据集继续进行建模和预测。
Logistic回归
Logistic回归在信用评分卡开发中起到核心作用。由于其特点,以及对自变量进行证据权重转换(WOE),Logistic回归的结果可以直接转换为一个汇总表,即所谓的标准评分卡格式。
基本公式
建立模型
首先使用glm函数对所有变量建立Logistic回归模型,模型如下:
> # 建立logistic回归模型
> fit <- glm(y~., train, family = "binomial")
> summary(fit)
Call:
glm(formula = y ~ ., family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.6144 -0.3399 -0.2772 -0.2240 3.6997
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.812e+00 6.411e-02 -28.268 < 2e-16 ***
x1 -1.846e-05 8.972e-05 -0.206 0.836948
x2 -2.861e-02 1.276e-03 -22.428 < 2e-16 ***
x3 5.767e-01 1.564e-02 36.867 < 2e-16 ***
x4 -2.321e-05 1.538e-05 -1.509 0.131224
x5 -1.355e-05 3.845e-06 -3.524 0.000425 ***
x6 -2.769e-03 3.798e-03 -0.729 0.466051
x7 8.468e-01 2.429e-02 34.855 < 2e-16 ***
x8 8.620e-02 1.599e-02 5.393 6.94e-08 ***
x9 8.294e-01 3.338e-02 24.848 < 2e-16 ***
x10 5.126e-02 1.388e-02 3.694 0.000221 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 36747 on 74864 degrees of freedom
Residual deviance: 29793 on 74854 degrees of freedom
AIC: 29815
Number of Fisher Scoring iterations: 6
可以看出利用全变量建立回归模型,拟合的效果并不是很好,其中x1, x4, x6三个变量未能通过“显著性检验”,再次直接剔除这三个变量,用剩余的变量重新建立回归模型,或者更新原有的回归模型:
> fit2 <- update(fit, y~.-x1-x4-x6)
> # same as:
> # fit2<-glm(y~x2+x3+x5+x7+x8+x9+x10,train,family = "binomial")
> summary(fit2)
Call:
glm(formula = y ~ x2 + x3 + x5 + x7 + x8 + x9 + x10, family = "binomial",
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.6223 -0.3402 -0.2777 -0.2239 3.5868
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.825e+00 6.320e-02 -28.873 < 2e-16 ***
x2 -2.894e-02 1.252e-03 -23.120 < 2e-16 ***
x3 5.742e-01 1.544e-02 37.187 < 2e-16 ***
x5 -1.185e-05 3.513e-06 -3.373 0.000744 ***
x7 8.500e-01 2.401e-02 35.397 < 2e-16 ***
x8 7.494e-02 1.420e-02 5.276 1.32e-07 ***
x9 8.306e-01 3.338e-02 24.883 < 2e-16 ***
x10 5.169e-02 1.386e-02 3.730 0.000192 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 36747 on 74864 degrees of freedom
Residual deviance: 29797 on 74857 degrees of freedom
AIC: 29813
Number of Fisher Scoring iterations: 6
第二个回归模型中的所有变量都通过了显著性检验,AIC值(赤池信息准则)更小,所有模型的拟合效果更好些。
模型评估
通常一个二值分类器可以通过ROC(Receiver Operating Characteristic)曲线和AUC值来评价优劣。很多二元分类器会产生一个概率预测值,而非仅仅是0-1预测值。我们可以使用某个临界点(例如0.5),以划分哪些预测为1,哪些预测为0。得到二元预测后,可以构建一个混淆矩阵来评价二元分类器的预测效果。所有的训练数据都会落入这个矩阵中,而对角线上的数字代表了预测正确的数目,即True Positive + True Nagetive。同时可以相应的算出TPR(真正率或称为灵敏度)和TNR(真负率或称为特异度)。我们主观上希望这两个指标越大越好,但可惜二者是此消彼长的关系。除了分类器的训练参数,临界点的选择,也会大大的影响TPR和TNR。有时可以根据具体问题和需要,来选择具体的临界点。
如果我们选择一系列的临界点,就回得到一系列的TPR和TNR,将这些值对应起来的点连接起来,就构成了ROC曲线。ROC曲线可以帮助我们清楚的了解这个分类器的性能表现,还能方便比较不同分类器的性能。在绘制ROC曲线的时候,习惯上使用1-TNR作为横坐标即FPR(False Positive Rate),TPR作为纵坐标。这就形成了ROC曲线。
而AUC(Area Under Curve)被定义成ROC曲线下的面积,显然这个面积的数值不会>1,又由于ROC曲线一般处于y=x这条直线上方,所以AUC的取值范围在0.5到1之间。使用AUC值作为评价标准是因为很多时候ROC曲线并不能清晰的说明哪个分类器的效果更好,而作为一个数值,对应AUC更大的分类器的效果更好。
下面首先利用模型对test数据进行预测,生成概率预测值:
# 检验模型的正确性
# 利用模型对test数据进行预测,生成概率预测值
pre <- predict(fit2, test)
# 在R中,使用pROC包可以方便比较两个分类器
# install.packages("pROC")
library(pROC)
在R中,可以利用pROC包,它能方便比较这两个分类器,还能自动标识处最优的临界点,图看起来也比较漂亮。在下图中最优点FPR=1-TNR=0.845, TPR=0.638, AUC值为0.812,说明这个模型的预测效果还是不错的,正确较高。
> modelroc <- roc(test$y, pre)
> plot(modelroc, print.auc=TRUE, auc.polygon=TRUE,
+ grid=c(0.1, 0.2), grid.col=c("green", "red"),
+ max.auc.polygon=TRUE, auc.polygon.col="skyblue",
+ print.thres=TRUE)
Call:
roc.default(response = test$y, predictor = pre)
Data: pre in 69991 controls (test$y 0) < 4874 cases (test$y 1).
Area under the curve: 0.8102
WOE转换
证据权重(Weight Of Evidence)转换可以将Logistic回归模型转变为标准评分卡格式。引入WOE转换的目的并不是为了提高模型质量,只是一些变量不应该被纳入模型,这或者是因为他们不能增加模型值,或是因为与其模型相关系数有关的误差较大。其实建立标准信用卡也可以不采用WOE转换。这种情况下,Logistic回归模型需要处理大量的自变量,尽管这样会增加建模程序的复杂性,但最终得到的评分卡都是一样的。
用WOE(x)替换x,WOE()=ln(违约/总违约)/(正常/总正常)
通过上述的Logistic回归,剔除了x1, x4, x6三个变量,对剩下的变量进行WOE转换。
进行分箱
age变量x2:
> # 将logistic回归模型转换成标准评分卡格式
> # 对当前logistic模型中的各个变量进行WOE转换
> # 进行分箱:age(x2)
> cutx2 <- c(-Inf, 30, 35, 40, 50, 55, 60, 65, 75, Inf)
> plot(cut(train$x2, cutx2))
NumberOfTime30-59DaysPastDueNotWorse变量(x3):
> cutx3 <- c(-Inf, 0, 1, 3, 5, Inf)
> plot(cut(train$x3, cutx3))
MonthlyIncome变量(x5):
> cutx5 <- c(-Inf, 1000, 2000, 3000, 4000, 5000, 6000,
+ 7500, 9500, 12000, Inf)
> plot(cut(train$x5, cutx5))
NumberOfTimes90DaysLate变量(x7):
> cutx7 <- c(-Inf, 0, 1, 3, 5, 10, Inf)
> plot(cut(train$x7, cutx7))
NumberRealEstateLoansOrLines变量(x8):
> cutx8 <- c(-Inf, 0, 1, 2, 3, 5, Inf)
> plot(cut(train$x8, cutx8))
NumberOfTime60-89DaysPastDueNotWorse变量(x9):
> cutx9 <- c(-Inf, 0, 1, 3, 5, Inf)
> plot(cut(train$x9, cutx9))
NumberOfDependents变量(x10):
> cutx10 <- c(-Inf, 0, 1, 2, 3, 5, Inf)
> plot(cut(train$x10, cutx10))
计算WOE值
计算各个变量的WOE值,编写函数:
> # 计算woe值的函数
> totalgood <- as.numeric(table(train$y))[1]
> totalbad <- as.numeric(table(train$y))[2]
>
> getWOE <- function(a, p, q) {
+ Good <- as.numeric(table(train$y[a>p & a<=q]))[1]
+ Bad <- as.numeric(table(train$y[a>p & a<=q]))[2]
+ WOE <- log((Bad/totalbad)/(Good/totalgood), base=exp(1))
+ return(WOE)
+ }
age变量(x2):
> # 计算变量age的Woe值
> Agelessthan30.WOE <- getWOE(train$x2, -Inf, 30)
> Age30to35.WOE <- getWOE(train$x2, 30, 35)
> Age35to40.WOE <- getWOE(train$x2, 35, 40)
> Age40to45.WOE <- getWOE(train$x2, 40, 45)
> Age45to50.WOE <- getWOE(train$x2, 45, 50)
> Age50to55.WOE <- getWOE(train$x2, 50, 55)
> Age55to60.WOE <- getWOE(train$x2, 55, 60)
> Age60to65.WOE <- getWOE(train$x2, 60, 65)
> Age65to75.WOE <- getWOE(train$x2, 65, 75)
> Agemorethan.WOE <- getWOE(train$x2, 75, Inf)
> age.WOE <- c(Agelessthan30.WOE, Age30to35.WOE, Age35to40.WOE,
+ Age40to45.WOE, Age45to50.WOE, Age50to55.WOE,
+ Age55to60.WOE, Age60to65.WOE, Age65to75.WOE,
+ Agemorethan.WOE)
> age.WOE
[1] 0.57432879 0.52063157 0.34283924 0.24251193 0.22039521 0.07194294 -0.25643603 -0.55868003 -0.94144504 -1.28914527
NumberOfTime30-59DaysPastDueNotWorse变量(x3)
> # 计算x3:NumberOfTime30-59DaysPastDueNotWorse变量的woe值
> Worselessthan0.WOE <- getWOE(train$x3, -Inf, 0)
> Worse0to1.WOE <- getWOE(train$x3, 0, 1)
> Worse1to3.WOE <- getWOE(train$x3, 1, 3)
> Worse3to5.WOE <- getWOE(train$x3, 3, 5)
> Worsemorethan5.WOE <- getWOE(train$x3, 5, Inf)
> worse.WOE <- c(Worselessthan0.WOE, Worse0to1.WOE,
+ Worse1to3.WOE, Worse3to5.WOE,
+ Worsemorethan5.WOE)
> worse.WOE
[1] -0.5324915 0.9106018 1.7645290 2.4432903 2.5682332
MonthlyIncome变量(x5)
> # 计算x5:MonthlyIncome变量的woe值
> Incomelessthan1000.WOE <- getWOE(train$x5, -Inf, 1000)
> Income1000to2000.WOE <- getWOE(train$x5, 1000, 2000)
> Income2000to3000.WOE <- getWOE(train$x5, 2000, 3000)
> Income3000to4000.WOE <- getWOE(train$x5, 3000, 4000)
> Income4000to5000.WOE <- getWOE(train$x5, 4000, 5000)
> Income5000to6000.WOE <- getWOE(train$x5, 5000, 6000)
> Income6000to7500.WOE <- getWOE(train$x5, 6000, 7500)
> Income7500to9500.WOE <- getWOE(train$x5, 7500, 9500)
> Income9500to12000.WOE <- getWOE(train$x5, 9500, 12000)
> Incomemorethan12000.WOE <- getWOE(train$x5, 12000, Inf)
> Income.WOE <- c(Incomelessthan1000.WOE, Income1000to2000.WOE,
+ Income2000to3000.WOE, Income3000to4000.WOE,
+ Income4000to5000.WOE, Income5000to6000.WOE,
+ Income6000to7500.WOE, Income7500to9500.WOE,
+ Income9500to12000.WOE, Incomemorethan12000.WOE)
> Income.WOE
[1] -1.128862326 0.448960482 0.312423080 0.350846777 0.247782295 0.114417168 -0.001808106 -0.237224039 -0.389158800 -0.462438653
NumberOfTimes90DaysLate变量(x7)
> # 计算x7:NumberOfTimes90DaysLate变量的woe值
> Latelessthan0.WOE <- getWOE(train$x7, -Inf, 0)
> Late0to1.WOE <- getWOE(train$x7, 0, 1)
> Late1to3.WOE <- getWOE(train$x7, 1, 3)
> Late3to5.WOE <- getWOE(train$x7, 3, 5)
> Late5to10.WOE <- getWOE(train$x7, 5, 10)
> Latemorethan10.WOE <- getWOE(train$x7, 10, Inf)
> Late.WOE <- c(Latelessthan0.WOE, Late0to1.WOE,
+ Late1to3.WOE, Late3to5.WOE,
+ Late5to10.WOE, Latemorethan10.WOE)
> Late.WOE
[1] -0.3694044 1.9400973 2.7294448 3.3090003 3.3852925 2.3483738
NumberRealEstateLoansOrLines变量(x8)
> # 计算x8:NumberRealEstateLoansOrLines的woe值
> Reallessthan0.WOE <- getWOE(train$x8, -Inf, 0)
> Real0to1.WOE <- getWOE(train$x8, 0, 1)
> Real1to2.WOE <- getWOE(train$x8, 1, 2)
> Real2to3.WOE <- getWOE(train$x8, 2, 3)
> Real3to5.WOE <- getWOE(train$x8, 3, 5)
> Realmorethan5.WOE <- getWOE(train$x8, 5, Inf)
> Real.WOE <- c(Reallessthan0.WOE, Real0to1.WOE,
+ Real1to2.WOE, Real2to3.WOE, Real3to5.WOE,
+ Realmorethan5.WOE)
> Real.WOE
[1] 0.21490691 -0.24386987 -0.15568385 0.02906876 0.41685234 1.12192809
NumberOfTime60-89DaysPastDueNotWorse变量(x9)
> # 计算x9:NumberOfTime60-89DaysPastDueNotWorse的woe值
> Worse2lessthan0.WOE <- getWOE(train$x9, -Inf, 0)
> Worse20to1.WOE <- getWOE(train$x9, 0, 1)
> Worse21to3.WOE <- getWOE(train$x9, 1, 3)
> Worse23to5.WOE <- getWOE(train$x9, 3, 5)
> Worse2morethan5.WOE <- getWOE(train$x9, 5, Inf)
> worse2.WOE <- c(Worse2lessthan0.WOE, Worse20to1.WOE,
+ Worse21to3.WOE,Worse23to5.WOE,
+ Worse2morethan5.WOE)
> worse2.WOE
[1] -0.2784605 1.8329078 2.7775343 3.5805174 3.4469860
NumberOfDependents变量(x10)
> # 计算x10:NumberOfDependents的woe值
> Deplessthan0.WOE <- getWOE(train$x10, -Inf, 0)
> Dep0to1.WOE <- getWOE(train$x10, 0, 1)
> Dep1to2.WOE <- getWOE(train$x10, 1, 2)
> Dep2to3.WOE <- getWOE(train$x10, 2, 3)
> Dep3to5.WOE <- getWOE(train$x10, 3, 5)
> Depmorethan5.WOE <- getWOE(train$x10, 5, Inf)
> Dep.WOE <- c(Deplessthan0.WOE, Dep0to1.WOE,
+ Dep1to2.WOE, Dep2to3.WOE, Dep3to5.WOE,
+ Depmorethan5.WOE)
> Dep.WOE
[1] -0.15525081 0.08669961 0.19618098 0.33162486 0.40469824 0.76425365
对变量进行WOE变换
如age变量(x2)
> # x2
> tmp.age <- 0
> for(i in 1 : nrow(train)) {
+ if(train$x2[i] <= 30)
+ tmp.age[i] <- Agelessthan30.WOE
+ else if(train$x2[i] <= 35)
+ tmp.age[i] <- Age30to35.WOE
+ else if(train$x2[i] <= 40)
+ tmp.age[i] <- Age35to40.WOE
+ else if(train$x2[i] <= 45)
+ tmp.age[i] <- Age40to45.WOE
+ else if(train$x2[i] <= 50)
+ tmp.age[i] <- Age45to50.WOE
+ else if(train$x2[i] <= 55)
+ tmp.age[i] <- Age50to55.WOE
+ else if(train$x2[i] <= 60)
+ tmp.age[i] <- Age55to60.WOE
+ else if(train$x2[i] <= 65)
+ tmp.age[i] <- Age60to65.WOE
+ else if(train$x2[i] <= 75)
+ tmp.age[i] <- Age65to75.WOE
+ else
+ tmp.age[i] <- Agemorethan.WOE
+ }
>
> table(tmp.age)
tmp.age
-1.2891452711972 -0.941445039519045 -0.558680027962495 -0.256436029353835 0.0719429392949312 0.220395209955515 0.242511934081286 0.342839240194068
5063 9196 8180 8472 9009 9465 8008 6784
0.52063156705216 0.574328792863984
5390 5298
> tmp.age[1:10]
[1] 0.34283924 0.57432879 0.34283924 0.57432879 0.07194294 0.22039521 0.07194294 0.24251193 0.34283924 0.52063157
> train$x2[1:10]
[1] 38 30 39 30 51 46 53 43 39 32
NumberOfTime30-59DaysPastDueNotWorse变量(x3)
> # x3
> tmp.worse <- 0
> for(i in 1 : nrow(train)) {
+ if(train$x3[i] <= 0)
+ tmp.worse[i] <- Worselessthan0.WOE
+ else if(train$x3[i] <= 1)
+ tmp.worse[i] <- Worse0to1.WOE
+ else if(train$x3[i] <= 3)
+ tmp.worse[i] <- Worse1to3.WOE
+ else if(train$x3[i] <= 5)
+ tmp.worse[i] <- Worse3to5.WOE
+ else
+ tmp.worse[i] <- Worsemorethan5.WOE
+ }
> table(tmp.worse)
tmp.worse
-0.53249146131578 0.910601840444591 1.76452904024992 2.44329031065646 2.56823323027274
62948 8077 3160 562 118
> tmp.worse[1:10]
[1] 0.9106018 -0.5324915 -0.5324915 -0.5324915 -0.5324915 -0.5324915 -0.5324915 -0.5324915 -0.5324915 -0.5324915
> train$x3[1:10]
[1] 1 0 0 0 0 0 0 0 0 0
MonthIncome变量(x5)
> # x5
> tmp.income <- 0
> for(i in 1 : nrow(train)) {
+ if(train$x5[i] <= 1000)
+ tmp.income[i] <- Incomelessthan1000.WOE
+ else if(train$x5[i] <= 2000)
+ tmp.income[i] <- Income1000to2000.WOE
+ else if(train$x5[i] <= 3000)
+ tmp.income[i] <- Income2000to3000.WOE
+ else if(train$x5[i] <= 4000)
+ tmp.income[i] <- Income3000to4000.WOE
+ else if(train$x5[i] <= 5000)
+ tmp.income[i] <- Income4000to5000.WOE
+ else if(train$x5[i] <= 6000)
+ tmp.income[i] <- Income5000to6000.WOE
+ else if(train$x5[i] <= 7500)
+ tmp.income[i] <- Income6000to7500.WOE
+ else if(train$x5[i] <= 9500)
+ tmp.income[i] <- Income7500to9500.WOE
+ else if(train$x5[i] <= 12000)
+ tmp.income[i] <- Income9500to12000.WOE
+ else
+ tmp.income[i] <- Incomemorethan12000.WOE
+ }
> table(tmp.income)
tmp.income
-1.12886232582259 -0.462438653207328 -0.389158799506996 -0.237224038650003 -0.00180810632297072 0.114417167554772 0.247782294610166
10201 5490 5486 7048 8076 7249 9147
0.312423079500641 0.350846777249291 0.448960482499888
8118 9680 4370
> tmp.income[1:10]
[1] 0.350846777 0.350846777 0.350846777 0.312423080 -0.001808106 -0.462438653 -0.237224039 0.350846777 0.312423080 -0.237224039
> train$x5[1:10]
[1] 3042 3300 3500 2500 6501 12454 8800 3280 2500 7916
NumberOfTime90DaysPastDueNotWorse变量(x7)
> # x7
> tmp.Late <- 0
> for(i in 1 : nrow(train)) {
+ if(train$x7[i] <= 0)
+ tmp.Late[i] <- Latelessthan0.WOE
+ else if(train$x7[i] <= 1)
+ tmp.Late[i] <- Late0to1.WOE
+ else if(train$x7[i] <= 3)
+ tmp.Late[i] <- Late1to3.WOE
+ else if(train$x7[i] <= 5)
+ tmp.Late[i] <- Late3to5.WOE
+ else if(train$x7[i] <= 10)
+ tmp.Late[i] <- Late5to10.WOE
+ else
+ tmp.Late[i] <- Latemorethan10.WOE
+ }
> table(tmp.Late)
tmp.Late
-0.369404425455224 1.94009728631401 2.34837375415972 2.72944477623793 3.30900029985393 3.38529247382249
70793 2669 7 1093 222 81
> tmp.Late[1:10]
[1] 1.9400973 -0.3694044 -0.3694044 -0.3694044 -0.3694044 -0.3694044 -0.3694044 -0.3694044 -0.3694044 -0.3694044
> train$x7[1:10]
[1] 1 0 0 0 0 0 0 0 0 0
NumberRealEstateLoansOrLines变量(x8)
> # x8
> tmp.real <- 0
> for(i in 1 : nrow(train)) {
+ if(train$x8[i] <= 0)
+ tmp.real[i] <- Reallessthan0.WOE
+ else if(train$x8[i] <= 1)
+ tmp.real[i] <- Real0to1.WOE
+ else if(train$x8[i] <= 2)
+ tmp.real[i] <- Real1to2.WOE
+ else if(train$x8[i] <= 3)
+ tmp.real[i] <- Real2to3.WOE
+ else if(train$x8[i] <= 5)
+ tmp.real[i] <- Real3to5.WOE
+ else
+ tmp.real[i] <- Realmorethan5.WOE
+ }
> table(tmp.real)
tmp.real
-0.243869874062293 -0.155683851792327 0.0290687559545721 0.214906905417014 0.416852342556507 1.12192809398173
26150 15890 3130 27901 1428 366
> tmp.real[1:10]
[1] 0.2149069 0.2149069 0.2149069 0.2149069 -0.1556839 -0.1556839 0.2149069 -0.2438699 0.2149069 0.2149069
> train$x8[1:10]
[1] 0 0 0 0 2 2 0 1 0 0
NumberOfTime60.89DaysPastDueNotWorse变量(x9)
> # x9
> tmp.worse2 <- 0
> for(i in 1 : nrow(train)) {
+ if(train$x9[i] <= 0)
+ tmp.worse2[i] <- Worse2lessthan0.WOE
+ else if(train$x9[i] <= 1)
+ tmp.worse2[i] <- Worse20to1.WOE
+ else if(train$x9[i] <= 3)
+ tmp.worse2[i] <- Worse21to3.WOE
+ else if(train$x9[i] <= 5)
+ tmp.worse2[i] <- Worse23to5.WOE
+ else
+ tmp.worse2[i] <- Worse2morethan5.WOE
+ }
> table(tmp.worse2)
tmp.worse2
-0.278460464730538 1.83290775083723 2.77753428092856 3.44698604282783 3.58051743545235
71150 2919 708 13 75
> tmp.worse2[1:10]
[1] -0.2784605 -0.2784605 -0.2784605 -0.2784605 -0.2784605 -0.2784605 -0.2784605 -0.2784605 -0.2784605 -0.2784605
> train$x9[1:10]
[1] 0 0 0 0 0 0 0 0 0 0
NumberOfDependents变量(x10)
> # x10
> tmp.dep <- 0
> for(i in 1 : nrow(train)) {
+ if(train$x10[i] <= 0)
+ tmp.dep[i] <- Deplessthan0.WOE
+ else if(train$x10[i] <= 1)
+ tmp.dep[i] <- Dep0to1.WOE
+ else if(train$x10[i] <= 2)
+ tmp.dep[i] <- Dep1to2.WOE
+ else if(train$x10[i] <= 3)
+ tmp.dep[i] <- Dep2to3.WOE
+ else if(train$x10[i] <= 5)
+ tmp.dep[i] <- Dep3to5.WOE
+ else
+ tmp.dep[i] <- Depmorethan5.WOE
+ }
> table(tmp.dep)
tmp.dep
-0.155250809857344 0.0866996065110081 0.196180980387687 0.331624863227172 0.404698242905824 0.76425364970991
43498 14544 10102 4771 1815 135
> tmp.dep[1:10]
[1] -0.1552508 -0.1552508 -0.1552508 -0.1552508 0.1961810 0.1961810 -0.1552508 0.1961810 -0.1552508 -0.1552508
> train$x10[1:10]
[1] 0 0 0 0 2 2 0 2 0 0
WOE Dataframe构建
> # 构建WOE Dataframe
> trainWOE <- cbind.data.frame(tmp.age,
+ tmp.income,
+ tmp.worse,
+ tmp.Late,
+ tmp.real,
+ tmp.worse2,
+ tmp.dep)
评分卡的创建和实施
标准评分卡采用的格式是评分卡中的每一个变量都遵循一系列IF-THEN法则,变量的值决定了该变量所分配的分值,总分就是各个变量分值的和。
回归模型建模:因为数据中“1”表示的是违约,直接建模表示的是“发生违约的概率”,log(odds)即为“坏好比”。为了更符合逻辑,分数越高,信用越好,所以就调换“1”和“0”,使建模预测结果为“不发生违约的概率”,最后log(odds)即表示“好坏比”
> # 逻辑回归建模
> trainWOE$y <- 1 - train$y
> glm.fit <- glm(y ~ ., data = trainWOE, family = binomial(link = logit))
> summary(glm.fit)
Call:
glm(formula = y ~ ., family = binomial(link = logit), data = trainWOE)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.1791 0.1926 0.2483 0.3236 2.4823
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.60869 0.01726 151.148 < 2e-16 ***
tmp.age -0.69434 0.03599 -19.291 < 2e-16 ***
tmp.income -0.74529 0.04235 -17.599 < 2e-16 ***
tmp.worse -0.67189 0.01783 -37.685 < 2e-16 ***
tmp.Late -0.64627 0.01583 -40.833 < 2e-16 ***
tmp.real -0.63421 0.07105 -8.926 < 2e-16 ***
tmp.worse2 -0.48638 0.01963 -24.773 < 2e-16 ***
tmp.dep -0.36456 0.08883 -4.104 4.06e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 36747 on 74864 degrees of freedom
Residual deviance: 28561 on 74857 degrees of freedom
AIC: 28577
Number of Fisher Scoring iterations: 6
> coe <- -(glm.fit$coefficients) ### 这里取了负号!!!!
> # coe <- (glm.fit$coefficients)
> coe
(Intercept) tmp.age tmp.income tmp.worse tmp.Late tmp.real tmp.worse2 tmp.dep
-2.6086889 0.6943365 0.7452904 0.6718923 0.6462748 0.6342062 0.4863836 0.3645583
这里coe变量为什么会取负值???
个人总评分 = 基础分 + 各部分得分
基础分为:
> # 计算参数
> p <- 20/log(2)
> q <- 600 - 20*log(15)/log(2)
> # 个人总评分 = 基础分 + 各部分得分
> # 基础分为
> base <- q + p * as.numeric(coe[1])
> base
[1] 446.5913
对各个变量进行打分
age变量(x2)
> # x2
> Agelessthan30.SCORE <- p * as.numeric(coe[2]) * Agelessthan30.WOE
> Age30to35.SCORE <- p * as.numeric(coe[2]) * Age30to35.WOE
> Age35to40.SCORE <- p * as.numeric(coe[2]) * Age35to40.WOE
> Age40to45.SCORE <- p * as.numeric(coe[2]) * Age40to45.WOE
> Age45to50.SCORE <- p * as.numeric(coe[2]) * Age45to50.WOE
> Age50to55.SCORE <- p * as.numeric(coe[2]) * Age50to55.WOE
> Age55to60.SCORE <- p * as.numeric(coe[2]) * Age55to60.WOE
> Age60to65.SCORE <- p * as.numeric(coe[2]) * Age60to65.WOE
> Age65to75.SCORE <- p * as.numeric(coe[2]) * Age65to75.WOE
> Agemorethan75.SCORE <- p * as.numeric(coe[2]) * Agemorethan.WOE
> Age.SCORE <- c(Agelessthan30.SCORE, Age30to35.SCORE, Age35to40.SCORE, Age40to45.SCORE,
+ Age45to50.SCORE, Age50to55.SCORE, Age55to60.SCORE, Age60to65.SCORE,
+ Age65to75.SCORE, Agemorethan75.SCORE)
> Age.SCORE
[1] 11.506284 10.430497 6.868549 4.858561 4.415467 1.441328 -5.137520 -11.192772 -18.861207 -25.827143
其他变量的计算省略。
计算各个变量分箱得分
为了简化上述各个变量的评分过程,并对结果进行“四舍五入”的处理,编写一下计算评分的函数:
> # 计算各个变量分箱得分
> ?round
> getscore <- function(i, x) {
+ score <- round(p * as.numeric(coe[i]) * x, 0)
+ return(score)
+ }
重新计算各个变量的分箱得分,如下:
age变量(x2)
> # x2
> Agelessthan30.SCORE <- getscore(2, Agelessthan30.WOE)
> Age30to35.SCORE <- getscore(2, Age30to35.WOE)
> Age35to40.SCORE <- getscore(2, Age35to40.WOE)
> Age40to45.SCORE <- getscore(2, Age40to45.WOE)
> Age45to50.SCORE <- getscore(2, Age45to50.WOE)
> Age50to55.SCORE <- getscore(2, Age50to55.WOE)
> Age55to60.SCORE <- getscore(2, Age55to60.WOE)
> Age60to65.SCORE <- getscore(2, Age60to65.WOE)
> Age65to75.SCORE <- getscore(2, Age65to75.WOE)
> Agemorethan75.SCORE <- getscore(2, Agemorethan.WOE)
> Age.SCORE <- c(Agelessthan30.SCORE, Age30to35.SCORE, Age35to40.SCORE, Age40to45.SCORE,
+ Age45to50.SCORE, Age50to55.SCORE, Age55to60.SCORE, Age60to65.SCORE,
+ Age65to75.SCORE, Agemorethan75.SCORE)
> Age.SCORE
[1] 12 10 7 5 4 1 -5 -11 -19 -26
NumberOfTime30-59DaysPastDueNotWorse变量(x3)
> # x3
> Worselessthan0.SCORE <- getscore(4, Worselessthan0.WOE)
> Worse0to1.SCORE <- getscore(4, Worse0to1.WOE)
> Worse1to3.SCORE <- getscore(4, Worse1to3.WOE)
> Worse3to5.SCORE <- getscore(4, Worse3to5.WOE)
> Worsemorethan5.SCORE <- getscore(4, Worsemorethan5.WOE)
> Worse.SCORE <- c(Worselessthan0.SCORE, Worse0to1.SCORE, Worse1to3.SCORE, Worse3to5.SCORE,
+ Worsemorethan5.SCORE)
> Worse.SCORE
[1] -10 18 34 47 50
MonthlyIncome变量(x5)
> # x5
> Incomelessthan1000.SCORE <- getscore(3, Incomelessthan1000.WOE)
> Income1000to2000.SCORE <- getscore(3, Income1000to2000.WOE)
> Income2000to3000.SCORE <- getscore(3, Income2000to3000.WOE)
> Income3000to4000.SCORE <- getscore(3, Income3000to4000.WOE)
> Income4000to5000.SCORE <- getscore(3, Income4000to5000.WOE)
> Income5000to6000.SCORE <- getscore(3, Income5000to6000.WOE)
> Income6000to7500.SCORE <- getscore(3, Income6000to7500.WOE)
> Income7500to9500.SCORE <- getscore(3, Income7500to9500.WOE)
> Income9500to12000.SCORE <- getscore(3, Income9500to12000.WOE)
> Incomemorethan12000.SCORE <- getscore(3, Incomemorethan12000.WOE)
> Income.SCORE <- c(Incomelessthan1000.SCORE, Income1000to2000.SCORE, Income2000to3000.SCORE,
+ Income3000to4000.SCORE, Income4000to5000.SCORE, Income5000to6000.SCORE,
+ Income6000to7500.SCORE, Income7500to9500.SCORE, Income9500to12000.SCORE,
+ Incomemorethan12000.SCORE)
> Income.SCORE
[1] -24 10 7 8 5 2 0 -5 -8 -10
NumberOfTimes90DaysLate变量(x7)
> # x7
> Latelessthan0.SCORE <- getscore(5, Latelessthan0.WOE)
> Late0to1.SCORE <- getscore(5, Late0to1.WOE)
> Late1to3.SCORE <- getscore(5, Late1to3.WOE)
> Late3to5.SCORE <- getscore(5, Late3to5.WOE)
> Late5to10.SCORE <- getscore(5, Late5to10.WOE)
> Latemorethan10.SCORE <- getscore(5, Latemorethan10.WOE)
> Late.SCORE <- c(Latelessthan0.SCORE, Late0to1.SCORE, Late1to3.SCORE, Late3to5.SCORE, Late5to10.SCORE, Latemorethan10.SCORE)
> Late.SCORE
[1] -7 36 51 62 63 44
NumberRealEstateLoansOrLine变量(x8)
> # x8
> Reallessthan0.SCORE <- getscore(6, Reallessthan0.WOE)
> Real0to1.SCORE <- getscore(6, Real0to1.WOE)
> Real1to2.SCORE <- getscore(6, Real1to2.WOE)
> Real2to3.SCORE <- getscore(6, Real2to3.WOE)
> Real3to5.SCORE <- getscore(6, Real3to5.WOE)
> Realmorethan5.SCORE <- getscore(6, Realmorethan5.WOE)
> Real.SCORE <- c(Reallessthan0.SCORE, Real0to1.SCORE, Real1to2.SCORE, Real2to3.SCORE, Real3to5.SCORE,
+ Realmorethan5.SCORE)
> Real.SCORE
[1] 4 -4 -3 1 8 21
NumberOfTime60-89DaysPastDueNotWorse变量(x9)
> # x9
> Worse2lessthan0.SCORE <- getscore(7, Worse2lessthan0.WOE)
> Worse20to1.SCORE <- getscore(7, Worse20to1.WOE)
> Worse21to3.SCORE <- getscore(7, Worse21to3.WOE)
> Worse23to5.SCORE <- getscore(7, Worse23to5.WOE)
> Worse2morethan5.SCORE <- getscore(7, Worse2morethan5.WOE)
> Worse2.SCORE <- c(Worse2lessthan0.SCORE, Worse20to1.SCORE, Worse21to3.SCORE, Worse23to5.SCORE,
+ Worse2morethan5.SCORE)
> Worse2.SCORE
[1] -4 26 39 50 48
NumberOfDependents变量(x10)
> # x10
> Deplessthan0.SCORE <- getscore(8, Deplessthan0.WOE)
> Dep0to1.SCORE <- getscore(8, Dep0to1.WOE)
> Dep1to2.SCORE <- getscore(8, Dep1to2.WOE)
> Dep2to3.SCORE <- getscore(8, Dep2to3.WOE)
> Dep3to5.SCORE <- getscore(8, Dep3to5.WOE)
> Depmorethan5.SCORE <- getscore(8, Depmorethan5.WOE)
> Dep.SCORE <- c(Deplessthan0.SCORE, Dep0to1.SCORE, Dep1to2.SCORE, Dep2to3.SCORE, Dep3to5.SCORE,
+ Depmorethan5.SCORE)
> Dep.SCORE
[1] -2 1 2 3 4 8
最终生成的评分卡如下:
Age | <=30 | (30, 35] | (35, 40] | (40, 45] | (45, 50] | (50, 55] | (55, 60] | (60, 65] | (65, 75] | (75, 100] |
12 | 10 | 7 | 5 | 4 | 1 | -5 | -11 | -19 | -26 | |
Worse | <=0 | (0, 1] | (1, 3] | (3, 5] | >5 | |||||
-10 | 18 | 34 | 47 | 50 | ||||||
Income | <=1k | (1, 2] | (2, 3] | (3, 4] | (4 , 5] | (5, 6] | (6, 7.5] | (7.5, 9.5] | (9.5 ,12] | >12 |
-24 | 10 | 7 | 8 | 5 | 2 | 0 | -5 | -8 | -10 | |
Late | <=0 | (0, 1] | (1, 3] | (3, 5] | (5, 10] | >10 | ||||
-7 | 36 | 51 | 62 | 63 | 44 | |||||
Real | <=0 | (0, 1] | (1, 2] | (2, 3] | (3, 5] | >5 | ||||
4 | -4 | -3 | 1 | 8 | 21 | |||||
Worse2 | <=0 | (0, 1] | (1, 3] | (3, 5] | >5 | |||||
-4 | 26 | 39 | 50 | 48 | ||||||
Dep | <=0 | (0, 1] | (1, 2] | (2, 3] | (3, 5] | >5 | ||||
-2 | 1 | 2 | 3 | 4 | 8 |
个人评分计算案例:
Age | 38 | 7 |
Worse | 4 | 47 |
Income | 1500 | 10 |
Late | 2 | 51 |
Real | 1.5 | -3 |
Worse2 | 4 | 50 |
Dep | 1.5 | 2 |
所以这个人的信用总评分=基础分(base)+各个特征分数
总评分=446.5913+7+47+10+51-3+50+2=610.5913
文章参考自:http://blog.csdn.net/csqazwsxedc/article/details/51225156