【R语言 评分模型】R语言建立信用评分模型

1、数据源:
我们将会使用在信用评级建模中非常常用的德国信贷数据(German credit dataset)作为建模的数据集。德国信贷数据共有1000条数据,每条数据20个特征。

2、数据源下载:
https://github.com/frankhlchi/R-scorecard

3、建模过程

这里写图片描述

4、完整版(源代码):

rm(list=ls())
gc()
library(caret)
library(smbinning)
library(ggplot2)
library(woe)

#load the data
german_credit <- read.csv("C:/pic/credit/german_credit.csv",sep = ",")
train <-createDataPartition(y=german_credit$Creditability,p=0.75,list=FALSE)
train2 <- german_credit[train, ]
test2 <- german_credit[-train, ]

#Explore data distribution 
ggplot(german_credit, aes(x = Duration,y = ..count..,)) + geom_histogram(fill = "blue", colour = "grey60", size = 0.2, alpha = 0.2,binwidth = 5)
ggplot(german_credit, aes(x = CreditAmount,y = ..count..,)) + geom_histogram(fill = "blue", colour = "grey60", size = 0.2, alpha = 0.2,binwidth = 1000)
ggplot(german_credit, aes(x = Age,y = ..count..,)) + geom_histogram(fill = "blue", colour = "grey60", size = 0.2, alpha = 0.2,binwidth = 5)
ggplot(german_credit, aes(x =Creditability,y = ..count..,)) + geom_histogram(fill = "blue", colour = "grey60" , alpha = 0.2,binwidth = 0.5)

#Optimal Binning
Durationresult=smbinning(df=train2,y="Creditability",x="Duration",p=0.05)
CreditAmountresult=smbinning(df=train2,y="Creditability",x="CreditAmount",p=0.05) 
Ageresult=smbinning(df=train2,y="Creditability",x="Age",p=0.05) 
smbinning.plot(CreditAmountresult,option="WoE",sub="CreditAmount") 
smbinning.plot(Durationresult,option="WoE",sub="Duration")
smbinning.plot(Ageresult,option="WoE",sub="Age")



#Univariate Analysis
AccountBalancewoe=woe(train2, "AccountBalance",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
ggplot(AccountBalancewoe, aes(x = BIN, y = -WOE)) + geom_bar(stat = "identity",fill = "blue", colour = "grey60",size = 0.2, alpha = 0.2)+labs(title = "AccountBalance") 
ValueSavingswoe=woe(train2, "ValueSavings",Continuous = F, "Creditability",C_Bin = 5,Good = "1",Bad = "0")
ggplot(ValueSavingswoe, aes(x = BIN, y = -WOE)) + geom_bar(stat = "identity",fill = "blue", colour = "grey60",size = 0.2, alpha = 0.2)+labs(title = "ValueSavings") 
Lengthofcurrentemploymentwoe=woe(train2, "Lengthofcurrentemployment",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
ggplot(Lengthofcurrentemploymentwoe, aes(x = BIN, y = -WOE)) + geom_bar(stat = "identity",fill = "blue", colour = "grey60",size = 0.2, alpha = 0.2)+labs(title = "Lengthofcurrentemployment") 

#combine some bins
for(i in 1:750){
  if(train2$ValueSavings[i]==1){train2$ValueSavings[i]=2}
}
for(i in 1:750){
  if(train2$Lengthofcurrentemployment[i]==5){train2$Lengthofcurrentemployment[i]=4}
}

library(corrplot)
#correlation analysis
cor1<-cor(train)
corrplot(cor1,tl.cex = 0.5)

#Infomation Value calculation
for(i in 1:1000){
  if(german_credit$Duration[i]<=11){german_credit$Duration[i]=1}
  else if(german_credit$Duration[i]<=30){german_credit$Duration[i]=2}
  else{german_credit$Duration[i]=3}
}

for(i in 1:1000){
  if(german_credit$Age[i]<=25){german_credit$Age[i]=1}
  else{german_credit$Age[i]=2}
}

for(i in 1:1000){
  if(german_credit$CreditAmount[i]<=6742){german_credit$CreditAmount[i]=1}
  else{german_credit$CreditAmount[i]=2}
}


for(i in 1:1000){
  if(german_credit$ValueSavings[i]==1){german_credit$ValueSavings[i]=2}
}

for(i in 1:1000){
  if(german_credit$Lengthofcurrentemployment[i]==5){german_credit$Lengthofcurrentemployment[i]=4}
}

AccountBalancewoe=woe(train2, "AccountBalance",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Durationwoe=woe(train2, "Duration",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")
PaymentStatusofPreviousCreditwoe=woe(train2, "PaymentStatusofPreviousCredit",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Purposewoe = woe(train2, "Purpose",Continuous = F, "Creditability",C_Bin = 11,Good = "1",Bad = "0")
CreditAmountwoe= woe(train2, "CreditAmount",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")
ValueSavingswoe =woe(train2, "ValueSavings",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Lengthofcurrentemploymentwoe=woe(train2, "Lengthofcurrentemployment",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Instalmentpercenwoet=woe(train2, "Instalmentpercent",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Sex.Marital.Statuswoe=woe(train2, "Sex.Marital.Status",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Guarantorswoe=woe(train2, "Guarantors",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")
DurationinCurrentaddresswoe=woe(train2, "DurationinCurrentaddress",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Mostvaluableavailableassetwoe=woe(train2, "Mostvaluableavailableasset",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Agewoe=woe(train2, "Age",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")
ConcurrentCreditswoe=woe(train2, "ConcurrentCredits",Continuous = F, "Creditability",C_Bin = 3,Good = "1",Bad = "0")
Typeofapartmentwoe=woe(train2, "Typeofapartment",Continuous = F, "Creditability",C_Bin = 3,Good = "1",Bad = "0")
NoofCreditatthisBankwoe=woe(train2, "NoofCreditatthisBank",Continuous = F, "Creditability",C_Bin = 3,Good = "1",Bad = "0")
Occupationwoe=woe(train2, "Occupation",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Noofdependentswoe=woe(train2, "Noofdependents",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")
Telephonewoe=woe(train2, "Telephone",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")
ForeignWorkerwoe=woe(train2, "ForeignWorker",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")

va = c("AccountBalance",  "Duration",   "PaymentStatusofPreviousCredit",    "Purpose",  "CreditAmount", "ValueSavings", "Lengthofcurrentemployment",    "Instalmentpercent","Sex.Marital.Status","Guarantors","DurationinCurrentaddress",   "Mostvaluableavailableasset",   "Age","ConcurrentCredits","Typeofapartment",    "NoofCreditatthisBank","Occupation","Noofdependents",   "Telephone" ,"ForeignWorker")
iv=c(sum(AccountBalancewoe$IV),sum(Durationwoe$IV),sum(PaymentStatusofPreviousCreditwoe$IV),sum(Purposewoe$IV),sum(CreditAmountwoe$IV),sum(ValueSavingswoe$IV),sum(Lengthofcurrentemploymentwoe$IV) ,sum(Instalmentpercenwoet$IV) ,sum(Sex.Marital.Statuswoe$IV) ,sum(Guarantorswoe$IV)  ,sum(DurationinCurrentaddresswoe$IV) ,sum(Mostvaluableavailableassetwoe$IV),sum(Agewoe$IV),sum(ConcurrentCreditswoe$IV),sum(Typeofapartmentwoe$IV),sum(NoofCreditatthisBankwoe$IV),sum(Occupationwoe$IV), sum(Noofdependentswoe$IV), sum(Telephonewoe$IV),sum(ForeignWorkerwoe$IV))
infovalue = data.frame(va,iv)
ggplot(infovalue, aes(x = va, y = iv)) + geom_bar(stat = "identity",fill = "blue", colour = "grey60",size = 0.2, alpha = 0.2)+labs(title = "Information value")+ theme(axis.text.x=element_text(angle=90,colour="black",size=10));

#WoE transformation
german_credit$DurationinCurrentaddress=NULL
german_credit$Guarantors=NULL
german_credit$Instalmentpercent=NULL
german_credit$NoofCreditatthisBank=NULL
german_credit$Occupation=NULL
german_credit$Noofdependents=NULL
german_credit$Telephone=NULL
AccountBalancewoe=woe(train2, "AccountBalance",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Durationwoe=woe(train2, "Duration",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")
PaymentStatusofPreviousCreditwoe=woe(train2, "PaymentStatusofPreviousCredit",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Purposewoe = woe(train2, "Purpose",Continuous = F, "Creditability",C_Bin = 11,Good = "1",Bad = "0")
CreditAmountwoe= woe(train2, "CreditAmount",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")
ValueSavingswoe =woe(train2, "ValueSavings",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Lengthofcurrentemploymentwoe=woe(train2, "Lengthofcurrentemployment",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Sex.Marital.Statuswoe=woe(train2, "Sex.Marital.Status",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Mostvaluableavailableassetwoe=woe(train2, "Mostvaluableavailableasset",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Agewoe=woe(train2, "Age",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")
ConcurrentCreditswoe=woe(train2, "ConcurrentCredits",Continuous = F, "Creditability",C_Bin = 3,Good = "1",Bad = "0")
Typeofapartmentwoe=woe(train2, "Typeofapartment",Continuous = F, "Creditability",C_Bin = 3,Good = "1",Bad = "0")
ForeignWorkerwoe=woe(train2, "ForeignWorker",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")

for(i in 1:1000){

  for(s in 1:4){
    if(german_credit$AccountBalance[i]==s){
      german_credit$AccountBalance[i]=-AccountBalancewoe$WOE[s]
    }
  }

  for(s in 1:3){
    if(german_credit$Duration[i]==s){
      german_credit$Duration[i]=-Durationwoe$WOE[s]
    }
  }

  for(s in 0:4){
    if(german_credit$PaymentStatusofPreviousCredit[i]==s){
      german_credit$PaymentStatusofPreviousCredit[i]=-PaymentStatusofPreviousCreditwoe$WOE[s+1]
    }
  }

  for(s in 0:10){
    if(s<=6){
      if(german_credit$Purpose[i]==s){
        german_credit$Purpose[i]=-Purposewoe$WOE[s+1]
      }
    }else{
      if(german_credit$Purpose[i]==s){
        german_credit$Purpose[i]=-Purposewoe$WOE[s]
      }
    }
  }

  for(s in 1:2){
    if(german_credit$CreditAmount[i]==s){
      german_credit$CreditAmount[i]=-CreditAmountwoe$WOE[s]
    }
  }

  for(s in 2:5){
    if(german_credit$ValueSavings[i]==s){
      german_credit$ValueSavings[i]=-ValueSavingswoe$WOE[s-1]
    }
  }

  for(s in 1:5){
    if(german_credit$Lengthofcurrentemployment[i]==s){
      german_credit$Lengthofcurrentemployment[i]=-Lengthofcurrentemploymentwoe$WOE[s]
    }
  }

  for(s in 1:5){
    if(german_credit$Sex.Marital.Status[i]==s){
      german_credit$Sex.Marital.Status[i]=-Sex.Marital.Statuswoe$WOE[s]
    }
  }

  for(s in 1:4){
    if(german_credit$Mostvaluableavailableasset[i]==s){
      german_credit$Mostvaluableavailableasset[i]=-Mostvaluableavailableassetwoe$WOE[s]
    }
  }

  for(s in 1:2){
    if(german_credit$Age[i]==s){
      german_credit$Age[i]=-Agewoe$WOE[s]
    }
  }

  for(s in 1:5){
    if(german_credit$ConcurrentCredits[i]==s){
      german_credit$ConcurrentCredits[i]=-ConcurrentCreditswoe$WOE[s]
    }
  }

  for(s in 1:5){
    if(german_credit$Typeofapartment[i]==s){
      german_credit$Typeofapartment[i]=-Typeofapartmentwoe$WOE[s]
    }
  }

  for(s in 1:2){
    if(german_credit$ForeignWorker[i]==s){
      german_credit$ForeignWorker[i]=-ForeignWorkerwoe$WOE[s]
    }
  }
}

#Stepwise regression & Logistic model buiding
fit<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment +Sex.Marital.Status+ Mostvaluableavailableasset + Age + ConcurrentCredits + Typeofapartment + ForeignWorker,train2,family = "binomial")
backwards = step(fit)
summary(backwards)
fit2<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment + Age + ConcurrentCredits  + ForeignWorker,train2,family = "binomial")
summary(fit2)
fit3<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment + Age  + ForeignWorker,train2,family = "binomial")
summary(fit3)

#VIF testing
library(car)
vif(fit3, digits =3)

#producing confusion matrix
prediction <- predict(fit3,newdata=test2)
for (i in 1:250) {
  if(prediction[i]>0.99){
    prediction[i]=1}
  else
  {prediction[i]=0}
}
confusionMatrix(prediction, test2$Creditability)

coe = (fit3$coefficients)
p <- 20/log(2)
q <- 600-20*log(2.5)/log(2)

base <- q + p*as.numeric(coe[1])

#score for AccountBalance
AccountBalanceSCORE = p*as.numeric(coe[2])*AccountBalancewoe$woe[1]*-1

p*as.numeric(coe[2])*AccountBalancewoe$WOE[1]*-1
p*as.numeric(coe[2])*AccountBalancewoe$WOE[2]*-1
p*as.numeric(coe[2])*AccountBalancewoe$WOE[3]*-1
p*as.numeric(coe[2])*AccountBalancewoe$WOE[4]*-1

#score for Duration
p*as.numeric(coe[3])*Durationwoe$WOE[1]*-1
p*as.numeric(coe[3])*Durationwoe$WOE[2]*-1
p*as.numeric(coe[3])*Durationwoe$WOE[3]*-1

#score for PaymentStatusofPreviousCredit
p*as.numeric(coe[4])*PaymentStatusofPreviousCreditwoe$WOE[1]*-1
p*as.numeric(coe[4])*PaymentStatusofPreviousCreditwoe$WOE[2]*-1
p*as.numeric(coe[4])*PaymentStatusofPreviousCreditwoe$WOE[3]*-1
p*as.numeric(coe[4])*PaymentStatusofPreviousCreditwoe$WOE[4]*-1
p*as.numeric(coe[4])*PaymentStatusofPreviousCreditwoe$WOE[5]*-1

#score for purpose 
for(i in 1:10){
  print(p*as.numeric(coe[5])*Purposewoe$WOE[i])*-1
}

p*as.numeric(coe[6])*CreditAmountwoe$WOE[1]*-1
p*as.numeric(coe[6])*CreditAmountwoe$WOE[2]*-1

p*as.numeric(coe[7])*ValueSavingswoe$WOE[1]*-1
p*as.numeric(coe[7])*ValueSavingswoe$WOE[2]*-1
p*as.numeric(coe[7])*ValueSavingswoe$WOE[3]*-1
p*as.numeric(coe[7])*ValueSavingswoe$WOE[4]*-1

p*as.numeric(coe[8])*Lengthofcurrentemploymentwoe$WOE[1]*-1
p*as.numeric(coe[8])*Lengthofcurrentemploymentwoe$WOE[2]*-1
p*as.numeric(coe[8])*Lengthofcurrentemploymentwoe$WOE[3]*-1
p*as.numeric(coe[8])*Lengthofcurrentemploymentwoe$WOE[4]*-1

p*as.numeric(coe[9])*Agewoe$WOE[1]*-1
p*as.numeric(coe[9])*Agewoe$WOE[2]*-1

p*as.numeric(coe[10])*ForeignWorkerwoe$WOE[1]*-1
p*as.numeric(coe[10])*ForeignWorkerwoe$WOE[2]*-1

建模方法总结篇:

1、变量分箱
在评分卡建模中,变量分箱(binning)是对连续变量离散化(discretization)的一种称呼。要将logistic模型转换为标准评分卡的形式,这一环节是必须完成的。信用评分卡开发中一般有常用的等距分段、等深分段、最优分段。

其中等距分段(Equval length intervals)是指分段的区间是一致的,比如年龄以十年作为一个分段;等深分段(Equal frequency intervals)是先确定分段数量,然后令每个分段中数据数量大致相等;最优分段(Optimal Binning)又叫监督离散化(supervised discretizaion),使用递归划分(Recursive Partitioning)将连续变量分为分段,背后是一种基于条件推断查找较佳分组的算法(Conditional Inference Tree)

2、单变量分析
在风险建模的过程中,变量选择可以具体细化为单变量变量筛选 (Univariate Variable Selection)和多变量变量筛选 (Multivariate Variable Selection)。多变量变量筛选一般会利用Stepwise算法在变量池中选取最优变量。 而单变量筛选,或者说单变量分析,是通过比较指标分箱和对应分箱的违约概率来确定指标是否符合经济意义。

3、相关性分析 & IV(信息值)筛选
我们会用经过清洗后的数据看一下变量间的相关性。注意,这里的相关性分析只是初步的检查,进一步检查模型的多重共线性还需要通过 VIF(variance inflation factor)也就是 方差膨胀因子进行检验。

4、StepWise多变量分析 & Logistic模型建立
在进行StepWise分析前,我们需要将筛选后的变量转换为WoE值并建立Logistic模型。
将经过WoE转换的数据放入Logistic模型中建模,并使用向后逐步回归方法
为防止多重共线性问题的出现,我们对模型进行VIF检验:

5、模型检验
到这里,我们的建模部分基本结束了。我们需要验证一下模型的预测能力如何。我们使用在建模开始阶段预留的250条数据进行检验:
模型的精度达到了0.72,模型表现一般。这同Logistic模型本身的局限性有关。传统的回归模型精度一般都会弱于决策树、SVM等机器挖掘算法。

6、打分卡转换
我们在上一部分,我们已经基本完成了建模相关的工作,并用混淆矩阵验证了模型的预测能力。接下来的步骤,就是将Logistic模型转换为标准打分卡的形式。
在建立标准评分卡之前,我们需要选取几个评分卡参数:基础分值、 PDO(比率翻倍的分值)和好坏比。 这里, 我们取600分为基础分值,PDO为20 (每高20分好坏比翻一倍),好坏比取2.5。;可得下式:

620 = q - p * log(2.5)
600 = q - p * log(2.5/2)
p = 20/log(2)
q =600-20*log(2.5)/log(2)
  • 1
    点赞
  • 63
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

东华果汁哥

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值