R Package ‘smbinning’: Optimal Binning for Scoring Modeling

421 篇文章 14 订阅

by Herman Jopia

What is Binning?

Binning is the term used in scoring modeling for what is also known in Machine Learning as Discretization, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.

Why Binning?

Though there are some reticence to it [1], the benefits of binning are pretty straight forward:

  • It allows missing data and other special calculations (e.g. divided by zero) to be included in the model.
  • It controls or mitigates the impact of outliers over the model.
  • It solves the issue of having different scales among the characteristics, making the weights of the coefficients in the final model comparable.

Unsupervised Discretization
Unsupervised Discretization divides a continuous feature into groups (bins) without taking into account any other information. It is basically a partiton with two options: equal length intervals and equal frequency intervals.

Equal length intervals

  • Objective: Understand the distribution of a variable. 
  • Example: The classic histogram, whose bins have equal length that can be calculated using different rules (Sturges, Rice, and others).
  • Disadvantage: The number of records in a bin may be too small to allow for a valid calculation, as shown in Table 1.

Binning_eqlen


Table 1. Time on Books and Credit Performance. Bin 6 has no bads, producing indeterminate metrics.

Equal frequency intervals

  • Objective: Analyze the relationship with a binary target variable through metrics like bad rate.
  • Example: Quartlies or Percentiles.
  • Disadvantage: The cutpoints selected may not maximize the difference between bins when mapped to a target variable, as shown in Table 2

Binning_eqfreq

Table 2. Time on Books and Credit Performance. Different cutpoints may improve the Information Value (0.4969).

Supervised Discretization

Supervised Discretization divides a continuous feature into groups (bins) mapped to a target variable. The central idea is to find those cutpoints that maximize the difference between the groups.
In the past, analysts used to iteratively move from Fine Binning to Coarse Binning, a very time consuming process of finding manually and visually the right cutpoints (if ever). Nowadays with algorithms like ChiMerge or Recursive Partitioning, two out of several techniques available [2], analysts can quickly find the optimal cutpoints in seconds and evaluate the relationship with the target variable using metrics such as Weight of Evidence and Information Value.

An Example With 'smbinning'

Using the 'smbinning' package and its data (chileancredit), whose documentation can be found on its supporting website, the characteristic Time on Books is grouped into bins taking into account the Credit Performance (Good/Bad) to establish the optimal cutpoints to get meaningful and statistically different groups. The R code below, Table 3, and Figure 1 show the result of this application, which clearly surpass the previous methods with the highest Information Value (0.5353).

# Load package and its data 
library(smbinning) 
data(chileancredit) 
# Training and testing samples 
chileancredit.train=subset(chileancredit,FlagSample==1) 
chileancredit.test=subset(chileancredit,FlagSample==0) 
# Run and save results 
result=smbinning(df=chileancredit.train,y="FlagGB",x="TOB",p=0.05) 
result$ivtable
 
# Relevant plots (2x2 Page) 
par(mfrow=c(2,2)) 
boxplot(chileancredit.train$TOB~chileancredit.train$FlagGB, 
horizontal=T, frame=F, col="lightgray",main="Distribution") 
mtext("Time on Books (Months)",3) 
smbinning.plot(result,option="dist",sub="Time on Books (Months)") 
smbinning.plot(result,option="badrate",sub="Time on Books (Months)") 
smbinning.plot(result,option="WoE",sub="Time on Books (Months)")

  Binning_rp


Table 3. Time on Books cutpoints mapped to Credit Performance.

Binning_plot

Figure 1. Plots generated by the package.

In the middle of the "data era", it is critical to speed up the development of scoring models. Binning, and more specifically, automated binning helps to reduce significantly the time consuming process of generating predictive characteristics, reason why companies like SAS and FICO have developed their own proprietary algorithms to implement this functionality on their respective software. For analysts who do not have these specific tools or modules, the R package 'smbinning' offers an statistically robust alternative to run their analysis faster.
 
For more information about binning, the package's documentation available on  CRAN lists some references related to the algorithm behind it and its supporting  website some references for scoring modeling development. 

References
[1] Dinero, T. (1996) Seven Reasons Why You Should Not Categorize Continuous Data. Journal of health & social policy 8(1) 63-72 (1996).
[2] Garcia, S. et al (2013) A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, April 2013.

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值