来源 | R友舍
分箱方法是一种简单常用的预处理方法。所谓“分箱”,实际上就是按照属性值划分的子区间,如果一个属性值处于某个子区间范围内,就称把该属性值放进这个子区间所代表的“箱子”内。
不同的数据分析工具(如SAS、SPSS)都有相应的模块来对连续数据进行分箱操作,R语言主要使用smbinning包进行数据分箱操作。
library(smbinning)
## Loading required package: sqldf
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI
## Warning: package 'DBI' was built under R version 3.3.2
## Loading required package: partykit
## Loading required package: grid
## Loading required package: Formula
以chileancredit的征信数据为例,可以利用smbinning包实现数据的分箱,先看看数据概要:
head(chileancredit)
## CustomerId TOB IncomeLevel Bal01 MaxDqBin01 MaxDqBin02 MaxDqBin03## 9 0000000185 44 1 604.86 0 0 0## 13 0000000238 79 1 1006.21 0 0 0## 21 0000000346 102 1 299.23 0 0 0## 25 0000000460 NA 1 645.19 0 0 0## 31 0000000549 109 <NA> 218.00 0 0 0## 32 0000000559 183 <NA> 10.32 0 0 0## MaxDqBin04 MaxDqBin05 MaxDqBin06 MtgBal01 NonBankTradesDq01## 9 0 0 0 0 0## 13 0 0 0 0 0## 21 0 0 0 0 0## 25 0 0 0 0 0## 31 0 0 0 0 0## 32 0 0 0 0 0## NonBankTradesDq02 NonBankTradesDq03 NonBankTradesDq04 NonBankTradesDq05## 9 0 0 0 0## 13 0 0 0 0## 21 0 0 0 0## 25 0 0 0 0## 31 0 0 0 0## 32 0 0 0 0## NonBankTradesDq06 FlagGB FlagSample## 9 0 1 1## 13 0 1 1## 21 0 1 1## 25 0 1 1## 31 0 1 1## 32 0 1 1
str(chileancredit)
## 'data.frame': 7702 obs. of 19 variables:## $ CustomerId : chr "0000000185" "0000000238" "0000000346" "0000000460" ...## $ TOB : int 44 79 102 NA 109 183 172 76 136 171 ...## $ IncomeLevel : Factor w/ 6 levels "0","1","2","3",..: 2 2 2 2 NA NA 1 2 1 1 ...## $ Bal01 : num 605 1006 299 645 218 ...## $ MaxDqBin01 : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...## $ MaxDqBin02 : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...## $ MaxDqBin03 : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 1 1 ...## $ MaxDqBin04 : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...## $ MaxDqBin05 : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...## $ MaxDqBin06 : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...## $ MtgBal01 : num 0 0 0 0 0 0 0 0 0 0 ...## $ NonBankTradesDq01: int 0 0 0 0 0 0 0 0 0 0 ...## $ NonBankTradesDq02: int 0 0 0 0 0 0 0 0 0 0 ...## $ NonBankTradesDq03: int 0 0 0 0 0 0 0 0 0 0 ...## $ NonBankTradesDq04: int 0 0 0 0 0 0 0 1 0 0 ...## $ NonBankTradesDq05: int 0 0 0 0 0 0 0 1 0 0 ...## $ NonBankTradesDq06: int 0 0 0 0 0 0 0 1 0 0 ...## $ FlagGB : int 1 1 1 1 1 1 1 1 1 1 ...## $ FlagSample : int 1 1 1 1 1 1 1 1 1 1 ...
对数据集切分为训练集与测试集两类:FlagSample=1作为训练集,另一半作为测试集:
chileancredit.train=subset(chileancredit,FlagSample==1)chileancredit.test=subset(chileancredit,FlagSample==0)
以TOB变量为例,我们可以尝试对其进行分箱处理:
smbinning包的分箱原理是基于构造条件推断树ctree的监督式分享,因此需要提前定义好目标标签Y,这里将用户好坏标签FlagGB作为分箱的目标标签。
result=smbinning(df=chileancredit.train,y="FlagGB",x="TOB",p=0.05)
## Loading required package: tcltk
分好箱后,smbinning包还提供了smbinning.plot函数来将分箱效果给展现出来:
par(mfrow=c(2,2))boxplot(chileancredit.train$TOB~chileancredit.train$FlagGB, horizontal=T, frame=F, col="lightgray",main="Distribution")mtext("Time on Books (Months)",3)smbinning.plot(result,option="dist",sub="Time on Books (Months)")smbinning.plot(result,option="badrate",sub="Time on Books (Months)")smbinning.plot(result,option="WoE",sub="Time on Books (Months)")
图1与图2是变量的分布情况,图3表示的是bad标签的比例,除了缺失值外,bad标签的比例表现为递减趋势。图4为分箱变量的woe指标。woe指标是评分卡模型里极为常见的变量评价指标,反映了变量区分好坏标签的程度,woe值越大,则说明该组good标签占比越高,反之则说明bad标签占比越高。
最后根据分箱逻辑,smbinning还提供了smbinning.sql函数用以形成sql逻辑:
smbinning.sql(result)
## [1] "case when TOB <= 17 then '01: TOB <= 17' when TOB <= 30 then '02: TOB <= 30' when TOB <= 63 then '03: TOB <= 63' when TOB > 63 then '04: TOB > 63' when TOB Is Null then 'TOB Is Null' else '99: Error' end "
根据分箱结果后,还可以利用smbinning.gen函数,在原始数据集上直接衍生出分箱变量:
chileancredit=smbinning.gen(chileancredit, result, chrname = "gTOB")
smbinning包还提供了IV评价指标。IV也称Information Value,是评价变量对分类标签区分能力的一种指标,常用于评分卡变量评价上。通常IV值越高,则说明变量预测能力越好。评分卡构建过程中,分析人员也更加倾向于选择高IV值,线性关系显著的变量作为预备变量纳入模型筛选集。
sumivt=smbinning.sumiv(chileancredit.train,y="FlagGB")
print(sumivt)
## Char IV Process## 5 MaxDqBin01 2.3771 Factor binning OK## 6 MaxDqBin02 1.8599 Factor binning OK## 12 NonBankTradesDq01 1.8129 Numeric binning OK## 13 NonBankTradesDq02 1.4417 Numeric binning OK## 7 MaxDqBin03 1.3856 Factor binning OK## 14 NonBankTradesDq03 1.1819 Numeric binning OK## 8 MaxDqBin04 1.0729 Factor binning OK## 15 NonBankTradesDq04 0.8948 Numeric binning OK## 9 MaxDqBin05 0.8844 Factor binning OK## 16 NonBankTradesDq05 0.7511 Numeric binning OK## 10 MaxDqBin06 0.6302 Factor binning OK## 17 NonBankTradesDq06 0.5501 Numeric binning OK## 2 TOB 0.5025 Numeric binning OK## 3 IncomeLevel 0.3380 Factor binning OK## 11 MtgBal01 0.1452 Numeric binning OK## 4 Bal01 0.1379 Numeric binning OK## 1 CustomerId NA Not numeric nor factor## 18 FlagSample NA Uniques values of x < 10
最后,可以用smbinning.sumiv.plot函数将变量IV值进行排序,并进行可视化展现出来,方便分析人员进行变量筛选。