correlationfunnel
相关漏斗图,用于快速探索目标变量与自变量之间的相关关系。
# install.packages("correlationfunnel")
library(correlationfunnel)
library(tidyverse)
导入数据
data("german", package = "rchallenge")
german %>% glimpse()
## Rows: 1,000
## Columns: 21
## $ status <fct> no checking account, no checking account, ... …
## $ duration <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history <fct> all credits at this bank paid back duly, all c…
## $ purpose <fct> car (used), others, retraining, others, others…
## $ amount <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex <fct> female : non-single or male : single, male : m…
## $ other_debtors <fct> none, none, none, none, none, none, none, none…
## $ present_residence <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property <fct> car or other, unknown / no property, unknown /…
## $ age <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing <fct> for free, for free, for free, for free, rent, …
## $ number_credits <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job <fct> skilled employee/official, skilled employee/of…
## $ people_liable <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk <fct> good, good, good, good, good, good, good, good…
德国信用数据,可以从rchallenge中获得,目标是使用20个解释变量来判断因变量信用风险(好/坏)
1. 处理数据作为二元特征 Binary Features
我们使用 binarize() 函数生成二进制 (0/1) 变量的特征集。
-
Numeric data 被切分(使用n_bins)成分类数据,然后所有分类数据都被单热编码以产生二进制特征。为了防止低频类别(高基数类别)增加维数(结果数据框的宽度),我们使用 thresh_infreq = 0.01 和 name_infreq = “OTHER” 对多余的类别进行分组。
-
Categorical data : one-hot encoding
german_binarized_tbl = german %>%
correlationfunnel::binarize(n_bins = 5, thresh_infreq = 0.1, name_infreq = "OTHER", one_hot=TRUE)
german_binarized_tbl %>%
glimpse()
## Rows: 1,000
## Columns: 74
## $ status__no_checking_account <dbl> 1, 1, 0,…
## $ `status__..._<_0_DM` <dbl> 0, 0, 1,…
## $ `status__..._>=_200_DM_/_salary_for_at_least_1_year` <dbl> 0, 0, 0,…
## $ status__OTHER <dbl> 0, 0, 0,…
## $ `duration__-Inf_12` <dbl> 0, 1, 1,…
## $ duration__12_15 <dbl> 0, 0, 0,…
## $ duration__15_24 <dbl> 1, 0, 0,…
## $ duration__24_30 <dbl> 0, 0, 0,…
## $ duration__30_Inf <dbl> 0, 0, 0,…
## $ `credit_history__no_credits_taken/all_credits_paid_back_duly` <dbl> 0, 0, 1,…
## $ credit_history__all_credits_at_this_bank_paid_back_duly <dbl> 1, 1, 0,…
## $ credit_history__OTHER <dbl> 0, 0, 0,…
## $ purpose__others <dbl> 0, 1, 0,…
## $ `purpose__car_(new)` <dbl> 0, 0, 0,…
## $ `purpose__car_(used)` <dbl> 1, 0, 0,…
## $ `purpose__furniture/equipment` <dbl> 0, 0, 0,…
## $ purpose__OTHER <dbl> 0, 0, 1,…
## $ `amount__-Inf_1262` <dbl> 1, 0, 1,…
## $ amount__1262_1906.8 <dbl> 0, 0, 0,…
## $ amount__1906.8_2852.4 <dbl> 0, 1, 0,…
## $ amount__2852.4_4720 <dbl> 0, 0, 0,…
## $ amount__4720_Inf <dbl> 0, 0, 0,…
## $ `savings__unknown/no_savings_account` <dbl> 1, 1, 0,…
## $ `savings__..._<__100_DM` <dbl> 0, 0, 1,…
## $ `savings__..._>=_1000_DM` <dbl> 0, 0, 0,…
## $ savings__OTHER <dbl> 0, 0, 0,…
## $ `employment_duration__<_1_yr` <dbl> 1, 0, 0,…
## $ `employment_duration__1_<=_..._<_4_yrs` <dbl> 0, 1, 0,…
## $ `employment_duration__4_<=_..._<_7_yrs` <dbl> 0, 0, 1,…
## $ `employment_duration__>=_7_yrs` <dbl> 0, 0, 0,…
## $ employment_duration__OTHER <dbl> 0, 0, 0,…
## $ installment_rate__1 <dbl> 0, 0, 0,…
## $ installment_rate__2 <dbl> 0, 1, 1,…
## $ installment_rate__3 <dbl> 0, 0, 0,…
## $ installment_rate__4 <dbl> 1, 0, 0,…
## $ `personal_status_sex__female_:_non-single_or_male_:_single` <dbl> 1, 0, 1,…
## $ `personal_status_sex__male_:_married/widowed` <dbl> 0, 1, 0,…
## $ personal_status_sex__OTHER <dbl> 0, 0, 0,…
## $ other_debtors__none <dbl> 1, 1, 1,…
## $ other_debtors__OTHER <dbl> 0, 0, 0,…
## $ present_residence__1 <dbl> 0, 0, 0,…
## $ present_residence__2 <dbl> 0, 1, 0,…
## $ present_residence__3 <dbl> 0, 0, 0,…
## $ present_residence__4 <dbl> 1, 0, 1,…
## $ `property__unknown_/_no_property` <dbl> 0, 1, 1,…
## $ property__car_or_other <dbl> 1, 0, 0,…
## $ `property__building_soc._savings_agr./life_insurance` <dbl> 0, 0, 0,…
## $ property__real_estate <dbl> 0, 0, 0,…
## $ `age__-Inf_26` <dbl> 1, 0, 1,…
## $ age__26_30 <dbl> 0, 0, 0,…
## $ age__30_36 <dbl> 0, 1, 0,…
## $ age__36_44 <dbl> 0, 0, 0,…
## $ age__44_Inf <dbl> 0, 0, 0,…
## $ other_installment_plans__bank <dbl> 0, 0, 0,…
## $ other_installment_plans__none <dbl> 1, 1, 1,…
## $ other_installment_plans__OTHER <dbl> 0, 0, 0,…
## $ housing__for_free <dbl> 1, 1, 1,…
## $ housing__rent <dbl> 0, 0, 0,…
## $ housing__own <dbl> 0, 0, 0,…
## $ number_credits__1 <dbl> 1, 0, 1,…
## $ `number_credits__2-3` <dbl> 0, 1, 0,…
## $ number_credits__OTHER <dbl> 0, 0, 0,…
## $ `job__unskilled_-_resident` <dbl> 0, 0, 1,…
## $ `job__skilled_employee/official` <dbl> 1, 1, 0,…
## $ `job__manager/self-empl./highly_qualif._employee` <dbl> 0, 0, 0,…
## $ job__OTHER <dbl> 0, 0, 0,…
## $ people_liable__3_or_more <dbl> 0, 1, 0,…
## $ people_liable__0_to_2 <dbl> 1, 0, 1,…
## $ telephone__no <dbl> 1, 1, 1,…
## $ `telephone__yes_(under_customer_name)` <dbl> 0, 0, 0,…
## $ foreign_worker__no <dbl> 1, 1, 1,…
## $ foreign_worker__OTHER <dbl> 0, 0, 0,…
## $ credit_risk__bad <dbl> 0, 0, 0,…
## $ credit_risk__good <dbl> 1, 1, 1,…
2. 建立与目标变量的联系
customer_risk_tbl = german_binarized_tbl %>%
correlationfunnel::correlate(credit_risk__good)
customer_risk_tbl %>% glimpse()
## Rows: 74
## Columns: 3
## $ feature <fct> credit_risk, credit_risk, status, status, credit_history, …
## $ bin <chr> "bad", "good", "..._>=_200_DM_/_salary_for_at_least_1_year…
## $ correlation <dbl> -1.00000000, 1.00000000, 0.32243570, -0.25833347, 0.181713…
3. 绘制相关关系图
A Correlation Funnel is an tornado plot that lists the highest correlation features (based on absolute magnitude) at the top of the and the lowest correlation features at the bottom. The resulting visualization looks like a Funnel.
customer_risk_tbl %>%
correlationfunnel::plot_correlation_funnel(interactive = FALSE)
4. 进一步检查结果
customer_risk_tbl %>%
filter(feature %in% c("status", "credit_history", "duration",
"savings", "amount", "housing","property","age")) %>%
correlationfunnel::plot_correlation_funnel(interactive = FALSE, limits= c(-0.5,0.5))
从图中我们可以发现以下特征与因变量 credit_good 有重要关系:
- status: 薪资水平在200 DM salary for at least 1 year 时
- credit_history: 这家银行的所有信用都已按时偿还
- duration: 小于12时
- savings: 存款大于1000DM时
欢迎评论区交流~