数据来源
https://www.kaggle.com/wendykan/lending-club-loan-data 包含226万个样本,145个字段。
数据处理
library(rio) #导入导出数据的包
loan <- import('loan.csv')
str(loan)
查看并处理缺失值
NAcol <- which(colSums(is.na(loan))>0) #包含缺失值的特征列序号
library(knitr) #使用其中的kable函数制表
kable(sort(colSums(sapply(loan[NAcol], is.na)),decreasing = T) )#按降序列出特征缺失值数量
特征 | 缺失数量 |
---|---|
id | 2260668 |
member_id | 2260668 |
url | 2260668 |
orig_projected_additional_accrued_interest | 2252242 |
deferral_term | 2250055 |
hardship_amount | 2250055 |
hardship_length | 2250055 |
hardship_dpd | 2250055 |
hardship_payoff_balance_amount | 2250055 |
hardship_last_payment_amount | 2250055 |
settlement_amount | 2227612 |
settlement_percentage | 2227612 |
settlement_term | 2227612 |
sec_app_mths_since_last_major_derog | 2224726 |
sec_app_revol_util | 2154484 |
revol_bal_joint | 2152648 |
sec_app_inq_last_6mths | 2152647 |
sec_app_mort_acc | 2152647 |
sec_app_open_acc | 2152647 |
sec_app_open_act_il | 2152647 |
sec_app_num_rev_accts | 2152647 |
sec_app_chargeoff_within_12_mths |