【数据挖掘】datafountain&&lending club缺失值和异常值

本文探讨了数据挖掘中缺失值和异常值的处理方法,包括直接删除、数据填补(如固定值、统计值、热卡、建模填充)以及不处理策略。在实际操作中,通过数据查看、统计分析、3σ原则和箱型图识别异常值,并使用随机森林等模型填充缺失值。
摘要由CSDN通过智能技术生成

在这里插入图片描述

缺失的一般处理方法缺失值过多将严重影响预测结果的精度,出现偏差甚至错误,因此有必要在前期对缺失值进行处理,通常采用的处理方法有:

  1. 直接删除:当少数样本存在多列特征缺失时,可以将这些样本整行删除;当某列特征大部分缺失时,可将这列属性整列删除。
  2. 数据填补:可根据原始数据集中其他样本的取值分布情况进行缺失值填充,通常有以下几类方法:
    ----固定值填充:若缺失值本身含有一定的信息,则可将缺失值作为一种特征值进行标记,比如0或一些特殊字符。
    ----统计值填充:将数据集中的特征分为数值特征和类别特征,数值特征可以采用其他样本的均值或中位数进行填充;类别特征可采用众数进行填充。
    ----热卡填充:也称就近补齐,找到最相似的样本的特征值进行填充。
    ----建模填充:利用机器学习算法对缺失值进行预估,例如KNN、线性回归、随机森林等。
  3. 不处理:有些算法具备处理缺失值的能力,例如XGBoost、LightGBM、贝叶斯网络、人工神经网络等,此时可不处理缺失值

数据查看和观察


```python
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv('/Users/chouyangyu/Downloads/lc_2016_2017.csv')    # 读取数据
# data = data.sample(int(data.shape[0]*0.01)) # Kagging test
data.head(10)
id	member_id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	...	total_bal_il	il_util	open_rv_12m	open_rv_24m	max_bal_bc	all_util	total_rev_hi_lim	inq_fi	total_cu_tl	inq_last_12m
0	112435993	NaN	2300	2300	2300.0	36 months	12.62	77.08	C	C1	...	0.0	NaN	1.0	2.0	2315.0	55.0	7100.0	1.0	0.0	2.0
1	112290210	NaN	16000	16000	16000.0	60 months	12.62	360.95	C	C1	...	11078.0	69.0	3.0	5.0	1962.0	94.0	31900.0	0.0	6.0	1.0
2	112436985	NaN	6025	6025	6025.0	36 months	15.05	209.01	C	C4	...	0.0	NaN	1.0	2.0	1950.0	45.0	27700.0	1.0	5.0	3.0
3	112439006	NaN	20400	20400	20400.0	36 months	9.44	652.91	B	B1	...	53566.0	95.0	1.0	2.0	4240.0	60.0	46900.0	1.0	1.0	1.0
4	112438929	NaN	13000	13000	13000.0	36 months	11.99	431.73	B	B5	...	8466.0	72.0	0.0	1.0	2996.0	78.0	7800.0	0.0	0.0	0.0
5	112230200	NaN	12000	12000	12000.0	36 months	9.44	384.06	B	B1	...	12438.0	40.0	2.0	3.0	5227.0	49.0	25800.0	0.0	0.0	2.0
6	112210041	NaN	6000	6000	6000.0	36 months	10.42	194.79	B	B3	...	19983.0	NaN	1.0	1.0	3990.0	59.0	15800.0	0.0	0.0	0.0
7	112360031	NaN	12000	12000	12000.0	60 months	15.05	285.80	C	C4	...	35899.0	54.0	1.0	2.0	4107.0	60.0	19100.0	1.0	0.0	4.0
8	112038251	NaN	11575	11575	11575.0	36 months	7.35	359.26	A	A4	...	92315.0	63.0	2.0	8.0	1581.0	36.0	37600.0	1.0	6.0	2.0
9	112134207	NaN	20400	20400	20400.0	60 months	7.97	413.35	A	A5	...	44985.0	NaN	0.0	1.0	5133.0	8.0	65500.0	0.0	2.0	1.0
10 rows × 72 columns

)
print(data.describe())
print(data.shape)
missingDf=data.isnull().sum().sort_values(ascending=False).reset_index()
​
missingDf.columns=['feature','miss_num']
missingDf['miss_percentage']=missingDf['miss_num']/data.shape[0]
missingDf.head(10
lending club 贷款数据 2018年第二季度的贷款数据 "id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","open_il_12m","open_il_24m","mths_since_rcnt_il","total_bal_il","il_util","open_rv_12m","open_rv_24m","max_bal_bc","all_util","total_rev_hi_lim","inq_fi","total_cu_tl","inq_last_12m","acc_open_past_24mths","avg_cur_bal","bc_open_to_buy","bc_util","chargeoff_within_12_mths","delinq_amnt","mo_sin_old_il_acct","mo_sin_old_rev_tl_op","mo_sin_rcnt_rev_tl_op","mo_sin_rcnt_tl","mort_acc","mths_since_recent_bc","mths_since_recent_bc_dlq","mths_since_recent_inq","mths_since_recent_revol_delinq","num_accts_ever_120_pd","num_actv_bc_tl","num_actv_rev_tl","num_bc_sats","num_bc_tl","num_il_tl","num_op_rev_tl","num_rev_accts","num_rev_tl_bal_gt_0","num_sats","num_tl_120dpd_2m","num_tl_30dpd","num_tl_90g_dpd_24m","num_tl_op_past_12m","pct_tl_nvr_dlq","percent_bc_gt_75","pub_rec_bankruptcies","tax_liens","tot_hi_cred_lim","total_bal_ex_mort","total_bc_limit","total_il_high_credit_limit","revol_bal_joint","sec_app_earliest_cr_line","sec_app_inq_last_6mths","sec_app_mort_acc","sec_app_open_acc","sec_app_revol_util","sec_app_open_act_il","sec_app_num_rev
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值