唐宇迪《python数据分析与机器学习实战》学习笔记
30机器学习项目实战-贷款申请最大化利润
本文相关原始数据及代码:链接,密码:8v5y
一、数据清洗过滤无用特征
互联网贷款网站:https://www.lendingclub.com/info/download-data.action
通过历史数据来决定是否放款,这里下载了2007-2011年的数据,大概有4万多个样本特征非常多:
这里对其进行一些预处理:
import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1)
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)#去掉空值太多的行
#thresh=n,保留至少有 n 个非 NA 数的行
loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)#去掉没用的“描述”‘链接’行
loans_2007.to_csv('loans_2007.csv', index=False)#保存为新文件
import pandas as pd
loans_2007 = pd.read_csv("loans_2007.csv")
print(loans_2007.iloc[0])#看一下第一行内容
print(loans_2007.shape[1])#看一下列数
id 1077501
member_id 1.2966e+06
loan_amnt 5000
funded_amnt 5000
funded_amnt_inv 4975
term 36 months
int_rate 10.65%
installment 162.87
grade B
sub_grade B2
emp_title NaN
emp_length 10+ years
home_ownership RENT
annual_inc 24000
verification_status Verified
issue_d Dec-2011
loan_status Fully Paid
pymnt_plan n
purpose credit_card
title Computer
zip_code 860xx
addr_state AZ
dti 27.65
delinq_2yrs 0
earliest_cr_line Jan-1985
inq_last_6mths 1
open_acc 3
pub_rec 0
revol_bal 13648
revol_util 83.7%
total_acc 9
initial_list_status f
out_prncp 0
out_prncp_inv 0
total_pymnt 5863.16
total_pymnt_inv 5833.84
total_rec_prncp 5000
total_rec_int 863.16
total_rec_late_fee 0
recoveries 0
collection_recovery_fee 0
last_pymnt_d Jan-2015
last_pymnt_amnt 171.62
last_credit_pull_d Nov-2016
collections_12_mths_ex_med 0
policy_code 1
application_type INDIVIDUAL
acc_now_delinq 0
chargeoff_within_12_mths 0
delinq_amnt 0
pub_rec_bankruptcies 0
tax_liens 0
Name: 0, dtype: object
52
id数据只是个编号对结果无影响所以不当成特征,具体的特征描述网站上可以看一下,loan_amnt 为申请额,funded_amnt 、funded_amnt_inv 为实际给的这种量是预测之后的事了,对预测无用,这种类似的量都丢除,只留申请时填写的数据。term还款日期。int_rate利率,这个指标可能越高是不是越不容易还款,可作特征。高度重复的特征也要去除,比如各种打分值其实说的是一个事,存在高度相关性。还有些特征,比如公司单位,要预测的话可能需要进行分级太麻烦,也去除。这里共52个特征,就照着上面的思路去清洗:
loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)
loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
打印出候选特征,发现还是有点多32个:
loans_2007 = loans_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
print(loans_2007.iloc[0])
print(loans_2007.shape[1])
loan_amnt 5000
term 36 months
int_rate 10.65%
installment 162.87
emp_length 10+ years
home_ownership RENT
annual_inc 24000
verification_status Verified
loan_status Fully Paid
pymnt_plan n
purpose credit_card
title Computer
addr_state AZ
dti 27.65
delinq_2yrs 0
earliest_cr_line Jan-1985
inq_last_6mths 1
open_acc 3
pub_rec 0
revol_bal 13648
revol_util 83.7%
total_acc 9
initial_list_status f
last_credit_pull_d Nov-2016
collections_12_mths_ex_med 0
policy_code 1
application_type INDIVIDUAL
acc_now_delinq 0
chargeoff_within_12_mths 0
delinq_amnt 0
pub_rec_bankruptcies 0
tax_liens 0
Name: 0, dtype: object
32
观察数据并没有哪个项明确规定标签值,无0/1这种。下面是网站上关于贷款状态’Loamn Status’列的评判。Fully Paid全额批准,可以打个1,Charged Off 没批准,评0,三的项是没满足要求的最终也不知道到底给贷款没,这里就不用了。后面其他指标的不确定性比较强也丢弃,例如Late(16-30days)批准延期了。所以最终选择Fully Paid和Charged Off作为Label值。
#取出拥有这两个属性的行
loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]
#把属性和映射值做成字典
status_replace = {
"loan_status" : {
"Fully Paid": 1,
"Charged Off": 0,
}
}
loans_2007 = loans_2007.replace(status_replace)#应用字典进行替换
还剩30多列,观察发现有些列包含的值都是一样的,比如pymnt_plan列全为n、pub_rec全为0…对于预测完全无用,所以接下把这类列都去除。这里统计每列数据如果唯一值则去取,不过考虑到有些值可能为空值,所以这里先去除缺失值。
orig_columns = loans_2007.columns#取所有列
drop_colum