30机器学习项目实战-贷款申请最大化利润

本文是唐宇迪《python数据分析与机器学习实战》的学习笔记,通过30个机器学习项目实战,聚焦贷款申请如何实现利润最大化。首先进行数据清洗,去除无用特征,然后进行数据预处理,处理缺失值和非数值型数据。核心在于通过逻辑回归和随机森林模型,结合权重调整以降低假阳性(FP)和提高真阳性(TP),目标是优化FPR和TPR。最后发现随机森林模型效果较好,但仍有改进空间。
摘要由CSDN通过智能技术生成

唐宇迪《python数据分析与机器学习实战》学习笔记
30机器学习项目实战-贷款申请最大化利润

本文相关原始数据及代码:链接,密码:8v5y

一、数据清洗过滤无用特征

互联网贷款网站:https://www.lendingclub.com/info/download-data.action
在这里插入图片描述
通过历史数据来决定是否放款,这里下载了2007-2011年的数据,大概有4万多个样本特征非常多:
在这里插入图片描述
这里对其进行一些预处理:

import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1)
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)#去掉空值太多的行
#thresh=n,保留至少有 n 个非 NA 数的行
loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)#去掉没用的“描述”‘链接’行
loans_2007.to_csv('loans_2007.csv', index=False)#保存为新文件
import pandas as pd
loans_2007 = pd.read_csv("loans_2007.csv")
print(loans_2007.iloc[0])#看一下第一行内容
print(loans_2007.shape[1])#看一下列数

id 1077501
member_id 1.2966e+06
loan_amnt 5000
funded_amnt 5000
funded_amnt_inv 4975
term 36 months
int_rate 10.65%
installment 162.87
grade B
sub_grade B2
emp_title NaN
emp_length 10+ years
home_ownership RENT
annual_inc 24000
verification_status Verified
issue_d Dec-2011
loan_status Fully Paid
pymnt_plan n
purpose credit_card
title Computer
zip_code 860xx
addr_state AZ
dti 27.65
delinq_2yrs 0
earliest_cr_line Jan-1985
inq_last_6mths 1
open_acc 3
pub_rec 0
revol_bal 13648
revol_util 83.7%
total_acc 9
initial_list_status f
out_prncp 0
out_prncp_inv 0
total_pymnt 5863.16
total_pymnt_inv 5833.84
total_rec_prncp 5000
total_rec_int 863.16
total_rec_late_fee 0
recoveries 0
collection_recovery_fee 0
last_pymnt_d Jan-2015
last_pymnt_amnt 171.62
last_credit_pull_d Nov-2016
collections_12_mths_ex_med 0
policy_code 1
application_type INDIVIDUAL
acc_now_delinq 0
chargeoff_within_12_mths 0
delinq_amnt 0
pub_rec_bankruptcies 0
tax_liens 0
Name: 0, dtype: object
52

id数据只是个编号对结果无影响所以不当成特征,具体的特征描述网站上可以看一下,loan_amnt 为申请额,funded_amnt 、funded_amnt_inv 为实际给的这种量是预测之后的事了,对预测无用,这种类似的量都丢除,只留申请时填写的数据。term还款日期。int_rate利率,这个指标可能越高是不是越不容易还款,可作特征。高度重复的特征也要去除,比如各种打分值其实说的是一个事,存在高度相关性。还有些特征,比如公司单位,要预测的话可能需要进行分级太麻烦,也去除。这里共52个特征,就照着上面的思路去清洗:

loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)

loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)

打印出候选特征,发现还是有点多32个:

loans_2007 = loans_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
print(loans_2007.iloc[0])
print(loans_2007.shape[1])

loan_amnt 5000
term 36 months
int_rate 10.65%
installment 162.87
emp_length 10+ years
home_ownership RENT
annual_inc 24000
verification_status Verified
loan_status Fully Paid
pymnt_plan n
purpose credit_card
title Computer
addr_state AZ
dti 27.65
delinq_2yrs 0
earliest_cr_line Jan-1985
inq_last_6mths 1
open_acc 3
pub_rec 0
revol_bal 13648
revol_util 83.7%
total_acc 9
initial_list_status f
last_credit_pull_d Nov-2016
collections_12_mths_ex_med 0
policy_code 1
application_type INDIVIDUAL
acc_now_delinq 0
chargeoff_within_12_mths 0
delinq_amnt 0
pub_rec_bankruptcies 0
tax_liens 0
Name: 0, dtype: object
32

观察数据并没有哪个项明确规定标签值,无0/1这种。下面是网站上关于贷款状态’Loamn Status’列的评判。Fully Paid全额批准,可以打个1,Charged Off 没批准,评0,三的项是没满足要求的最终也不知道到底给贷款没,这里就不用了。后面其他指标的不确定性比较强也丢弃,例如Late(16-30days)批准延期了。所以最终选择Fully Paid和Charged Off作为Label值
在这里插入图片描述

#取出拥有这两个属性的行
loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]
#把属性和映射值做成字典
status_replace = {
   
    "loan_status" : {
   
        "Fully Paid": 1,
        "Charged Off": 0,
    }
}

loans_2007 = loans_2007.replace(status_replace)#应用字典进行替换

还剩30多列,观察发现有些列包含的值都是一样的,比如pymnt_plan列全为n、pub_rec全为0…对于预测完全无用,所以接下把这类列都去除。这里统计每列数据如果唯一值则去取,不过考虑到有些值可能为空值,所以这里先去除缺失值。

orig_columns = loans_2007.columns#取所有列
drop_colum
lending club 贷款数据 2018年第二季度的贷款数据 "id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","open_il_12m","open_il_24m","mths_since_rcnt_il","total_bal_il","il_util","open_rv_12m","open_rv_24m","max_bal_bc","all_util","total_rev_hi_lim","inq_fi","total_cu_tl","inq_last_12m","acc_open_past_24mths","avg_cur_bal","bc_open_to_buy","bc_util","chargeoff_within_12_mths","delinq_amnt","mo_sin_old_il_acct","mo_sin_old_rev_tl_op","mo_sin_rcnt_rev_tl_op","mo_sin_rcnt_tl","mort_acc","mths_since_recent_bc","mths_since_recent_bc_dlq","mths_since_recent_inq","mths_since_recent_revol_delinq","num_accts_ever_120_pd","num_actv_bc_tl","num_actv_rev_tl","num_bc_sats","num_bc_tl","num_il_tl","num_op_rev_tl","num_rev_accts","num_rev_tl_bal_gt_0","num_sats","num_tl_120dpd_2m","num_tl_30dpd","num_tl_90g_dpd_24m","num_tl_op_past_12m","pct_tl_nvr_dlq","percent_bc_gt_75","pub_rec_bankruptcies","tax_liens","tot_hi_cred_lim","total_bal_ex_mort","total_bc_limit","total_il_high_credit_limit","revol_bal_joint","sec_app_earliest_cr_line","sec_app_inq_last_6mths","sec_app_mort_acc","sec_app_open_acc","sec_app_revol_util","sec_app_open_act_il","sec_app_num_rev
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值