机器学习项目实战–贷款申请利润最大化
1、项目介绍
本项目的背景为某互联网贷款网站提供的贷款人的个人信息,通过建立模型来预测新来了一个人银行是否对他进行放贷。
数据来源:https://www.lendingclub.com/info/download-data.action,
2、数据预处理
通过打开对应csv文件,发现数据量是非常庞大的,特征列数量的庞大,导致我们不可能将所有特征都作为模型的划分依据,必然有一些特征是不重要的,对模型结果影响较小的,我们可以去掉这些特征,减少模型计算量。
2.1 去掉无用特征,减少数据维度
Step.1 去掉一些明显没用的特征,如’desc’,'url’,并将剩下特征保存到一个新的csv文件中
import pandas as pd
loans_2007 = pd.read_csv('initial_loans_2007.csv', skiprows=1)
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)
loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)
loans_2007.to_csv('loans_2007.csv', index=False)
Step.2 输出数据标签,初判断无用特征
import pandas as pd
loans_2007 = pd.read_csv("E:\\machineLearning\\loans_2007.csv")
loans_2007.drop_duplicates() #去除重复项
print (loans_2007.iloc[0]) #打印第一行的数据
print (loans_2007.shape[1]) #打印特征数据的列数
输出结果
id 1077501
member_id 1.2966e+06
loan_amnt 5000
funded_amnt 5000
funded_amnt_inv 4975
term 36 months
int_rate 10.65%
installment 162.87
grade B
sub_grade B2
emp_title NaN
emp_length 10+ years
home_ownership RENT
annual_inc 24000
verification_status Verified
issue_d Dec-2011
loan_status Fully Paid
pymnt_plan n
purpose credit_card
title Computer
zip_code 860xx
addr_state AZ
dti 27.65
delinq_2yrs 0
earliest_cr_line Jan-1985
inq_last_6mths 1
open_acc 3
pub_rec 0
revol_bal 13648
revol_util 83.7%
total_acc 9
initial_list_status f
out_prncp 0
out_prncp_inv 0
total_pymnt 5863.16
total_pymnt_inv 5833.84
total_rec_prncp 5000
total_rec_int 863.16
total_rec_late_fee 0
recoveries 0
collection_recovery_fee 0
last_pymnt_d Jan-2015
last_pymnt_amnt 171.62
last_credit_pull_d Nov-2016
collections_12_mths_ex_med 0
policy_code 1
application_type INDIVIDUAL
acc_now_delinq 0
chargeoff_within_12_mths 0
delinq_amnt 0
pub_rec_bankruptcies 0
tax_liens 0
Name: 0, dtype: object
52
很明显从常识来讲‘id’和‘member_id’与银行是否对他进行放贷没有任何关系,’funded_amnt’ 和funded_amnt_inv’为预测之后银行对该人借贷的金额,很明显也与模型判断没有关系。其实判断哪些特征是有用还是没用的,是一个很值得讨论的工作,很多公司都要就一个特征开会讨论,所以在此就不在做过多的讨论了,具体参数的实际含义可参考数据来源的链接,按照此思路,本文中选择去掉的特征的代码如下:
loans_2007 = loans_2007.drop(['id','member_id','funded_amnt','funded_amnt_inv','grade','sub_grade','emp_title','last_pymnt_d','last_pymnt_amnt'],axis=1)
loans_2007 = loans_2007.drop(['zip_code','out_prncp','out_prncp_inv','total_pymnt','total_pymnt_inv','total_rec_prncp'],axis=1)
loans_2007 = loans_2007.drop(['total_rec_int','total_rec_late_fee','recoveries','collection_recovery_fee','issue_d'],axis=1)
print (loans_2007.iloc[0])
print (loans_2007.shape[1])
输出结果
loan_amnt 5000
term 36 months
int_rate 10.65%
installment 162.87
emp_length 10+ years
home_ownership RENT
annual_inc 24000
verification_status Verified
loan_status Fully Paid
pymnt_plan n
purpose credit_card
title Computer
addr_state AZ
dti 27.65
delinq_2yrs 0
earliest_cr_line Jan-1985
inq_last_6mths 1
open_acc 3
pub_rec 0
revol_bal 13648
revol_util 83.7%
total_acc 9
initial_list_status f
last_credit_pull_d Nov-2016
collections_12_mths_ex_med 0
policy_code 1
application_type INDIVIDUAL
acc_now_delinq 0
chargeoff_within_12_mths 0
delinq_amnt 0
pub_rec_bankruptcies 0
tax_liens 0
Name: 0, dtype: object
32
小结:很明显,此时数据的维度降成了32维。
Step.3 确定当前贷款状态(label值)
本文中的label值为’loan_status’,首先看看其包含的类别及所占的数量
print (loans_2007["loan_status"].value_counts())
输出结果
Fully Paid 33902
Charged Off 5658
Does not meet the credit policy. Status:Fully Paid 1988
Does not meet the credit policy. Status:Charged Off 761
Current 201
Late (31-120 days) 10
In Grace Period 9
Late (16-30 days) 5
Default 1
Name: loan_status, dtype: int64
从结果中可以看出Fully Paid代表已放款,Charged Off代表拒贷,其余参数表达不明确,就暂且不使用这些参数,所以此问题就转换成了一个二分类的问题。为方便模型划分,将这两种情况分别转换为1和0,操作代码如下。
loans_2007 = loans_2007[(loans_2007['loan_status']=='Fully Paid') | (loans_2007['loan_status']=='Charged Off')]
#将字符串转化成数字,将要替换的对象做成字典
status_replace = {
'loan_status':{
'Fully Paid':1,
'Charged Off':0,
}
}
loans_2007 = loans_2007.replace(status_replace)
Step.4 去掉特征中只有一种属性的列
orig_columns = loans_2007.columns
drop_columns = []
for col in orig_columns:
#先删除缺失值,再进行唯一属性的判断,否则加上缺失值,属性就为2
col_series = loans_2007[col].dropna().unique()
if len(col_series) == 1:
#如果某一列都是一种值,去掉当前列
drop_columns.append(col)
loans_2007 = loans_2007.drop(drop_columns,axis=1)
print(drop_columns)
print(loans_2007.shape)
loans_2007.to_csv('filtered_loans_2007.csv', index=False)
输出结果
['initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type',
'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']
(39560, 24)
Step.5 处理缺失值
首先进行缺失值的判断
import pandas as pd
loans = pd.read_csv('filtered_loans_2007.csv')
null_counts = loans.isnull().sum()
print(null_counts)
输出结果
loan_amnt 0
term 0
int_rate 0
installment 0
emp_length 0
home_ownership 0
annual_inc 0
verification_status 0
loan_status 0
pymnt_plan 0
purpose 0
title 10
addr_state 0
dti 0
delinq_2yrs 0
earliest_cr_line 0
inq_last_6mths 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 50
total_acc 0
last_credit_pull_d 2
pub_rec_bankruptcies 697
dtype: int64
从统计出的结果可以看出‘title’和‘revol_util’相对于数据总量来说较少,可以直接去掉缺失值所在的行,而‘pub_rec_bankruptcies ’中的缺失值较多,说明该数据统计的情况较差,在本文中直接将此特征删除即可。
loans = loans.drop("pub_rec_bankruptcies", axis=1)#去掉此特征列
loans = loans.dropna(axis=0) #去掉缺失值所在的行
print(loans.dtypes.value_counts())
输出结果
object 12
float64 10
int64 1
dtype: int64
Step.6 数据类型的转换
由于sk-learn库不接受字符型的数据,所以还需将上面特征中12个字符型的数据进行处理。
#select_dtypes 根据数据所属的数据类型进行选取
object_columns_df = loans_2007.select_dtypes(include=["object"])
print(object_columns_df.iloc[0])
输出结果
term 36 months
int_rate 10.65%
emp_length 10+ years
home_ownership RENT
verification_status Verified
pymnt_plan n
purpose credit_card
title Computer
addr_state AZ
earliest_cr_line Jan-1985
revol_util 83.7%
last_credit_pull_d Nov-2016
Name: 0, dtype: object
查看指定标签的属性,并记数。
cols = ['home_ownership','verification_status','emp_length','term','addr_state']
for c in cols:
print(loans_2007[c].value_counts())
输出结果
RENT 18780
MORTGAGE 17574
OWN 3045
OTHER 96
NONE 3
Name: home_ownership, dtype: int64
Not Verified 16856
Verified 12705
Source Verified 9937
Name: verification_status, dtype: int64
10+ years 8821
< 1 year 4563
2 years 4371
3 years 4074
4 years 3409
5 years 3270
1 year 3227
6 years 2212
7 years 1756
8 years 1472
9 years 1254
n/a 1069
Name: emp_length, dtype: int64
36 months 29041
60 months 10457
Name: term, dtype: int64
CA 7070
NY 3788
FL 2856
TX 2714
NJ 1838
IL 1517
PA 1504
VA 1400
GA 1393
MA 1336
OH 1208
MD 1049
AZ 874
WA 834
CO 786
NC 780
CT 747
MI 722
MO 682
MN 611
NV 492
SC 470
WI 453
AL 446
OR 445
LA 435
KY 325
OK 298
KS 269
UT 256
AR 243
DC 211
RI 198
NM 188
WV 176
HI 172
NH 172
DE 113
MT 84
WY 83
AK 79
SD 63
VT 54
MS 19
TN 17
IN 9
ID 6
IA 5
NE 5
ME 3
Name: addr_state, dtype: int64
"purpose"和"title"表达的意思相近,且从输出结果可以看出"title"所含的属性较多,可以将其舍弃掉
print(loans["purpose"].value_counts())
print(loans["title"].value_counts())
输出结果
debt_consolidation 18533
credit_card 5099
other 3963
home_improvement 2965
major_purchase 2181
small_business 1815
car 1544
wedding 945
medical 692
moving 581
vacation 379
house 378
educational 320
renewable_energy 103
Name: purpose, dtype: int64
Debt Consolidation 2168
Debt Consolidation Loan 1706
Personal Loan 658
Consolidation 509
debt consolidation 502
Credit Card Consolidation 356
Home Improvement 354
Debt consolidation 333
Small Business Loan 322
Credit Card Loan 313
Personal 308
Consolidation Loan 255
Home Improvement Loan 246
personal loan 234
personal 220
Loan 212
Wedding Loan 209
consolidation 200
Car Loan 200
Other Loan 190
Credit Card Payoff 155
Wedding 152
Major Purchase Loan 144
Credit Card Refinance 143
Consolidate 127
Medical 122
Credit Card 117
home improvement 111
My Loan 94
Credit Cards 93
...
DebtConsolidationn 1
Freedom 1
Credit Card Consolidation Loan - SEG 1
SOLAR PV 1
Pay on Credit card 1
To pay off balloon payments due
Paying off the debt 1
Payoff ING PLOC 1
Josh CC Loan 1
House payoff 1
Taking care of Business 1
Gluten Free Bakery in ideal town for it 1
Startup Money for Small Business 1
FundToFinanceCar 1
getting ready for Baby 1
Dougs Wedding Loan 1
d rock 1
LC Loan 2 1
swimming pool repair 1
engagement 1
Cut the credit cards Loan 1
vinman 1
working hard to get out of debt 1
consolidate the rest of my debt 1
Medical/Vacation 1
2BDebtFree 1
Paying Off High Interest Credit Cards! 1
Baby on the way! 1
cart loan 1
Consolidaton 1
Name: title, dtype: int64
现在终于可以对特征进行处理了
mapping_dict = {
"emp_length": {
"10+ years": 10,
"9 years": 9,
"8 years": 8,
"7 years": 7,
"6 years": 6,
"5 years": 5,
"4 years": 4,
"3 years": 3,
"2 years": 2,
"1 year": 1,
"< 1 year": 0,
"n/a": 0
}
}
loans = loans.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")
loans = loans.replace(mapping_dict)
剩余的其他字符型特征,此处选择使用pandas的get_dummies()函数,直接映射为数值型。
cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]
dummy_df = pd.get_dummies(loans[cat_columns])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(cat_columns, axis=1)
loans = loans.drop("pymnt_plan", axis=1)
到此为止,数据处理部分已经完成,保存清理好的数据,接下来是运用机器学习算法建模的过程
3、模型训练
前面花费了大量的时间在进行数据处理,这足以说明在机器学习中数据准备的工作有多重要,有了好的数据才能预测出好的分类结果,对于二分类问题,一般情况下,首选逻辑回归。
首先定义模型效果的评判标准。根据贷款行业的实际情况,如下图所示,在这里我们假设将钱借给了没有还款能力的人,结果损失一千,将钱借给了有偿还能力的人,从每笔中赚0.1的利润,而其余情况收益为零,就相当于预测对十个人才顶上预测错一个人的收益,所以精度不再适用于此模型,为了实现利润最大化,我们不仅要求模型预测recall率较高,同时是需要要让fall-out率较低,故这里采用两个指标TPR(true positive rate)和FPR(false positive rate)。
T
P
R
=
t
r
u
e
p
o
s
i
t
i
v
e
s
f
a
l
s
e
p
o
s
i
t
i
v
e
s
+
t
r
u
e
p
o
s
i
t
i
v
e
s
TPR=\frac{truepositives}{false positives+true positives}
TPR=falsepositives+truepositivestruepositives
F
P
R
=
f
a
l
s
e
p
o
s
i
t
i
v
e
s
f
a
l
s
e
p
o
s
i
t
i
v
e
s
+
t
r
u
e
p
o
s
i
t
i
v
e
s
FPR=\frac{false positives}{false positives+true positives}
FPR=falsepositives+truepositivesfalsepositives
在这里先将原文件写入到新文件中,在调用info()
函数查看文档信息。
loans.to_csv('cleaned_loans2007.csv', index=False)
import pandas as pd
loans = pd.read_csv("cleaned_loans2007.csv")
print(loans.info())
输出结果
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39498 entries, 0 to 39497
Data columns (total 37 columns):
loan_amnt 39498 non-null float64
int_rate 39498 non-null float64
installment 39498 non-null float64
annual_inc 39498 non-null float64
loan_status 39498 non-null int64
dti 39498 non-null float64
delinq_2yrs 39498 non-null float64
inq_last_6mths 39498 non-null float64
open_acc 39498 non-null float64
pub_rec 39498 non-null float64
revol_bal 39498 non-null float64
revol_util 39498 non-null float64
total_acc 39498 non-null float64
home_ownership_MORTGAGE 39498 non-null int64
home_ownership_NONE 39498 non-null int64
home_ownership_OTHER 39498 non-null int64
home_ownership_OWN 39498 non-null int64
home_ownership_RENT 39498 non-null int64
verification_status_Not Verified 39498 non-null int64
verification_status_Source Verified 39498 non-null int64
verification_status_Verified 39498 non-null int64
purpose_car 39498 non-null int64
purpose_credit_card 39498 non-null int64
purpose_debt_consolidation 39498 non-null int64
purpose_educational 39498 non-null int64
purpose_home_improvement 39498 non-null int64
purpose_house 39498 non-null int64
purpose_major_purchase 39498 non-null int64
purpose_medical 39498 non-null int64
purpose_moving 39498 non-null int64
purpose_other 39498 non-null int64
purpose_renewable_energy 39498 non-null int64
purpose_small_business 39498 non-null int64
purpose_vacation 39498 non-null int64
purpose_wedding 39498 non-null int64
term_ 36 months 39498 non-null int64
term_ 60 months 39498 non-null int64
dtypes: float64(12), int64(25)
memory usage: 11.1 MB
None
从结果可以看出,缺失值全部处理完,字符型也全部转换成的数值型,下面就可以调用sk-learn库进行计算了。
3.1 使用逻辑回归训练
逻辑回归是机器学习中非常好用的二分类算法,其速度也是相当快的。
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold,cross_val_score
lr = LogisticRegression()
cols = loans.columns
train_cols = cols.drop("loan_status")
features = loans[train_cols]
target = loans["loan_status"]
kf = KFold(features.shape[0],random_state=1)
predictions = cross_val_predict(lr,features,target,cv=kf)
predictions = pd.Series(predictions)
# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])
# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])
# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])
# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)
print(fpr)
print(predictions[:20])
输出结果
0.999084438406
0.998049299521
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 1
19 1
dtype: int64
小结:从结果看出TPR和FPR的值都很高,说明来一个人基本都会判断为可以借钱,那显然模型就完全没有分类的意义。
3.3 考虑权重后使用逻辑回归训练
那么为什么会出现上面极其离谱的现象呢?这是由于我们的样本是很不均衡的,这就容易导致我们构建的分类器把所有样本都归为样本量较大的那一个类。解决的方法有很多,其中一个是进行数据增强,就是把少的样本增多,但是要添加的数据要么是收集的,要么是自己造的,所以这项工作还是挺难的。所以在本文中将考虑权重,将少的样本的权重增大,期望模型能够达到比较均衡的状态。
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold,cross_val_score
lr = LogisticRegression(class_weight="balanced")
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)
# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])
# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])
# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])
# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)
print(fpr)
print(predictions[:20])
输出结果
0.670781771464
0.400780280192
0 1
1 0
2 0
3 1
4 1
5 0
6 0
7 0
8 0
9 0
10 1
11 0
12 1
13 1
14 0
15 0
16 1
17 1
18 1
19 0
dtype: int64
3.4 自定义权重后使用逻辑回归训练
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold,cross_val_score
penalty = {
0: 5,
1: 1
}
lr = LogisticRegression(class_weight=penalty)
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)
# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])
# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])
# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])
# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)
print(fpr)
输出结果
0.731799521545
0.478985635751
3.5 使用随机森林进行训练
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold,cross_val_score
#树的数量为10颗
rf = RandomForestClassifier(n_estimators=10,class_weight="balanced", random_state=1)
#print help(RandomForestClassifier)
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(rf, features, target, cv=kf)
predictions = pd.Series(predictions)
# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])
# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])
# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])
# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)
print(fpr)
输出结果
0.973862193213
0.940946976414
总结: 以上模型都没有给出一个较好的结果,我后面也试了件随机森林的树的数量变多,发现效果也不是特别好,本项目实战主要是给出使用机器学习建模的一般流程,分为两大部分:数据处理和模型学习。
当模型的效果不理想时,可以考虑的调整策略:
1.调节正负样本的权重参数。
2.更换模型算法,如SVM,adboost等。
3.同时几个使用模型进行预测,然后取去测的最终结果。
4.使用原数据,生成新特征。
5.调整模型参数。