机器学习项目实战--贷款申请利润最大化（附详细讲解，python代码）

最新推荐文章于 2024-07-24 00:54:42 发布

西南交大-Liu_z

最新推荐文章于 2024-07-24 00:54:42 发布

阅读量2.1k

点赞数

分类专栏：借贷利润最大化文章标签：机器学习数据分析 python 数据挖掘大数据

本文链接：https://blog.csdn.net/qq_40683479/article/details/89258022

版权

借贷利润最大化专栏收录该内容

1 篇文章 0 订阅

订阅专栏

机器学习项目实战–贷款申请利润最大化

1、项目介绍

本项目的背景为某互联网贷款网站提供的贷款人的个人信息，通过建立模型来预测新来了一个人银行是否对他进行放贷。
数据来源：https://www.lendingclub.com/info/download-data.action,

2、数据预处理

通过打开对应csv文件，发现数据量是非常庞大的，特征列数量的庞大，导致我们不可能将所有特征都作为模型的划分依据，必然有一些特征是不重要的，对模型结果影响较小的，我们可以去掉这些特征，减少模型计算量。
2.1 去掉无用特征，减少数据维度
Step.1 去掉一些明显没用的特征，如’desc’,'url’，并将剩下特征保存到一个新的csv文件中

import pandas as pd
loans_2007 = pd.read_csv('initial_loans_2007.csv', skiprows=1)
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)
loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)
loans_2007.to_csv('loans_2007.csv', index=False)

Step.2 输出数据标签，初判断无用特征

import pandas as pd
loans_2007 = pd.read_csv("E:\\machineLearning\\loans_2007.csv")
loans_2007.drop_duplicates()		#去除重复项
print (loans_2007.iloc[0])		#打印第一行的数据
print (loans_2007.shape[1])	#打印特征数据的列数

输出结果

id                                1077501
member_id                      1.2966e+06
loan_amnt                            5000
funded_amnt                          5000
funded_amnt_inv                      4975
term                            36 months
int_rate                           10.65%
installment                        162.87
grade                                   B
sub_grade                              B2
emp_title                             NaN
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
issue_d                          Dec-2011
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
zip_code                            860xx
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                 Jan-1985
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                          83.7%
total_acc                               9
initial_list_status                     f
out_prncp                               0
out_prncp_inv                           0
total_pymnt                       5863.16
total_pymnt_inv                   5833.84
total_rec_prncp                      5000
total_rec_int                      863.16
total_rec_late_fee                      0
recoveries                              0
collection_recovery_fee                 0
last_pymnt_d                     Jan-2015
last_pymnt_amnt                    171.62
last_credit_pull_d               Nov-2016
collections_12_mths_ex_med              0
policy_code                             1
application_type               INDIVIDUAL
acc_now_delinq                          0
chargeoff_within_12_mths                0
delinq_amnt                             0
pub_rec_bankruptcies                    0
tax_liens                               0
Name: 0, dtype: object
52

很明显从常识来讲‘id’和‘member_id’与银行是否对他进行放贷没有任何关系，’funded_amnt’ 和funded_amnt_inv’为预测之后银行对该人借贷的金额，很明显也与模型判断没有关系。其实判断哪些特征是有用还是没用的，是一个很值得讨论的工作，很多公司都要就一个特征开会讨论，所以在此就不在做过多的讨论了，具体参数的实际含义可参考数据来源的链接，按照此思路，本文中选择去掉的特征的代码如下：

loans_2007 = loans_2007.drop(['id','member_id','funded_amnt','funded_amnt_inv','grade','sub_grade','emp_title','last_pymnt_d','last_pymnt_amnt'],axis=1)
loans_2007 = loans_2007.drop(['zip_code','out_prncp','out_prncp_inv','total_pymnt','total_pymnt_inv','total_rec_prncp'],axis=1)
loans_2007 = loans_2007.drop(['total_rec_int','total_rec_late_fee','recoveries','collection_recovery_fee','issue_d'],axis=1)
print (loans_2007.iloc[0])
print (loans_2007.shape[1])

输出结果

loan_amnt                            5000
term                            36 months
int_rate                           10.65%
installment                        162.87
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                 Jan-1985
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                          83.7%
total_acc                               9
initial_list_status                     f
last_credit_pull_d               Nov-2016
collections_12_mths_ex_med              0
policy_code                             1
application_type               INDIVIDUAL
acc_now_delinq                          0
chargeoff_within_12_mths                0
delinq_amnt                             0
pub_rec_bankruptcies                    0
tax_liens                               0
Name: 0, dtype: object
32

小结：很明显，此时数据的维度降成了32维。
Step.3 确定当前贷款状态（label值）
本文中的label值为’loan_status’，首先看看其包含的类别及所占的数量

print (loans_2007["loan_status"].value_counts())

输出结果

Fully Paid                                             33902
Charged Off                                             5658
Does not meet the credit policy. Status:Fully Paid      1988
Does not meet the credit policy. Status:Charged Off      761
Current                                                  201
Late (31-120 days)                                        10
In Grace Period                                            9
Late (16-30 days)                                          5
Default                                                    1
Name: loan_status, dtype: int64

从结果中可以看出Fully Paid代表已放款，Charged Off代表拒贷,其余参数表达不明确，就暂且不使用这些参数，所以此问题就转换成了一个二分类的问题。为方便模型划分，将这两种情况分别转换为1和0，操作代码如下。

loans_2007 = loans_2007[(loans_2007['loan_status']=='Fully Paid') | (loans_2007['loan_status']=='Charged Off')]
#将字符串转化成数字，将要替换的对象做成字典
status_replace = {
    'loan_status':{
        'Fully Paid':1,
        'Charged Off':0,
    }
}
loans_2007 = loans_2007.replace(status_replace)

Step.4 去掉特征中只有一种属性的列

orig_columns = loans_2007.columns
drop_columns = []
for col in orig_columns:
#先删除缺失值，再进行唯一属性的判断，否则加上缺失值，属性就为2
    col_series = loans_2007[col].dropna().unique()
    if len(col_series) == 1:
        #如果某一列都是一种值，去掉当前列
        drop_columns.append(col)

loans_2007 = loans_2007.drop(drop_columns,axis=1)
print(drop_columns)
print(loans_2007.shape)
loans_2007.to_csv('filtered_loans_2007.csv', index=False)

输出结果

['initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type',
 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']
(39560, 24)

Step.5 处理缺失值

首先进行缺失值的判断

import pandas as pd
loans = pd.read_csv('filtered_loans_2007.csv')
null_counts = loans.isnull().sum()
print(null_counts)

输出结果

loan_amnt                 0
term                      0
int_rate                  0
installment               0
emp_length                0
home_ownership            0
annual_inc                0
verification_status       0
loan_status               0
pymnt_plan                0
purpose                   0
title                    10
addr_state                0
dti                       0
delinq_2yrs               0
earliest_cr_line          0
inq_last_6mths            0
open_acc                  0
pub_rec                   0
revol_bal                 0
revol_util               50
total_acc                 0
last_credit_pull_d        2
pub_rec_bankruptcies    697
dtype: int64

从统计出的结果可以看出‘title’和‘revol_util’相对于数据总量来说较少，可以直接去掉缺失值所在的行，而‘pub_rec_bankruptcies ’中的缺失值较多，说明该数据统计的情况较差，在本文中直接将此特征删除即可。

loans = loans.drop("pub_rec_bankruptcies", axis=1)#去掉此特征列
loans = loans.dropna(axis=0)		#去掉缺失值所在的行
print(loans.dtypes.value_counts())

输出结果

object     12
float64    10
int64       1
dtype: int64

Step.6 数据类型的转换
由于sk-learn库不接受字符型的数据，所以还需将上面特征中12个字符型的数据进行处理。

#select_dtypes 根据数据所属的数据类型进行选取
object_columns_df = loans_2007.select_dtypes(include=["object"])
print(object_columns_df.iloc[0])

输出结果

term                     36 months
int_rate                    10.65%
emp_length               10+ years
home_ownership                RENT
verification_status       Verified
pymnt_plan                       n
purpose                credit_card
title                     Computer
addr_state                      AZ
earliest_cr_line          Jan-1985
revol_util                   83.7%
last_credit_pull_d        Nov-2016
Name: 0, dtype: object

查看指定标签的属性，并记数。

cols = ['home_ownership','verification_status','emp_length','term','addr_state']
for c in cols:
    print(loans_2007[c].value_counts())

输出结果

RENT        18780
MORTGAGE    17574
OWN          3045
OTHER          96
NONE            3
Name: home_ownership, dtype: int64
Not Verified       16856
Verified           12705
Source Verified     9937
Name: verification_status, dtype: int64
10+ years    8821
< 1 year     4563
2 years      4371
3 years      4074
4 years      3409
5 years      3270
1 year       3227
6 years      2212
7 years      1756
8 years      1472
9 years      1254
n/a          1069
Name: emp_length, dtype: int64
 36 months    29041
 60 months    10457
Name: term, dtype: int64
CA    7070
NY    3788
FL    2856
TX    2714
NJ    1838
IL    1517
PA    1504
VA    1400
GA    1393
MA    1336
OH    1208
MD    1049
AZ     874
WA     834
CO     786
NC     780
CT     747
MI     722
MO     682
MN     611
NV     492
SC     470
WI     453
AL     446
OR     445
LA     435
KY     325
OK     298
KS     269
UT     256
AR     243
DC     211
RI     198
NM     188
WV     176
HI     172
NH     172
DE     113
MT      84
WY      83
AK      79
SD      63
VT      54
MS      19
TN      17
IN       9
ID       6
IA       5
NE       5
ME       3
Name: addr_state, dtype: int64

"purpose"和"title"表达的意思相近，且从输出结果可以看出"title"所含的属性较多，可以将其舍弃掉

print(loans["purpose"].value_counts())
print(loans["title"].value_counts())

输出结果

debt_consolidation    18533
credit_card            5099
other                  3963
home_improvement       2965
major_purchase         2181
small_business         1815
car                    1544
wedding                 945
medical                 692
moving                  581
vacation                379
house                   378
educational             320
renewable_energy        103
Name: purpose, dtype: int64
Debt Consolidation                         2168
Debt Consolidation Loan                    1706
Personal Loan                               658
Consolidation                               509
debt consolidation                          502
Credit Card Consolidation                   356
Home Improvement                            354
Debt consolidation                          333
Small Business Loan                         322
Credit Card Loan                            313
Personal                                    308
Consolidation Loan                          255
Home Improvement Loan                       246
personal loan                               234
personal                                    220
Loan                                        212
Wedding Loan                                209
consolidation                               200
Car Loan                                    200
Other Loan                                  190
Credit Card Payoff                          155
Wedding                                     152
Major Purchase Loan                         144
Credit Card Refinance                       143
Consolidate                                 127
Medical                                     122
Credit Card                                 117
home improvement                            111
My Loan                                      94
Credit Cards                                 93
                                           ... 
DebtConsolidationn                            1
 Freedom                                      1
Credit Card Consolidation Loan - SEG          1
SOLAR PV                                      1
Pay on Credit card                            1
To pay off balloon payments due               
Paying off the debt                           1
Payoff ING PLOC                               1
Josh CC Loan                                  1
House payoff                                  1
Taking care of Business                       1
Gluten Free Bakery in ideal town for it       1
Startup Money for Small Business              1
FundToFinanceCar                              1
getting ready for Baby                        1
Dougs Wedding Loan                            1
d rock                                        1
LC Loan 2                                     1
swimming pool repair                          1
engagement                                    1
Cut the credit cards Loan                     1
vinman                                        1
working hard to get out of debt               1
consolidate the rest of my debt               1
Medical/Vacation                              1
2BDebtFree                                    1
Paying Off High Interest Credit Cards!        1
Baby on the way!                              1
cart loan                                     1
Consolidaton                                  1
Name: title, dtype: int64

现在终于可以对特征进行处理了

mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}
loans = loans.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")
loans = loans.replace(mapping_dict)

剩余的其他字符型特征，此处选择使用pandas的get_dummies()函数，直接映射为数值型。

cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]
dummy_df = pd.get_dummies(loans[cat_columns])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(cat_columns, axis=1)
loans = loans.drop("pymnt_plan", axis=1)

到此为止，数据处理部分已经完成，保存清理好的数据，接下来是运用机器学习算法建模的过程

3、模型训练

前面花费了大量的时间在进行数据处理，这足以说明在机器学习中数据准备的工作有多重要，有了好的数据才能预测出好的分类结果，对于二分类问题，一般情况下，首选逻辑回归。
首先定义模型效果的评判标准。根据贷款行业的实际情况，如下图所示，在这里我们假设将钱借给了没有还款能力的人，结果损失一千，将钱借给了有偿还能力的人，从每笔中赚0.1的利润，而其余情况收益为零，就相当于预测对十个人才顶上预测错一个人的收益，所以精度不再适用于此模型，为了实现利润最大化，我们不仅要求模型预测recall率较高，同时是需要要让fall-out率较低，故这里采用两个指标TPR(true positive rate)和FPR(false positive rate)。

$TPR=\frac{truepositives}{false positives+true positives}$
$FPR=\frac{false positives}{false positives+true positives}$
在这里插入图片描述
在这里先将原文件写入到新文件中，在调用info()函数查看文档信息。

loans.to_csv('cleaned_loans2007.csv', index=False)
import pandas as pd
loans = pd.read_csv("cleaned_loans2007.csv")
print(loans.info())

输出结果

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39498 entries, 0 to 39497
Data columns (total 37 columns):
loan_amnt                              39498 non-null float64
int_rate                               39498 non-null float64
installment                            39498 non-null float64
annual_inc                             39498 non-null float64
loan_status                            39498 non-null int64
dti                                    39498 non-null float64
delinq_2yrs                            39498 non-null float64
inq_last_6mths                         39498 non-null float64
open_acc                               39498 non-null float64
pub_rec                                39498 non-null float64
revol_bal                              39498 non-null float64
revol_util                             39498 non-null float64
total_acc                              39498 non-null float64
home_ownership_MORTGAGE                39498 non-null int64
home_ownership_NONE                    39498 non-null int64
home_ownership_OTHER                   39498 non-null int64
home_ownership_OWN                     39498 non-null int64
home_ownership_RENT                    39498 non-null int64
verification_status_Not Verified       39498 non-null int64
verification_status_Source Verified    39498 non-null int64
verification_status_Verified           39498 non-null int64
purpose_car                            39498 non-null int64
purpose_credit_card                    39498 non-null int64
purpose_debt_consolidation             39498 non-null int64
purpose_educational                    39498 non-null int64
purpose_home_improvement               39498 non-null int64
purpose_house                          39498 non-null int64
purpose_major_purchase                 39498 non-null int64
purpose_medical                        39498 non-null int64
purpose_moving                         39498 non-null int64
purpose_other                          39498 non-null int64
purpose_renewable_energy               39498 non-null int64
purpose_small_business                 39498 non-null int64
purpose_vacation                       39498 non-null int64
purpose_wedding                        39498 non-null int64
term_ 36 months                        39498 non-null int64
term_ 60 months                        39498 non-null int64
dtypes: float64(12), int64(25)
memory usage: 11.1 MB
None

从结果可以看出，缺失值全部处理完，字符型也全部转换成的数值型，下面就可以调用sk-learn库进行计算了。
3.1 使用逻辑回归训练
逻辑回归是机器学习中非常好用的二分类算法，其速度也是相当快的。

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold,cross_val_score
lr = LogisticRegression()
cols = loans.columns
train_cols = cols.drop("loan_status")
features = loans[train_cols]
target = loans["loan_status"]
kf = KFold(features.shape[0],random_state=1)
predictions = cross_val_predict(lr,features,target,cv=kf)
predictions = pd.Series(predictions)
# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))

print(tpr)
print(fpr)
print(predictions[:20])

输出结果

0.999084438406
0.998049299521
0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    1
15    1
16    1
17    1
18    1
19    1
dtype: int64

小结：从结果看出TPR和FPR的值都很高，说明来一个人基本都会判断为可以借钱，那显然模型就完全没有分类的意义。

3.3 考虑权重后使用逻辑回归训练

那么为什么会出现上面极其离谱的现象呢？这是由于我们的样本是很不均衡的，这就容易导致我们构建的分类器把所有样本都归为样本量较大的那一个类。解决的方法有很多，其中一个是进行数据增强，就是把少的样本增多，但是要添加的数据要么是收集的，要么是自己造的，所以这项工作还是挺难的。所以在本文中将考虑权重，将少的样本的权重增大，期望模型能够达到比较均衡的状态。

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold,cross_val_score

lr = LogisticRegression(class_weight="balanced")
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)
# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])
# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])
# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])
# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)
print(fpr)
print(predictions[:20])

输出结果

0.670781771464
0.400780280192
0     1
1     0
2     0
3     1
4     1
5     0
6     0
7     0
8     0
9     0
10    1
11    0
12    1
13    1
14    0
15    0
16    1
17    1
18    1
19    0
dtype: int64

3.4 自定义权重后使用逻辑回归训练

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold,cross_val_score
penalty = {
    0: 5,
    1: 1
}

lr = LogisticRegression(class_weight=penalty)
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))

print(tpr)
print(fpr)

输出结果

0.731799521545

0.478985635751

3.5 使用随机森林进行训练

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold,cross_val_score
#树的数量为10颗
rf = RandomForestClassifier(n_estimators=10,class_weight="balanced", random_state=1)
#print help(RandomForestClassifier)
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(rf, features, target, cv=kf)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)
print(fpr)

输出结果

0.973862193213

0.940946976414

总结： 以上模型都没有给出一个较好的结果，我后面也试了件随机森林的树的数量变多，发现效果也不是特别好，本项目实战主要是给出使用机器学习建模的一般流程，分为两大部分：数据处理和模型学习。
当模型的效果不理想时，可以考虑的调整策略：
1.调节正负样本的权重参数。
2.更换模型算法，如SVM,adboost等。
3.同时几个使用模型进行预测，然后取去测的最终结果。
4.使用原数据，生成新特征。
5.调整模型参数。