贷款申请最大化利润(二分类,逻辑回归,随机森林)

1. 数据清洗过滤无用特征

1.1 设定的nan个数界限保留整列

  • dropna(thresh=n)非nan最少n个才能保留
import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1)
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)
loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)
loans_2007.to_csv('loans_2007.csv', index=False)

1.2 查看数据,第一行与列数

import pandas as pd
loans_2007 = pd.read_csv("loans_2007.csv")
#loans_2007.drop_duplicates()
print(loans_2007.iloc[0])
print(loans_2007.shape[1])#列
id                                1077501
member_id                      1.2966e+06
loan_amnt                            5000
funded_amnt                          5000
funded_amnt_inv                      4975
term                            36 months
int_rate                           10.65%
installment                        162.87
grade                                   B
sub_grade                              B2
emp_title                             NaN
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
issue_d                          Dec-2011
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
zip_code                            860xx
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                 Jan-1985
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                          83.7%
total_acc                               9
initial_list_status                     f
out_prncp                               0
out_prncp_inv                           0
total_pymnt                       5863.16
total_pymnt_inv                   5833.84
total_rec_prncp                      5000
total_rec_int                      863.16
total_rec_late_fee                      0
recoveries                              0
collection_recovery_fee                 0
last_pymnt_d                     Jan-2015
last_pymnt_amnt                    171.62
last_credit_pull_d               Nov-2016
collections_12_mths_ex_med              0
policy_code                             1
application_type               INDIVIDUAL
acc_now_delinq                          0
chargeoff_within_12_mths                0
delinq_amnt                             0
pub_rec_bankruptcies                    0
tax_liens                               0
Name: 0, dtype: object
52

1.3 删除无用列

loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)
loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
loans_2007 = loans_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
print(loans_2007.iloc[0])
print(loans_2007.shape[1])
loan_amnt                            5000
term                            36 months
int_rate                           10.65%
installment                        162.87
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                 Jan-1985
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                          83.7%
total_acc                               9
initial_list_status                     f
last_credit_pull_d               Nov-2016
collections_12_mths_ex_med              0
policy_code                             1
application_type               INDIVIDUAL
acc_now_delinq                          0
chargeoff_within_12_mths                0
delinq_amnt                             0
pub_rec_bankruptcies                    0
tax_liens                               0
Name: 0, dtype: object
32

2. 数据预处理

2.1 保留loan_status中两值变为0和1 (target)

print(loans_2007['loan_status'].value_counts())

在这里插入图片描述

loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]
status_replace = {
    "loan_status" : {
        "Fully Paid": 1,
        "Charged Off": 0,
    }
}
loans_2007 = loans_2007.replace(status_replace)

2.2 删除唯一属性的列

orig_columns = loans_2007.columns
drop_columns = []
for col in orig_columns:
    col_series = loans_2007[col].dropna().unique() #先清除缺失值,否则会多一个唯一值
    if len(col_series) == 1:
        drop_columns.append(col)
loans_2007 = loans_2007.drop(drop_columns, axis=1)
print(drop_columns)
print loans_2007.shape
loans_2007.to_csv('filtered_loans_2007.csv', index=False)

在这里插入图片描述

2.3 求出每列的空值总和,并去除空值

import pandas as pd
loans = pd.read_csv('filtered_loans_2007.csv')
null_counts = loans.isnull().sum()
print(null_counts)

loans = loans.drop("pub_rec_bankruptcies", axis=1) #删除空值多的列
loans = loans.dropna(axis=0) #删除空值所在的行

在这里插入图片描述

2.4 查看数据各个类型的数量,并转换字符类型的列的值

print(loans.dtypes.value_counts())
object_columns_df = loans.select_dtypes(include=["object"])
print(object_columns_df.iloc[0])

在这里插入图片描述
在这里插入图片描述

2.5 查看列的各个特征值的数量,判断是更改还是删除

cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for c in cols:
    print(loans[c].value_counts())
RENT        18780
MORTGAGE    17574
OWN          3045
OTHER          96
NONE            3
Name: home_ownership, dtype: int64
Not Verified       16856
Verified           12705
Source Verified     9937
Name: verification_status, dtype: int64
10+ years    8821
< 1 year     4563
2 years      4371
3 years      4074
4 years      3409
5 years      3270
1 year       3227
6 years      2212
7 years      1756
8 years      1472
9 years      1254
n/a          1069
Name: emp_length, dtype: int64
 36 months    29041
 60 months    10457
Name: term, dtype: int64
CA    7070
NY    3788
FL    2856
TX    2714
NJ    1838
IL    1517
PA    1504
VA    1400
GA    1393
MA    1336
OH    1208
MD    1049
AZ     874
WA     834
CO     786
NC     780
CT     747
MI     722
MO     682
MN     611
NV     492
SC     470
WI     453
AL     446
OR     445
LA     435
KY     325
OK     298
KS     269
UT     256
AR     243
DC     211
RI     198
NM     188
WV     176
HI     172
NH     172
DE     113
MT      84
WY      83
AK      79
SD      63
VT      54
MS      19
TN      17
IN       9
ID       6
IA       5
NE       5
ME       3
Name: addr_state, dtype: int64
print(loans["purpose"].value_counts())
print(loans["title"].value_counts())
debt_consolidation    18533
credit_card            5099
other                  3963
home_improvement       2965
major_purchase         2181
small_business         1815
car                    1544
wedding                 945
medical                 692
moving                  581
vacation                379
house                   378
educational             320
renewable_energy        103
Name: purpose, dtype: int64
Debt Consolidation                         2168
Debt Consolidation Loan                    1706
Personal Loan                               658
Consolidation                               509
debt consolidation                          502
Credit Card Consolidation                   356
Home Improvement                            354
Debt consolidation                          333
Small Business Loan                         322
Credit Card Loan                            313
Personal                                    308
Consolidation Loan                          255
Home Improvement Loan                       246
personal loan                               234
personal                                    220
Loan                                        212
Wedding Loan                                209
consolidation                               200
Car Loan                                    200
...
Name: title, dtype: int64

2.6 更改字符的类型

mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}
loans = loans.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)#删除特征值太多的列
loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")
loans = loans.replace(mapping_dict)

2.7 字符ont-hot编码

cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]
dummy_df = pd.get_dummies(loans[cat_columns])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(cat_columns, axis=1)
loans = loans.drop("pymnt_plan", axis=1)
loans.to_csv('cleaned_loans2007.csv', index=False)

3. 获得最大利润的条件与做法

3.1 查看数据的类型

import pandas as pd
loans = pd.read_csv("cleaned_loans2007.csv")
print(loans.info())

全部为常量

3.2 建立数据集

cols = loans.columns
train_cols = cols.drop("loan_status")
features = loans[train_cols]
target = loans["loan_status"]

3.3 LogisticRegression简单模型

  • 交叉验证
  • cross_val_predict 为每个输入数据点生成交叉验证的估计值
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict, KFold
lr = LogisticRegression()
kf = KFold(5,shuffle=True, random_state=1)
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
print(fp_filter)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)#实际能还,预测不能
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))#盈利
fpr = fp / float((fp + tn))#赔钱

print(tpr)
print(fpr)
print(predictions[:20])

打印:

0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    1
15    1
16    1
17    1
18    1
19    1
0.6501668684840072
0.36815038127327543

直接构建模型效果差

3.4 LogisticRegression模型添加权重

  • class_weight=“balanced”
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
lr = LogisticRegression(class_weight="balanced")#样本均衡 
kf = KFold(5,shuffle=True, random_state=1)
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))

print(tpr)
print(fpr)
print(predictions[:20])
loans['predicted_label']=predictions
matches = loans["predicted_label"] == loans["loan_status"]
#print('matches',matches)
correct_predictions = loans[matches]
print('len(correct_predictions)',len(correct_predictions))
print('float(len(admissions)',float(len(loans)))
accuracy = len(correct_predictions) / float(len(loans))
print('准确率',accuracy)
0.6501668684840072
0.36815038127327543
0     1
1     0
2     0
3     1
4     1
5     0
6     0
7     0
8     0
9     0
10    1
11    0
12    1
13    1
14    0
15    0
16    1
17    1
18    1
19    0
len(correct_predictions) 25577
float(len(admissions) 39498.0
准确率 0.6475517747734062

模型效果一般

  • class_weight=penalty
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
penalty = {
    0: 5,
    1: 1
}

lr = LogisticRegression(class_weight=penalty)
kf = KFold(5,shuffle=True, random_state=1)
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))

print('#自己定义权重项',tpr)
print('#自己定义权重项',fpr)
loans['predicted_label']=predictions
matches = loans["predicted_label"] == loans["loan_status"]
#print('matches',matches)
correct_predictions = loans[matches]
print('len(correct_predictions)',len(correct_predictions))
print('float(len(admissions)',float(len(loans)))
accuracy = len(correct_predictions) / float(len(loans))
print('#自己定义权重项准确率',accuracy)
#自己定义权重项 0.6933459346111817
#自己定义权重项 0.45664124844830645
len(correct_predictions) 26540
float(len(admissions) 39498.0
#自己定义权重项准确率 0.6719327560889159

3.5 随机森林

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict
rf = RandomForestClassifier(n_estimators=10,class_weight="balanced", random_state=1)
#print help(RandomForestClassifier)
kf = KFold(5,shuffle=True, random_state=1)
predictions = cross_val_predict(rf, features, target, cv=kf)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print('随机森林',tpr)
print('随机森林',fpr)
print(predictions[:20])
loans['predicted_label']=predictions
matches = loans["predicted_label"] == loans["loan_status"]
#print('matches',matches)
correct_predictions = loans[matches]
print('len(correct_predictions)',len(correct_predictions))
print('float(len(admissions)',float(len(loans)))
accuracy = len(correct_predictions) / float(len(loans))
print('随机森林准确率',accuracy)
随机森林 0.9744824123571281
随机森林 0.9313708104273808
0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    1
15    1
16    1
17    1
18    1
19    1
len(correct_predictions) 33382
float(len(admissions) 39498.0
随机森林准确率 0.8451567167957871

从预测结果来看也不是很理想,
可以调节正负样本权重以及模型的参数进行优化,也可以选择比如SVM等模型进行对比。

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
lending club 贷款数据 2018年第二季度的贷款数据 "id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","open_il_12m","open_il_24m","mths_since_rcnt_il","total_bal_il","il_util","open_rv_12m","open_rv_24m","max_bal_bc","all_util","total_rev_hi_lim","inq_fi","total_cu_tl","inq_last_12m","acc_open_past_24mths","avg_cur_bal","bc_open_to_buy","bc_util","chargeoff_within_12_mths","delinq_amnt","mo_sin_old_il_acct","mo_sin_old_rev_tl_op","mo_sin_rcnt_rev_tl_op","mo_sin_rcnt_tl","mort_acc","mths_since_recent_bc","mths_since_recent_bc_dlq","mths_since_recent_inq","mths_since_recent_revol_delinq","num_accts_ever_120_pd","num_actv_bc_tl","num_actv_rev_tl","num_bc_sats","num_bc_tl","num_il_tl","num_op_rev_tl","num_rev_accts","num_rev_tl_bal_gt_0","num_sats","num_tl_120dpd_2m","num_tl_30dpd","num_tl_90g_dpd_24m","num_tl_op_past_12m","pct_tl_nvr_dlq","percent_bc_gt_75","pub_rec_bankruptcies","tax_liens","tot_hi_cred_lim","total_bal_ex_mort","total_bc_limit","total_il_high_credit_limit","revol_bal_joint","sec_app_earliest_cr_line","sec_app_inq_last_6mths","sec_app_mort_acc","sec_app_open_acc","sec_app_revol_util","sec_app_open_act_il","sec_app_num_rev
1. 数据集简介 Bank Marketing数据集是一个关于银行市场营销活动的数据集,包含了一系列客户的特征和目标变量。目标变量是二分类变量,指示客户是否订阅了银行的定期存款。 数据集包含了45211个样本和17个特征: - age:年龄 - job:职业 - marital:婚姻状况 - education:教育程度 - default:是否有信用违约记录 - balance:账户余额 - housing:是否有住房贷款 - loan:是否有个人贷款 - contact:联系方式 - day:最后一次联系的日期 - month:最后一次联系的月份 - duration:最后一次联系的通话时长 - campaign:此次活动期间与该客户联系的次数 - pdays:距离上次联系该客户的时间 - previous:此次活动期间与该客户联系的次数 - poutcome:上次活动的结果 - y:是否订阅定期存款 2. 数据集预处理 首先需要将数据集导入Python中,并进行数据预处理。具体包括以下几个步骤: - 导入必要的库和数据集 - 查看数据集的基本信息、缺失值和重复值 - 对非数值型变量进行编码 - 将数据集划分为训练集和测试集 代码如下: ```python # 导入必要的库和数据集 import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder bank = pd.read_csv('bank.csv', delimiter=';') # 查看数据集的基本信息、缺失值和重复值 print(bank.info()) print(bank.isnull().sum()) print(bank.duplicated().sum()) # 对非数值型变量进行编码 le = LabelEncoder() bank['job'] = le.fit_transform(bank['job']) bank['marital'] = le.fit_transform(bank['marital']) bank['education'] = le.fit_transform(bank['education']) bank['default'] = le.fit_transform(bank['default']) bank['housing'] = le.fit_transform(bank['housing']) bank['loan'] = le.fit_transform(bank['loan']) bank['contact'] = le.fit_transform(bank['contact']) bank['month'] = le.fit_transform(bank['month']) bank['poutcome'] = le.fit_transform(bank['poutcome']) bank['y'] = le.fit_transform(bank['y']) # 将数据集划分为训练集和测试集 X = bank.iloc[:, :-1] y = bank.iloc[:, -1] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) ``` 3. 构建逻辑回归模型 构建逻辑回归模型需要完成以下几个步骤: - 导入必要的库 - 创建逻辑回归模型对象 - 将训练集数据拟合到模型中 - 使用测试集数据评估模型的性能 代码如下: ```python # 导入必要的库 from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix, classification_report # 创建逻辑回归模型对象 logreg = LogisticRegression() # 将训练集数据拟合到模型中 logreg.fit(X_train, y_train) # 使用测试集数据评估模型的性能 y_pred = logreg.predict(X_test) print('Accuracy score:', accuracy_score(y_test, y_pred)) print('Confusion matrix:\n', confusion_matrix(y_test, y_pred)) print('Classification report:\n', classification_report(y_test, y_pred)) ``` 4. 结果分析 运行上述代码后,可以得到模型的性能指标。以本例为例,模型的准确率为89.8%,混淆矩阵如下: ``` [[11574 380] [ 1055 658]] ``` 可以看出,在测试集上,模型预测正确的正样本有658个,预测错误的正样本有1055个;预测正确的负样本有11574个,预测错误的负样本有380个。同时,分类报告可以帮助我们更好地了解模型的性能: ``` precision recall f1-score support 0 0.92 0.97 0.94 11954 1 0.63 0.38 0.47 1713 accuracy 0.90 13667 macro avg 0.77 0.68 0.71 13667 weighted avg 0.88 0.90 0.89 13667 ``` 可以看出,模型的精确度为0.63,召回率为0.38,F1值为0.47。这表明模型的性能有待进一步提高。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值