背景
该数据集包含来自近 10,000 名获得贷款的借款人的数据——一些已经偿还,另一些仍在进行中。它是从 lendingclub.com 中提取的,这是一个将借款人与投资者联系起来的组织。任务是预测是否全额偿还贷款。
该数据集字段如下:
- credit.policy: 如果客户符合LendingClub.com的信贷承销标准,则为1,否则为0。
- purpose: 贷款目的,取值包括 “credit_card”(信用卡)、“debt_consolidation”(债务合并)、“educational”(教育)、“major_purchase”(大额购买)、“small_business”(小企业)和"all_other"(其他)。
- int.rate: 贷款的利率,以比例表示(例如,11%的利率将存储为0.11)。LendingClub.com认为更有风险的借款人被分配更高的利率。
- installment: 如果贷款获得资金,借款人每月应还的分期付款。
- log.annual.inc: 借款人自报的年收入的自然对数。
- dti: 借款人的债务收入比(债务金额除以年收入)。
- fico: 借款人的FICO信用评分。
- days.with.cr.line: 借款人拥有信用额度的天数。
- revol.bal: 借款人的循环余额(信用卡账单周期结束时未支付的金额)。
- revol.util: 借款人的循环额度利用率(使用的信用额度与总可用信用额度的比率)。
- inq.last.6mths: 借款人在过去6个月中受到债权人查询的次数。
- delinq.2yrs: 借款人在过去2年中逾期超过30天的付款次数。
- pub.rec: 借款人的不良公共记录数(破产申请,税收留置或判决)。
- not_fully_paid: 否全额偿还贷款(1代表没有全额偿还贷款,0代表全额偿还贷款)。
本文流程
- 数据导入
- 探索性数据分析
- 数据预处理
- 初步建模
- PCA分析并预测
- REF筛选重要特征
- 模型选择并预测
- 总结
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import precision_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.model_selection import GridSearchCV
加载数据
df = pd.read_csv('loan_data.csv')
df.columns = [i.replace('.', '_') for i in df.columns]
df.head()
credit_policy | purpose | int_rate | installment | log_annual_inc | dti | fico | days_with_cr_line | revol_bal | revol_util | inq_last_6mths | delinq_2yrs | pub_rec | not_fully_paid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | debt_consolidation | 0.1189 | 829.10 | 11.350407 | 19.48 | 737 | 5639.958333 | 28854 | 52.1 | 0 | 0 | 0 | 0 |
1 | 1 | credit_card | 0.1071 | 228.22 | 11.082143 | 14.29 | 707 | 2760.000000 | 33623 | 76.7 | 0 | 0 | 0 | 0 |
2 | 1 | debt_consolidation | 0.1357 | 366.86 | 10.373491 | 11.63 | 682 | 4710.000000 | 3511 | 25.6 | 1 | 0 | 0 | 0 |
3 | 1 | debt_consolidation | 0.1008 | 162.34 | 11.350407 | 8.10 | 712 | 2699.958333 | 33667 | 73.2 | 1 | 0 | 0 | 0 |
4 | 1 | credit_card | 0.1426 | 102.92 | 11.299732 | 14.97 | 667 | 4066.000000 | 4740 | 39.5 | 0 | 1 | 0 | 0 |
探索性数据分析
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 credit_policy 9578 non-null int64
1 purpose 9578 non-null object
2 int_rate 9578 non-null float64
3 installment 9578 non-null float64
4 log_annual_inc 9578 non-null float64
5 dti 9578 non-null float64
6 fico 9578 non-null int64
7 days_with_cr_line 9578 non-null float64
8 revol_bal 9578 non-null int64
9 revol_util 9578 non-null float64
10 inq_last_6mths 9578 non-null int64
11 delinq_2yrs 9578 non-null int64
12 pub_rec 9578 non-null int64
13 not_fully_paid 9578 non-null int64
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB
df.isnull().sum()
credit_policy 0
purpose 0
int_rate 0
installment 0
log_annual_inc 0
dti 0
fico 0
days_with_cr_line 0
revol_bal 0
revol_util 0
inq_last_6mths 0
delinq_2yrs 0
pub_rec 0
not_fully_paid 0
dtype: int64
df.shape
(9578, 14)
X = df.copy()
y = X.pop('not_fully_paid')
X
credit_policy | purpose | int_rate | installment | log_annual_inc | dti | fico | days_with_cr_line | revol_bal | revol_util | inq_last_6mths | delinq_2yrs | pub_rec | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | debt_consolidation | 0.1189 | 829.10 | 11.350407 | 19.48 | 737 | 5639.958333 | 28854 | 52.1 | 0 | 0 | 0 |
1 | 1 | credit_card | 0.1071 | 228.22 | 11.082143 | 14.29 | 707 | 2760.000000 | 33623 | 76.7 | 0 | 0 | 0 |
2 | 1 | debt_consolidation | 0.1357 | 366.86 | 10.373491 | 11.63 | 682 | 4710.000000 | 3511 | 25.6 | 1 | 0 | 0 |
3 | 1 | debt_consolidation | 0.1008 | 162.34 | 11.350407 | 8.10 | 712 | 2699.958333 | 33667 | 73.2 | 1 | 0 | 0 |
4 | 1 | credit_card | 0.1426 | 102.92 | 11.299732 | 14.97 | 667 | 4066.000000 | 4740 | 39.5 | 0 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9573 | 0 | all_other | 0.1461 | 344.76 | 12.180755 | 10.39 | 672 | 10474.000000 | 215372 | 82.1 | 2 | 0 | 0 |
9574 | 0 | all_other | 0.1253 | 257.70 | 11.141862 | 0.21 | 722 | 4380.000000 | 184 | 1.1 | 5 | 0 | 0 |
9575 | 0 | debt_consolidation | 0.1071 | 97.81 | 10.596635 | 13.09 | 687 | 3450.041667 | 10036 | 82.9 | 8 | 0 | 0 |
9576 | 0 | home_improvement | 0.1600 | 351.58 | 10.819778 | 19.18 | 692 | 1800.000000 | 0 | 3.2 | 5 | 0 | 0 |
9577 | 0 | debt_consolidation | 0.1392 | 853.43 | 11.264464 | 16.28 | 732 | 4740.000000 | 37879 | 57.0 | 6 | 0 | 0 |
9578 rows × 13 columns
num = [i for i in X.select_dtypes(['int64', 'float64']).columns]
non_num = ['purpose']
len(num)
credit_policy | int_rate | installment | log_annual_inc | dti | fico | days_with_cr_line | revol_bal | revol_util | inq_last_6mths | delinq_2yrs | pub_rec | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.1189 | 829.10 | 11.350407 | 19.48 | 737 | 5639.958333 | 28854 | 52.1 | 0 | 0 | 0 |
1 | 1 | 0.1071 | 228.22 | 11.082143 | 14.29 | 707 | 2760.000000 | 33623 | 76.7 | 0 | 0 | 0 |
2 | 1 | 0.1357 | 366.86 | 10.373491 | 11.63 | 682 | 4710.000000 | 3511 | 25.6 | 1 | 0 | 0 |
3 | 1 | 0.1008 | 162.34 | 11.350407 | 8.10 | 712 | 2699.958333 | 33667 | 73.2 | 1 | 0 | 0 |
4 | 1 | 0.1426 | 102.92 | 11.299732 | 14.97 | 667 | 4066.000000 | 4740 | 39.5 | 0 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9573 | 0 | 0.1461 | 344.76 | 12.180755 | 10.39 | 672 | 10474.000000 | 215372 | 82.1 | 2 | 0 | 0 |
9574 | 0 | 0.1253 | 257.70 | 11.141862 | 0.21 | 722 | 4380.000000 | 184 | 1.1 | 5 | 0 | 0 |
9575 | 0 | 0.1071 | 97.81 | 10.596635 | 13.09 | 687 | 3450.041667 | 10036 | 82.9 | 8 | 0 | 0 |
9576 | 0 | 0.1600 | 351.58 | 10.819778 | 19.18 | 692 | 1800.000000 | 0 | 3.2 | 5 | 0 | 0 |
9577 | 0 | 0.1392 | 853.43 | 11.264464 | 16.28 | 732 | 4740.000000 | 37879 | 57.0 | 6 | 0 | 0 |
9578 rows × 12 columns
plt.figure(figsize=(15,18))
for i, col in enumerate(num):
ax = plt.subplot(4, 3, i+1)
sn