从用户画像到申请结果预测：信用卡模型构建之旅

最新推荐文章于 2024-05-19 23:31:32 发布

蓝皮怪

最新推荐文章于 2024-05-19 23:31:32 发布

阅读量951

点赞数 42

文章标签： python 机器学习深度学习

本文链接：https://blog.csdn.net/m0_53814833/article/details/138770907

版权

项目连接：https://www.heywhale.com/mw/project/652634b61ca35e635cde8015

数据来源：https://www.kaggle.com/datasets/rohitudageri/credit-card-details

1.项目背景

本数据集包含过去一段时间内某银行收到的信用卡申请人的申请信息，收集的申请数据信息中包含了申请人的人口统计、就业、收入等信息。
为了改善其信用风险管理，该银行正在寻求更好地了解其客户基础，并确定信用风险的关键驱动因素。这些属性将被评估以开发信用评分模型，以便做出申请决定并防范未来损失。

2.数据说明

字段	说明
ind_ID	客户ID
Gender	性别信息(M : 男 , F : 女)
Car_owner	是否有车(Y : 有 , N : 无)
Propert_owner	是否有房产(Y : 有 , N : 无)
Children	子女数量
Annual_income	年收入
Type_Income	收入类型
Education	教育程度
Marital_status	婚姻状况
Housing_type	居住方式
Birthday_count	生日计数，以当前日期为0，往前倒数天数，-1代表昨天
Employed_days	就业天数。以当前日期为0，往前倒数天数。正值意味着个人目前未就业
Mobile_phone	是否有移动电话
Work_phone	是否有工作电话
Phone	是否有电话
EMAIL_ID	是否有电子邮件ID
Type_Occupation	职业类型
Family_Members	家庭人数
Label	0表示申请通过，1表示申请拒绝

3.Python库导入及数据读取

#导入需要的库
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import xgboost as xgb
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report,confusion_matrix,roc_curve, auc
import warnings
warnings.filterwarnings('ignore')

#读取数据
data = pd.read_csv(r"D:\Desktop\商业数据分析案例\信用卡申请用户\Credit_card.csv")

4.数据预览及数据处理

4.1数据预览

#查看数据维度
data.shape

(1548, 19)

#查看数据信息
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1548 entries, 0 to 1547
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Ind_ID           1548 non-null   int64  
 1   GENDER           1541 non-null   object 
 2   Car_Owner        1548 non-null   object 
 3   Propert_Owner    1548 non-null   object 
 4   CHILDREN         1548 non-null   int64  
 5   Annual_income    1525 non-null   float64
 6   Type_Income      1548 non-null   object 
 7   EDUCATION        1548 non-null   object 
 8   Marital_status   1548 non-null   object 
 9   Housing_type     1548 non-null   object 
 10  Birthday_count   1526 non-null   float64
 11  Employed_days    1548 non-null   int64  
 12  Mobile_phone     1548 non-null   int64  
 13  Work_Phone       1548 non-null   int64  
 14  Phone            1548 non-null   int64  
 15  EMAIL_ID         1548 non-null   int64  
 16  Type_Occupation  1060 non-null   object 
 17  Family_Members   1548 non-null   int64  
 18  Label            1548 non-null   int64  
dtypes: float64(2), int64(9), object(8)
memory usage: 229.9+ KB

#查看各列缺失值
data.isna().sum()

Ind_ID               0
GENDER               7
Car_Owner            0
Propert_Owner        0
CHILDREN             0
Annual_income       23
Type_Income          0
EDUCATION            0
Marital_status       0
Housing_type         0
Birthday_count      22
Employed_days        0
Mobile_phone         0
Work_Phone           0
Phone                0
EMAIL_ID             0
Type_Occupation    488
Family_Members       0
Label                0
dtype: int64

#查看重复值
data.duplicated().sum()

#绘制箱线图来观察'Annual_income','CHILDREN','Birthday_count','Employed_days','Family_Members'是否存在异常值
columns = ['Annual_income','CHILDREN','Birthday_count','Employed_days','Family_Members']
plt.figure(figsize=(20,15))
for i,col in enumerate(columns,1):
    plt.subplot(2,3,i)
    sns.boxplot(y=data[col])
    plt.title(f'Boxplot of {col}')

在这里插入图片描述

通过查看数据格式，可以看到：
1.Ind_ID应该是object格式；
2.GENDER,Annual_income,Birthday_count,Type_Occupation存在缺失值，对GENDER采用众数填充，对Annual_income和Birthday_count采用中位数填充,对Type_Occupation填充"Unknown"
3.异常值分析：
3.1 Annual_income: 虽然有一些高于大多数人的收入值，但这些值在实际生活中是可能的，因此我们可以不将其视为异常值;
3.2 CHILDREN: 存在一个值特别高的数据点，可能是异常值;
3.3 Birthday_count: 此列似乎没有明显的异常值;
3.4 Employed_days: 存在一个远高于其他值的数据点，这可能是一个占位值或代表某种特殊情况（例如退休）。我们需要进一步考虑如何处理这个值;
3.5 Family_Members: 存在一个值特别高的数据点，可能是异常值，虽然结合3.2，我们可以认为该值不属于异常值，但是考虑到后续模型分析，需要将3.2和3.5的这个极端值修正。

4.2数据处理

data['Ind_ID'] = data['Ind_ID'].astype(str) #将Ind_ID转化为object格式
#缺失值处理
#对于性别，采用众数填充
gender_mode = data['GENDER'].mode()[0]
data['GENDER'].fillna(gender_mode,inplace=True)
#对于年收入和生日计数采用中位数填充
annual_income_median = data['Annual_income'].median()
birthday_count_median = data['Birthday_count'].median()
data['Annual_income'].fillna(annual_income_median,inplace=True)
data['Birthday_count'].fillna(birthday_count_median,inplace=True)
#对于职业类型填充'Unknown'
data['Type_Occupation'].fillna('Unknown',inplace=True)
#查看缺失值情况
data.isna().sum()

Ind_ID             0
GENDER             0
Car_Owner          0
Propert_Owner      0
CHILDREN           0
Annual_income      0
Type_Income        0
EDUCATION          0
Marital_status     0
Housing_type       0
Birthday_count     0
Employed_days      0
Mobile_phone       0
Work_Phone         0
Phone              0
EMAIL_ID           0
Type_Occupation    0
Family_Members     0
Label              0
dtype: int64

#处理'CHILDREN','Family_Members'的异常值
second_largest_children = data['CHILDREN'].nlargest(2).iloc[-1]
data.loc[data['CHILDREN'] == 14, 'CHILDREN'] = second_largest_children
second_largest_family = data['Family_Members'].nlargest(2).iloc[-1]
data.loc[data['Family_Members'] == 15, 'Family_Members'] = second_largest_family

#将'Birthday_count'转化为'Age'
data['Age'] = (data['Birthday_count'] / 365).astype(int)
data['Age'] = data['Age'].abs()
#删除'Birthday_count'
data.drop('Birthday_count',axis=1, inplace=True)

#查看'Employed_days'正值的情况
positive_employed_days = data[data['Employed_days'] > 0]['Employed_days']
positive_employed_days.describe()

count       261.0
mean     365243.0
std           0.0
min      365243.0
25%      365243.0
50%      365243.0
75%      365243.0
max      365243.0
Name: Employed_days, dtype: float64

可以发现，Employed_days正值均为365243，可以大胆猜测这是一个特殊的情况，并不是数据异常，接下来，我们针对这类数值，继续探索。

positive_employed_data = data[data['Employed_days'] == 365243][['Type_Income', 'Type_Occupation', 'Annual_income']]
type_income_counts = positive_employed_data['Type_Income'].value_counts()
type_occupation_counts = positive_employed_data['Type_Occupation'].value_counts()
type_income_counts, type_occupation_counts

(Pensioner    261
 Name: Type_Income, dtype: int64,
 Unknown    261
 Name: Type_Occupation, dtype: int64)

Type_Income: 所有261个正值记录的 Type_Income都是 “Pensioner”，即退休人员；
Type_Occupation: 所有261个正值记录的 Type_Occupation都是 “Unknown”；
因此，我们可以对这一类数据，进行新的处理。

#将'Type_Occupation'中的'Unknown'全部替换成'Pensioner'
data.loc[data['Employed_days'] == 365243, 'Type_Occupation'] = 'Pensioner'
#创建一个新的二元特征 "Is_Retired"，为退休人员设置为1，其他人为0,将Employed_days列中的正值转换为0,因为退休人员并没有工作。
data['Is_Retired'] = (data['Employed_days'] == 365243).astype(int)
data.loc[data['Employed_days'] == 365243, 'Employed_days'] = 0

#同样，考虑后续建模Employed_days的值差异比较大，考虑将Employed_days转化为Employed_years
data['Employed_years'] = (data['Employed_days'] / 365).astype(int)
data['Employed_years'] = data['Employed_years'].abs()
#删除'Employed_days'
data.drop('Employed_days',axis=1, inplace=True)

#再绘制箱线图来观察'Annual_income','CHILDREN','Age','Employed_days','Family_Members'
columns = ['Annual_income','CHILDREN','Age','Employed_years','Family_Members']
plt.figure(figsize=(20,15))
for i,col in enumerate(columns,1):
    plt.subplot(2,3,i)
    sns.boxplot(y=data[col])
    plt.title(f'Boxplot of {col}')

在这里插入图片描述

5.数据探索

5.1信用卡申请结果分布

sns.set_style("whitegrid")
plt.figure(figsize=(10, 7.5))
ax = sns.countplot(data=data, x='Label', palette="Blues_d")
plt.title('Distribution of Credit Card Application Results', fontsize=15)
plt.xlabel('Application Result', fontsize=12)
plt.ylabel('Count', fontsize=12)
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', fontsize=11, color='black', xytext=(0, 5),
                textcoords='offset points')

plt.show()

在这里插入图片描述

通过上图，我们可以得知：大部分的信用卡申请都被批准，仅有少部分的没有被通过，也正因为样本不平衡，所以我们在后续建立预测模型的时候，需要考虑这个情况。

5.2数值变量分布

numerical_columns = ['Annual_income', 'Age', 'Employed_years']
plt.figure(figsize=(20, 15))
for i, column in enumerate(numerical_columns, 1):
    plt.subplot(2, 2, i)
    sns.histplot(data[column], kde=True, bins=30, color="skyblue") # 和鲸使用的是旧版本的distplot
    plt.title(f'Distribution of {column}', fontsize=15)
    plt.xlabel(column, fontsize=12)
    plt.ylabel('Count', fontsize=12)
plt.tight_layout()
plt.show()

在这里插入图片描述

1.Annual_income: 年收入呈右偏分布，大多数人的收入集中在较低的范围内，但也有少数人收入较高。
2.Age: 年龄分布接近正态分布，20岁左右和70岁左右的人比较少。
3.Employed_years: 工龄为0的比较多，这和之前把退休人员归为0有一定关系。

5.3分类变量分布

categorical_columns = ['GENDER', 'Type_Income', 'EDUCATION']

plt.figure(figsize=(20, 15))

for i, column in enumerate(categorical_columns, 1):
    plt.subplot(2, 2, i)
    sns.countplot(data=data, x=column, palette="Blues_d", order=data[column].value_counts().index)
    plt.title(f'Distribution of {column}', fontsize=15)
    plt.xlabel(column, fontsize=12)
    plt.ylabel('Count', fontsize=12)
    plt.xticks(rotation=0)

plt.tight_layout()
plt.show()

在这里插入图片描述

1.GENDER: 男性数量高于女性。
2.Type_Income: 大多数人的收入类型为 “Working”，其次是 “Pensioner” 和 “Commercial associate”。
3.EDUCATION: 大多数人拥有中等教育背景，较少人拥有高学历或低学历。

5.4某些特征与信用卡申请结果

relationship_columns = ['Type_Income', 'EDUCATION', 'GENDER', 'Is_Retired']

plt.figure(figsize=(20, 15))

for i, column in enumerate(relationship_columns, 1):
    plt.subplot(2, 2, i)
    sns.countplot(data=data, x=column, hue='Label', palette="Blues_d", order=data[column].value_counts().index)
    plt.title(f'Relationship between {column} and Credit Card Application Result', fontsize=15)
    plt.xlabel(column, fontsize=12)
    plt.ylabel('Count', fontsize=12)
    plt.xticks(rotation=0)
    plt.legend(title='Application Result')

plt.tight_layout()
plt.show()

在这里插入图片描述

1.Type_Income: 不同的收入类型对申请结果有明显的影响。例如，“Working” 和 “Commercial associate” 的申请者大部分都被批准，而 "Pensioner"则有更高的拒绝率。
2.EDUCATION: 教育背景也影响申请结果。拥有高学历的申请者更可能获得批准。
3.GENDER: 男性和女性之间的申请结果分布相似，性别似乎不是决定申请结果的关键因素。
4.Is_Retired: 退休人员的申请被拒绝的几率更高。

6用户画像

6.1申请通过用户画像

approved_data = data[data['Label'] == 0] #获取通过申请的用户数据
approved_profile = {
    'Gender Distribution': approved_data['GENDER'].value_counts(normalize=True),
    'Average Annual Income': approved_data['Annual_income'].mean(),
    'Education Distribution': approved_data['EDUCATION'].value_counts(normalize=True),
    'Car Ownership Distribution': approved_data['Car_Owner'].value_counts(normalize=True),
    'Property Ownership Distribution': approved_data['Propert_Owner'].value_counts(normalize=True),
    'Average Age': approved_data['Age'].mean(),
    'Occupation Distribution': approved_data['Type_Occupation'].value_counts(normalize=True),
    'Marital Status Distribution': approved_data['Marital_status'].value_counts(normalize=True)
}
print(approved_profile)

{'Gender Distribution': F    0.640932
M    0.359068
Name: GENDER, dtype: float64, 'Average Annual Income': 190049.14238892935, 'Education Distribution': Secondary / secondary special    0.671522
Higher education                 0.270211
Incomplete higher                0.045885
Lower secondary                  0.010925
Academic degree                  0.001457
Name: EDUCATION, dtype: float64, 'Car Ownership Distribution': N    0.594319
Y    0.405681
Name: Car_Owner, dtype: float64, 'Property Ownership Distribution': Y    0.655499
N    0.344501
Name: Propert_Owner, dtype: float64, 'Average Age': 43.2367079388201, 'Occupation Distribution': Laborers                 0.174800
Pensioner                0.164603
Unknown                  0.149308
Core staff               0.109978
Managers                 0.088857
Sales staff              0.081573
Drivers                  0.056082
High skill tech staff    0.042243
Medicine staff           0.034232
Accountants              0.028405
Cleaning staff           0.014567
Security staff           0.012382
Cooking staff            0.012382
Private service staff    0.012382
Secretaries              0.006555
Low-skill Laborers       0.005098
Waiters/barmen staff     0.002913
HR staff                 0.002185
Realty agents            0.001457
Name: Type_Occupation, dtype: float64, 'Marital Status Distribution': Married                 0.680991
Single / not married    0.139840
Civil marriage          0.070648
Separated               0.059723
Widow                   0.048798
Name: Marital_status, dtype: float64}

在信用卡申请通过的用户中：
1.性别分布: 女性64.09%，男性35.91%。
2.平均年收入: 190049。
3.教育情况: 初中教育: 1.09%，中学/中等专业教育: 67.15%，未完成的高等教育: 4.59%，高等教育: 27.02%，学术学位: 0.15%。
4.是否拥有汽车: 有: 40.57%，无: 59.43%。
5.是否拥有房产: 有: 65.55%，无: 34.45%。
6.平均年龄: 43.24岁。
7.职业情况(前五): 劳工17.48%，退休16.46%，未知14.93%，核心员工11%，经理8.89%。
8.婚姻情况: 已婚68.1%，未婚13.98%，民事婚姻7.06%，分居5.97%，寡妇4.88%。

6.2申请未通过用户画像

rejected_data = data[data['Label'] == 1] #获取通过未申请的用户数据
rejected_profile = {
    'Gender Distribution': rejected_data['GENDER'].value_counts(normalize=True),
    'Average Annual Income': rejected_data['Annual_income'].mean(),
    'Education Distribution': rejected_data['EDUCATION'].value_counts(normalize=True),
    'Car Ownership Distribution': rejected_data['Car_Owner'].value_counts(normalize=True),
    'Property Ownership Distribution': rejected_data['Propert_Owner'].value_counts(normalize=True),
    'Average Age': rejected_data['Age'].mean(),
    'Occupation Distribution': rejected_data['Type_Occupation'].value_counts(normalize=True),
    'Marital Status Distribution': rejected_data['Marital_status'].value_counts(normalize=True)
}
print(rejected_profile)

{'Gender Distribution': F    0.571429
M    0.428571
Name: GENDER, dtype: float64, 'Average Annual Income': 198720.0, 'Education Distribution': Secondary / secondary special    0.622857
Higher education                 0.314286
Lower secondary                  0.034286
Incomplete higher                0.028571
Name: EDUCATION, dtype: float64, 'Car Ownership Distribution': N    0.617143
Y    0.382857
Name: Car_Owner, dtype: float64, 'Property Ownership Distribution': Y    0.628571
N    0.371429
Name: Propert_Owner, dtype: float64, 'Average Age': 44.85142857142857, 'Occupation Distribution': Pensioner                0.200000
Laborers                 0.160000
Core staff               0.131429
Unknown                  0.125714
Managers                 0.080000
Sales staff              0.057143
Drivers                  0.051429
Security staff           0.045714
High skill tech staff    0.040000
Accountants              0.028571
Cooking staff            0.022857
Medicine staff           0.017143
Cleaning staff           0.011429
Low-skill Laborers       0.011429
IT staff                 0.011429
Waiters/barmen staff     0.005714
Name: Type_Occupation, dtype: float64, 'Marital Status Distribution': Married                 0.651429
Single / not married    0.200000
Separated               0.080000
Widow                   0.045714
Civil marriage          0.022857
Name: Marital_status, dtype: float64}

在信用卡申请未通过的用户中：
1.性别分布: 女性57.14%，男性42.86%。
2.平均年收入: 198720。
3.教育情况: 初中教育: 3.43%，中学/中等专业教育: 62.29%，未完成的高等教育: 2.83%，高等教育: 31.43%。
4.是否拥有汽车: 有: 38.29%，无: 61.71%。
5.是否拥有房产: 有: 62.86%，无: 37.14%。
6.平均年龄: 44.85岁。
7.职业情况(前五): 退休20%，劳工17.48%，核心员工13.14%，未知12.57%，经理8%。
8.婚姻情况: 已婚65.14%，未婚20%，民事婚姻2.29%，分居8%，寡妇4.57%。

6.3差异分析

sns.set_style("whitegrid")
plt.figure(figsize=(20, 20))

#用于创建条形图的函数
def bar_plot_with_annotation(ax, series, title, color):
    sns.barplot(x=series.index, y=series.values, ax=ax, color=color)
    ax.set_title(title, fontsize=15)
    ax.set_ylabel('')
    ax.set_xlabel('')
    for p in ax.patches:
        ax.annotate(f'{p.get_height()*100:.2f}%', (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha='center', va='center', fontsize=11, color='black', xytext=(0, 10),
                    textcoords='offset points')
    ax.set_ylim(0, 1)

#性别
ax1 = plt.subplot(4, 2, 1)
approved_profile['Gender Distribution'].plot(kind='pie', ax=ax1, autopct='%1.1f%%', startangle=140, colors=['#66b2b2', '#ff9999'])
ax1.set_title('Gender Distribution (Approved)', fontsize=15)
ax1.set_ylabel('')

ax2 = plt.subplot(4, 2, 2)
rejected_profile['Gender Distribution'].plot(kind='pie', ax=ax2, autopct='%1.1f%%', startangle=140, colors=['#66b2b2', '#ff9999'])
ax2.set_title('Gender Distribution (Rejected)', fontsize=15)
ax2.set_ylabel('')

#教育情况
ax3 = plt.subplot(4, 2, 3)
bar_plot_with_annotation(ax3, approved_profile['Education Distribution'], 'Education Distribution (Approved)', '#66b2b2')

ax4 = plt.subplot(4, 2, 4)
bar_plot_with_annotation(ax4, rejected_profile['Education Distribution'], 'Education Distribution (Rejected)', '#ff9999')

#汽车
ax5 = plt.subplot(4, 2, 5)
bar_plot_with_annotation(ax5, approved_profile['Car Ownership Distribution'], 'Car Ownership Distribution (Approved)', '#66b2b2')

ax6 = plt.subplot(4, 2, 6)
bar_plot_with_annotation(ax6, rejected_profile['Car Ownership Distribution'], 'Car Ownership Distribution (Rejected)', '#ff9999')

#房产
ax7 = plt.subplot(4, 2, 7)
bar_plot_with_annotation(ax7, approved_profile['Property Ownership Distribution'], 'Property Ownership Distribution (Approved)', '#66b2b2')

ax8 = plt.subplot(4, 2, 8)
bar_plot_with_annotation(ax8, rejected_profile['Property Ownership Distribution'], 'Property Ownership Distribution (Rejected)', '#ff9999')

plt.tight_layout()
plt.show()

在这里插入图片描述

plt.figure(figsize=(20, 20))

#年收入
ax1 = plt.subplot(3, 2, 1)
sns.histplot(approved_data['Annual_income'], ax=ax1, color='#66b2b2', bins=30, kde=True) # 和鲸使用的是distplot
ax1.set_title('Annual Income Distribution (Approved)', fontsize=15)
ax1.set_xlabel('Annual Income')
ax1.set_ylabel('Frequency')

ax2 = plt.subplot(3, 2, 2)
sns.histplot(rejected_data['Annual_income'], ax=ax2, color='#ff9999', bins=30, kde=True) # 和鲸使用的是distplot
ax2.set_title('Annual Income Distribution (Rejected)', fontsize=15)
ax2.set_xlabel('Annual Income')
ax2.set_ylabel('Frequency')

#工作类型(前五)
ax3 = plt.subplot(3, 2, 3)
top_occupations_approved = approved_profile['Occupation Distribution'].head(5)
bar_plot_with_annotation(ax3, top_occupations_approved, 'Occupation Distribution (Approved)', '#66b2b2')

ax4 = plt.subplot(3, 2, 4)
top_occupations_rejected = rejected_profile['Occupation Distribution'].head(5)
bar_plot_with_annotation(ax4, top_occupations_rejected, 'Occupation Distribution (Rejected)', '#ff9999')

#婚姻情况
ax5 = plt.subplot(3, 2, 5)
bar_plot_with_annotation(ax5, approved_profile['Marital Status Distribution'], 'Marital Status Distribution (Approved)', '#66b2b2')

ax6 = plt.subplot(3, 2, 6)
bar_plot_with_annotation(ax6, rejected_profile['Marital Status Distribution'], 'Marital Status Distribution (Rejected)', '#ff9999')

plt.tight_layout()
plt.show()

在这里插入图片描述

根据前面的分析和可视化，我们可以对信用卡申请通过和信用卡申请未通过的用户群体进行以下比较：
1.性别: 申请通过的用户中，女性比例略高，约为64.09%，而申请未通过的用户中，女性比例为57.14%。这意味着男性用户相对更有可能被拒绝。
2.年收入: 申请通过的用户的平均年收入为190049，而申请未通过的用户的平均年收入为198720，二者差异不大，这可能意味着年收入并不是决定申请是否通过的唯一或主要因素。
3.教育情况: 学士学位的用户信用卡申请没有被拒绝，高学历被拒绝的可能性也就越小，但是被拒绝的用户中，竟然有31.43%是高等学历。
4.其他因素差距并不是很大，尽管我们可以从这些图表中观察到两组用户在某些特征上的差异，但这些差异并不总是显著的。这意味着决定申请是否通过可能涉及多个因素的组合，而不仅仅是单一的特征。为了获得更深入的见解，可能需要进一步的数据分析和建模。

7.信用卡申请结果预测

7.1数据处理

#删除Ind_ID,因为这个对模型没有帮助
data.drop(columns=['Ind_ID'],axis=1, inplace=True)
#将所有的Y替换成1，所有的N替换成0,同样将男女替换
data.replace({'Y': 1, 'N': 0}, inplace=True)
data.replace({'M': 0, 'F': 1}, inplace=True)
#处理教育情况这一列，将其改为有序编码
education_mapping = {
    'Lower secondary': 0,
    'Secondary / secondary special': 1,
    'Incomplete higher': 2,
    'Higher education': 3,
    'Academic degree': 4
}
data['EDUCATION'] = data['EDUCATION'].map(education_mapping)

#对分类数据(除二元变量和有序编码)采取杜热编码:
categorical_cols = data.select_dtypes(include=['object']).columns
new_data = pd.get_dummies(data, columns=categorical_cols)
#对连续变量进行标准化
numerical_cols = ['Annual_income', 'Age', 'Employed_years', 'Family_Members', 'CHILDREN','EDUCATION']
scaler = StandardScaler()
new_data[numerical_cols] = scaler.fit_transform(new_data[numerical_cols])

x = new_data.drop('Label', axis=1)
y = new_data['Label']
#采用分层抽样来保证训练集和测试集中label与整体数据集的label分布相似
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10, stratify=y) #37分

考虑到label不平衡，建立出来的模型可能不够理想，因此可以采取过采样或者欠采样使得样本平衡，在此案例中，欠采样会损失大量的样本数据，所以我选择过采样的方法。

#分离少数类和多数类
x_minority = x_train[y_train == 1]
y_minority = y_train[y_train == 1]
x_majority = x_train[y_train == 0]
y_majority = y_train[y_train == 0]
x_minority_resampled = resample(x_minority, replace=True, n_samples=len(x_majority), random_state=15)
y_minority_resampled = resample(y_minority, replace=True, n_samples=len(y_majority), random_state=15)
new_x_train = pd.concat([x_majority, x_minority_resampled])
new_y_train = pd.concat([y_majority, y_minority_resampled])

现在数据已经处理完毕，开始建立模型，我打算建立逻辑回归、随机森林、SVM、XGBoost四个模型，并且建立评价指标评估模型的优劣，最后得出一个结论。

7.2逻辑回归

logreg = LogisticRegression(random_state=15)
logreg.fit(new_x_train, new_y_train)

y_pred = logreg.predict(x_test)
class_report = classification_report(y_test, y_pred)
print(class_report)

              precision    recall  f1-score   support

           0       0.90      0.63      0.74       412
           1       0.13      0.43      0.20        53

    accuracy                           0.61       465
   macro avg       0.51      0.53      0.47       465
weighted avg       0.81      0.61      0.68       465

#绘制混淆矩阵
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues',
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix for Logistic Regression Model')
plt.show()

在这里插入图片描述

#绘制ROC取消
fpr, tpr, _ = roc_curve(y_test, logreg.predict_proba(x_test)[:, 1])
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

在这里插入图片描述

逻辑回归模型评分如下:
1.精确度: 对于类别0，精确度为0.9，对于类别1，精确度为0.13。
2.召回率: 对于类别0，召回率为0.63，对于类别1，召回率为0.43。
3.F1得分: 对于类别0，F1得分为0.74，对于类别1，F1得分为0.2。
4.准确率: 0.61。
5.ROC: 0.6。

7.3随机森林

rf_clf = RandomForestClassifier(random_state=15)
rf_clf.fit(new_x_train, new_y_train)

y_pred_rf = rf_clf.predict(x_test)
class_report_rf = classification_report(y_test, y_pred_rf)
print(class_report_rf)

              precision    recall  f1-score   support

           0       0.93      0.97      0.95       412
           1       0.67      0.45      0.54        53

    accuracy                           0.91       465
   macro avg       0.80      0.71      0.75       465
weighted avg       0.90      0.91      0.90       465

#绘制混淆矩阵
cm_rf = confusion_matrix(y_test, y_pred_rf)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_rf, annot=True, fmt='g', cmap='Blues',
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix for Random Forest Model')
plt.show()

![在这里插入图片描述](https://img-blog.csdnimg.cn/direct/4e054be866ca4b7aaed348d8aed0c055.png#pic_center

#绘制ROU曲线
fpr_rf, tpr_rf, _ = roc_curve(y_test, rf_clf.predict_proba(x_test)[:, 1])
roc_auc_rf = auc(fpr_rf, tpr_rf)

plt.figure(figsize=(8, 6))
plt.plot(fpr_rf, tpr_rf, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_rf)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Random Forest')
plt.legend(loc="lower right")
plt.show()

![在这里插入图片描述](https://img-blog.csdnimg.cn/direct/87c8aadc05e04366bc1fbcde465f260b.png#pic_center

随机森林模型评分如下:
1.精确度: 对于类别0，精确度为0.93，对于类别1，精确度为0.67。
2.召回率: 对于类别0，召回率为0.97，对于类别1，召回率为0.45。
3.F1得分: 对于类别0，F1得分为0.95，对于类别1，F1得分为0.54。
4.准确率: 0.91。
5.ROC: 0.77。

7.4支持向量机

svm_clf = SVC(kernel='rbf', probability=True, random_state=15)
svm_clf.fit(new_x_train, new_y_train)

y_pred_svm = svm_clf.predict(x_test)
class_report_svm = classification_report(y_test, y_pred_svm)
print(class_report_svm)

              precision    recall  f1-score   support

           0       0.93      0.79      0.85       412
           1       0.25      0.55      0.34        53

    accuracy                           0.76       465
   macro avg       0.59      0.67      0.60       465
weighted avg       0.85      0.76      0.80       465

#绘制混淆矩阵
cm_svm = confusion_matrix(y_test, y_pred_svm)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_svm, annot=True, fmt='g', cmap='Blues',
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix for SVM Model')
plt.show()

在这里插入图片描述

#绘制ROC曲线
fpr_svm, tpr_svm, _ = roc_curve(y_test, svm_clf.predict_proba(x_test)[:, 1])
roc_auc_svm = auc(fpr_svm, tpr_svm)

plt.figure(figsize=(8, 6))
plt.plot(fpr_svm, tpr_svm, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_svm)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for SVM')
plt.legend(loc="lower right")
plt.show()

在这里插入图片描述

支持向量机模型评分如下:
1.精确度: 对于类别0，精确度为0.93，对于类别1，精确度为0.25。
2.召回率: 对于类别0，召回率为0.79，对于类别1，召回率为0.55。
3.F1得分: 对于类别0，F1得分为0.85，对于类别1，F1得分为0.34。
4.准确率: 0.76。
5.ROC: 0.71。

7.5XGBoost

xgb_clf = xgb.XGBClassifier(random_state=15, use_label_encoder=False, eval_metric='logloss')
xgb_clf.fit(new_x_train, new_y_train)

y_pred_xgb = xgb_clf.predict(x_test)
class_report_xgb = classification_report(y_test, y_pred_xgb)
print(class_report_xgb)

              precision    recall  f1-score   support

           0       0.94      0.94      0.94       412
           1       0.51      0.51      0.51        53

    accuracy                           0.89       465
   macro avg       0.72      0.72      0.72       465
weighted avg       0.89      0.89      0.89       465

#绘制混淆矩阵
cm_xgb = confusion_matrix(y_test, y_pred_xgb)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_xgb, annot=True, fmt='g', cmap='Blues',
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix for XGBoost Model')
plt.show()

在这里插入图片描述

#绘制ROC曲线
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, xgb_clf.predict_proba(x_test)[:, 1])
roc_auc_xgb = auc(fpr_xgb, tpr_xgb)

plt.figure(figsize=(8, 6))
plt.plot(fpr_xgb, tpr_xgb, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_xgb)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for XGBoost')
plt.legend(loc="lower right")
plt.show()

在这里插入图片描述

XGBoost模型评分如下:
1.精确度: 对于类别0，精确度为0.94，对于类别1，精确度为0.51。
2.召回率: 对于类别0，召回率为0.94，对于类别1，召回率为0.51。
3.F1得分: 对于类别0，F1得分为0.94，对于类别1，F1得分为0.51。
4.准确率: 0.89。
5.ROC: 0.74。

7.6神经网络

mlp_clf = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=200, random_state=15)
mlp_clf.fit(new_x_train, new_y_train)

y_pred_mlp = mlp_clf.predict(x_test)
class_report_mlp = classification_report(y_test, y_pred_mlp)
print(class_report_mlp)

              precision    recall  f1-score   support

           0       0.93      0.92      0.92       412
           1       0.42      0.45      0.44        53

    accuracy                           0.87       465
   macro avg       0.67      0.69      0.68       465
weighted avg       0.87      0.87      0.87       465

#绘制混淆矩阵
cm_mlp = confusion_matrix(y_test, y_pred_mlp)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_xgb, annot=True, fmt='g', cmap='Blues',
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix for MLP Model')
plt.show()

在这里插入图片描述

#绘制ROC曲线
fpr_mlp, tpr_mlp, _ = roc_curve(y_test, mlp_clf.predict_proba(x_test)[:, 1])
roc_auc_mlp = auc(fpr_mlp, tpr_mlp)

plt.figure(figsize=(8, 6))
plt.plot(fpr_mlp, tpr_mlp, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_mlp)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for MLP')
plt.legend(loc="lower right")
plt.show()

在这里插入图片描述

神经网络模型评分如下:
1.精确度: 对于类别0，精确度为0.93，对于类别1，精确度为0.42。
2.召回率: 对于类别0，召回率为0.92，对于类别1，召回率为0.45。
3.F1得分: 对于类别0，F1得分为0.92，对于类别1，F1得分为0.44。
4.准确率: 0.87。
5.ROC: 0.73。

综上所述：
1.在优化参数之前，随机森林模型表现得最好，在准确率、ROC值，以及对类别0的预测上都达到了很高的水平。
2.XGBoost对类别1的召回率表现最佳，也是一个不错的模型选择。
3.神经网络模型整体也不错，和随机森林、XGBoost模型比较接近，也是一个不错的选择。
4.逻辑回归模型与其他四类模型的表现就比较差，仅比随机猜测(ROC=0.5)好一点，不考虑这类模型。
5.现在我们对随机森林模型、XGBoost模型、神经网络模型进行调参，尽可能保证他们的参数最优(因为调整参数，是一个十分漫长的过程，只能选择几个常用的参数数值来判断，这些参数值不一定是最优的)，最终确定一个预测模型。

7.7参数优化

7.7.1随机森林

#设置参数网格
param_grid = {
    'n_estimators':[10,50,100], #决策树的数量,数值越大，模型可能越好，但是也能会导致过拟合
    'max_depth':[None,10,20,30],#决策树的最大深度，控制模型的复杂度
    'min_samples_split':[2,5,10],#决定节点分裂的最小样本数量
    'min_samples_leaf':[1,2,4],#决定叶节点的最小样本数量
    'max_features':['auto','sqrt','log2'] #考虑分裂时的特征数量
}

rf = RandomForestClassifier()
grid_search = GridSearchCV(estimator=rf,param_grid=param_grid,cv=5,n_jobs=-1,verbose=1) #verbose=0表示不输出日志，verbose=1表示输出总体的进度信息，verbose=2输出更详细的进度信息
grid_search.fit(new_x_train,new_y_train)
print('Best parameters found:',grid_search.best_params_)

Fitting 5 folds for each of 324 candidates, totalling 1620 fits
Best parameters found: {'max_depth': 30, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}

best_rf = grid_search.best_estimator_
y_pred_orf = best_rf.predict(x_test)
class_report_orf = classification_report(y_test, y_pred_orf)
print(class_report_orf)

              precision    recall  f1-score   support

           0       0.93      0.97      0.95       412
           1       0.69      0.47      0.56        53

    accuracy                           0.92       465
   macro avg       0.81      0.72      0.76       465
weighted avg       0.91      0.92      0.91       465

#绘制混淆矩阵
cm_orf = confusion_matrix(y_test, y_pred_orf)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_orf, annot=True, fmt='g', cmap='Blues',
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix for Optimized Random Forest Model')
plt.show()

在这里插入图片描述

#绘制ROU曲线
fpr_orf, tpr_orf, _ = roc_curve(y_test, best_rf.predict_proba(x_test)[:, 1])
roc_auc_orf = auc(fpr_orf, tpr_orf)

plt.figure(figsize=(8, 6))
plt.plot(fpr_orf, tpr_orf, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_orf)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Optimized Random Forest')
plt.legend(loc="lower right")
plt.show()

在这里插入图片描述

优化后的随机森林模型评分如下:
1.精确度: 对于类别0，精确度为0.93，对于类别1，精确度为0.69。
2.召回率: 对于类别0，召回率为0.97，对于类别1，召回率为0.47。
3.F1得分: 对于类别0，F1得分为0.95，对于类别1，F1得分为0.56。
4.准确率: 0.92。
5.ROC: 0.77。

7.7.2XGBoost

param_grid = {
    'learning_rate':[0.01,0.05,0.1],
    'n_estimators':[100,500,1000],
    'max_depth':[3,5,10],
    'min_child_weight':[1,3,5],
    'gamma':[0,0.1,0.2],
    'subsample':[0.8,1.0],
    'colsample_bytree':[0.8,1.0],
    'objective':['binary:logistic']
}
best_xgb = xgb.XGBClassifier()
grid_search = GridSearchCV(estimator=best_xgb,param_grid=param_grid,cv=5,n_jobs=-1,verbose=1)
grid_search.fit(new_x_train,new_y_train)
print('Best parameters found:',grid_search.best_params_)

Fitting 5 folds for each of 972 candidates, totalling 4860 fits
Best parameters found: {'colsample_bytree': 0.8, 'gamma': 0, 'learning_rate': 0.1, 'max_depth': 10, 'min_child_weight': 1, 'n_estimators': 1000, 'objective': 'binary:logistic', 'subsample': 0.8}

best_xgb = grid_search.best_estimator_
y_pred_oxgb = best_xgb.predict(x_test)
class_report_oxgb = classification_report(y_test, y_pred_oxgb)
print(class_report_oxgb)

              precision    recall  f1-score   support

           0       0.94      0.95      0.94       412
           1       0.57      0.53      0.55        53

    accuracy                           0.90       465
   macro avg       0.76      0.74      0.75       465
weighted avg       0.90      0.90      0.90       465

#绘制混淆矩阵
cm_oxgb = confusion_matrix(y_test, y_pred_oxgb)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_oxgb, annot=True, fmt='g', cmap='Blues',
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix for Optimized XGBoost Model')
plt.show()

在这里插入图片描述

#绘制ROU曲线
fpr_oxgb, tpr_oxgb, _ = roc_curve(y_test, best_xgb.predict_proba(x_test)[:, 1])
roc_auc_oxgb = auc(fpr_oxgb, tpr_oxgb)

plt.figure(figsize=(8, 6))
plt.plot(fpr_oxgb, tpr_oxgb, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_oxgb)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Optimized XGBoost')
plt.legend(loc="lower right")
plt.show()

在这里插入图片描述

优化后的XGBoost模型评分如下:
1.精确度: 对于类别0，精确度为0.94，对于类别1，精确度为0.57。
2.召回率: 对于类别0，召回率为0.95，对于类别1，召回率为0.53。
3.F1得分: 对于类别0，F1得分为0.94，对于类别1，F1得分为0.55。
4.准确率: 0.90。
5.ROC: 0.76。

#这一步运行十分漫长，建议有时间精力的挂着就行，没时间精力的直接跳过这一步看结论，当然你们可以自己调参，选择更多范围的参数，以便得出最优情况。
param_grid = {
    'hidden_layer_sizes': [(50,), (100,)], # 隐藏层的大小
    'activation': ['relu', 'tanh', 'logistic'], #激活函数
    'solver': ['adam', 'sgd'], # 优化算法
    'alpha': [0.0001, 0.001, 0.01], #正则化参数
    'learning_rate': ['constant', 'adaptive'] #学习率的调度方法
}
mlp = MLPClassifier(random_state=15, max_iter=1000)
grid_search = GridSearchCV(mlp, param_grid, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(new_x_train, new_y_train)
print('Best parameters found:',grid_search.best_params_)

Fitting 5 folds for each of 72 candidates, totalling 360 fits
Best parameters found: {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}

best_params = grid_search.best_params_
best_mlp = MLPClassifier(**best_params)
best_mlp.fit(new_x_train, new_y_train)
y_pred_omlp = best_mlp.predict(x_test)
class_report_omlp = classification_report(y_test, y_pred_omlp)
print(class_report_omlp)

              precision    recall  f1-score   support

           0       0.93      0.90      0.92       412
           1       0.38      0.47      0.42        53

    accuracy                           0.85       465
   macro avg       0.66      0.69      0.67       465
weighted avg       0.87      0.85      0.86       465

#绘制混淆矩阵
cm_omlp = confusion_matrix(y_test, y_pred_omlp)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_omlp, annot=True, fmt='g', cmap='Blues',
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix for Optimized MLP Model')
plt.show()

在这里插入图片描述

#绘制ROU曲线
fpr_omlp, tpr_omlp, _ = roc_curve(y_test, best_mlp.predict_proba(x_test)[:, 1])
roc_auc_omlp = auc(fpr_omlp, tpr_omlp)

plt.figure(figsize=(8, 6))
plt.plot(fpr_omlp, tpr_omlp, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_omlp)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Optimized MLP Model')
plt.legend(loc="lower right")
plt.show()

在这里插入图片描述

优化后的神经网络模型评分如下:
1.精确度: 对于类别0，精确度为0.93，对于类别1，精确度为0.38。
2.召回率: 对于类别0，召回率为0.90，对于类别1，召回率为0.47。
3.F1得分: 对于类别0，F1得分为0.92，对于类别1，F1得分为0.42。
4.准确率: 0.85。
5.ROC: 0.71。
优化后的神经网络模型竟然还不如优化前的神经网络模型，原因可能是：导致过拟合，参数范围没有包含最优情况等等，考虑到过程十分漫长，这里不建议再一次调参，因为优化后的神经网络模型不一定比随机森林和XGBoost优秀多少，反而会消耗大量的时间。

7.8重要特征展示

这里选择优化后的随机森林模型，整体上看，该模型的比较优秀，当然也可以选择优化后的XGBoost模型，该模型经过优化后，与随机森林几乎一样优秀。

rf_feature_importance = best_rf.feature_importances_
feature_names = new_x_train.columns
rf_feature_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': rf_feature_importance
})
sorted_rf_feature_df = rf_feature_df.sort_values(by='Importance', ascending=False).head() #筛选出前五的重要特征

sorted_rf_feature_df

	Feature	Importance
4	Annual_income	0.165538
11	Age	0.161521
13	Employed_years	0.113815
10	Family_Members	0.040307
5	EDUCATION	0.040090

在随机森林模型中，重要程度最大的是：年收入>年龄>工龄>家庭人员数量>学历。

8.总结

1.我们对数据有了初步的处理，将一些缺失值、异常值进行了处理，并且构建了新的特征，如：是否退休、年龄、工龄。
2.我们探索了数据情况，并且将结果可视化了出来，这一份数据，大部分的信用卡申请都被批准，仅有少部分的没有被通过；年收入呈右偏分布，大多数人的收入集中在较低的范围内，但也有少数人收入较高；年龄分布接近正态分布，20岁左右和70岁左右的人比较少；男性数量高于女性；大多数人的收入类型为 “Working”，其次是 “Pensioner” 和 “Commercial associate”；多数人拥有中等教育背景，较少人拥有高学历或低学历等情况。
3.我们构建了申请通过用户画像和申请未通过用户画像，并且做了差异分析，得出了：申请通过的用户中，女性比例略高，而申请未通过的用户中，女性比例比申请通过的用户中女性比例低，这意味着男性用户相对更有可能被拒绝；学士学位的用户信用卡申请没有被拒绝；虽然得到了一些初步结论，但是仍然需要进一步建立模型来探索。
4.我们建立了多个模型，并且对他们进行了参数优化，因为优化后的随机森林模型整体比较优秀，所以最终选择了它，通过该模型，我们得到了重要程度最大的是：年收入>年龄>工龄>家庭人员数量>学历，这能为银行今后的工作提供帮助，比如银行人员更应该重视这些特征，并且可以使用该模型来进行预测用户是否可以通过申请，因为优化后的随机森林模型准确率高达92%。