贷款审批预测分析(机器学习_逻辑回归预测)

问题背景

Dream Housing Finance 公司经营各种房屋贷款。其所在的地区类型分为:城市、半城市和农村地区。
流程:客户首先申请房屋贷款,然后公司验证客户的贷款资格。该公司希望根据填写在线申请表时提供的客户详细信息(性别、婚姻状况、教育、家属人数、收入、贷款金额、信用记录等)自动执行贷款资格流程(实时)。
为了使这一过程自动化,他们提供了一个数据集来识别有资格获得贷款金额的客户群,以便他们可以专门针对这些客户。 

数据集链接:Loan Prediction

数据属性描述

英文属性:

VariableDescription
Loan_IDUnique Loan ID
GenderMale/ Female
MarriedApplicant married (Y/N)
DependentsNumber of dependents
EducationApplicant Education (Graduate/ Under Graduate)
Self_EmployedSelf employed (Y/N)
ApplicantIncomeApplicant income
CoapplicantIncomeCoapplicant income
LoanAmountLoan amount in thousands
Loan_Amount_TermTerm of loan in months
Credit_Historycredit history meets guidelines
Property_AreaUrban/ Semi Urban/ Rural
Loan_Status(Target) Loan approved (Y/N)

中文属性描述:

VariableDescription
贷款_ID(唯一)贷款 ID
性别男/女
已婚申请人已婚(是/否)
家属家属人数
教育申请人教育(毕业生/未毕业生)
自雇人士自雇(是/否)
申请人收入申请人收入
共同申请人收入共同申请人收入
贷款额度贷款金额(千)
贷款周期贷款期限(月)
信用_历史信用记录符合准则
所属地类型城市/半城市/农村
贷款状态(目标)贷款批准(是/否)

加载python库

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split

加载数据集

train=pd.read_csv('train_ctrUa4K.csv')
test=pd.read_csv('test_lAUu6dG.csv')
ss=pd.read_csv('sample_submission_49d68Cx.csv')

数据探测

## 训练集大小
train.shape
(614, 13)
## 测试集大小
test.shape
(367, 12)
train.head()

Categorical Columns:
Gender (Male/Female), Married (Yes/No), Number of dependents (Possible values:0,1,2,3+), Education (Graduate / Not Graduate), Self-Employed (No/Yes), credit history(Yes/No), Property Area (Rural/Semi-Urban/Urban) and Loan Status (Y/N)(i. e. Target variable)

Numerical Columns:
Loan ID, Applicant Income, Co-applicant Income, Loan Amount, and Loan amount term

数据预处理

连接训练数据和测试数据以进行数据预处理:

data=pd.concat([train,test])

删除不需要的列:

data.drop("Loan_ID",axis=1,inplace=True)

探查缺失值:

data.isnull().sum()

估算缺失值:

for i in [data]:
    i["Gender"] = i["Gender"].fillna(data.Gender.dropna().mode()[0])
    i["Married"] = i["Married"].fillna(data.Married.dropna().mode()[0])
    i["Dependents"] = i["Dependents"].fillna(
        data.Dependents.dropna().mode()[0])
    i["Self_Employed"] = i["Self_Employed"].fillna(
        data.Self_Employed.dropna().mode()[0])
    i["Credit_History"] = i["Credit_History"].fillna(
        data.Credit_History.dropna().mode()[0])

使用 Iterative imputer 来填充 LoanAmount 和 Loan_Amount_Term 的缺失值

from sklearn.ensemble import RandomForestRegressor

data1 = data.loc[:, ['LoanAmount', 'Loan_Amount_Term']]

# Run imputer with a Random Forest estimator(随机森林估计法)
imp = IterativeImputer(RandomForestRegressor(), max_iter=10, random_state=0)
data1 = pd.DataFrame(imp.fit_transform(data1), columns=data1.columns)
data['LoanAmount'] = data1['LoanAmount']
data['Loan_Amount_Term'] = data1['Loan_Amount_Term']
data

将分类变量转化为整数类型:

for i in [data]:
    i["Gender"] = i["Gender"].map({'Male': 0, 'Female': 1}).astype(int)
    i["Married"] = i["Married"].map({'No': 0, 'Yes': 1}).astype(int)
    i["Education"] = i["Education"].map(
        {'Not Graduate': 0, 'Graduate': 1}).astype(int)
    i["Self_Employed"] = i["Self_Employed"].map(
        {'No': 0, 'Yes': 1}).astype(int)
    i["Credit_History"] = i["Credit_History"].astype(int)
for i in [data]:
    i["Property_Area"] = i["Property_Area"].map(
        {'Urban': 0, 'Rural': 1, 'Semiurban': 2}).astype(int)
    i["Dependents"] = i["Dependents"].map({'0': 0, '1': 1, '2': 2, '3+': 3})

探索性数据分析 (EDA)

将数据拆分为 new_train 和 new_test:

new_train = data.iloc[:614]
new_test = data.iloc[614:]

调整预测target属性数据:

new_train["Loan_Status"] = new_train["Loan_Status"].map(
    {'N': 0, 'Y': 1}).astype(int)

单变量分析

fig, ax = plt.subplots(2, 4, figsize=(16, 10))
sns.countplot('Loan_Status', data=new_train, ax=ax[0][0])
sns.countplot('Gender', data=new_train, ax=ax[0][1])
sns.countplot('Married', data=new_train, ax=ax[0][2])
sns.countplot('Education', data=new_train, ax=ax[0][3])
sns.countplot('Self_Employed', data=new_train, ax=ax[1][0])
sns.countplot('Property_Area', data=new_train, ax=ax[1][1])
sns.countplot('Credit_History', data=new_train, ax=ax[1][2])
sns.countplot('Dependents', data=new_train, ax=ax[1][3])

单变量分析观察

1、贷款审批批准率高于拒绝率
2、男性申请人数多于女性
3、已婚申请人数多于未婚申请人数
4、毕业生人数多于非毕业生
5、自雇人数少于非自雇人数
6、申请人多来源于半城市地区
7、许多申请人都有信用记录
8、Dependents=0 的申请人数最多

双变量分析

sns.boxplot(x='Loan_Status', y='ApplicantIncome', data=new_train)

申请人是否有收入的影响力几乎相同

sns.boxplot(x='Loan_Status', y='CoapplicantIncome', data=new_train)

共同申请人的平均收入为1略多于0 (o:否,1是)

sns.catplot(x='Gender', y='LoanAmount', data=new_train, kind='box')

男性申请的贷款额平均值(0)略高于女性(1)

sns.catplot(x='Gender', y='LoanAmount', data=data,
            kind='box', hue='Loan_Status', col='Married')

已婚人士申请的贷款金额略高于未婚人士

sns.catplot(x='Gender', y='CoapplicantIncome', data=data,
            kind='boxen', hue='Loan_Status', col='Property_Area')

在所有三个领域中,男性的共同申请人收入均高于女性

热力图

plt.figure(figsize=(10, 10))
correlation_matrix = new_train.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show

特征工程

总收入

for i in [data]:
    i["TotalIncome"] = i['ApplicantIncome']+i['CoapplicantIncome']

EMI

rate=10.0 r = ((10/12)/100) = 0.00833

r = 0.00833
data['EMI'] = data.apply(lambda x: (
    x['LoanAmount']*r*((1+r)**x['Loan_Amount_Term']))/((1+r)**((x['Loan_Amount_Term'])-1)), axis=1)

其他特征

data['Dependents_EMI_mean'] = data.groupby(
    ['Dependents'])['EMI'].transform('mean')

# LoanAmount_per_TotalIncome
data['LoanAmount_per_TotalIncome'] = data['LoanAmount']/data['TotalIncome']

# Loan_Amount_Term_per_TotalIncome
data['Loan_Amount_Term_per_TotalIncome'] = data['Loan_Amount_Term'] / \
    data['TotalIncome']

# EMI_per_Loan_Amount_Term
data['EMI_per_Loan_Amount_Term'] = data['EMI']/data['Loan_Amount_Term']

# EMI_per_LoanAmount
data['EMI_per_LoanAmount'] = data['EMI']/data['LoanAmount']

# Categorical variables wise mean of LoanAmount_per_TotalIncome
data['Property_Area_LoanAmount_per_TotalIncome_mean'] = data.groupby(
    ['Property_Area'])['Loan_Amount_Term_per_TotalIncome'].transform('mean')

# Credit_History wise sum of TotalIncome
data['Credit_History_Income_sum'] = data.groupby(
    ['Credit_History'])['TotalIncome'].transform('sum')

# Dependents wise sum of LoanAmount
data['Dependents_LoanAmount_sum'] = data.groupby(
    ['Dependents'])['LoanAmount'].transform('sum')

Bin information

from sklearn.preprocessing import KBinsDiscretizer
Loan_Amount_Term_discretizer = KBinsDiscretizer(
    n_bins=5, encode='ordinal', strategy='quantile')
data['Loan_Amount_Term_Bins'] = Loan_Amount_Term_discretizer.fit_transform(
    data['Loan_Amount_Term'].values.reshape(-1, 1)).astype(float)

TotalIncome_discretizer = KBinsDiscretizer(
    n_bins=5, encode='ordinal', strategy='quantile')
data['TotalIncome_Bins'] = TotalIncome_discretizer.fit_transform(
    data['TotalIncome'].values.reshape(-1, 1)).astype(float)

LoanAmount_per_TotalIncome_discretizer = KBinsDiscretizer(
    n_bins=5, encode='ordinal', strategy='quantile')
data['LoanAmount_per_TotalIncome_Bins'] = LoanAmount_per_TotalIncome_discretizer.fit_transform(
    data['LoanAmount_per_TotalIncome'].values.reshape(-1, 1)).astype(float)

删除多余列

data = data.drop(['EMI'], axis=1)
data = data.drop(['TotalIncome'], axis=1)
data = data.drop(['LoanAmount_per_TotalIncome'], axis=1)
data.shape

new_train1 = data.iloc[:614]
new_test1 = data.iloc[614:]
new_train1.shape

构建机器学习模型

# input
x = new_train1.drop('Loan_Status', axis=1)
# output
y = new_train1['Loan_Status']

拆分数据集

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
x_train.shape

x_test.shape

使用 ML 算法进行训练

逻辑回归

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import cross_val_score
log_clf = LogisticRegression()
cross_val_score(log_clf, x_train, y_train,
                scoring=make_scorer(accuracy_score), cv=3)

predo = log_clf.fit(x_train, y_train).predict(x_test)
accuracy_score(predo, y_test)

 

使用 GridSearchCV 对其进行微调以提高准确度

from sklearn.model_selection import GridSearchCV
LRparam_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    'penalty': ['l1', 'l2'],
    'max_iter': list(range(100, 800, 100)),
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}
LR_search=GridSearchCV(LogisticRegression(),LRparam_grid,refit=True,verbose=3,cv=5)
LR_search.fit(x_train,y_train)
LR_search.best_params_

# summarize
print("Mean Accuracy:%.3f" % LR_search.best_score_)
print('config:%s' % LR_search.best_params_)

l=LR_search.predict(x_test)
accuracy_score(l,y_test)

 

原文:Loan Approval Prediction Machine Learning - Analytics Vidhya

  • 3
    点赞
  • 38
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值