问题背景
Dream Housing Finance 公司经营各种房屋贷款。其所在的地区类型分为:城市、半城市和农村地区。
流程:客户首先申请房屋贷款,然后公司验证客户的贷款资格。该公司希望根据填写在线申请表时提供的客户详细信息(性别、婚姻状况、教育、家属人数、收入、贷款金额、信用记录等)自动执行贷款资格流程(实时)。
为了使这一过程自动化,他们提供了一个数据集来识别有资格获得贷款金额的客户群,以便他们可以专门针对这些客户。
数据集链接:Loan Prediction
数据属性描述
英文属性:
Variable | Description |
---|---|
Loan_ID | Unique Loan ID |
Gender | Male/ Female |
Married | Applicant married (Y/N) |
Dependents | Number of dependents |
Education | Applicant Education (Graduate/ Under Graduate) |
Self_Employed | Self employed (Y/N) |
ApplicantIncome | Applicant income |
CoapplicantIncome | Coapplicant income |
LoanAmount | Loan amount in thousands |
Loan_Amount_Term | Term of loan in months |
Credit_History | credit history meets guidelines |
Property_Area | Urban/ Semi Urban/ Rural |
Loan_Status | (Target) Loan approved (Y/N) |
中文属性描述:
Variable | Description |
---|---|
贷款_ID | (唯一)贷款 ID |
性别 | 男/女 |
已婚 | 申请人已婚(是/否) |
家属 | 家属人数 |
教育 | 申请人教育(毕业生/未毕业生) |
自雇人士 | 自雇(是/否) |
申请人收入 | 申请人收入 |
共同申请人收入 | 共同申请人收入 |
贷款额度 | 贷款金额(千) |
贷款周期 | 贷款期限(月) |
信用_历史 | 信用记录符合准则 |
所属地类型 | 城市/半城市/农村 |
贷款状态 | (目标)贷款批准(是/否) |
加载python库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split
加载数据集
train=pd.read_csv('train_ctrUa4K.csv')
test=pd.read_csv('test_lAUu6dG.csv')
ss=pd.read_csv('sample_submission_49d68Cx.csv')
数据探测¶
## 训练集大小
train.shape
(614, 13)
## 测试集大小
test.shape
(367, 12)
train.head()
Categorical Columns:
Gender (Male/Female), Married (Yes/No), Number of dependents (Possible values:0,1,2,3+), Education (Graduate / Not Graduate), Self-Employed (No/Yes), credit history(Yes/No), Property Area (Rural/Semi-Urban/Urban) and Loan Status (Y/N)(i. e. Target variable)
Numerical Columns:
Loan ID, Applicant Income, Co-applicant Income, Loan Amount, and Loan amount term
数据预处理
连接训练数据和测试数据以进行数据预处理:
data=pd.concat([train,test])
删除不需要的列:
data.drop("Loan_ID",axis=1,inplace=True)
探查缺失值:
data.isnull().sum()
估算缺失值:
for i in [data]:
i["Gender"] = i["Gender"].fillna(data.Gender.dropna().mode()[0])
i["Married"] = i["Married"].fillna(data.Married.dropna().mode()[0])
i["Dependents"] = i["Dependents"].fillna(
data.Dependents.dropna().mode()[0])
i["Self_Employed"] = i["Self_Employed"].fillna(
data.Self_Employed.dropna().mode()[0])
i["Credit_History"] = i["Credit_History"].fillna(
data.Credit_History.dropna().mode()[0])
使用 Iterative imputer 来填充 LoanAmount 和 Loan_Amount_Term 的缺失值
from sklearn.ensemble import RandomForestRegressor
data1 = data.loc[:, ['LoanAmount', 'Loan_Amount_Term']]
# Run imputer with a Random Forest estimator(随机森林估计法)
imp = IterativeImputer(RandomForestRegressor(), max_iter=10, random_state=0)
data1 = pd.DataFrame(imp.fit_transform(data1), columns=data1.columns)
data['LoanAmount'] = data1['LoanAmount']
data['Loan_Amount_Term'] = data1['Loan_Amount_Term']
data
将分类变量转化为整数类型:
for i in [data]:
i["Gender"] = i["Gender"].map({'Male': 0, 'Female': 1}).astype(int)
i["Married"] = i["Married"].map({'No': 0, 'Yes': 1}).astype(int)
i["Education"] = i["Education"].map(
{'Not Graduate': 0, 'Graduate': 1}).astype(int)
i["Self_Employed"] = i["Self_Employed"].map(
{'No': 0, 'Yes': 1}).astype(int)
i["Credit_History"] = i["Credit_History"].astype(int)
for i in [data]:
i["Property_Area"] = i["Property_Area"].map(
{'Urban': 0, 'Rural': 1, 'Semiurban': 2}).astype(int)
i["Dependents"] = i["Dependents"].map({'0': 0, '1': 1, '2': 2, '3+': 3})
探索性数据分析 (EDA)
将数据拆分为 new_train 和 new_test:
new_train = data.iloc[:614]
new_test = data.iloc[614:]
调整预测target属性数据:
new_train["Loan_Status"] = new_train["Loan_Status"].map(
{'N': 0, 'Y': 1}).astype(int)
单变量分析¶
fig, ax = plt.subplots(2, 4, figsize=(16, 10))
sns.countplot('Loan_Status', data=new_train, ax=ax[0][0])
sns.countplot('Gender', data=new_train, ax=ax[0][1])
sns.countplot('Married', data=new_train, ax=ax[0][2])
sns.countplot('Education', data=new_train, ax=ax[0][3])
sns.countplot('Self_Employed', data=new_train, ax=ax[1][0])
sns.countplot('Property_Area', data=new_train, ax=ax[1][1])
sns.countplot('Credit_History', data=new_train, ax=ax[1][2])
sns.countplot('Dependents', data=new_train, ax=ax[1][3])
单变量分析观察
1、贷款审批批准率高于拒绝率
2、男性申请人数多于女性
3、已婚申请人数多于未婚申请人数
4、毕业生人数多于非毕业生
5、自雇人数少于非自雇人数
6、申请人多来源于半城市地区
7、许多申请人都有信用记录
8、Dependents=0 的申请人数最多
双变量分析
sns.boxplot(x='Loan_Status', y='ApplicantIncome', data=new_train)
申请人是否有收入的影响力几乎相同
sns.boxplot(x='Loan_Status', y='CoapplicantIncome', data=new_train)
共同申请人的平均收入为1略多于0 (o:否,1是)
sns.catplot(x='Gender', y='LoanAmount', data=new_train, kind='box')
男性申请的贷款额平均值(0)略高于女性(1)
sns.catplot(x='Gender', y='LoanAmount', data=data,
kind='box', hue='Loan_Status', col='Married')
已婚人士申请的贷款金额略高于未婚人士
sns.catplot(x='Gender', y='CoapplicantIncome', data=data,
kind='boxen', hue='Loan_Status', col='Property_Area')
在所有三个领域中,男性的共同申请人收入均高于女性
热力图
plt.figure(figsize=(10, 10))
correlation_matrix = new_train.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show
特征工程
总收入
for i in [data]:
i["TotalIncome"] = i['ApplicantIncome']+i['CoapplicantIncome']
EMI
rate=10.0 r = ((10/12)/100) = 0.00833
r = 0.00833
data['EMI'] = data.apply(lambda x: (
x['LoanAmount']*r*((1+r)**x['Loan_Amount_Term']))/((1+r)**((x['Loan_Amount_Term'])-1)), axis=1)
其他特征
data['Dependents_EMI_mean'] = data.groupby(
['Dependents'])['EMI'].transform('mean')
# LoanAmount_per_TotalIncome
data['LoanAmount_per_TotalIncome'] = data['LoanAmount']/data['TotalIncome']
# Loan_Amount_Term_per_TotalIncome
data['Loan_Amount_Term_per_TotalIncome'] = data['Loan_Amount_Term'] / \
data['TotalIncome']
# EMI_per_Loan_Amount_Term
data['EMI_per_Loan_Amount_Term'] = data['EMI']/data['Loan_Amount_Term']
# EMI_per_LoanAmount
data['EMI_per_LoanAmount'] = data['EMI']/data['LoanAmount']
# Categorical variables wise mean of LoanAmount_per_TotalIncome
data['Property_Area_LoanAmount_per_TotalIncome_mean'] = data.groupby(
['Property_Area'])['Loan_Amount_Term_per_TotalIncome'].transform('mean')
# Credit_History wise sum of TotalIncome
data['Credit_History_Income_sum'] = data.groupby(
['Credit_History'])['TotalIncome'].transform('sum')
# Dependents wise sum of LoanAmount
data['Dependents_LoanAmount_sum'] = data.groupby(
['Dependents'])['LoanAmount'].transform('sum')
Bin information
from sklearn.preprocessing import KBinsDiscretizer
Loan_Amount_Term_discretizer = KBinsDiscretizer(
n_bins=5, encode='ordinal', strategy='quantile')
data['Loan_Amount_Term_Bins'] = Loan_Amount_Term_discretizer.fit_transform(
data['Loan_Amount_Term'].values.reshape(-1, 1)).astype(float)
TotalIncome_discretizer = KBinsDiscretizer(
n_bins=5, encode='ordinal', strategy='quantile')
data['TotalIncome_Bins'] = TotalIncome_discretizer.fit_transform(
data['TotalIncome'].values.reshape(-1, 1)).astype(float)
LoanAmount_per_TotalIncome_discretizer = KBinsDiscretizer(
n_bins=5, encode='ordinal', strategy='quantile')
data['LoanAmount_per_TotalIncome_Bins'] = LoanAmount_per_TotalIncome_discretizer.fit_transform(
data['LoanAmount_per_TotalIncome'].values.reshape(-1, 1)).astype(float)
删除多余列
data = data.drop(['EMI'], axis=1)
data = data.drop(['TotalIncome'], axis=1)
data = data.drop(['LoanAmount_per_TotalIncome'], axis=1)
data.shape
new_train1 = data.iloc[:614]
new_test1 = data.iloc[614:]
new_train1.shape
构建机器学习模型
# input
x = new_train1.drop('Loan_Status', axis=1)
# output
y = new_train1['Loan_Status']
拆分数据集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
x_train.shape
x_test.shape
使用 ML 算法进行训练
逻辑回归
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import cross_val_score
log_clf = LogisticRegression()
cross_val_score(log_clf, x_train, y_train,
scoring=make_scorer(accuracy_score), cv=3)
predo = log_clf.fit(x_train, y_train).predict(x_test)
accuracy_score(predo, y_test)
使用 GridSearchCV 对其进行微调以提高准确度
from sklearn.model_selection import GridSearchCV
LRparam_grid = {
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
'penalty': ['l1', 'l2'],
'max_iter': list(range(100, 800, 100)),
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}
LR_search=GridSearchCV(LogisticRegression(),LRparam_grid,refit=True,verbose=3,cv=5)
LR_search.fit(x_train,y_train)
LR_search.best_params_
# summarize
print("Mean Accuracy:%.3f" % LR_search.best_score_)
print('config:%s' % LR_search.best_params_)
l=LR_search.predict(x_test)
accuracy_score(l,y_test)
原文:Loan Approval Prediction Machine Learning - Analytics Vidhya