#Loan Prediction
项目地址:https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
df = pd.read_csv("/Users/mac/Desktop/kaggle/P2/train.csv") #导入数据
##数据探索
###变量识别
df.head(10)
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | NaN | 360.0 | 1.0 | Urban | Y |
1 | LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.0 | 360.0 | 1.0 | Rural | N |
2 | LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.0 | 360.0 | 1.0 | Urban | Y |
3 | LP001006 | Male | Yes | 0 | Not Graduate | No | 2583 | 2358.0 | 120.0 | 360.0 | 1.0 | Urban | Y |
4 | LP001008 | Male | No | 0 | Graduate | No | 6000 | 0.0 | 141.0 | 360.0 | 1.0 | Urban | Y |
5 | LP001011 | Male | Yes | 2 | Graduate | Yes | 5417 | 4196.0 | 267.0 | 360.0 | 1.0 | Urban | Y |
6 | LP001013 | Male | Yes | 0 | Not Graduate | No | 2333 | 1516.0 | 95.0 | 360.0 | 1.0 | Urban | Y |
7 | LP001014 | Male | Yes | 3+ | Graduate | No | 3036 | 2504.0 | 158.0 | 360.0 | 0.0 | Semiurban | N |
8 | LP001018 | Male | Yes | 2 | Graduate | No | 4006 | 1526.0 | 168.0 | 360.0 | 1.0 | Urban | Y |
9 | LP001020 | Male | Yes | 1 | Graduate | No | 12841 | 10968.0 | 349.0 | 360.0 | 1.0 | Semiurban | N |
df.describe()
ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | |
---|---|---|---|---|---|
count | 614.000000 | 614.000000 | 592.000000 | 600.00000 | 564.000000 |
mean | 5403.459283 | 1621.245798 | 146.412162 | 342.00000 | 0.842199 |
std | 6109.041673 | 2926.248369 | 85.587325 | 65.12041 | 0.364878 |
min | 150.000000 | 0.000000 | 9.000000 | 12.00000 | 0.000000 |
25% | 2877.500000 | 0.000000 | 100.000000 | 360.00000 | 1.000000 |
50% | 3812.500000 | 1188.500000 | 128.000000 | 360.00000 | 1.000000 |
75% | 5795.000000 | 2297.250000 | 168.000000 | 360.00000 | 1.000000 |
max | 81000.000000 | 41667.000000 | 700.000000 | 480.00000 | 1.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID 614 non-null object
Gender 601 non-null object
Married 611 non-null object
Dependents 599 non-null object
Education 614 non-null object
Self_Employed 582 non-null object
ApplicantIncome 614 non-null int64
CoapplicantIncome 614 non-null float64
LoanAmount 592 non-null float64
Loan_Amount_Term 600 non-null float64
Credit_History 564 non-null float64
Property_Area 614 non-null object
Loan_Status 614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.4+ KB
注意到有数据缺失的情况,在数据整理阶段进行处理。
###单一变量分析
df['Loan_Status'].value_counts(normalize=True).plot.bar(title= 'Loan_Status')
<matplotlib.axes._subplots.AxesSubplot at 0x111b343c8>
分析预测项Loan_Status,约69%的贷款得到批准。
df['Gender'].value_counts(normalize=True)
Male 0.813644
Female 0.186356
Name: Gender, dtype: float64
分析性别特征,发现申请贷款的人多数为男性。
df['Married'].value_counts(normalize=True)
Yes 0.651391
No 0.348609
Name: Married, dtype: float64
分析结婚状态,有65%的贷款人是已婚。
df['Dependents'].value_counts().plot.bar(title= 'Dependents')
<matplotlib.axes._subplots.AxesSubplot at 0x111b4fe10>
分析亲属数量,大多数贷款人没有亲属。
df['Loan_Amount_Term'].value_counts()
360.0 512
180.0 44
480.0 15
300.0 13
84.0 4
240.0 4
120.0 3
36.0 2
60.0 2
12.0 1
Name: Loan_Amount_Term, dtype: int64
分析贷款时间,绝大多数为360天。
df['Credit_History'].value_counts(normalize=True)
1.0 0.842199
0.0 0.157801
Name: Credit_History, dtype: float64
分析信用历史,84%的贷款人已偿还债务,信用良好。
df['Education'].value_counts(normalize=True)