个人信贷评估模型研究
数据初探和可视化分析
这一部分主要是对数据可视化分析,使用常识和专家经验寻找关键特征和预测的量之间的大致关系,在这里主要学习的pandas的主要使用以及seaborn和matplotlib的可视化方法和数据分析的思路。
介绍
本文数据来源于Lending Club平台,主要目的是对客户的信用状态进行评估,其信用状态如下表:
由人工把7种再次划分为良好与不良两种状态,主要使用分析工具是pandas、sklearn、keras和seaborn、matplotlib;用pandas做数据清洗和数据规整分析,用sklearn做特征工程,使用keras进行分类,用seaborn、matplotlib进行可视化分析。下面是所需要的包
# Import our libraries we are going to use for our data analysis.
import keras
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# Plotly visualizations
from plotly import tools
import plotly.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
# plotly.tools.set_credentials_file(username='AlexanderBach', api_key='o4fx6i1MtEIJQxfWYvU1')
# For oversampling Library (Dealing with Imbalanced Datasets)
from imblearn.over_sampling import SMOTE
from collections import Counter
# Other Libraries
import time
一般信息统计
主要是用pandas读取数据,查看数据信息
% matplotlib inline
df = pd.read_csv('../input/loan.csv', low_memory=False)
# Copy of the dataframe
original_df = df.copy()
#查看表中的基本样本,以及总体信息
df.head()
df.info()
然后根据习惯可以重命名,删掉没用的信息,如成员ID
df = df.rename(columns={
"loan_amnt": "loan_amount", "funded_amnt": "funded_amount"})
df.drop([ 'emp_title', 'zip_code', 'title'], axis=1, inplace=True)#inplace覆盖原来的
数据分布
画直方图
看一些变量的直方图,这里使用sns的displot函数来画直方图
fig, ax = plt.subplots(1, 3, figsize=(16,5))
loan_amount = df["loan_amount"].values
funded_amount = df["funded_amount"].values
investor_funds = df["investor_funds"].values
sns.distplot(loan_amount, ax=ax[0], color="#F7522F")
ax[0].set_title("Loan Applied by the Borrower", fontsize=14)
sns.distplot(funded_amount, ax=ax[1], color="#2F8FF7")
ax[1].set_title("Amount Funded by the Lender", fontsize=14)
sns.distplot(investor_funds, ax=ax[2], color="#2EAD46")
ax[2].set_title("Total committed by Investors", fontsize=14)
画饼状图
先对loan_status特征重新划分为两类
bad_loan = ["Charged Off", "Default", "Does not meet the credit policy. Status:Charged Off", "In Grace Period",
"Late (16-30 days)", "Late (31-120 days)"]
df['loan_condition'] = np.nan
def loan_condition(status):
if status in bad_loan:
return 'Bad Loan'
else:
return 'Good Loan'
df['loan_condition'] = df['loan_status'].apply(loan_condition)
用plot画饼状图
colors = ["#3791D7", "#D72626"]
labels ="Good Loans", "Bad Loans"
df["loan_condition"].value_counts().plot.pie(explode=[0,0.25],
autopct='%1.2f%%',
shadow=True,
colors=colors,
labels=labels,
fontsize=12, startangle=70)
#x :(每一块)的比例,如果sum(x) > 1会使用sum(x)归一化;
#labels :(每一块)饼图外侧显示的说明文字;
#explode :(每一块)离开中心距离;
#shadow :在饼图下面画一个阴影。默认值:False,即不画阴影;
#autopct :控制饼图内百分比设置,可以使用format字符串或者format function
# '%1.1f'指小数点前后位数(没有用空格补齐);
画柱状图
将信息转化为时间变量
# Lets' transform the issue dates by year.
df['issue_d'].head()
dt_series = pd.to_datetime(df['issue_d'])
df['year'] = dt_series.dt.year
根据年份画贷款金额,这里用sns的barplot
plt.figure(figsize=(12,8))
#非常方便的传参形式,直接在DataFrame上对某两列进行可视化,另外可以还有一个参量“hue”,表示另一个维度,每一年按这个维度划分
sns.barplot('year', 'loan_amount', data=df, palette='tab10')
plt.title('Issuance of Loans', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average loan amount issued', fontsize=14)
好贷款与坏贷款
贷款类型
用pandas查看某列分量的值
df['loan_status'].value_counts()
各地区发放的贷款
对区域进行划分,重组;在这里可以好好体会pandas中apply函数的使用。
df['addr_state'].unique()#看不同值
# Make a list with each of the regions by state.
west = ['CA', 'OR', 'UT','WA', 'CO', 'NV', 'AK', 'MT', 'HI', 'WY', 'ID']
south_west = ['AZ', 'TX', 'NM', 'OK']
south_east = ['GA', 'NC', 'VA', 'FL', 'KY', 'SC', 'LA', 'AL', 'WV', 'DC', 'AR', 'DE', 'MS', 'TN' ]
mid_west = ['IL', 'MO', 'MN', 'OH', 'WI', 'KS', 'MI', 'SD', 'IA', 'NE', 'IN', 'ND']
north_east = ['CT', 'NY', 'PA', 'NJ', 'RI','MA', 'MD', 'VT', 'NH', 'ME']
df['region'] = np.nan
def finding_regions(state):
if state in west:
return 'West'
elif state in south_west:
return 'SouthWest'
elif state in south_east:
return 'SouthEast'
elif state in mid_west:
return 'MidWest'
elif state in north_east:
return 'NorthEast'
df['region'] = df['addr_state'].apply(finding_regions)
深入研究不良贷款
按贷款状况分类为每个地区的不良贷款的贷款数量。
首先把不良贷款找出来,然后按地区分组.
要点1: pd.crosstab(badloans_df[‘region’], badloans_df[‘loan_status’]).apply(lambda x: x/x.sum() * 100)
pd.crosstab()是交叉列表,第一个参量为行引索,第二个参量为列引索;后面跟了apply函数,其中x指的是整个DataFrame本身;关于lambda函数详情请参考这里
badloans_df = df.loc[df["loan_condition"] == "Bad Loan"]
# loan_status cross
loan_status_cross = pd.crosstab(badloans_df['region'], badloans_df['loan_status']).apply(lambda x: x/x.sum() * 100)
number_of_loanstatus = pd.crosstab(badloans_df['region'], badloans_df['loan_status'])
# Round our values
loan_status_cross['Charged Off'] = loan_status_cross['Charged Off'].apply(lambda x: round(x, 2))
loan_status_cross['Default'] = loan_status_cross['Default'].apply(lambda x: round(x, 2))
# loan_status_cross['Does not meet the credit policy. Status:Charged Off'] = loan_status_cross['Does not meet the credit policy. Status:Charged Off'].apply(lambda x: round(x, 2))
loan_status_cross['In Grace Period'] = loan_status_cross['In Grace Period'].apply(lambda x: round(x,