个人信贷评估模型研究

数据初探和可视化分析

这一部分主要是对数据可视化分析,使用常识和专家经验寻找关键特征和预测的量之间的大致关系,在这里主要学习的pandas的主要使用以及seaborn和matplotlib的可视化方法和数据分析的思路。

介绍

本文数据来源于Lending Club平台,主要目的是对客户的信用状态进行评估,其信用状态如下表:
在这里插入图片描述
由人工把7种再次划分为良好与不良两种状态,主要使用分析工具是pandas、sklearn、keras和seaborn、matplotlib;用pandas做数据清洗和数据规整分析,用sklearn做特征工程,使用keras进行分类,用seaborn、matplotlib进行可视化分析。下面是所需要的包

# Import our libraries we are going to use for our data analysis.
import keras 
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Plotly visualizations
from plotly import tools
import plotly.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
# plotly.tools.set_credentials_file(username='AlexanderBach', api_key='o4fx6i1MtEIJQxfWYvU1')


# For oversampling Library (Dealing with Imbalanced Datasets)
from imblearn.over_sampling import SMOTE
from collections import Counter

# Other Libraries
import time

一般信息统计

主要是用pandas读取数据,查看数据信息

% matplotlib inline

df = pd.read_csv('../input/loan.csv', low_memory=False)

# Copy of the dataframe
original_df = df.copy()
#查看表中的基本样本,以及总体信息
df.head()
df.info()

然后根据习惯可以重命名,删掉没用的信息,如成员ID

df = df.rename(columns={
   "loan_amnt": "loan_amount", "funded_amnt": "funded_amount"})
df.drop([ 'emp_title',  'zip_code', 'title'], axis=1, inplace=True)#inplace覆盖原来的

数据分布

画直方图
看一些变量的直方图,这里使用sns的displot函数来画直方图

fig, ax = plt.subplots(1, 3, figsize=(16,5))

loan_amount = df["loan_amount"].values
funded_amount = df["funded_amount"].values
investor_funds = df["investor_funds"].values

sns.distplot(loan_amount, ax=ax[0], color="#F7522F")
ax[0].set_title("Loan Applied by the Borrower", fontsize=14)
sns.distplot(funded_amount, ax=ax[1], color="#2F8FF7")
ax[1].set_title("Amount Funded by the Lender", fontsize=14)
sns.distplot(investor_funds, ax=ax[2], color="#2EAD46")
ax[2].set_title("Total committed by Investors", fontsize=14)

在这里插入图片描述
画饼状图
先对loan_status特征重新划分为两类

bad_loan = ["Charged Off", "Default", "Does not meet the credit policy. Status:Charged Off", "In Grace Period", 
           "Late (16-30 days)", "Late (31-120 days)"]


df['loan_condition'] = np.nan

def loan_condition(status):
   if status in bad_loan:
       return 'Bad Loan'
   else:
       return 'Good Loan'
   
   
df['loan_condition'] = df['loan_status'].apply(loan_condition)

用plot画饼状图

colors = ["#3791D7", "#D72626"]
labels ="Good Loans", "Bad Loans"
df["loan_condition"].value_counts().plot.pie(explode=[0,0.25], 
											autopct='%1.2f%%',
											shadow=True, 	
                                            colors=colors,
                                            labels=labels, 
                                            fontsize=12, startangle=70#x       :(每一块)的比例,如果sum(x) > 1会使用sum(x)归一化;
#labels  :(每一块)饼图外侧显示的说明文字;
#explode :(每一块)离开中心距离;
#shadow  :在饼图下面画一个阴影。默认值:False,即不画阴影;
#autopct :控制饼图内百分比设置,可以使用format字符串或者format function
 #       '%1.1f'指小数点前后位数(没有用空格补齐);

在这里插入图片描述
画柱状图
将信息转化为时间变量

# Lets' transform the issue dates by year.
df['issue_d'].head()
dt_series = pd.to_datetime(df['issue_d'])
df['year'] = dt_series.dt.year

根据年份画贷款金额,这里用sns的barplot

plt.figure(figsize=(12,8))
#非常方便的传参形式,直接在DataFrame上对某两列进行可视化,另外可以还有一个参量“hue”,表示另一个维度,每一年按这个维度划分
sns.barplot('year', 'loan_amount', data=df, palette='tab10')
plt.title('Issuance of Loans', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average loan amount issued', fontsize=14)

在这里插入图片描述

好贷款与坏贷款

贷款类型

用pandas查看某列分量的值

df['loan_status'].value_counts()

各地区发放的贷款

对区域进行划分,重组;在这里可以好好体会pandas中apply函数的使用。

df['addr_state'].unique()#看不同值

# Make a list with each of the regions by state.

west = ['CA', 'OR', 'UT','WA', 'CO', 'NV', 'AK', 'MT', 'HI', 'WY', 'ID']
south_west = ['AZ', 'TX', 'NM', 'OK']
south_east = ['GA', 'NC', 'VA', 'FL', 'KY', 'SC', 'LA', 'AL', 'WV', 'DC', 'AR', 'DE', 'MS', 'TN' ]
mid_west = ['IL', 'MO', 'MN', 'OH', 'WI', 'KS', 'MI', 'SD', 'IA', 'NE', 'IN', 'ND']
north_east = ['CT', 'NY', 'PA', 'NJ', 'RI','MA', 'MD', 'VT', 'NH', 'ME']

df['region'] = np.nan

def finding_regions(state):
    if state in west:
        return 'West'
    elif state in south_west:
        return 'SouthWest'
    elif state in south_east:
        return 'SouthEast'
    elif state in mid_west:
        return 'MidWest'
    elif state in north_east:
        return 'NorthEast'
    
df['region'] = df['addr_state'].apply(finding_regions)

深入研究不良贷款

按贷款状况分类为每个地区的不良贷款的贷款数量。
首先把不良贷款找出来,然后按地区分组.
要点1: pd.crosstab(badloans_df[‘region’], badloans_df[‘loan_status’]).apply(lambda x: x/x.sum() * 100)
pd.crosstab()是交叉列表,第一个参量为行引索,第二个参量为列引索;后面跟了apply函数,其中x指的是整个DataFrame本身;关于lambda函数详情请参考这里

badloans_df = df.loc[df["loan_condition"] == "Bad Loan"]

# loan_status cross
loan_status_cross = pd.crosstab(badloans_df['region'], badloans_df['loan_status']).apply(lambda x: x/x.sum() * 100)
number_of_loanstatus = pd.crosstab(badloans_df['region'], badloans_df['loan_status'])


# Round our values
loan_status_cross['Charged Off'] = loan_status_cross['Charged Off'].apply(lambda x: round(x, 2))
loan_status_cross['Default'] = loan_status_cross['Default'].apply(lambda x: round(x, 2))
# loan_status_cross['Does not meet the credit policy. Status:Charged Off'] = loan_status_cross['Does not meet the credit policy. Status:Charged Off'].apply(lambda x: round(x, 2))
loan_status_cross['In Grace Period'] = loan_status_cross['In Grace Period'].apply(lambda x: <
  • 3
    点赞
  • 32
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值