个人信贷评估模型研究

最新推荐文章于 2022-06-22 19:29:29 发布

格拉迪沃

最新推荐文章于 2022-06-22 19:29:29 发布

阅读量3.1k

点赞数 3

分类专栏：数据竞赛文章标签：个人信贷数据挖掘机器学习实战

本文链接：https://blog.csdn.net/qq_32796253/article/details/89249670

版权

通过对Lending Club平台数据的分析，本研究旨在构建个人信贷评估模型。通过数据初探和可视化，对贷款状态进行分类，关注好贷款与坏贷款的比例、地区分布以及不良贷款的决定因素。利用pandas、sklearn、keras等工具进行数据清洗、特征工程和模型训练，最终实现对个人信贷风险的有效评估。

摘要由CSDN通过智能技术生成

数据初探和可视化分析

这一部分主要是对数据可视化分析，使用常识和专家经验寻找关键特征和预测的量之间的大致关系，在这里主要学习的pandas的主要使用以及seaborn和matplotlib的可视化方法和数据分析的思路。

介绍

本文数据来源于Lending Club平台，主要目的是对客户的信用状态进行评估，其信用状态如下表：
在这里插入图片描述
由人工把7种再次划分为良好与不良两种状态,主要使用分析工具是pandas、sklearn、keras和seaborn、matplotlib；用pandas做数据清洗和数据规整分析，用sklearn做特征工程，使用keras进行分类，用seaborn、matplotlib进行可视化分析。下面是所需要的包

# Import our libraries we are going to use for our data analysis.
import keras 
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Plotly visualizations
from plotly import tools
import plotly.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
# plotly.tools.set_credentials_file(username='AlexanderBach', api_key='o4fx6i1MtEIJQxfWYvU1')


# For oversampling Library (Dealing with Imbalanced Datasets)
from imblearn.over_sampling import SMOTE
from collections import Counter

# Other Libraries
import time

一般信息统计

主要是用pandas读取数据，查看数据信息

% matplotlib inline

df = pd.read_csv('../input/loan.csv', low_memory=False)

# Copy of the dataframe
original_df = df.copy()
#查看表中的基本样本，以及总体信息
df.head()
df.info()

然后根据习惯可以重命名,删掉没用的信息，如成员ID

df = df.rename(columns={
   "loan_amnt": "loan_amount", "funded_amnt": "funded_amount"})
df.drop([ 'emp_title',  'zip_code', 'title'], axis=1, inplace=True)#inplace覆盖原来的

数据分布

画直方图
看一些变量的直方图,这里使用sns的displot函数来画直方图

fig, ax = plt.subplots(1, 3, figsize=(16,5))

loan_amount = df["loan_amount"].values
funded_amount = df["funded_amount"].values
investor_funds = df["investor_funds"].values

sns.distplot(loan_amount, ax=ax[0], color="#F7522F")
ax[0].set_title("Loan Applied by the Borrower", fontsize=14)
sns.distplot(funded_amount, ax=ax[1], color="#2F8FF7")
ax[1].set_title("Amount Funded by the Lender", fontsize=14)
sns.distplot(investor_funds, ax=ax[2], color="#2EAD46")
ax[2].set_title("Total committed by Investors", fontsize=14)

在这里插入图片描述
画饼状图
先对loan_status特征重新划分为两类

bad_loan = ["Charged Off", "Default", "Does not meet the credit policy. Status:Charged Off", "In Grace Period", 
           "Late (16-30 days)", "Late (31-120 days)"]


df['loan_condition'] = np.nan

def loan_condition(status):
   if status in bad_loan:
       return 'Bad Loan'
   else:
       return 'Good Loan'
   
   
df['loan_condition'] = df['loan_status'].apply(loan_condition)

用plot画饼状图

colors = ["#3791D7", "#D72626"]
labels ="Good Loans", "Bad Loans"
df["loan_condition"].value_counts().plot.pie(explode=[0,0.25], 
											autopct='%1.2f%%',
											shadow=True, 	
                                            colors=colors,
                                            labels=labels, 
                                            fontsize=12, startangle=70）
#x       :(每一块)的比例，如果sum(x) > 1会使用sum(x)归一化；
#labels  :(每一块)饼图外侧显示的说明文字；
#explode :(每一块)离开中心距离；
#shadow  :在饼图下面画一个阴影。默认值：False，即不画阴影；
#autopct :控制饼图内百分比设置,可以使用format字符串或者format function
 #       '%1.1f'指小数点前后位数(没有用空格补齐)；

在这里插入图片描述
画柱状图
将信息转化为时间变量

# Lets' transform the issue dates by year.
df['issue_d'].head()
dt_series = pd.to_datetime(df['issue_d'])
df['year'] = dt_series.dt.year

根据年份画贷款金额,这里用sns的barplot

plt.figure(figsize=(12,8))
#非常方便的传参形式，直接在DataFrame上对某两列进行可视化,另外可以还有一个参量“hue”，表示另一个维度，每一年按这个维度划分
sns.barplot('year', 'loan_amount', data=df, palette='tab10')
plt.title('Issuance of Loans', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average loan amount issued', fontsize=14)

在这里插入图片描述

好贷款与坏贷款

贷款类型

用pandas查看某列分量的值

df['loan_status'].value_counts()

各地区发放的贷款

对区域进行划分，重组；在这里可以好好体会pandas中apply函数的使用。

df['addr_state'].unique()#看不同值

# Make a list with each of the regions by state.

west = ['CA', 'OR', 'UT','WA', 'CO', 'NV', 'AK', 'MT', 'HI', 'WY', 'ID']
south_west = ['AZ', 'TX', 'NM', 'OK']
south_east = ['GA', 'NC', 'VA', 'FL', 'KY', 'SC', 'LA', 'AL', 'WV', 'DC', 'AR', 'DE', 'MS', 'TN' ]
mid_west = ['IL', 'MO', 'MN', 'OH', 'WI', 'KS', 'MI', 'SD', 'IA', 'NE', 'IN', 'ND']
north_east = ['CT', 'NY', 'PA', 'NJ', 'RI','MA', 'MD', 'VT', 'NH', 'ME']

df['region'] = np.nan

def finding_regions(state):
    if state in west:
        return 'West'
    elif state in south_west:
        return 'SouthWest'
    elif state in south_east:
        return 'SouthEast'
    elif state in mid_west:
        return 'MidWest'
    elif state in north_east:
        return 'NorthEast'
    
df['region'] = df['addr_state'].apply(finding_regions)

深入研究不良贷款

按贷款状况分类为每个地区的不良贷款的贷款数量。
首先把不良贷款找出来，然后按地区分组.
要点1： pd.crosstab(badloans_df[‘region’], badloans_df[‘loan_status’]).apply(lambda x: x/x.sum() * 100)
pd.crosstab（）是交叉列表，第一个参量为行引索，第二个参量为列引索；后面跟了apply函数，其中x指的是整个DataFrame本身；关于lambda函数详情请参考这里

badloans_df = df.loc[df["loan_condition"] == "Bad Loan"]

# loan_status cross
loan_status_cross = pd.crosstab(badloans_df['region'], badloans_df['loan_status']).apply(lambda x: x/x.sum() * 100)
number_of_loanstatus = pd.crosstab(badloans_df['region'], badloans_df['loan_status'])


# Round our values
loan_status_cross['Charged Off'] = loan_status_cross['Charged Off'].apply(lambda x: round(x, 2))
loan_status_cross['Default'] = loan_status_cross['Default'].apply(lambda x: round(x, 2))
# loan_status_cross['Does not meet the credit policy. Status:Charged Off'] = loan_status_cross['Does not meet the credit policy. Status:Charged Off'].apply(lambda x: round(x, 2))
loan_status_cross['In Grace Period'] = loan_status_cross['In Grace Period'].apply(lambda x: round(x,