1、前言
Lending Club是全球最大的撮合借款人和投资人的线上金融平台,它利用互联网模式建立了一种比传统银行系统更有效率的、能够在借款人和投资人之间自由配置资本的机制。本次分析的源数据基于Lending Club 2017年全年和2018年一二季度的公开数据,目的是建立一个贷前评分卡。数据原址:https://www.lendingclub.com/info/download-data.action 。
2、数据清洗
2.1 导入分析模块和源数据
import numpy as np
import pandas as pd
from scipy.stats import mode
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import chisqbin
import warnings
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve,auc
from imblearn.over_sampling import SMOTE
pd.set_option('display.max_columns',200)
warnings.filterwarnings('ignore')
%matplotlib inline
names=['2017Q1','2017Q2','2017Q3','2017Q4','2018Q1','2018Q2']
data_list=[]
for name in names:
data=pd.read_table('C:/Users/H/Desktop/lending_club/LoanStats_'+name+'.csv',sep=',',low_memory=False)
data_list.append(data)
loan=pd.concat(data_list,ignore_index=True)
接下来,我们看一下目标变量loan_status
,
loan_status=pd.DataFrame({
'意思':['还款中','审核通过','全额结清','宽限期','逾期(31-120天)','逾期(16-30天)','坏账','违约'],'数量':loan['loan_status'].value_counts().values},
index=loan['loan_status'].value_counts().index)
print(loan_status)
因为主要目的是贷前评分,这里只考虑全额结清和未按时还款的的情况,还款中暂时不考虑。
loan['loan_status']=loan['loan_status'].replace(['Fully Paid','In Grace Period','Late (31-120 days)','Late (16-30 days)','Charged Off','Default'],
['0','1','1','1','1','1'])
loan=loan[loan['loan_status'].isin(['0','1'])]
loan['loan_status']=loan['loan_status'].astype('int')
2.2 缺失值处理
数据集的变量虽然有100多个,但其中不少变量包含大量缺失值,缺失比例在50%以上,还有部分变量与我们的目标变量关系不大,这些变量一并剔除。
null_cols=loan.isna().sum().sort_values(ascending=False)/float(loan.shape[0])
null_cols[null_cols > .3]
loan=loan.dropna(thresh=loan.shape[0]*.7,axis=1)
names=['sub_grade','emp_title','pymnt_plan','title','zip_code','total_rec_late_fee','recoveries','collection_recovery_fee','last_pymnt_d','last_pymnt_amnt',