信用卡评分模型构建
背景描述
目前拥有用户年龄,信用卡和个人信贷额度的总余额,过去2年借款人逾期,预测借款人是否会预期次数,月收入,负债比率,家属等信息,通过这些信息建立风控,信用评分模型,预测预测借款人是否会预期。
一.导入数据和库
导入相应库
import datetime
import pandas as pd
import numpy as np
import os
import seaborn as sns
import re
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')
sns.set(style="darkgrid")
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False # 用来正常显示
/opt/conda/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/opt/conda/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
time: 1.57 s
导入数据
train = pd.read_csv('/home/kesci/input/kaggle4396/cs-training.csv')
test = pd.read_csv('/home/kesci/input/kaggle4396/cs-test.csv')
time: 248 ms
train.drop(columns=["Unnamed: 0"], inplace=True)
test.drop(columns=["Unnamed: 0"], inplace=True)
time: 9.83 ms
数据维度
train.shape
(150000, 11)
time: 3.97 ms
有无缺失值
train.isnull().sum()
SeriousDlqin2yrs 0
RevolvingUtilizationOfUnsecuredLines 0
age 0
NumberOfTime30-59DaysPastDueNotWorse 0
DebtRatio 0
MonthlyIncome 29731
NumberOfOpenCreditLinesAndLoans 0
NumberOfTimes90DaysLate 0
NumberRealEstateLoansOrLines 0
NumberOfTime60-89DaysPastDueNotWorse 0
NumberOfDependents 3924
dtype: int64
time: 30.9 ms
有无重复值
train.duplicated().sum()
609
time: 61.2 ms
整体分布
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 11 columns):
SeriousDlqin2yrs 150000 non-null int64
RevolvingUtilizationOfUnsecuredLines 150000 non-null float64
age 150000 non-null int64
NumberOfTime30-59DaysPastDueNotWorse 150000 non-null int64
DebtRatio 150000 non-null float64
MonthlyIncome 120269 non-null float64
NumberOfOpenCreditLinesAndLoans 150000 non-null int64
NumberOfTimes90DaysLate 150000 non-null int64
NumberRealEstateLoansOrLines 150000 non-null int64
NumberOfTime60-89DaysPastDueNotWorse 150000 non-null int64
NumberOfDependents 146076 non-null float64
dtypes: float64(4), int64(7)
memory usage: 12.6 MB
time: 32.2 ms
看下数据
train.head()
SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.766127 | 45 | 2 | 0.802982 | 9120.0 | 13 | 0 | 6 | 0 | 2.0 |
1 | 0 | 0.957151 | 40 | 0 | 0.121876 | 2600.0 | 4 | 0 | 0 | 0 | 1.0 |
2 | 0 | 0.658180 | 38 | 1 | 0.085113 | 3042.0 | 2 | 1 | 0 | 0 | 0.0 |
3 | 0 | 0.233810 | 30 | 0 | 0.036050 | 3300.0 | 5 | 0 | 0 | 0 | 0.0 |
4 | 0 | 0.907239 | 49 | 1 | 0.024926 | 63588.0 | 7 | 0 | 1 | 0 | 0.0 |
time: 12.1 ms
cor=train.corr()
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(cor, xticklabels=cor.columns, yticklabels=cor.columns, annot=True, ax=ax);
time: 1.2 s
二.数据前处理
train_clean = train.copy()
time: 6.31 ms
去重
train_clean.drop_duplicates(inplace=True)
time: 198 ms
缺失值处理
通过众数填充缺失值
def fill_na(df):
na_list = [i for i in df.isnull().sum().index if df.isnull().sum()[i] > 0]
for n in na_list:
train_fillna = train_clean[n][train_clean[n].isna() == False]
train_clean[n].fillna(train_fillna.median(), inplace=True)
time: 1.13 ms
fill_na(train_clean)
train_clean.isnull().sum()
SeriousDlqin2yrs 0
RevolvingUtilizationOfUnsecuredLines 0
age 0
NumberOfTime30-59DaysPastDueNotWorse 0
DebtRatio 0
MonthlyIncome 0
NumberOfOpenCreditLinesAndLoans 0
NumberOfTimes90DaysLate 0
NumberRealEstateLoansOrLines 0
NumberOfTime60-89DaysPastDueNotWorse 0
NumberOfDependents 0
dtype: int64
time: 360 ms
贷款人的年龄分布
plt.figure(figsize=(16, 6))
sns.distplot(train_clean["age"], color = "black");
time: 665 ms
train_clean["age_label"] = pd.cut(train_clean["age"], np.arange(20, 110, 10))
time: 9.82 ms
# 重新分组,合并样本太少或者违约率过于接近的分组
bins = [0, 30, 40, 50, 60, 70, 110