信用卡评分模型构建数据

信用卡评分模型构建

背景描述

目前拥有用户年龄,信用卡和个人信贷额度的总余额,过去2年借款人逾期,预测借款人是否会预期次数,月收入,负债比率,家属等信息,通过这些信息建立风控,信用评分模型,预测预测借款人是否会预期。

一.导入数据和库

导入相应库
import datetime
import pandas as pd
import numpy as np
import os
import seaborn as sns
import re
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('always')
warnings.filterwarnings('ignore')
sns.set(style="darkgrid")
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示

/opt/conda/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/opt/conda/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)


time: 1.57 s
导入数据
train = pd.read_csv('/home/kesci/input/kaggle4396/cs-training.csv')
test = pd.read_csv('/home/kesci/input/kaggle4396/cs-test.csv')
time: 248 ms
train.drop(columns=["Unnamed: 0"], inplace=True)
test.drop(columns=["Unnamed: 0"], inplace=True)
time: 9.83 ms
数据维度
train.shape
(150000, 11)



time: 3.97 ms
有无缺失值
train.isnull().sum()
SeriousDlqin2yrs                            0
RevolvingUtilizationOfUnsecuredLines        0
age                                         0
NumberOfTime30-59DaysPastDueNotWorse        0
DebtRatio                                   0
MonthlyIncome                           29731
NumberOfOpenCreditLinesAndLoans             0
NumberOfTimes90DaysLate                     0
NumberRealEstateLoansOrLines                0
NumberOfTime60-89DaysPastDueNotWorse        0
NumberOfDependents                       3924
dtype: int64



time: 30.9 ms
有无重复值
train.duplicated().sum()
609



time: 61.2 ms
整体分布
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 11 columns):
SeriousDlqin2yrs                        150000 non-null int64
RevolvingUtilizationOfUnsecuredLines    150000 non-null float64
age                                     150000 non-null int64
NumberOfTime30-59DaysPastDueNotWorse    150000 non-null int64
DebtRatio                               150000 non-null float64
MonthlyIncome                           120269 non-null float64
NumberOfOpenCreditLinesAndLoans         150000 non-null int64
NumberOfTimes90DaysLate                 150000 non-null int64
NumberRealEstateLoansOrLines            150000 non-null int64
NumberOfTime60-89DaysPastDueNotWorse    150000 non-null int64
NumberOfDependents                      146076 non-null float64
dtypes: float64(4), int64(7)
memory usage: 12.6 MB
time: 32.2 ms
看下数据
train.head()
SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
0 1 0.766127 45 2 0.802982 9120.0 13 0 6 0 2.0
1 0 0.957151 40 0 0.121876 2600.0 4 0 0 0 1.0
2 0 0.658180 38 1 0.085113 3042.0 2 1 0 0 0.0
3 0 0.233810 30 0 0.036050 3300.0 5 0 0 0 0.0
4 0 0.907239 49 1 0.024926 63588.0 7 0 1 0 0.0
time: 12.1 ms
cor=train.corr()
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(cor, xticklabels=cor.columns, yticklabels=cor.columns, annot=True, ax=ax);
time: 1.2 s

二.数据前处理

train_clean = train.copy()
time: 6.31 ms
去重
train_clean.drop_duplicates(inplace=True)
time: 198 ms
缺失值处理
通过众数填充缺失值
def fill_na(df):
    na_list = [i for i in df.isnull().sum().index if df.isnull().sum()[i] > 0]
    for n in na_list:
        train_fillna = train_clean[n][train_clean[n].isna() == False]
        train_clean[n].fillna(train_fillna.median(), inplace=True)
time: 1.13 ms
fill_na(train_clean)
train_clean.isnull().sum()
SeriousDlqin2yrs                        0
RevolvingUtilizationOfUnsecuredLines    0
age                                     0
NumberOfTime30-59DaysPastDueNotWorse    0
DebtRatio                               0
MonthlyIncome                           0
NumberOfOpenCreditLinesAndLoans         0
NumberOfTimes90DaysLate                 0
NumberRealEstateLoansOrLines            0
NumberOfTime60-89DaysPastDueNotWorse    0
NumberOfDependents                      0
dtype: int64



time: 360 ms
贷款人的年龄分布
plt.figure(figsize=(16, 6))
sns.distplot(train_clean["age"], color = "black");
time: 665 ms
train_clean["age_label"] = pd.cut(train_clean["age"], np.arange(20, 110, 10))
time: 9.82 ms
# 重新分组,合并样本太少或者违约率过于接近的分组
bins = [0, 30, 40, 50, 60, 70, 110
  • 3
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值