信用评分卡-Give Me Some Credit

最新推荐文章于 2024-04-23 12:14:13 发布

Madeinray

最新推荐文章于 2024-04-23 12:14:13 发布

阅读量1.3k

点赞数 1

分类专栏：风控模型案例文章标签：信用评分卡 Kaggle案例 Give Me Some Credit 数据挖掘风控模型

本文链接：https://blog.csdn.net/qq_24520431/article/details/102614761

版权

本文探讨了信用评分卡的建模目的和业务背景，主要集中在Kaggle的Give Me Some Credit数据集上。通过对数据的理解、预处理和探索，分析了包括年龄、负债率、偿还能力等多个关键字段对信用评分的影响，旨在预测客户未来两年的财务危机概率。

摘要由CSDN通过智能技术生成

信用评分卡

1.业务理解

1.1 建模目的

基于客户数据，通过预测客户未来两年是否会陷入财务危机的概率来改善银行信用评分的质量。

1.2 业务背景

a、信用评分是指根据银行客户的各种历史信用资料，利用一定的信用评分模型，得到不同等级的信用分数，根据客户的信用分数，授信者可以通过分析客户按时还款的可能性，据此决定是否给予授信以及授信的额度和利率。虽然授信者通过人工分析客户的历史信用资料，同样可以得到这样的分析结果，但利用信用评分却更加快速、更加客观、更具有一致性。

b、本次要建立的信用评分卡属于信用评分卡中的A卡（Application score card）即申请评分卡，是客户贷款前对客户实行打分制，以期对客户有一个优质与否的评判。

2.数据理解阶段

2.1 数据来源

本次建模数据来自Kaggle竞赛网的Give Me Some Credit数据集，该数据集收集了15万条个体客户的基本信息，其中12万条训练集数据，3万条测试集数据。数据集中的包含10个字段，其中一个为目标字段，下表展示了各个字段的含义。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-x0dSD4mh-1571316258240)(attachment:image.png)]

2.2 字段理解

根据业务的背景，可以上客户信息分为以下几类：

基本信息：age

负债信息:RevolvingUtilizationOfUnsecuredLines、DebtRatio、NumberOfOpenCreditLinesAndLoans、NumberRealEstateLoansOrLines

偿还能力：MonthlyIncome

历史信用记录：NumberOfTime30-59DaysPastDueNotWorse、NumberOfTime60-89DaysPastDueNotWorse、NumberOfTimes90DaysLate

人际社交信息：NumberOfDependents

3.数据预处理

3.1 读取数据

#导入数据预处理所需要的工具包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

#导入统计学常用数据包
from scipy import stats
from scipy.stats import norm, skew 

import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #忽略警告
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #设置pandas输出数据为3位小数

#导入测试集与训练集
train = pd.read_csv('cs-training.csv')
test = pd.read_csv('cs-test v2.csv')

#数据基本信息
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120000 entries, 0 to 119999
Data columns (total 12 columns):
CustomerID                              120000 non-null int64
RevolvingUtilizationOfUnsecuredLines    120000 non-null float64
age                                     120000 non-null int64
NumberOfTime30-59DaysPastDueNotWorse    120000 non-null int64
DebtRatio                               120000 non-null float64
MonthlyIncome                           96224 non-null float64
NumberOfOpenCreditLinesAndLoans         120000 non-null int64
NumberOfTimes90DaysLate                 120000 non-null int64
NumberRealEstateLoansOrLines            120000 non-null int64
NumberOfTime60-89DaysPastDueNotWorse    120000 non-null int64
NumberOfDependents                      116860 non-null float64
SeriousDlqin2yrs                        120000 non-null int64
dtypes: float64(4), int64(8)
memory usage: 11.0 MB

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 12 columns):
CustomerID                              30000 non-null int64
RevolvingUtilizationOfUnsecuredLines    30000 non-null float64
age                                     30000 non-null int64
NumberOfTime30-59DaysPastDueNotWorse    30000 non-null int64
DebtRatio                               30000 non-null float64
MonthlyIncome                           24045 non-null float64
NumberOfOpenCreditLinesAndLoans         30000 non-null int64
NumberOfTimes90DaysLate                 30000 non-null int64
NumberRealEstateLoansOrLines            30000 non-null int64
NumberOfTime60-89DaysPastDueNotWorse    30000 non-null int64
NumberOfDependents                      29216 non-null float64
SeriousDlqin2yrs                        30000 non-null int64
dtypes: float64(4), int64(8)
memory usage: 2.7 MB

3.2 合并训练集与测试集

训练集与测试集中数据包含的字段完全一致，都为10个数值型变量和一个目标变量，因此可以统一的预处理

#保存ID号码
train_ID = train['CustomerID']
test_ID = test['CustomerID']

#合并训练集和测试集
y_train = train.SeriousDlqin2yrs
y_test = test.SeriousDlqin2yrs
all_data = pd.concat((train, test)).reset_index(drop=True)
print("all_data size is : {}".format(all_data.shape))

all_data size is : (150000, 12)

3.3 重复值处理

训练集数据高达12万条，很大概率含有重复值，先将重复值去除。

#检查数据
print("处理前数据大小 : {} ".format(all_data.shape))

#获取去重复的字段名列表
cols_df = all_data.columns.values.tolist()
cols_df.remove('CustomerID')

#去除训练数据中的重复值
all_data.drop_duplicates(subset = cols_df, inplace=True)

#检查数据
print("处理后数据大小 : {} ".format(all_data.shape)

最低0.47元/天解锁文章

Madeinray

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
信用评分卡-Give Me Some Credit

信用评分卡1.业务理解1.1 建模目的基于客户数据，通过预测客户未来两年是否会陷入财务危机的概率来改善银行信用评分的质量。1.2 业务背景a、信用评分是指根据银行客户的各种历史信用资料，利用一定的信用评分模型，得到不同等级的信用分数，根据客户的信用分数，授信者可以通过分析客户按时还款的可能性，据此决定是否给予授信以及授信的额度和利率。虽然授信者通过人工分析客户的历史信用资料，同样可以得到...
复制链接

扫一扫