信用评分卡（A卡）基于LR模型的数据处理及建模过程

最新推荐文章于 2023-04-25 21:18:01 发布

萝莉巴索小布丁

最新推荐文章于 2023-04-25 21:18:01 发布

阅读量6.5k

点赞数 6

分类专栏：信用评分卡 Logistic Regression 文章标签：信用评分卡数据实践

本文链接：https://blog.csdn.net/axy_shelly/article/details/83274534

版权

本文介绍了使用Logistic Regression构建信用评分卡的过程，涉及数据预处理，包括Log_Info处理、缺失值和异常值管理，特征工程如变量分箱和编码，以及尺度化步骤。通过计算信息值（IV）筛选特征，并处理线性相关性问题，最终形成有效的信用评分模型。

摘要由CSDN通过智能技术生成

数据来自：魔镜杯风控算法大赛（拍拍贷）。有关数据的具体描述可以看比赛页面。

0. 数据集的关键字段及描述：

Master：每一行代表一个样本（一笔成功成交借款），每个样本包含200多个各类字段。

idx：每一笔贷款的unique key，可以与另外2个文件里的idx相匹配。
UserInfo_*：借款人特征字段
WeblogInfo_*：Info网络行为字段
Education_Info*：学历学籍字段
ThirdParty_Info_PeriodN_*：第三方数据时间段N字段
SocialNetwork_*：社交网络字段
LinstingInfo：借款成交时间
Target：违约标签（1 = 贷款违约，0 = 正常还款）。测试集里不包含target字段。

Log_Info：借款人的登陆信息。

ListingInfo：借款成交时间
LogInfo1：操作代码
LogInfo2：操作类别
LogInfo3：登陆时间
idx：每一笔贷款的unique key

Userupdate_Info：借款人修改信息

ListingInfo1：借款成交时间
UserupdateInfo1：修改内容
UserupdateInfo2：修改时间
idx：每一笔贷款的unique key

Logistic Regression的优点在于简单、稳定可解释，作为初次实践，用这个模型比较好上手。

1. 数据预处理

提炼特征的方法有求和、比例、频率、平均。

对Log_Info的处理

对于本数据中的登录时间，登录日期与放款日期的间隔天数，大部分在180天以内。

选取半年内的时间切片：30、60、90、120、150、180

可以计算不同时间切片下的：

登录次数
不同登录方式的个数
不同登录方式的平均个数

缺失值处理

缺失值占比超过80%做删除处理，否则按特殊值处理
Master中的UserInfo_的缺失值根据相关性较高的字段进行填充

异常值处理

为了不丢失重要信息，先不做处理，在分箱过程中进行处理。

数据一致性

数据格式差异：Master中的LinstingInfo，统一转成时间戳形式；大小写不一致的数据；手机号格式统一等

2. 特征工程

变量分箱使用卡方分箱法，并通过来判断分箱后的分布均匀性。

同时：

处理异常值：占比低于5%，将特殊值与正常值中的最大的一箱进行合并。
类别型变量分箱：
- 学历等有序的：按照排序赋值
- 省份城市等无序的：用该类型的坏样本率代替

分箱后编码：WOE=ln(GoodPercent/BadPercent)

挑选特征：

特征信息值IV = （GoodPercent-BadPercent）*WOE

IV衡量的是特征总体的重要性，也与分箱方式有关。

由上图可知，变量的IV普遍较低，稍微放宽IV选择的条件，以0.02为阈值进行粗筛。

线性相关性：通过相关矩阵来判断

多重共线性：VIF（方差膨胀因子）如果大于10，则存在

部分变量的p值不显著，WOE也存在正值，因此要检查显著性和正确性。

对所有p值超过0.1的变量单独做一元逻辑回归模型，p值全部低于0.1，说明不显著的p值是由于线性相关性引起的。

对上述所有正系数的变量单独做一元逻辑回归模型，系数全部为-1。

将变量根据IV进行降序排列，从IV最高的变量开始，逐个放入，如仍满足p小于0.1，则继续加入，否则剔除新加入的变量。

变量选择后，符号都为负，且p值小于阈值0.1

3. 尺度化

将概率转化成分数，违约概率越低，资质越好，分数越高。

y = log(p/(1-p))

PDO:好坏比上升1倍时，分数上升PDO个单位。

评分分布较均匀。

附：

代码1-数据处理、建模代码

import pandas as pd
import datetime
import collections
import numpy as np
import numbers
import random
import sys
import pickle
from itertools import combinations
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
import statsmodels.api as sm
from importlib import reload
from matplotlib import pyplot as plt
reload(sys)
sys.setdefaultencoding( "utf-8")
from scorecard_functions import *
from sklearn.linear_model import LogisticRegressionCV
# -*- coding: utf-8 -*-

################################
######## UDF: 自定义函数 ########
################################
### 对时间窗口，计算累计产比 ###
def TimeWindowSelection(df, daysCol, time_windows):
    '''
    :param df: the dataset containg variabel of days
    :param daysCol: the column of days
    :param time_windows: the list of time window
    :return:
    '''
    freq_tw = {}
    for tw in time_windows:
        freq = sum(df[daysCol].apply(lambda x: int(x<=tw)))
        freq_tw[tw] = freq
    return freq_tw


def DeivdedByZero(nominator, denominator):
    '''
    当分母为0时，返回0；否则返回正常值
    '''
    if denominator == 0:
        return 0
    else:
        return nominator*1.0/denominator


#对某些统一的字段进行统一
def ChangeContent(x):
    y = x.upper()
    if y == '_MOBILEPHONE':
        y = '_PHONE'
    return y

def MissingCategorial(df,x):
    missing_vals = df[x].map(lambda x: int(x!=x))
    return sum(missing_vals)*1.0/df.shape[0]

def MissingContinuous(df,x):
    missing_vals = df[x].map(lambda x: int(np.isnan(x)))
    return sum(missing_vals) * 1.0 / df.shape[0]

def MakeupRandom(x, sampledList):
    if x==x:
        return x
    else:
        randIndex = random.randint(0, len(sampledList)-1)
        return sampledList[randIndex]



############################################################
#Step 0: 数据分析的初始工作, 包括读取数据文件、检查用户Id的一致性等#
############################################################

folderOfData = '/Users/Code/Data Collections/bank default/'
data1 = pd.read_csv(folderOfData+'PPD_LogInfo_3_1_Training_Set.csv', header = 0)
data2 = pd.read_csv(folderOfData+'PPD_Training_Master_GBK_3_1_Training_Set.csv', header = 0,encoding = 'gbk')
data3 = pd.read_csv(folderOfData+'PPD_Userupdate_Info_3_1_Training_Set.csv', header = 0)

#############################################################################################
# Step 1: 从PPD_LogInfo_3_1_Training_Set &  PPD_Userupdate_Info_3_1_Training_Set数据中衍生特征#
#############################################################################################
# compare whether the four city variables match
data2['city_match'] = data2.apply(lambda x: int(x.UserInfo_2 == x.UserInfo_4 == x.UserInfo_8 == x.UserInfo_20),axis = 1)
del data2['UserInfo_2']
del data2['UserInfo_4']
del data2['UserInfo_8']
del data2['UserInfo_20']

### 提取申请日期，计算日期差，查看日期差的分布
data1['logInfo'] = data1['LogInfo3'].map(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))
data1['Listinginfo'] = data1['Listinginfo1'].map(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))
data1['ListingGap'] = data1[['logInfo','Listinginfo']].apply(lambda x: (x[1]-x[0]).days,axis = 1)
plt.hist(data1['ListingGap'],bins=200)
plt.title('Days between login date and listing date')
ListingGap2 = data1['ListingGap'].map(lambda x: min(x,365))
plt.hist(ListingGap2,bins=200)

timeWindows = TimeWindowSelection(data1, 'ListingGap', range(30,361,30))

'''
使用180天作为最大的时间窗口计算新特征
所有可以使用的时间窗口可以有7 days, 30 days, 60 days, 90 days, 120 days, 150 days and 180 days.
在每个时间窗口内，计算总的登录次数，不同的登录方式，以及每种登录方式的平均次数
'''
time_window = [7, 30, 60, 90, 120, 150, 180]
var_list = ['LogInfo1','LogInfo2']
data1GroupbyIdx = pd.DataFrame({'Idx':data1['Idx'].drop_duplicates()})

for tw in time_window:
    data1['TruncatedLogInfo'] = data1['Listinginfo'].map(lambda x: x + datetime.timedelta(-tw))
    temp = data1.loc[data1['logInfo'] >= data1['TruncatedLogInfo']]
    for var in var_list:
        #count the frequences of LogInfo1 and LogInfo2
        count_stats = temp.groupby(['Idx'])[var].count().to_dict()
        data1GroupbyIdx[str(var)+'_'+str(tw)+'_count'] = data1GroupbyIdx['Idx'].map(lambda x: count_stats.get(x,0))

        # count the distinct value of LogInfo1 and LogInfo2
        Idx_UserupdateInfo1 = temp[['Idx', var]].drop_duplicates()
        uniq_stats = Idx_UserupdateInfo1.groupby(['Idx'])[var].count().to_dict()
        data1GroupbyIdx[str(var) + '_' + str(tw) + '_unique'] = data1GroupbyIdx['Idx'].map(lambda x: uniq_stats.get(x,0))

        # calculate the average count of each value in LogInfo1 and LogInfo2
        data1GroupbyIdx[str(var) + '_' + str(tw) + '_avg_count'] = data1GroupbyIdx[[str(var)+'_'+str(tw)+'_count',str(var) + '_' + str(tw) + '_unique']].\
            apply(lambda x: DeivdedByZero(x[0],x[1]), axis=1)


data3['ListingInfo'] = data3['ListingInfo1'].map(lambda x: datetime.datetime.strptime(x,'%Y/%m/%d'))
data3['UserupdateInfo'] = data3['UserupdateInfo2'].map(lambda x: datetime.datetime.strptime(x,'%Y/%m/%d'))
data3['ListingGap'] = data3[['UserupdateInfo','ListingInfo']].apply(lambda x: (x[1]-x[0]).days,axis = 1)
collections.Counter(data3['ListingGap'])
hist_ListingGap = np.histogram(data3['ListingGap'])
hist_ListingGap = pd.DataFrame({'Freq':hist_ListingGap[0],'gap':hist_ListingGap[1][1:]})
hist_ListingGap['CumFreq'] = hist_ListingGap['Freq'].cumsum()
hist_ListingGap['CumPercent'] = hist_ListingGap['CumFreq'].map(lambda x: x*1.0/hist_ListingGap.iloc[-1]['CumFreq'])

'''
对 QQ和qQ, Idnumber和idNumber,MOBILEPHONE和PHONE 进行统一
在时间切片内，计算
 (1) 更新的频率
 (2) 每种更新对象的种类个数
 (3) 对重要信息如IDNUMBER,HASBUYCAR, MARRIAGESTATUSID, PHONE的更新
'''
data3['UserupdateInfo1'] = data3['UserupdateInfo1'].map(ChangeContent)
data3GroupbyIdx = pd.DataFrame({'Idx':data3['Idx'].drop_duplicates()})

time_window = [7, 30, 60, 90, 120, 150, 180]
for tw in time_window:
    data3['TruncatedLogInfo'] = data3['ListingInfo'].map(lambda x: x + datetime.timedelta(-tw))
    temp = data3.loc[data3['UserupdateInfo'] &g