风控模型-风险预警模型

风控模型-风险预警模型

最近一个朋友在面试一家银行的算法岗,第一轮是take home test,也就是公司发了个测试题,要求候选人回家做完,面试讲解。看来今年市场行情的确不太好,很少高级算法岗会有这么一轮面试。笔者看了下数据,还蛮有意思,正好顺一下建模的pipeline,讲解下WOE和LR的结合应用。

WOE 的应用价值

  • 处理缺失值:利用分箱讲null单独处理,可以讲有效覆盖率只有30%的数据利用起来
  • 处理异常值:利用分箱讲异常值单独处理,增加变量的鲁棒性。例如,年龄由于用户手动填写的,可能存在200这种异常值,可以将这种情况划入age>60的分箱。
  • 业务解释性:业务习惯于线性判断变量的作用,当x越来越大,y就越来越大。但实际x与y之间经常存在着非线性关系,可经过WOE进行变换

建模pipeline

建模的pipeline:

加载数据->EDA->Feature Generation->Model Establishment-> Release online

EDA

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from scipy.stats import chi2_contingency
pd.set_option('display.max_columns', None)
sns.set_theme(style="darkgrid")
sns.set(rc = {'figure.figsize':(20,15)})
dataset.head()
accountNumbercustomerIdcreditLimitavailableMoneytransactionDateTimetransactionAmountmerchantNameacqCountrymerchantCountryCodeposEntryModeposConditionCodemerchantCategoryCodecurrentExpDateaccountOpenDatedateOfLastAddressChangecardCVVenteredCVVcardLast4DigitstransactionTypeechoBuffercurrentBalancemerchantCitymerchantStatemerchantZipcardPresentposOnPremisesrecurringAuthIndexpirationDateKeyInMatchisFraudtransactionDatetransactionHourtransactionMonth
07372650567372650565000.05000.02016-08-13 14:27:3298.55UberUSUS0201rideshare06/20232015-03-142015-03-144144141803PURCHASENaN0.0NaNNaNNaNFalseNaNNaNFalseFalse2016-08-13148
17372650567372650565000.05000.02016-10-11 05:05:5474.51AMC #191138USUS0901entertainment02/20242015-03-142015-03-14486486767PURCHASENaN0.0NaNNaNNaNTrueNaNNaNFalseFalse2016-10-11510
27372650567372650565000.05000.02016-11-08 09:18:397.47Play StoreUSUS0901mobileapps08/20252015-03-142015-03-14486486767PURCHASENaN0.0NaNNaNNaNFalseNaNNaNFalseFalse2016-11-08911
37372650567372650565000.05000.02016-12-10 02:14:507.47Play StoreUSUS0901mobileapps08/20252015-03-142015-03-14486486767PURCHASENaN0.0NaNNaNNaNFalseNaNNaNFalseFalse2016-12-10212
48303290918303290915000.05000.02016-03-24 21:04:4671.18Tim Hortons #947751USUS0201fastfood10/20292015-08-062015-08-068858853143PURCHASENaN0.0NaNNaNNaNTrueNaNNaNFalseFalse2016-03-24213
fraud = dataset['isFraud'].value_counts().to_frame()
fraud['pct'] = fraud['isFraud']/fraud['isFraud'].sum()
display(fraud)
isFraudpct
False7739460.98421
True124170.01579

这是个极度不平衡的数据集,对于unbalanced的数据集,两个思路,一个是upsampling,一个是downsampling.这个放在其他文章里讲吧,本篇就不涉及了

信用卡刷卡,有时候会因为网络等原因存在多次扣款的情况,数据集里就会存在多条重复记录,需要将这些重复记录删除

dataset = dataset[~(dataset['transactionType'].isin(['REVERSAL']))]
mult_swipe = dataset[dataset.duplicated(keep='first',subset=['customerId','transactionDate','transactionAmount','cardLast4Digits','transactionHour'])]
print('multi-swipe transaction number:{0},amount:{1}'.format(len(mult_swipe),sum(mult_swipe['transactionAmount'])))
dataset = dataset[~(dataset.index.isin(mult_swipe.index))]
multi-swipe transaction number:7565,amount:1076660.0299999956

先看一眼随时间的分布

sns.set_theme(style="darkgrid")
sns.set(rc = {'figure.figsize':(20,15)})
plt.figure(figsize = (15,8))
sns.barplot(data = dataset, x='transactionMonth',y='transactionAmount',estimator=sum)

在这里插入图片描述

sns.set_theme(style="darkgrid")
sns.set(rc = {'figure.figsize':(20,15)})
plt.figure(figsize = (15,8))
sns.barplot(data = dataset, x='transactionMonth',y='transactionAmount',estimator=len)

在这里插入图片描述

可以看出每个月份的交易金额和交易笔数相差不大

plt.figure(figsize = (15,8))
sns.boxplot(data = dataset,x='transactionMonth',y='transactionAmount',notch=True,showcaps=True,
           flierprops={'marker':'x'},medianprops={'color':'coral'})

在这里插入图片描述

boxplot反映异常值主要是较高金额

处理下null较多的字段

data_null
null_numtotalpct
accountNumber07584950.000000
customerId07584950.000000
creditLimit07584950.000000
availableMoney07584950.000000
transactionDateTime07584950.000000
transactionAmount07584950.000000
merchantName07584950.000000
acqCountry44017584950.005802
merchantCountryCode7037584950.000927
posEntryMode39047584950.005147
posConditionCode3967584950.000522
merchantCategoryCode07584950.000000
currentExpDate07584950.000000
accountOpenDate07584950.000000
dateOfLastAddressChange07584950.000000
cardCVV07584950.000000
enteredCVV07584950.000000
cardLast4Digits07584950.000000
transactionType6907584950.000910
echoBuffer7584957584951.000000
currentBalance07584950.000000
merchantCity7584957584951.000000
merchantState7584957584951.000000
merchantZip7584957584951.000000
cardPresent07584950.000000
posOnPremises7584957584951.000000
recurringAuthInd7584957584951.000000
expirationDateKeyInMatch07584950.000000
isFraud07584950.000000
transactionDate07584950.000000
transactionHour07584950.000000
transactionMonth07584950.000000

remove columns which null_pct >= 0.5

data_df = dataset[data_null[data_null['pct']<0.5].index.tolist()]
data_df = data_df[['customerId', 'creditLimit', 'availableMoney','transactionDateTime','transactionAmount', 'merchantName','acqCountry', 'merchantCountryCode', 'posEntryMode', 'posConditionCode','merchantCategoryCode','accountOpenDate','dateOfLastAddressChange', 'cardCVV', 'enteredCVV', 'cardLast4Digits',
       'transactionType', 'currentBalance','expirationDateKeyInMatch', 'isFraud', 'transactionDate','transactionHour','transactionMonth']]

进一步看下fraud交易和正常交易随时间分布的差异

fig, axes = plt.subplots(3,4)
sns.histplot(ax=axes[0,0],data = data_df[(data_df['transactionMonth']==1)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='1')
sns.histplot(ax=axes[0,1],data = data_df[(data_df['transactionMonth']==2)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='2')
sns.histplot(ax=axes[0,2],data = data_df[(data_df['transactionMonth']==3)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='3')
sns.histplot(ax=axes[0,3],data = data_df[(data_df['transactionMonth']==4)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='4')
sns.histplot(ax=axes[1,0],data = data_df[(data_df['transactionMonth']==5)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='5')
sns.histplot(ax=axes[1,1],data = data_df[(data_df['transactionMonth']==6)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='6')
sns.histplot(ax=axes[1,2],data = data_df[(data_df['transactionMonth']==7)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='7')
sns.histplot(ax=axes[1,3],data = data_df[(data_df['transactionMonth']==8)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='8')
sns.histplot(ax=axes[2,0],data = data_df[(data_df['transactionMonth']==9)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='9')
sns.histplot(ax=axes[2,1],data = data_df[(data_df['transactionMonth']==10)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='10')
sns.histplot(ax=axes[2,2],data = data_df[(data_df['transactionMonth']==11)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='11')
sns.histplot(ax=axes[2,3],data = data_df[(data_df['transactionMonth']==12)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='12')
plt.tight_layout()

在这里插入图片描述

fig, axes = plt.subplots(3,4)
sns.histplot(ax=axes[0,0],data = data_df[(data_df['transactionMonth']==1)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='1')
sns.histplot(ax=axes[0,1],data = data_df[(data_df['transactionMonth']==2)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='2')
sns.histplot(ax=axes[0,2],data = data_df[(data_df['transactionMonth']==3)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='3')
sns.histplot(ax=axes[0,3],data = data_df[(data_df['transactionMonth']==4)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='4')
sns.histplot(ax=axes[1,0],data = data_df[(data_df['transactionMonth']==5)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='5')
sns.histplot(ax=axes[1,1],data = data_df[(data_df['transactionMonth']==6)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='6')
sns.histplot(ax=axes[1,2],data = data_df[(data_df['transactionMonth']==7)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='7')
sns.histplot(ax=axes[1,3],data = data_df[(data_df['transactionMonth']==8)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='8')
sns.histplot(ax=axes[2,0],data = data_df[(data_df['transactionMonth']==9)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='9')
sns.histplot(ax=axes[2,1],data = data_df[(data_df['transactionMonth']==10)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='10')
sns.histplot(ax=axes[2,2],data = data_df[(data_df['transactionMonth']==11)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='11')
sns.histplot(ax=axes[2,3],data = data_df[(data_df['transactionMonth']==12)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='12')
plt.tight_layout()

在这里插入图片描述

fraud类型交易0-100和100-200区间交易占比较高。正常交易0-100占比明显高于其他区间,且第四季度占比略高于其他季度

plt.figure(figsize = (15,8))
g = sns.FacetGrid(data_df,col='transactionTime_',row='isFraud',margin_titles=True)
g.map(sns.histplot,'transactionAmount',stat='probability',binwidth = 50)
plt.tight_layout()

在这里插入图片描述

看着这个比例分布,正常交易对时间段不敏感,无差异。fraud类型的交易下午时间段分布和其他时间段不同,100-200占比最高

cus_df.head()
customerIdtransactionAmounttransactionDatecumpct
03806802414589985.93315540.044210
18828151341842601.52126650.061958
25708848631514931.43104520.076549
32462512531425588.8498060.090280
43693080351012414.4269280.100032
sns.lineplot(x='cus_pct',y='cumpct',data=cus_df )

在这里插入图片描述

80%的交易金额由20%的客户贡献

sns.histplot(data = cus_df, x='TA',stat='count',binwidth=10).set(title='TA')

在这里插入图片描述

客单价集中在150左右

df_ttl = df[['customerId', 'creditLimit','transactionAmount','acqCountry','merchantCategoryCode','transactionType','currentBalance','transactionMonth','transactionTime_',
        'days_tranz_open','merchantName_', 'CVV', 'days_address', 'isFraud']]
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.model_selection import GridSearchCV
trained_cols = ['customerId', 'creditLimit','transactionAmount','acqCountry','merchantCategoryCode','transactionType','currentBalance','transactionMonth','transactionTime_',
        'days_tranz_open','merchantName_', 'CVV', 'days_address','label']

s_train_X,s_test_X,s_train_y,s_test_y = train_test_split(df_ttl[trained_cols],\
                                                 df_ttl['label'],train_size=0.8,random_state=123)
cus_df = s_train_X.groupby('customerId',as_index=False).agg({'transactionAmount':'sum','acqCountry':'count'})
cus_df = cus_df.rename(columns = {'acqCountry':'Frequency'})
cus_df['TA'] = cus_df['transactionAmount']/cus_df['Frequency']
s_train_X = s_train_X.merge(cus_df[['customerId','Frequency','TA']],on='customerId',how='left')
s_test_X = s_test_X.merge(cus_df[['customerId','Frequency','TA']],on='customerId',how='left')
s_train_X['TA_Tranz'] = s_train_X['transactionAmount'] - s_train_X['TA']
s_test_X['TA_Tranz'] = s_test_X['transactionAmount'] - s_test_X['TA']

feature generation

class WOE(Analysis):
    @staticmethod
    def __perc_share(df,group_name):
        return df[group_name]/df[group_name].sum()
    def __calculate_perc_share(self,feat):
        df = self.group_by_feature(feat)
        df['perc_good'] = self.__perc_share(df,'good')
        df['perc_bad'] = self.__perc_share(df,'bad')
        df['perc_diff'] = df['perc_good'] - df['perc_bad']
        return df
    def calculate_woe(self,feat):
        df = self.__calculate_perc_share(feat)
        df['woe'] = np.log(df['perc_good']/df['perc_bad'])
        df['woe'] = df['woe'].replace([np.inf,-np.inf],np.nan).fillna(0)
        return df
    
    
    
class CategoricalFeature():
    def __init__(self,df,feature):
        self.df = df
        self.feature = feature
    
    @property
    def _df_woe(self):
        df_woe = self.df.copy()
        df_woe['bin'] = df_woe[self.feature].fillna('missing')
        return df_woe[['bin','label']]
        
def draw_woe(woe_df):
    fig, ax = plt.subplots(figsize=(10,6))
    sns.barplot(x=woe_df.columns[0], y=woe_df.columns[-2], data=woe_df, palette=sns.cubehelix_palette(len(woe_df),start=0.5,rot=0.75,reverse=True))
    ax.set_title('WOE visualization for: ' + feature)
    plt.xticks(rotation=30)
    plt.show()
def print_iv(woe_df):
    iv = woe_df['iv'].sum()
    if iv < 0.02:
        interpre = 'useless'
    elif iv < 0.1:
        interpre = 'weak'
    elif iv < 0.3:
        interpre = 'medium'
    elif iv < 0.5:
        interpre = 'strong'
    else:
        interpre = 'toogood'
    return iv,interpre
        
category feature
feature_cat = ['creditLimit','acqCountry', 'transactionType', 
               'transactionMonth', 'transactionTime_', 'CVV','merchantName_','merchantCategoryCode']
# feature_cat = ['creditLimit']
iv_dic = {}
iv_dic['feature'] = []
iv_dic['iv'] = []
iv_dic['interpretation'] = []
    
iv_df = pd.DataFrame(iv_dic)
display(iv_df)
featureivinterpretation
0creditLimit0.019578useless
1acqCountry0.000674useless
2transactionType0.016641useless
3transactionMonth0.003295useless
4transactionTime_0.000690useless
5CVV0.004968useless
6merchantName_0.766754toogood
7merchantCategoryCode0.222249medium
continuous feature
import scipy.stats as stats
feature_conti = ['transactionAmount','currentBalance','days_tranz_open','days_address', 'Frequency', 'TA','TA_Tranz']
# feature_conti = ['Frequency']
iv_con_dic = {}
iv_con_dic['feature'] = []
iv_con_dic['iv'] = []
iv_con_dic['interpretation'] = []

iv_con_df = pd.DataFrame(iv_con_dic)
iv_con_df
featureivinterpretation
0transactionAmount0.377574strong
1currentBalance0.003375useless
2days_tranz_open0.000420useless
3days_address0.003841useless
4Frequency0.023392weak
5TA0.022107weak
6TA_Tranz0.322500strong
draw_woe(woe_df)

在这里插入图片描述

这里用单笔交易与相应客户的平均比单价的差举例,可以看出当差大于10时,相应交易更倾向于fraud

col_model = iv_df[~(iv_df['interpretation'].isin(['useless']))]['feature'].values.tolist()
col_model.extend(iv_con_df[~(iv_con_df['interpretation'].isin(['useless']))]['feature'].values.tolist())
trained_col = []
for col in col_model:
    trained_col.append('woe'+col)
trained_col
['woemerchantName_',
 'woemerchantCategoryCode',
 'woetransactionAmount',
 'woeFrequency',
 'woeTA',
 'woeTA_Tranz']

后面就使用这6个feature进行建模

Model

利用GridsearchCV对超参数进行遍历寻优

lr_paras = [{'penalty':['l2'],
             'C':[0.01,0.05,0.1,0.5,1,5,10,50,100],
             'class_weight':[{0:0.1,1:0.9},{0:0.2,1:0.8},{0:0.3,1:0.7},{0:0.4,1:0.6},{0:0.5,1:0.5},{0:0.6,1:0.4},{0:0.7,1:0.3},{0:0.8,1:0.2},{0:0.9,1:0.1}],
             'solver':['liblinear'],
             'multi_class':['ovr']}]

modelLR = GridSearchCV(LogisticRegression(tol=1e-6),lr_paras,cv=5,verbose=1)
modelLR.fit(s_train_X,s_train_y)
Fitting 5 folds for each of 81 candidates, totalling 405 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 405 out of 405 | elapsed:  7.0min finished





GridSearchCV(cv=5, estimator=LogisticRegression(tol=1e-06),
             param_grid=[{'C': [0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100],
                          'class_weight': [{0: 0.1, 1: 0.9}, {0: 0.2, 1: 0.8},
                                           {0: 0.3, 1: 0.7}, {0: 0.4, 1: 0.6},
                                           {0: 0.5, 1: 0.5}, {0: 0.6, 1: 0.4},
                                           {0: 0.7, 1: 0.3}, {0: 0.8, 1: 0.2},
                                           {0: 0.9, 1: 0.1}],
                          'multi_class': ['ovr'], 'penalty': ['l2'],
                          'solver': ['liblinear']}],
             verbose=1)
coef_,intercept_ = LR_(modelLR.best_estimator_,s_train_X,s_train_y)

s_test_X.drop(columns='pred',inplace=True)
s_train_X.drop(columns='pred',inplace=True)
s_y_pred_prob = lr_score(s_test_X,coef_,intercept_)

s_y_pred_prob_train = lr_score(s_train_X,coef_,intercept_)

LR可以输出probability,遍历下步长,确定最优的threshold

thre = np.linspace(0,0.5,50)
score_dic = {}
score_dic['thre'] = []
score_dic['score'] = []
for item in thre:
    s_y_pred_train = [1 if i >= item else 0 for i in s_y_pred_prob_train]
    score_f1 = f1_score(s_train_y,s_y_pred_train,average='macro')
    score_dic['thre'].append(item)
    score_dic['score'].append(score_f1)
s_y_pred_train = [1 if i >= thresh else 0 for i in s_y_pred_prob_train]
s_y_pred = [1 if i >= thresh else 0 for i in s_y_pred_prob]
print('Training set performance')
print(metrics.confusion_matrix(s_train_y, s_y_pred_train))
print(metrics.classification_report(s_train_y, s_y_pred_train))

print('Test set performance')
print(metrics.confusion_matrix(s_test_y, s_y_pred))
print(metrics.classification_report(s_test_y, s_y_pred))
Training set performance
[[576091  14082]
 [  7969   1162]]
              precision    recall  f1-score   support

           0       0.99      0.98      0.98    590173
           1       0.08      0.13      0.10      9131

    accuracy                           0.96    599304
   macro avg       0.53      0.55      0.54    599304
weighted avg       0.97      0.96      0.97    599304

Test set performance
[[144053   3385]
 [  2091    297]]
              precision    recall  f1-score   support

           0       0.99      0.98      0.98    147438
           1       0.08      0.12      0.10      2388

    accuracy                           0.96    149826
   macro avg       0.53      0.55      0.54    149826
weighted avg       0.97      0.96      0.97    149826
fpr, tpr, _ = metrics.roc_curve(s_test_y, s_test_X_.pred)
auc = metrics.roc_auc_score(s_test_y, s_test_X_.pred)

plt.figure(figsize=(8,6))
sns.lineplot(fpr,tpr,label='Model AUC %0.2f' % auc, color='palevioletred', lw = 2)
plt.plot([0, 1], [0, 1], color='lightgrey', lw=1.5, linestyle='--')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate',fontsize=12)
plt.ylabel('True Positive Rate',fontsize=12)
plt.title('ROC - Test Set',fontsize=13)
plt.legend(loc="lower right",fontsize=12)
plt.rc_context({'axes.edgecolor':'darkgrey','xtick.color':'black','ytick.color':'black','figure.facecolor':'white'})
plt.show() 

在这里插入图片描述

AUC虽然有0.75,但是F1score只有0.54,模型效果其实是很差的。这主要是极端不平衡数据集带来的影响。同时也说明,对于这种极端不平衡数据集挑选合适的metrics很重要

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值