Kaggle项目:信用贷款欺诈检测

Kaggle项目:信用贷款欺诈检测

Credit Card Fraud Detection

数据背景: 在批准客户的贷款时,探索所有的可能性。

数据内容: 数据集分为3个csv表,其中:'application_data.csv’为主表,有307510行和122列,包括目标列;'columns_description.csv’用于对其他表各字段的含义进行解释;'previous_application.csv’为以前的申请数据。

数据来源链接: https://www.kaggle.com/datasets/mishra5001/credit-card?datasetId=263888&sortBy=voteCount&select=application_data.csv

以下通过Python对信用卡数据进行清理和探索性数据分析(Exploratory Data Analysis),并尝试预测信用欺诈:

一、数据准备及概览

#引入库
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from collections import Counter
import warnings
warnings.filterwarnings("ignore")
#引入机器学习库
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score,f1_score,precision_score, recall_score
from sklearn.metrics import confusion_matrix

from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
# 正常显示中文和负号
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
data = pd.read_csv('application_data.csv')
origin = data.copy()
data.head()
SK_ID_CURRTARGETNAME_CONTRACT_TYPECODE_GENDERFLAG_OWN_CARFLAG_OWN_REALTYCNT_CHILDRENAMT_INCOME_TOTALAMT_CREDITAMT_ANNUITY...FLAG_DOCUMENT_18FLAG_DOCUMENT_19FLAG_DOCUMENT_20FLAG_DOCUMENT_21AMT_REQ_CREDIT_BUREAU_HOURAMT_REQ_CREDIT_BUREAU_DAYAMT_REQ_CREDIT_BUREAU_WEEKAMT_REQ_CREDIT_BUREAU_MONAMT_REQ_CREDIT_BUREAU_QRTAMT_REQ_CREDIT_BUREAU_YEAR
01000021Cash loansMNY0202500.0406597.524700.5...00000.00.00.00.00.01.0
11000030Cash loansFNN0270000.01293502.535698.5...00000.00.00.00.00.00.0
21000040Revolving loansMYY067500.0135000.06750.0...00000.00.00.00.00.00.0
31000060Cash loansFNY0135000.0312682.529686.5...0000NaNNaNNaNNaNNaNNaN
41000070Cash loansMNY0121500.0513000.021865.5...00000.00.00.00.00.00.0

5 rows × 122 columns

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
#查看数据集是否平衡
fraud = len(data[data.TARGET == 1])
legit = len(data[data.TARGET == 0])
print("欺诈交易的数量:", fraud)
print("欺诈交易占比: {:.4f} %".format(fraud/(fraud+legit)*100))
欺诈交易的数量: 24825
欺诈交易占比: 8.0729 %

二、数据清洗

#检查是否有重复值
print('数据集中重复记录个数: ', data.duplicated().sum())
数据集中重复记录个数:  0
#新增列——是否有申请过
data2 = pd.read_csv('previous_application.csv')
id_num = data2['SK_ID_CURR'][data2['AMT_APPLICATION']!=0.0].unique()
data['HAS_APPLIED'] = data['SK_ID_CURR'].isin(id_num).astype(int)
#删除ID字段
data.drop('SK_ID_CURR',axis=1,inplace=True)
#找出含缺失值的列,并求缺失值占比
per_null = data.isnull().sum().sort_values(ascending = False)/len(data)
per_null[per_null>0].head()
COMMONAREA_MODE             0.698723
COMMONAREA_AVG              0.698723
COMMONAREA_MEDI             0.698723
NONLIVINGAPARTMENTS_MEDI    0.694330
NONLIVINGAPARTMENTS_AVG     0.694330
dtype: float64
#缺失值占比可视化
fig, ax = plt.subplots(figsize=(6, 12))
per_null[per_null>0].plot(kind='barh',ax=ax,title='缺失值占比')
<AxesSubplot:title={'center':'缺失值占比'}>

在这里插入图片描述

#删除缺失值占比大于40%的列
null_col_name = per_null[per_null>0.4].index
data.drop(null_col_name,axis=1,inplace=True)

三、探索性数据分析(EDA)

(一)Object 类型字段

#Object类型字段
print("Object类型字段:",np.count_nonzero(data.select_dtypes('object').columns))
print(data.select_dtypes('object').columns)
Object类型字段: 12
Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE'],
      dtype='object')
#求缺失值数量和占比的函数
def missing(df):
    total = df.isnull().sum().sort_values(ascending = False)
    total = total[total>0]
    percent = df.isnull().sum().sort_values(ascending = False)/len(df)*100
    percent = percent[percent>0]
    return pd.concat([total, percent], axis=1, keys=['Total','Percentage'])
#查看是否有缺失值
missing(data.select_dtypes('object'))
TotalPercentage
OCCUPATION_TYPE9639131.345545
NAME_TYPE_SUITE12920.420148
#缺失值处理,用字符串'NaN'填充
data['OCCUPATION_TYPE'].fillna('NaN',inplace=True)
data['NAME_TYPE_SUITE'].fillna('NaN',inplace=True)
#查看是否有缺失值
missing(data.select_dtypes('object'))
TotalPercentage
#卡方检验(Chi_square_test)
for x in ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY','NAME_TYPE_SUITE', 'NAME_INCOME_TYPE',
 'NAME_EDUCATION_TYPE','NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 
 'OCCUPATION_TYPE','WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE']:
    
    #原假设H0(null hypothesis):原假设是变量独立的,实际观测频率和理论频率一致。    
    c_table = np.array(pd.crosstab(data['TARGET'],data[x]))
    stat,p,dof,expected = stats.chi2_contingency(c_table)# stat卡方统计值,p:P_value,dof 自由度,expected理论频率分布
    prob = 0.95 # 选取95%置信度
    critical = stats.chi2.ppf(prob,dof)  # 计算临界阀值
    print('probality=%.3f,critical=%.3f,stat=%.3f '%(prob,critical,stat))
    if abs(stat)>=critical:
        print(x,':拒绝原假设,变量不独立')
    else:
        print(x,':接受原假设,变量独立')
probality=0.950,critical=3.841,stat=293.151 
NAME_CONTRACT_TYPE :拒绝原假设,变量不独立
probality=0.950,critical=5.991,stat=920.791 
CODE_GENDER :拒绝原假设,变量不独立
probality=0.950,critical=3.841,stat=146.656 
FLAG_OWN_CAR :拒绝原假设,变量不独立
probality=0.950,critical=3.841,stat=11.576 
FLAG_OWN_REALTY :拒绝原假设,变量不独立
probality=0.950,critical=14.067,stat=45.190 
NAME_TYPE_SUITE :拒绝原假设,变量不独立
probality=0.950,critical=14.067,stat=1253.471 
NAME_INCOME_TYPE :拒绝原假设,变量不独立
probality=0.950,critical=9.488,stat=1019.213 
NAME_EDUCATION_TYPE :拒绝原假设,变量不独立
probality=0.950,critical=11.070,stat=504.694 
NAME_FAMILY_STATUS :拒绝原假设,变量不独立
probality=0.950,critical=11.070,stat=420.556 
NAME_HOUSING_TYPE :拒绝原假设,变量不独立
probality=0.950,critical=28.869,stat=1975.083 
OCCUPATION_TYPE :拒绝原假设,变量不独立
probality=0.950,critical=12.592,stat=15.388 
WEEKDAY_APPR_PROCESS_START :拒绝原假设,变量不独立
probality=0.950,critical=75.624,stat=1609.241 
ORGANIZATION_TYPE :拒绝原假设,变量不独立
#数据描述
data.describe(include=['O'])
NAME_CONTRACT_TYPECODE_GENDERFLAG_OWN_CARFLAG_OWN_REALTYNAME_TYPE_SUITENAME_INCOME_TYPENAME_EDUCATION_TYPENAME_FAMILY_STATUSNAME_HOUSING_TYPEOCCUPATION_TYPEWEEKDAY_APPR_PROCESS_STARTORGANIZATION_TYPE
count307511307511307511307511307511307511307511307511307511307511307511307511
unique23228856619758
topCash loansFNYUnaccompaniedWorkingSecondary / secondary specialMarriedHouse / apartmentNaNTUESDAYBusiness Entity Type 3
freq278232202448202924213312248526158774218391196432272868963915390167992

(二)Float 类型字段

#Float类型字段
print("Float类型字段:",np.count_nonzero(data.select_dtypes('float').columns))
print(data.select_dtypes('float').columns)
Float类型字段: 20
Index(['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'CNT_FAM_MEMBERS',
       'EXT_SOURCE_2', 'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE',
       'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',
       'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE',
       'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
       'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
       'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object')
#查看是否有缺失值
missing(data.select_dtypes('float'))
TotalPercentage
EXT_SOURCE_36096519.825307
AMT_REQ_CREDIT_BUREAU_YEAR4151913.501631
AMT_REQ_CREDIT_BUREAU_QRT4151913.501631
AMT_REQ_CREDIT_BUREAU_MON4151913.501631
AMT_REQ_CREDIT_BUREAU_WEEK4151913.501631
AMT_REQ_CREDIT_BUREAU_DAY4151913.501631
AMT_REQ_CREDIT_BUREAU_HOUR4151913.501631
OBS_60_CNT_SOCIAL_CIRCLE10210.332021
DEF_60_CNT_SOCIAL_CIRCLE10210.332021
DEF_30_CNT_SOCIAL_CIRCLE10210.332021
OBS_30_CNT_SOCIAL_CIRCLE10210.332021
EXT_SOURCE_26600.214626
AMT_GOODS_PRICE2780.090403
AMT_ANNUITY120.003902
CNT_FAM_MEMBERS20.000650
DAYS_LAST_PHONE_CHANGE10.000325
#缺失值处理,用众数填充空值,'AMT_GOODS_PRICE'、'AMT_ANNUITY'字段待后续处理
for col in ['EXT_SOURCE_3', 'AMT_REQ_CREDIT_BUREAU_YEAR',
           'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_MON',
           'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_DAY',
           'AMT_REQ_CREDIT_BUREAU_HOUR', 'OBS_60_CNT_SOCIAL_CIRCLE',
           'DEF_60_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
           'OBS_30_CNT_SOCIAL_CIRCLE', 'EXT_SOURCE_2', 'CNT_FAM_MEMBERS', 
           'DAYS_LAST_PHONE_CHANGE']:
    data[col].fillna(data[col].mode()[0],inplace=True)
#查看是否有缺失值
missing(data.select_dtypes('float'))
TotalPercentage
AMT_GOODS_PRICE2780.090403
AMT_ANNUITY120.003902
#数值为整数的字段类型转为int
ct_cols = ['DAYS_REGISTRATION', 'CNT_FAM_MEMBERS', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
           'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE',
           'AMT_REQ_CREDIT_BUREAU_YEAR','AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_MON',
           'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_HOUR']
for col in ct_cols:
    data[col]=data[col].astype(int)
#数据描述
data.describe(include=['float'])
AMT_INCOME_TOTALAMT_CREDITAMT_ANNUITYAMT_GOODS_PRICEREGION_POPULATION_RELATIVEEXT_SOURCE_2EXT_SOURCE_3
count3.075110e+053.075110e+05307499.0000003.072330e+05307511.0000003.075110e+05307511.000000
mean1.687979e+055.990260e+0527108.5739095.383962e+050.0208685.139023e-010.557531
std2.371231e+054.024908e+0514493.7373153.694465e+050.0138311.911477e-010.198114
min2.565000e+044.500000e+041615.5000004.050000e+040.0002908.173617e-080.000527
25%1.125000e+052.700000e+0516524.0000002.385000e+050.0100063.907824e-010.417100
50%1.471500e+055.135310e+0524903.0000004.500000e+050.0188505.654672e-010.604113
75%2.025000e+058.086500e+0534596.0000006.795000e+050.0286636.634218e-010.746300
max1.170000e+084.050000e+06258025.5000004.050000e+060.0725088.549997e-010.896010
# 箱线图函数
def boxplots_custom(dataset, columns_list, rows, cols, suptitle):
    fig, axs = plt.subplots(rows, cols, sharey=True, figsize=(12,4))
    fig.suptitle(suptitle,y=1, size=25)
    axs = axs.flatten()
    for i, data in enumerate(columns_list):
        sns.boxplot(data=dataset[data], orient='h', ax=axs[i])
        axs[i].set_title(data + ', skewness is: '+str(round(dataset[data].skew(axis = 0, skipna = True),2)))
#异常值暂不处理
boxplots_custom(dataset=data.select_dtypes('float'), columns_list=data.select_dtypes('float').columns, 
                rows=2, cols=4, suptitle='箱线图')
plt.tight_layout()

在这里插入图片描述

#用相关性热图识别特征的多重共线性
plt.figure(figsize = (4,4))
corr = data.select_dtypes('float').corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask = mask, center = 0,square = True, cmap="viridis",linewidths = .6,vmin=-1, vmax=1)
plt.title('Correlation Table')
Text(0.5, 1.0, 'Correlation Table')

在这里插入图片描述

#删除多余的特征
data.drop(['AMT_GOODS_PRICE','AMT_ANNUITY'],axis=1,inplace=True)
missing(data.select_dtypes('float'))
TotalPercentage

(三)Int 类型字段

#Int类型字段
print("Int类型字段:",np.count_nonzero(data.select_dtypes('int').columns))
print(data.select_dtypes('int').columns)
Int 类型字段: 54
Index(['TARGET', 'CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE',
       'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
       'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
       'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
       'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3',
       'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
       'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9',
       'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
       'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15',
       'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
       'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21',
       'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
       'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
       'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR',
       'HAS_APPLIED'],
      dtype='object')
#查看是否有缺失值
missing(data.select_dtypes('int'))
TotalPercentage
#用相关性热图识别特征的多重共线性
plt.figure(figsize = (11,11))
corr = data.select_dtypes('int').corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask = mask, center = 0,square = True, cmap="viridis",linewidths = .6,vmin=-1, vmax=1)
plt.title('Correlation Table')
Text(0.5, 1.0, 'Correlation Table')

在这里插入图片描述

#删除多余的特征
data.drop(['CNT_FAM_MEMBERS', 'FLAG_EMP_PHONE', 'REGION_RATING_CLIENT_W_CITY','LIVE_REGION_NOT_WORK_REGION',\
           'LIVE_CITY_NOT_WORK_CITY', 'OBS_60_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE'],axis=1,inplace=True)
#关于天数的字段转换为年
for col in data.select_dtypes('int').columns:
    if col[:4]=="DAYS":
        data[col]=round(-data[col]/365,0).astype('int')
#将定性变量字段转换为字符字段
convert_cols = ['FLAG_MOBIL', 'FLAG_WORK_PHONE','FLAG_CONT_MOBILE', 'FLAG_PHONE',
 'FLAG_EMAIL','REGION_RATING_CLIENT','HOUR_APPR_PROCESS_START', 
 'REG_REGION_NOT_LIVE_REGION','REG_CITY_NOT_LIVE_CITY','REG_CITY_NOT_WORK_CITY', 
'FLAG_DOCUMENT_2','FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21', 'HAS_APPLIED']
data[convert_cols] = data[convert_cols].astype('object')
#数据描述
data.describe(include=['int'])
TARGETCNT_CHILDRENDAYS_BIRTHDAYS_EMPLOYEDDAYS_REGISTRATIONDAYS_ID_PUBLISHREG_REGION_NOT_WORK_REGIONOBS_30_CNT_SOCIAL_CIRCLEDEF_30_CNT_SOCIAL_CIRCLEDAYS_LAST_PHONE_CHANGEAMT_REQ_CREDIT_BUREAU_HOURAMT_REQ_CREDIT_BUREAU_DAYAMT_REQ_CREDIT_BUREAU_WEEKAMT_REQ_CREDIT_BUREAU_MONAMT_REQ_CREDIT_BUREAU_QRTAMT_REQ_CREDIT_BUREAU_YEAR
count307511.000000307511.000000307511.000000307511.000000307511.000000307511.000000307511.000000307511.000000307511.000000307511.000000307511.000000307511.000000307511.000000307511.000000307511.000000307511.000000
mean0.0807290.41705243.938646-174.89533713.6594408.1985750.0507691.4175230.1429442.6328620.0055380.0060550.0297230.2312930.2296311.643447
std0.2724190.72212111.964047387.1856619.6593694.1515200.2195262.3983950.4460332.2813460.0780140.1030370.1907280.8568100.7440591.855821
min0.0000000.00000021.000000-1001.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%0.0000000.00000034.0000001.0000006.0000005.0000000.0000000.0000000.0000001.0000000.0000000.0000000.0000000.0000000.0000000.000000
50%0.0000000.00000043.0000003.00000012.0000009.0000000.0000000.0000000.0000002.0000000.0000000.0000000.0000000.0000000.0000001.000000
75%0.0000001.00000054.0000008.00000020.00000012.0000000.0000002.0000000.0000004.0000000.0000000.0000000.0000000.0000000.0000003.000000
max1.00000019.00000069.00000049.00000068.00000020.0000001.000000348.00000034.00000012.0000004.0000009.0000008.00000027.000000261.00000025.000000
# 箱线图函数
def boxplots_custom(dataset, columns_list, rows, cols, suptitle):
    fig, axs = plt.subplots(rows, cols, sharey=True, figsize=(12,7))
    fig.suptitle(suptitle,y=1, size=25)
    axs = axs.flatten()
    for i, data in enumerate(columns_list):
        sns.boxplot(data=dataset[data], orient='h', ax=axs[i])
        axs[i].set_title(data + ', skewness is: '+str(round(dataset[data].skew(axis = 0, skipna = True),2)))
#异常值暂不处理
boxplots_custom(dataset=data.select_dtypes('int'), columns_list=data.select_dtypes('int').columns, 
                rows=4, cols=4, suptitle='箱线图')
plt.tight_layout()

在这里插入图片描述

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 64 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   TARGET                      307511 non-null  int64  
 1   NAME_CONTRACT_TYPE          307511 non-null  object 
 2   CODE_GENDER                 307511 non-null  object 
 3   FLAG_OWN_CAR                307511 non-null  object 
 4   FLAG_OWN_REALTY             307511 non-null  object 
 5   CNT_CHILDREN                307511 non-null  int64  
 6   AMT_INCOME_TOTAL            307511 non-null  float64
 7   AMT_CREDIT                  307511 non-null  float64
 8   NAME_TYPE_SUITE             307511 non-null  object 
 9   NAME_INCOME_TYPE            307511 non-null  object 
 10  NAME_EDUCATION_TYPE         307511 non-null  object 
 11  NAME_FAMILY_STATUS          307511 non-null  object 
 12  NAME_HOUSING_TYPE           307511 non-null  object 
 13  REGION_POPULATION_RELATIVE  307511 non-null  float64
 14  DAYS_BIRTH                  307511 non-null  int32  
 15  DAYS_EMPLOYED               307511 non-null  int32  
 16  DAYS_REGISTRATION           307511 non-null  int32  
 17  DAYS_ID_PUBLISH             307511 non-null  int32  
 18  FLAG_MOBIL                  307511 non-null  object 
 19  FLAG_WORK_PHONE             307511 non-null  object 
 20  FLAG_CONT_MOBILE            307511 non-null  object 
 21  FLAG_PHONE                  307511 non-null  object 
 22  FLAG_EMAIL                  307511 non-null  object 
 23  OCCUPATION_TYPE             307511 non-null  object 
 24  REGION_RATING_CLIENT        307511 non-null  object 
 25  WEEKDAY_APPR_PROCESS_START  307511 non-null  object 
 26  HOUR_APPR_PROCESS_START     307511 non-null  object 
 27  REG_REGION_NOT_LIVE_REGION  307511 non-null  object 
 28  REG_REGION_NOT_WORK_REGION  307511 non-null  int64  
 29  REG_CITY_NOT_LIVE_CITY      307511 non-null  object 
 30  REG_CITY_NOT_WORK_CITY      307511 non-null  object 
 31  ORGANIZATION_TYPE           307511 non-null  object 
 32  EXT_SOURCE_2                307511 non-null  float64
 33  EXT_SOURCE_3                307511 non-null  float64
 34  OBS_30_CNT_SOCIAL_CIRCLE    307511 non-null  int32  
 35  DEF_30_CNT_SOCIAL_CIRCLE    307511 non-null  int32  
 36  DAYS_LAST_PHONE_CHANGE      307511 non-null  int32  
 37  FLAG_DOCUMENT_2             307511 non-null  object 
 38  FLAG_DOCUMENT_3             307511 non-null  object 
 39  FLAG_DOCUMENT_4             307511 non-null  object 
 40  FLAG_DOCUMENT_5             307511 non-null  object 
 41  FLAG_DOCUMENT_6             307511 non-null  object 
 42  FLAG_DOCUMENT_7             307511 non-null  object 
 43  FLAG_DOCUMENT_8             307511 non-null  object 
 44  FLAG_DOCUMENT_9             307511 non-null  object 
 45  FLAG_DOCUMENT_10            307511 non-null  object 
 46  FLAG_DOCUMENT_11            307511 non-null  object 
 47  FLAG_DOCUMENT_12            307511 non-null  object 
 48  FLAG_DOCUMENT_13            307511 non-null  object 
 49  FLAG_DOCUMENT_14            307511 non-null  object 
 50  FLAG_DOCUMENT_15            307511 non-null  object 
 51  FLAG_DOCUMENT_16            307511 non-null  object 
 52  FLAG_DOCUMENT_17            307511 non-null  object 
 53  FLAG_DOCUMENT_18            307511 non-null  object 
 54  FLAG_DOCUMENT_19            307511 non-null  object 
 55  FLAG_DOCUMENT_20            307511 non-null  object 
 56  FLAG_DOCUMENT_21            307511 non-null  object 
 57  AMT_REQ_CREDIT_BUREAU_HOUR  307511 non-null  int32  
 58  AMT_REQ_CREDIT_BUREAU_DAY   307511 non-null  int32  
 59  AMT_REQ_CREDIT_BUREAU_WEEK  307511 non-null  int32  
 60  AMT_REQ_CREDIT_BUREAU_MON   307511 non-null  int32  
 61  AMT_REQ_CREDIT_BUREAU_QRT   307511 non-null  int32  
 62  AMT_REQ_CREDIT_BUREAU_YEAR  307511 non-null  int32  
 63  HAS_APPLIED                 307511 non-null  object 
dtypes: float64(5), int32(13), int64(3), object(43)
memory usage: 134.9+ MB

(四)离群值处理

#关于检测包含离群值的记录索引的函数
def IQR_method (df,n,features):
    outlier_list = []
    
    for column in features:
        # 第一个四分位数 (25%)
        Q1 = np.percentile(df[column], 25)
        # 第三个四分位数 (75%)
        Q3 = np.percentile(df[column],75)
        # 四分位距 (IQR)
        IQR = Q3 - Q1
        # outlier step
        outlier_step = 1.5 * IQR
        # 确定含离群值记录的索引
        outlier_list_index = df[(df[column] < Q1 - outlier_step) | (df[column] > Q3 + outlier_step )].index
        # 追加到列表
        outlier_list.extend(outlier_list_index)
        
    # 选择包含n个以上离群值的记录
    outlier_list = Counter(outlier_list)        
    multiple_outliers = list( k for k, v in outlier_list.items() if v > n )
     
    return multiple_outliers
# 检测包含离群值的记录的索引
outliers_indexs = IQR_method(data,0,data.select_dtypes(['float','int']).columns)
#通过TARGET计数分布的对比,欺诈数据均包含在离群值的数据集中
outliers_tar = data.iloc[outliers_indexs]['TARGET'].value_counts()
origin_tar = data['TARGET'].value_counts()
pd.concat([outliers_tar,origin_tar],axis=1,keys=['outlier', 'origin'])
outlierorigin
0179967282686
12482524825
#创建一个仅包含离群值的数据集,用于后续建模
data2 = data.iloc[outliers_indexs].reset_index(drop=True)

四、构造模型对信用欺诈进行预测

#对定性数据进行独热编码
oh = OneHotEncoder(handle_unknown='ignore')
X_obj = oh.fit_transform(data2.select_dtypes('object')).toarray()
#对定性数据进行标准化
scaler = StandardScaler()
X_num = scaler.fit_transform(data2.select_dtypes(['float','int']).drop('TARGET',axis=1))
#创建数据集
X = np.concatenate([X_num,X_obj],axis=1)
y = data2['TARGET']
#创建训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size = 0.3, random_state = 24)
# 使用分类器进行预测(经训练对比,逻辑回归和AdaBoost分类器效果不及随机森林分类器,不再列出)
rf = RandomForestClassifier(random_state = 12, criterion = 'gini',n_estimators=50)
rf.fit(X_train,y_train)
y_predict = rf.predict(X_test)
# 计算精确率、召回率、F1值及混淆矩阵
cm = confusion_matrix(y_test, y_predict)
rf_Precision = precision_score(y_test, y_predict)
rf_Recall = recall_score(y_test, y_predict)
rf_f1 = f1_score(y_test, y_predict)
rf_accuracy = accuracy_score(y_test, y_predict)
# 显示模型评估分数
print("随机森林分类器:")
print('精确率: {:.3f}'.format(rf_Precision))
print('召回率: {:.3f}'.format(rf_Recall))
print('F1值: {:.3f}'.format(rf_f1))
print('准确率: {:.3f}'.format(rf_accuracy))
print('混淆矩阵:')
print(cm)
随机森林分类器:
精确率: 0.939
召回率: 0.139
F1值: 0.243
准确率: 0.895
混淆矩阵:
[[53923    67]
 [ 6411  1037]]

五、找出预测欺诈的重要影响因素

# 特征重要性
importances = rf.feature_importances_
#找出数据集字段名称
print('数据集长度:',len(importances))
df_num=data2.select_dtypes(['float','int']).drop('TARGET',axis=1)
print('定量数据集长度:',df_num.shape[1])
df_obj=pd.get_dummies(data2.select_dtypes('object'))
print('定性数据集长度:',df_obj.shape[1])
数据集长度: 230
定量数据集长度: 20
定性数据集长度: 210
# 标签名称
feat_labels = pd.concat([df_num, df_obj], axis=1).columns
# 下标排序
indices = np.argsort(importances)[::-1] 
# 找出重要性排前10的特征(-号代表左对齐、后补空白,*号代表对齐宽度由输入时确定)
for f in range(10):   
    print("%2d) %-*s %f" % (f + 1, 40, feat_labels[indices[f]], importances[indices[f]]))
 1) EXT_SOURCE_2                             0.072038
 2) EXT_SOURCE_3                             0.063325
 3) AMT_CREDIT                               0.044284
 4) DAYS_BIRTH                               0.042454
 5) AMT_INCOME_TOTAL                         0.040097
 6) DAYS_REGISTRATION                        0.037490
 7) REGION_POPULATION_RELATIVE               0.037468
 8) DAYS_EMPLOYED                            0.036155
 9) DAYS_ID_PUBLISH                          0.032153
10) OBS_30_CNT_SOCIAL_CIRCLE                 0.029426

六、数据不平衡——过采样

#由于正常交易远多于欺诈交易,导致数据不平衡问题,对数据集进行过采样
#过采样Oversampling:SMOTE
smote_pipeline = make_pipeline(SMOTE(random_state=24), 
                               RandomForestClassifier(random_state = 12, criterion = 'gini',n_estimators=50))
smote_pipeline.fit(X_train,y_train)
y_predict = smote_pipeline.predict(X_test)
# 计算精确率、召回率、F1值及混淆矩阵
cm = confusion_matrix(y_test, y_predict)
rf_Precision = precision_score(y_test, y_predict)
rf_Recall = recall_score(y_test, y_predict)
rf_f1 = f1_score(y_test, y_predict)
rf_accuracy = accuracy_score(y_test, y_predict)
# 显示模型评估分数
print("使用SMOTE进行过采样后:")
print('精确率: {:.3f}'.format(rf_Precision))
print('召回率: {:.3f}'.format(rf_Recall))
print('F1值: {:.3f}'.format(rf_f1))
print('准确率: {:.3f}'.format(rf_accuracy))
print('混淆矩阵:')
print(cm)
使用SMOTE进行过采样后:
精确率: 0.875
召回率: 0.211
F1值: 0.340
准确率: 0.901
混淆矩阵:
[[53766   224]
 [ 5876  1572]]

七、结论

  1. 数据处理中,定性数据与定量数据的区分有助于明确数据的含义,从而提高模型预测的效果;
  2. 欺诈数据均包含在离群值的数据集中,符合业务逻辑,可以用仅包含异常值的新数据集建模;
  3. 随机森林分类器训练后,预测的精确率达0.939,而召回率仅有0.139,需要与业务多沟通,不断地调整、新增用户的数据标签;
  4. 通过训练后的模型找出了重要性排前10的特征,按排序分别为外部数据’EXT_SOURCE_2’和’EXT_SOURCE_3’、贷款额度、客户年龄、客户收入、客户最近更改注册信息的年数、客户居住地区的人口、客户工龄、客户最近更改证件信息的年数、客户社会环境可观察到的违约次数(逾期30天)。建议进一步了解外部数据源情况,是否能增加相关标签,以提高模型的效果;
  5. 在训练模型前用SMOTE技术对数据集进行过采样,获得了更好的召回率0.211,但精确率下降到0.875,总体F1值有所提高。
  6. 是否采用SMOTE技术取决于业务决策:若能更好的避免贷款欺诈引发的损失,则采用;相反,若因精确率下降,将正常交易归类为欺诈交易,影响业务的开展,则得不偿失。本案例模型中(正常交易客户的平均贷款额度602,648美元,欺诈客户的平均贷款额度557,778美元),若采用SMOTE技术,召回率的提升(0.072)可多挽回每单约40,160美元的本金损失,而精确率的下降(0.064)导致拒绝正常交易客户的申请失去利润每单约1,928美元每年(假设平均每单利差为5%)。总体来看,若本数据集的时间跨度未超过20年,采用SMOTE技术能给公司带来更高的收益。
  • 1
    点赞
  • 25
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值