Kaggle项目:信用贷款欺诈检测
Credit Card Fraud Detection
数据背景 : 在批准客户的贷款时,探索所有的可能性。
数据内容 : 数据集分为3个csv表,其中:'application_data.csv’为主表,有307510行和122列,包括目标列;'columns_description.csv’用于对其他表各字段的含义进行解释;'previous_application.csv’为以前的申请数据。
数据来源链接 : https://www.kaggle.com/datasets/mishra5001/credit-card?datasetId=263888&sortBy=voteCount&select=application_data.csv
以下通过Python对信用卡数据进行清理和探索性数据分析(Exploratory Data Analysis),并尝试预测信用欺诈:
一、数据准备及概览
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib. pyplot as plt
from scipy import stats
from collections import Counter
import warnings
warnings. filterwarnings( "ignore" )
from sklearn. model_selection import train_test_split
from sklearn. preprocessing import StandardScaler
from sklearn. preprocessing import OneHotEncoder
from sklearn. ensemble import RandomForestClassifier
from sklearn. metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn. metrics import confusion_matrix
from imblearn. pipeline import make_pipeline
from imblearn. over_sampling import SMOTE
plt. rcParams[ 'font.sans-serif' ] = [ 'SimHei' ]
plt. rcParams[ 'axes.unicode_minus' ] = False
data = pd. read_csv( 'application_data.csv' )
origin = data. copy( )
data. head( )
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR 0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN 4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
5 rows × 122 columns
data. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
fraud = len ( data[ data. TARGET == 1 ] )
legit = len ( data[ data. TARGET == 0 ] )
print ( "欺诈交易的数量:" , fraud)
print ( "欺诈交易占比: {:.4f} %" . format ( fraud/ ( fraud+ legit) * 100 ) )
欺诈交易的数量: 24825
欺诈交易占比: 8.0729 %
二、数据清洗
print ( '数据集中重复记录个数: ' , data. duplicated( ) . sum ( ) )
数据集中重复记录个数: 0
data2 = pd. read_csv( 'previous_application.csv' )
id_num = data2[ 'SK_ID_CURR' ] [ data2[ 'AMT_APPLICATION' ] != 0.0 ] . unique( )
data[ 'HAS_APPLIED' ] = data[ 'SK_ID_CURR' ] . isin( id_num) . astype( int )
data. drop( 'SK_ID_CURR' , axis= 1 , inplace= True )
per_null = data. isnull( ) . sum ( ) . sort_values( ascending = False ) / len ( data)
per_null[ per_null> 0 ] . head( )
COMMONAREA_MODE 0.698723
COMMONAREA_AVG 0.698723
COMMONAREA_MEDI 0.698723
NONLIVINGAPARTMENTS_MEDI 0.694330
NONLIVINGAPARTMENTS_AVG 0.694330
dtype: float64
fig, ax = plt. subplots( figsize= ( 6 , 12 ) )
per_null[ per_null> 0 ] . plot( kind= 'barh' , ax= ax, title= '缺失值占比' )
<AxesSubplot:title={'center':'缺失值占比'}>
null_col_name = per_null[ per_null> 0.4 ] . index
data. drop( null_col_name, axis= 1 , inplace= True )
三、探索性数据分析(EDA)
(一)Object 类型字段
print ( "Object类型字段:" , np. count_nonzero( data. select_dtypes( 'object' ) . columns) )
print ( data. select_dtypes( 'object' ) . columns)
Object类型字段: 12
Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE'],
dtype='object')
def missing ( df) :
total = df. isnull( ) . sum ( ) . sort_values( ascending = False )
total = total[ total> 0 ]
percent = df. isnull( ) . sum ( ) . sort_values( ascending = False ) / len ( df) * 100
percent = percent[ percent> 0 ]
return pd. concat( [ total, percent] , axis= 1 , keys= [ 'Total' , 'Percentage' ] )
missing( data. select_dtypes( 'object' ) )
Total Percentage OCCUPATION_TYPE 96391 31.345545 NAME_TYPE_SUITE 1292 0.420148
data[ 'OCCUPATION_TYPE' ] . fillna( 'NaN' , inplace= True )
data[ 'NAME_TYPE_SUITE' ] . fillna( 'NaN' , inplace= True )
missing( data. select_dtypes( 'object' ) )
for x in [ 'NAME_CONTRACT_TYPE' , 'CODE_GENDER' , 'FLAG_OWN_CAR' ,
'FLAG_OWN_REALTY' , 'NAME_TYPE_SUITE' , 'NAME_INCOME_TYPE' ,
'NAME_EDUCATION_TYPE' , 'NAME_FAMILY_STATUS' , 'NAME_HOUSING_TYPE' ,
'OCCUPATION_TYPE' , 'WEEKDAY_APPR_PROCESS_START' , 'ORGANIZATION_TYPE' ] :
c_table = np. array( pd. crosstab( data[ 'TARGET' ] , data[ x] ) )
stat, p, dof, expected = stats. chi2_contingency( c_table)
prob = 0.95
critical = stats. chi2. ppf( prob, dof)
print ( 'probality=%.3f,critical=%.3f,stat=%.3f ' % ( prob, critical, stat) )
if abs ( stat) >= critical:
print ( x, ':拒绝原假设,变量不独立' )
else :
print ( x, ':接受原假设,变量独立' )
probality=0.950,critical=3.841,stat=293.151
NAME_CONTRACT_TYPE :拒绝原假设,变量不独立
probality=0.950,critical=5.991,stat=920.791
CODE_GENDER :拒绝原假设,变量不独立
probality=0.950,critical=3.841,stat=146.656
FLAG_OWN_CAR :拒绝原假设,变量不独立
probality=0.950,critical=3.841,stat=11.576
FLAG_OWN_REALTY :拒绝原假设,变量不独立
probality=0.950,critical=14.067,stat=45.190
NAME_TYPE_SUITE :拒绝原假设,变量不独立
probality=0.950,critical=14.067,stat=1253.471
NAME_INCOME_TYPE :拒绝原假设,变量不独立
probality=0.950,critical=9.488,stat=1019.213
NAME_EDUCATION_TYPE :拒绝原假设,变量不独立
probality=0.950,critical=11.070,stat=504.694
NAME_FAMILY_STATUS :拒绝原假设,变量不独立
probality=0.950,critical=11.070,stat=420.556
NAME_HOUSING_TYPE :拒绝原假设,变量不独立
probality=0.950,critical=28.869,stat=1975.083
OCCUPATION_TYPE :拒绝原假设,变量不独立
probality=0.950,critical=12.592,stat=15.388
WEEKDAY_APPR_PROCESS_START :拒绝原假设,变量不独立
probality=0.950,critical=75.624,stat=1609.241
ORGANIZATION_TYPE :拒绝原假设,变量不独立
data. describe( include= [ 'O' ] )
NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE OCCUPATION_TYPE WEEKDAY_APPR_PROCESS_START ORGANIZATION_TYPE count 307511 307511 307511 307511 307511 307511 307511 307511 307511 307511 307511 307511 unique 2 3 2 2 8 8 5 6 6 19 7 58 top Cash loans F N Y Unaccompanied Working Secondary / secondary special Married House / apartment NaN TUESDAY Business Entity Type 3 freq 278232 202448 202924 213312 248526 158774 218391 196432 272868 96391 53901 67992
(二)Float 类型字段
print ( "Float类型字段:" , np. count_nonzero( data. select_dtypes( 'float' ) . columns) )
print ( data. select_dtypes( 'float' ) . columns)
Float类型字段: 20
Index(['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'CNT_FAM_MEMBERS',
'EXT_SOURCE_2', 'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE',
'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',
'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE',
'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'],
dtype='object')
missing( data. select_dtypes( 'float' ) )
Total Percentage EXT_SOURCE_3 60965 19.825307 AMT_REQ_CREDIT_BUREAU_YEAR 41519 13.501631 AMT_REQ_CREDIT_BUREAU_QRT 41519 13.501631 AMT_REQ_CREDIT_BUREAU_MON 41519 13.501631 AMT_REQ_CREDIT_BUREAU_WEEK 41519 13.501631 AMT_REQ_CREDIT_BUREAU_DAY 41519 13.501631 AMT_REQ_CREDIT_BUREAU_HOUR 41519 13.501631 OBS_60_CNT_SOCIAL_CIRCLE 1021 0.332021 DEF_60_CNT_SOCIAL_CIRCLE 1021 0.332021 DEF_30_CNT_SOCIAL_CIRCLE 1021 0.332021 OBS_30_CNT_SOCIAL_CIRCLE 1021 0.332021 EXT_SOURCE_2 660 0.214626 AMT_GOODS_PRICE 278 0.090403 AMT_ANNUITY 12 0.003902 CNT_FAM_MEMBERS 2 0.000650 DAYS_LAST_PHONE_CHANGE 1 0.000325
for col in [ 'EXT_SOURCE_3' , 'AMT_REQ_CREDIT_BUREAU_YEAR' ,
'AMT_REQ_CREDIT_BUREAU_QRT' , 'AMT_REQ_CREDIT_BUREAU_MON' ,
'AMT_REQ_CREDIT_BUREAU_WEEK' , 'AMT_REQ_CREDIT_BUREAU_DAY' ,
'AMT_REQ_CREDIT_BUREAU_HOUR' , 'OBS_60_CNT_SOCIAL_CIRCLE' ,
'DEF_60_CNT_SOCIAL_CIRCLE' , 'DEF_30_CNT_SOCIAL_CIRCLE' ,
'OBS_30_CNT_SOCIAL_CIRCLE' , 'EXT_SOURCE_2' , 'CNT_FAM_MEMBERS' ,
'DAYS_LAST_PHONE_CHANGE' ] :
data[ col] . fillna( data[ col] . mode( ) [ 0 ] , inplace= True )
missing( data. select_dtypes( 'float' ) )
Total Percentage AMT_GOODS_PRICE 278 0.090403 AMT_ANNUITY 12 0.003902
ct_cols = [ 'DAYS_REGISTRATION' , 'CNT_FAM_MEMBERS' , 'OBS_30_CNT_SOCIAL_CIRCLE' , 'DEF_30_CNT_SOCIAL_CIRCLE' ,
'OBS_60_CNT_SOCIAL_CIRCLE' , 'DEF_60_CNT_SOCIAL_CIRCLE' , 'DAYS_LAST_PHONE_CHANGE' ,
'AMT_REQ_CREDIT_BUREAU_YEAR' , 'AMT_REQ_CREDIT_BUREAU_QRT' , 'AMT_REQ_CREDIT_BUREAU_MON' ,
'AMT_REQ_CREDIT_BUREAU_WEEK' , 'AMT_REQ_CREDIT_BUREAU_DAY' , 'AMT_REQ_CREDIT_BUREAU_HOUR' ]
for col in ct_cols:
data[ col] = data[ col] . astype( int )
data. describe( include= [ 'float' ] )
AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE EXT_SOURCE_2 EXT_SOURCE_3 count 3.075110e+05 3.075110e+05 307499.000000 3.072330e+05 307511.000000 3.075110e+05 307511.000000 mean 1.687979e+05 5.990260e+05 27108.573909 5.383962e+05 0.020868 5.139023e-01 0.557531 std 2.371231e+05 4.024908e+05 14493.737315 3.694465e+05 0.013831 1.911477e-01 0.198114 min 2.565000e+04 4.500000e+04 1615.500000 4.050000e+04 0.000290 8.173617e-08 0.000527 25% 1.125000e+05 2.700000e+05 16524.000000 2.385000e+05 0.010006 3.907824e-01 0.417100 50% 1.471500e+05 5.135310e+05 24903.000000 4.500000e+05 0.018850 5.654672e-01 0.604113 75% 2.025000e+05 8.086500e+05 34596.000000 6.795000e+05 0.028663 6.634218e-01 0.746300 max 1.170000e+08 4.050000e+06 258025.500000 4.050000e+06 0.072508 8.549997e-01 0.896010
def boxplots_custom ( dataset, columns_list, rows, cols, suptitle) :
fig, axs = plt. subplots( rows, cols, sharey= True , figsize= ( 12 , 4 ) )
fig. suptitle( suptitle, y= 1 , size= 25 )
axs = axs. flatten( )
for i, data in enumerate ( columns_list) :
sns. boxplot( data= dataset[ data] , orient= 'h' , ax= axs[ i] )
axs[ i] . set_title( data + ', skewness is: ' + str ( round ( dataset[ data] . skew( axis = 0 , skipna = True ) , 2 ) ) )
boxplots_custom( dataset= data. select_dtypes( 'float' ) , columns_list= data. select_dtypes( 'float' ) . columns,
rows= 2 , cols= 4 , suptitle= '箱线图' )
plt. tight_layout( )
plt. figure( figsize = ( 4 , 4 ) )
corr = data. select_dtypes( 'float' ) . corr( )
mask = np. triu( np. ones_like( corr, dtype= bool ) )
sns. heatmap( corr, mask = mask, center = 0 , square = True , cmap= "viridis" , linewidths = .6 , vmin= - 1 , vmax= 1 )
plt. title( 'Correlation Table' )
Text(0.5, 1.0, 'Correlation Table')
data. drop( [ 'AMT_GOODS_PRICE' , 'AMT_ANNUITY' ] , axis= 1 , inplace= True )
missing( data. select_dtypes( 'float' ) )
(三)Int 类型字段
print ( "Int类型字段:" , np. count_nonzero( data. select_dtypes( 'int' ) . columns) )
print ( data. select_dtypes( 'int' ) . columns)
Int 类型字段: 54
Index(['TARGET', 'CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE',
'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3',
'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9',
'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15',
'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21',
'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR',
'HAS_APPLIED'],
dtype='object')
missing( data. select_dtypes( 'int' ) )
plt. figure( figsize = ( 11 , 11 ) )
corr = data. select_dtypes( 'int' ) . corr( )
mask = np. triu( np. ones_like( corr, dtype= bool ) )
sns. heatmap( corr, mask = mask, center = 0 , square = True , cmap= "viridis" , linewidths = .6 , vmin= - 1 , vmax= 1 )
plt. title( 'Correlation Table' )
Text(0.5, 1.0, 'Correlation Table')
data. drop( [ 'CNT_FAM_MEMBERS' , 'FLAG_EMP_PHONE' , 'REGION_RATING_CLIENT_W_CITY' , 'LIVE_REGION_NOT_WORK_REGION' , \
'LIVE_CITY_NOT_WORK_CITY' , 'OBS_60_CNT_SOCIAL_CIRCLE' , 'DEF_60_CNT_SOCIAL_CIRCLE' ] , axis= 1 , inplace= True )
for col in data. select_dtypes( 'int' ) . columns:
if col[ : 4 ] == "DAYS" :
data[ col] = round ( - data[ col] / 365 , 0 ) . astype( 'int' )
convert_cols = [ 'FLAG_MOBIL' , 'FLAG_WORK_PHONE' , 'FLAG_CONT_MOBILE' , 'FLAG_PHONE' ,
'FLAG_EMAIL' , 'REGION_RATING_CLIENT' , 'HOUR_APPR_PROCESS_START' ,
'REG_REGION_NOT_LIVE_REGION' , 'REG_CITY_NOT_LIVE_CITY' , 'REG_CITY_NOT_WORK_CITY' ,
'FLAG_DOCUMENT_2' , 'FLAG_DOCUMENT_3' , 'FLAG_DOCUMENT_4' , 'FLAG_DOCUMENT_5' ,
'FLAG_DOCUMENT_6' , 'FLAG_DOCUMENT_7' , 'FLAG_DOCUMENT_8' ,
'FLAG_DOCUMENT_9' , 'FLAG_DOCUMENT_10' , 'FLAG_DOCUMENT_11' ,
'FLAG_DOCUMENT_12' , 'FLAG_DOCUMENT_13' , 'FLAG_DOCUMENT_14' ,
'FLAG_DOCUMENT_15' , 'FLAG_DOCUMENT_16' , 'FLAG_DOCUMENT_17' ,
'FLAG_DOCUMENT_18' , 'FLAG_DOCUMENT_19' , 'FLAG_DOCUMENT_20' ,
'FLAG_DOCUMENT_21' , 'HAS_APPLIED' ]
data[ convert_cols] = data[ convert_cols] . astype( 'object' )
data. describe( include= [ 'int' ] )
TARGET CNT_CHILDREN DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH REG_REGION_NOT_WORK_REGION OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR count 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 307511.000000 mean 0.080729 0.417052 43.938646 -174.895337 13.659440 8.198575 0.050769 1.417523 0.142944 2.632862 0.005538 0.006055 0.029723 0.231293 0.229631 1.643447 std 0.272419 0.722121 11.964047 387.185661 9.659369 4.151520 0.219526 2.398395 0.446033 2.281346 0.078014 0.103037 0.190728 0.856810 0.744059 1.855821 min 0.000000 0.000000 21.000000 -1001.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25% 0.000000 0.000000 34.000000 1.000000 6.000000 5.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 50% 0.000000 0.000000 43.000000 3.000000 12.000000 9.000000 0.000000 0.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 75% 0.000000 1.000000 54.000000 8.000000 20.000000 12.000000 0.000000 2.000000 0.000000 4.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000 max 1.000000 19.000000 69.000000 49.000000 68.000000 20.000000 1.000000 348.000000 34.000000 12.000000 4.000000 9.000000 8.000000 27.000000 261.000000 25.000000
def boxplots_custom ( dataset, columns_list, rows, cols, suptitle) :
fig, axs = plt. subplots( rows, cols, sharey= True , figsize= ( 12 , 7 ) )
fig. suptitle( suptitle, y= 1 , size= 25 )
axs = axs. flatten( )
for i, data in enumerate ( columns_list) :
sns. boxplot( data= dataset[ data] , orient= 'h' , ax= axs[ i] )
axs[ i] . set_title( data + ', skewness is: ' + str ( round ( dataset[ data] . skew( axis = 0 , skipna = True ) , 2 ) ) )
boxplots_custom( dataset= data. select_dtypes( 'int' ) , columns_list= data. select_dtypes( 'int' ) . columns,
rows= 4 , cols= 4 , suptitle= '箱线图' )
plt. tight_layout( )
data. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 64 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TARGET 307511 non-null int64
1 NAME_CONTRACT_TYPE 307511 non-null object
2 CODE_GENDER 307511 non-null object
3 FLAG_OWN_CAR 307511 non-null object
4 FLAG_OWN_REALTY 307511 non-null object
5 CNT_CHILDREN 307511 non-null int64
6 AMT_INCOME_TOTAL 307511 non-null float64
7 AMT_CREDIT 307511 non-null float64
8 NAME_TYPE_SUITE 307511 non-null object
9 NAME_INCOME_TYPE 307511 non-null object
10 NAME_EDUCATION_TYPE 307511 non-null object
11 NAME_FAMILY_STATUS 307511 non-null object
12 NAME_HOUSING_TYPE 307511 non-null object
13 REGION_POPULATION_RELATIVE 307511 non-null float64
14 DAYS_BIRTH 307511 non-null int32
15 DAYS_EMPLOYED 307511 non-null int32
16 DAYS_REGISTRATION 307511 non-null int32
17 DAYS_ID_PUBLISH 307511 non-null int32
18 FLAG_MOBIL 307511 non-null object
19 FLAG_WORK_PHONE 307511 non-null object
20 FLAG_CONT_MOBILE 307511 non-null object
21 FLAG_PHONE 307511 non-null object
22 FLAG_EMAIL 307511 non-null object
23 OCCUPATION_TYPE 307511 non-null object
24 REGION_RATING_CLIENT 307511 non-null object
25 WEEKDAY_APPR_PROCESS_START 307511 non-null object
26 HOUR_APPR_PROCESS_START 307511 non-null object
27 REG_REGION_NOT_LIVE_REGION 307511 non-null object
28 REG_REGION_NOT_WORK_REGION 307511 non-null int64
29 REG_CITY_NOT_LIVE_CITY 307511 non-null object
30 REG_CITY_NOT_WORK_CITY 307511 non-null object
31 ORGANIZATION_TYPE 307511 non-null object
32 EXT_SOURCE_2 307511 non-null float64
33 EXT_SOURCE_3 307511 non-null float64
34 OBS_30_CNT_SOCIAL_CIRCLE 307511 non-null int32
35 DEF_30_CNT_SOCIAL_CIRCLE 307511 non-null int32
36 DAYS_LAST_PHONE_CHANGE 307511 non-null int32
37 FLAG_DOCUMENT_2 307511 non-null object
38 FLAG_DOCUMENT_3 307511 non-null object
39 FLAG_DOCUMENT_4 307511 non-null object
40 FLAG_DOCUMENT_5 307511 non-null object
41 FLAG_DOCUMENT_6 307511 non-null object
42 FLAG_DOCUMENT_7 307511 non-null object
43 FLAG_DOCUMENT_8 307511 non-null object
44 FLAG_DOCUMENT_9 307511 non-null object
45 FLAG_DOCUMENT_10 307511 non-null object
46 FLAG_DOCUMENT_11 307511 non-null object
47 FLAG_DOCUMENT_12 307511 non-null object
48 FLAG_DOCUMENT_13 307511 non-null object
49 FLAG_DOCUMENT_14 307511 non-null object
50 FLAG_DOCUMENT_15 307511 non-null object
51 FLAG_DOCUMENT_16 307511 non-null object
52 FLAG_DOCUMENT_17 307511 non-null object
53 FLAG_DOCUMENT_18 307511 non-null object
54 FLAG_DOCUMENT_19 307511 non-null object
55 FLAG_DOCUMENT_20 307511 non-null object
56 FLAG_DOCUMENT_21 307511 non-null object
57 AMT_REQ_CREDIT_BUREAU_HOUR 307511 non-null int32
58 AMT_REQ_CREDIT_BUREAU_DAY 307511 non-null int32
59 AMT_REQ_CREDIT_BUREAU_WEEK 307511 non-null int32
60 AMT_REQ_CREDIT_BUREAU_MON 307511 non-null int32
61 AMT_REQ_CREDIT_BUREAU_QRT 307511 non-null int32
62 AMT_REQ_CREDIT_BUREAU_YEAR 307511 non-null int32
63 HAS_APPLIED 307511 non-null object
dtypes: float64(5), int32(13), int64(3), object(43)
memory usage: 134.9+ MB
(四)离群值处理
def IQR_method ( df, n, features) :
outlier_list = [ ]
for column in features:
Q1 = np. percentile( df[ column] , 25 )
Q3 = np. percentile( df[ column] , 75 )
IQR = Q3 - Q1
outlier_step = 1.5 * IQR
outlier_list_index = df[ ( df[ column] < Q1 - outlier_step) | ( df[ column] > Q3 + outlier_step ) ] . index
outlier_list. extend( outlier_list_index)
outlier_list = Counter( outlier_list)
multiple_outliers = list ( k for k, v in outlier_list. items( ) if v > n )
return multiple_outliers
outliers_indexs = IQR_method( data, 0 , data. select_dtypes( [ 'float' , 'int' ] ) . columns)
outliers_tar = data. iloc[ outliers_indexs] [ 'TARGET' ] . value_counts( )
origin_tar = data[ 'TARGET' ] . value_counts( )
pd. concat( [ outliers_tar, origin_tar] , axis= 1 , keys= [ 'outlier' , 'origin' ] )
outlier origin 0 179967 282686 1 24825 24825
data2 = data. iloc[ outliers_indexs] . reset_index( drop= True )
四、构造模型对信用欺诈进行预测
oh = OneHotEncoder( handle_unknown= 'ignore' )
X_obj = oh. fit_transform( data2. select_dtypes( 'object' ) ) . toarray( )
scaler = StandardScaler( )
X_num = scaler. fit_transform( data2. select_dtypes( [ 'float' , 'int' ] ) . drop( 'TARGET' , axis= 1 ) )
X = np. concatenate( [ X_num, X_obj] , axis= 1 )
y = data2[ 'TARGET' ]
X_train, X_test, y_train, y_test = train_test_split( X, y, stratify= y, test_size = 0.3 , random_state = 24 )
rf = RandomForestClassifier( random_state = 12 , criterion = 'gini' , n_estimators= 50 )
rf. fit( X_train, y_train)
y_predict = rf. predict( X_test)
cm = confusion_matrix( y_test, y_predict)
rf_Precision = precision_score( y_test, y_predict)
rf_Recall = recall_score( y_test, y_predict)
rf_f1 = f1_score( y_test, y_predict)
rf_accuracy = accuracy_score( y_test, y_predict)
print ( "随机森林分类器:" )
print ( '精确率: {:.3f}' . format ( rf_Precision) )
print ( '召回率: {:.3f}' . format ( rf_Recall) )
print ( 'F1值: {:.3f}' . format ( rf_f1) )
print ( '准确率: {:.3f}' . format ( rf_accuracy) )
print ( '混淆矩阵:' )
print ( cm)
随机森林分类器:
精确率: 0.939
召回率: 0.139
F1值: 0.243
准确率: 0.895
混淆矩阵:
[[53923 67]
[ 6411 1037]]
五、找出预测欺诈的重要影响因素
importances = rf. feature_importances_
print ( '数据集长度:' , len ( importances) )
df_num= data2. select_dtypes( [ 'float' , 'int' ] ) . drop( 'TARGET' , axis= 1 )
print ( '定量数据集长度:' , df_num. shape[ 1 ] )
df_obj= pd. get_dummies( data2. select_dtypes( 'object' ) )
print ( '定性数据集长度:' , df_obj. shape[ 1 ] )
数据集长度: 230
定量数据集长度: 20
定性数据集长度: 210
feat_labels = pd. concat( [ df_num, df_obj] , axis= 1 ) . columns
indices = np. argsort( importances) [ : : - 1 ]
for f in range ( 10 ) :
print ( "%2d) %-*s %f" % ( f + 1 , 40 , feat_labels[ indices[ f] ] , importances[ indices[ f] ] ) )
1) EXT_SOURCE_2 0.072038
2) EXT_SOURCE_3 0.063325
3) AMT_CREDIT 0.044284
4) DAYS_BIRTH 0.042454
5) AMT_INCOME_TOTAL 0.040097
6) DAYS_REGISTRATION 0.037490
7) REGION_POPULATION_RELATIVE 0.037468
8) DAYS_EMPLOYED 0.036155
9) DAYS_ID_PUBLISH 0.032153
10) OBS_30_CNT_SOCIAL_CIRCLE 0.029426
六、数据不平衡——过采样
smote_pipeline = make_pipeline( SMOTE( random_state= 24 ) ,
RandomForestClassifier( random_state = 12 , criterion = 'gini' , n_estimators= 50 ) )
smote_pipeline. fit( X_train, y_train)
y_predict = smote_pipeline. predict( X_test)
cm = confusion_matrix( y_test, y_predict)
rf_Precision = precision_score( y_test, y_predict)
rf_Recall = recall_score( y_test, y_predict)
rf_f1 = f1_score( y_test, y_predict)
rf_accuracy = accuracy_score( y_test, y_predict)
print ( "使用SMOTE进行过采样后:" )
print ( '精确率: {:.3f}' . format ( rf_Precision) )
print ( '召回率: {:.3f}' . format ( rf_Recall) )
print ( 'F1值: {:.3f}' . format ( rf_f1) )
print ( '准确率: {:.3f}' . format ( rf_accuracy) )
print ( '混淆矩阵:' )
print ( cm)
使用SMOTE进行过采样后:
精确率: 0.875
召回率: 0.211
F1值: 0.340
准确率: 0.901
混淆矩阵:
[[53766 224]
[ 5876 1572]]
七、结论
数据处理中,定性数据与定量数据的区分有助于明确数据的含义,从而提高模型预测的效果; 欺诈数据均包含在离群值的数据集中,符合业务逻辑,可以用仅包含异常值的新数据集建模; 随机森林分类器训练后,预测的精确率达0.939,而召回率仅有0.139,需要与业务多沟通,不断地调整、新增用户的数据标签; 通过训练后的模型找出了重要性排前10的特征,按排序分别为外部数据’EXT_SOURCE_2’和’EXT_SOURCE_3’、贷款额度、客户年龄、客户收入、客户最近更改注册信息的年数、客户居住地区的人口、客户工龄、客户最近更改证件信息的年数、客户社会环境可观察到的违约次数(逾期30天)。建议进一步了解外部数据源情况,是否能增加相关标签,以提高模型的效果; 在训练模型前用SMOTE技术对数据集进行过采样,获得了更好的召回率0.211,但精确率下降到0.875,总体F1值有所提高。 是否采用SMOTE技术取决于业务决策:若能更好的避免贷款欺诈引发的损失,则采用;相反,若因精确率下降,将正常交易归类为欺诈交易,影响业务的开展,则得不偿失。本案例模型中(正常交易客户的平均贷款额度602,648美元,欺诈客户的平均贷款额度557,778美元),若采用SMOTE技术,召回率的提升(0.072)可多挽回每单约40,160美元的本金损失,而精确率的下降(0.064)导致拒绝正常交易客户的申请失去利润每单约1,928美元每年(假设平均每单利差为5%)。总体来看,若本数据集的时间跨度未超过20年,采用SMOTE技术能给公司带来更高的收益。