[Kaggle竞赛] IEEE-CIS Fraud Detection

最新推荐文章于 2023-05-18 17:58:03 发布

Rinnki

最新推荐文章于 2023-05-18 17:58:03 发布

阅读量4k

点赞数 8

分类专栏： Python笔记文章标签： Kaggle IEEE-CIS Fraud Detection 数据科学 Python 二分类

本文链接：https://blog.csdn.net/qq_42017046/article/details/100187284

版权

本文详细介绍了参加IEEE-CIS Fraud Detection Kaggle竞赛的过程，包括数据探索、特征工程、模型构建等方面。通过Python和LightGBM，作者实现了二分类预测，最终获得了铜牌。关键在于数据的深入分析（EDA）、特征工程，特别是对Ds、Cs、Vs特征的处理。文中提供了代码示例和处理策略，但强调这并不是最优方案，仅作参考。

摘要由CSDN通过智能技术生成

0.写在前面

Kaggle竞赛——IEEE-CIS Fraud Detection

赛题描述：
In this competition, you’ll benchmark machine learning models on a challenging large-scale dataset. The data comes from Vesta’s real-world e-commerce transactions and contains a wide range of features from device type to product features. You also have the opportunity to create new features to improve your results.
In this competition you are predicting the probability that an online transaction is fraudulent, as denoted by the binary target isFraud.
The data is broken into two files identity and transaction, which are joined by TransactionID. Not all transactions have corresponding identity information.
LB：利用测试集前20%的数据进行验证的auc得分。
Private Leaderboard最终得分：利用测试集剩余80%的数据进行验证的auc得分。
本次比赛可以提交两份结果。

之前参加了Kaggle的几个入门级比赛，这次试试看IEEE和Vesta主办的二分类预测比赛，使用Python基于Jupyter Notebook用LightGBM建立模型进行预测，本比赛提分的关键在于对于数据的挖掘以及数据处理生成特征的策略选取，需要进行非常细致的EDA以及FE。
本次比赛的结果是铜牌：373/6381-Top 6% Private Leaderboard：0.928512

本文给出的思路，旨在辅助对于题目的理解并帮助解释贴出的Python代码，并不是最优做法。本文思路及代码仅供参考，思路中涉及到的方法以及详细步骤等请移步至参考链接。代码中变量命名、注释、试验记录等比较乱，仅供参考。

1.EDA

请参考以下Kaggle_kernels：
Nanashi：Fraud complete EDA_Nanashi

1.1 观察数据

官方数据描述及相关答疑：Data Description (Details and Discussion)

先来看Transaction表：
TransactionDT: 不是真实的时间戳，而是与某一时间开始以秒为单位的时间差。
TransactionAMT: transaction payment amount in USD，小数部分值得关注。
ProductCD: product code，有W\H\C\S\R五种。不一定是实际商品也有可能指某种服务。
card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
addr1-addr2: 是billing region和billing country
dist: distances between (not limited) billing address, mailing address, zip code, IP address, phone area, etc.
P_ and (R__) emaildomain: purchaser and recipient email domain，有一部分交易是不需要recipient的，其对应Remaildomain为空
C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. Plus like device, ipaddr, billingaddr, etc. Also these are for both purchaser and recipient, which doubles the number.
D1-D15: timedelta, such as days between previous transaction, etc.
M1-M9: match, such as names on card and address, etc.均为01变量
Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.不同部分的V特征有不同比例的缺失，其真正含义和处理方式仍不明。
再来看Identity表：
id01 to id11 are numerical features for identity, which is collected by Vesta and security partners such as device rating, ip_domain rating, proxy rating, etc. Also it recorded behavioral fingerprint like account login times/failed to login times, how long an account stayed on the page, etc. All of these are not able to elaborate due to security partner T&C.
DeviceType、DeviceInfo、id12 - id38是Categorical Features。

在许多EDA相关Kernels中我们可以发现数据的一些特征,尤其是数据随时间变化分布上的特征,还有训练集与测试集分布的不同之处。

1.2 处理缺失值

缺失比例：缺失值的比例也请查看EDA Kernels。
利用特征相关度判断相关特征来填充缺失值
请参考：Gunes Evitan——IEEE-CIS Fraud Detection Dependency Check

def check_dependency(independent_var, dependent_var):
    
    independent_uniques = []
    temp_df = pd.concat([train_df[[independent_var, dependent_var]], test_df[[independent_var, dependent_var]]])
    
    for value in temp_df[independent_var].unique():
        independent_uniques.append(temp_df[temp_df[independent_var] == value][dependent_var].value_counts().shape[0])

    values = pd.Series(data=independent_uniques, index=temp_df[independent_var].unique())
    
    N = len(values)
    N_dependent = len(values[values == 1])
    N_notdependent = len(values[values > 1])
    N_null = len(values[values == 0])
        
    print(f'In {independent_var}, there are {N} unique values')
    print(f'{N_dependent}/{N} have one unique {dependent_var} value')
    print(f'{N_notdependent}/{N} have more than one unique {dependent_var} values')
    print(f'{N_null}/{N} have only missing {dependent_var} values\n')

举个例子：

check_dependency('R_emaildomain', 'C5')
print(train_df['C10'].isnull().sum()/train_df.shape[0])
print(test_df['C10'].isnull().sum()/test_df.shape[0])
print(test_df[~test_df['R_emaildomain'].isnull()]['C5'].value_counts())

In R_emaildomain, there are 61 unique values
60/61 have one unique C5 value
0/61 have more than one unique C5 values
1/61 have only missing C5 values
0.0
5.920768278891869e-06
0.0    135867
Name: C5, dtype: int64

可见 R_emaildomain和C5相关度很高，且C5特征于测试集中有少量缺失，而R_emaildomain不缺失的时候C5缺失，R_emaildomain不缺失时C5均为0，将C5缺失值用0补上便是比较合理的。
按这个思路找到了几组相关度很高的特征，将测试集中的缺失值补上：

#1.1 find dependency and fillna
#'dist1', 'C3',只有test有C3的缺失,且只在dist1不缺失的时候缺失，dist1不缺失的时候C3全都是0
test_df['C3'] = test_df['C3'].fillna(0)
#'R_emaildomain', 'C5',只有test有C5的缺失，基本上都是在R_emaildomain不缺失的时候缺失，R_emaildomain缺失的C5缺失只有3个
test_df['C5'] = test_df['C5'].fillna(0)
#'id_30','C7',只有test有C7的缺失，只在id_30不缺失的时候缺失，id_30不缺失的C7缺失只有3个，其他都是0（Device）
test_df['C7'] = test_df['C7'].fillna(0)
#'id_31','C9',只有test有C9的缺失，只在id_31不缺失的时候缺失，id_31不缺失的C9缺失只有3个，其他都是0（Browser）
test_df['C9'] = test_df['C9'].fillna(0)

利用card1对应其余card特征的信息来填补card23456的缺失值

#1. More interaction between card features + fill nans
i_cols = ['TransactionID','card1','card2','card3','card4','card5','card6']

full_df = pd.concat([train_df[i_cols], test_df[i_cols]])

## I've used frequency encoding before so we have ints here
## we will drop very rare cards
full_df['card6'] = np.where(full_df['card6']==30, np.nan, full_df['card6'])
full_df['card6'] = np.where(full_df['card6']==16, np.nan, full_df['card6'])

i_cols = ['card2','card3','card4','card5','card6']

## We will find best match for nan values and fill with it 把23456都补上好多了
for col in i_cols:
    temp_df = full_df.groupby(['card1',col])[col].agg(['count']).reset_index()
    temp_df = temp_df.sort_values(by=['card1','count'], ascending=False).reset_index(drop=True)
    del temp_df['count']
    temp_df = temp_df.drop_duplicates(keep='first').reset_index(drop=True)
    temp_df.index = temp_df['card1'].values
    temp_df = temp_df[col].to_dict()
    full_df[col] = np.where(full_df[col].isna(), full_df['card1'].map(temp_df), full_df[col])
    
    
i_cols = ['card1','card2','card3','card4','card5','card6']
for col in i_cols:
    train_df[col] = full_df[full_df['TransactionID'].isin(train_df['TransactionID'])][col].values
    test_df[col] = full_df[full_df['TransactionID'].isin(test_df['TransactionID'])][col].values

1.3 挖掘数据隐含信息以便模型利用

为了保护用户信息官方对特征做了许多处理也隐瞒了特征的真实意义，需要通过对数据细致的观察分析来判断特征的意义及其蕴含的信息，以选择特征处理的合理手段。

日期
Kevin——TransactionDT startdate
这样Black Friday和Cyber Monday可以更好重合，这里选取2017-11-30作为起始日期点，加以TransactionDT这个timedelta可以获得交易的日期信息。
D系列特征
Akasyanama——EDA what’s behind D features?
A Humphrey——Understanding the D features (updated)
tuttifrutti——Creating features from D columns (guessing userID)
取几个意思明晰的：
D1: timedelta (days, rounded down) since first transaction for one card.
D2: this appears to be the same as D1, except D1 = 0 values have been replaced by NaN.
D3: timedelta since the previous transaction for one card. As with D1 and D2, the this feature appears to count different cards separately.
D4: timedelta since first transaction for all cards on the account. Using the example of a husband and wife each using their own card on a joint credit card account, this feature would not distinguish between which card was used.
D5: timedelta since the previous transaction for all cards on the account.
D6 and D7: 是D4和D5某种组合变形,丢掉任何一个auc都会下降。
D8: timedelta (float) since some event.
D9：是D8的小数部分，也就是the hour of day，由于每个小时对应的fraud_rate，也就是IsFraud的平均值变化相差很小，这个特征无法为模型的预测提供较大的帮助，计划丢掉这个特征。
D10：some kind of timedelta for domestic transactions.

选取处理策略：

由于Ds特征具有时间相关性，会随TransactionDT变化，可以考虑取部分D特征(如D1,D4)和TransactionDT，用两者求差得到时间差，从而显示开卡时间、上一笔交易具体时间等因素，单纯利用不加处理的Ds特征只能反映距离某一操作的时间差累积，且会引入时间变化。得到DminusDT类特征后可用于进行用户uid和卡cardid的合成，可以更加清晰地确定用户。关于得到的DminusDT类特征，虽然有可能带来过拟合的风险，但本模型还是选择保留它了。
Ds特征也可进行以不同时间段内的min_max_scaling处理以及std_score处理，用自定义的value_normalization函数实现。

dt_df[new_col+'_min_max'] = (dt_df[col]-dt_df['temp_min'])/(dt_df['temp_max']-dt_df['temp_min'])
dt_df[new_col+'_std_score'] = (dt_df[col]-dt_df['temp_mean'])/(dt_df['temp_std'])

C系列特征

分布请参考EDA相关kernels。
前文也提到了，由于Cs特征是对于交易付款人和收款人信息(如账单地址、邮箱地址)个数的统计，部分C与其他特征有较高的关联度，可考虑通过这个思路填充其测试集的缺失值。
训练集和测试集的分布有较大差别，考虑去除离群值改善分布。
V系列特征
请参考：Laevatein——Interesting finding about the V columns
可以根据Vs特征缺失率将Vs特征分块，各部分内应该是由相同数据生成的。
V1 ~ V11
V12 ~ V34
V35 ~ V52
V53 ~ V74
V75 ~ V94
V95 ~ V137 高相关度
V126-V138
V138 ~ V166 (high null ratio)
V167 ~ V216 (high null ratio)
V217 ~ V278 (high null ratio, 2 different null ratios)
V279 ~ V321 (2 different null ratios)
V289-V318
V319-V321高相关度
V322 ~ V339 (high null ratio)

其中numerical类型的Vs特征有：

'V126' 'V127' 'V128' 'V130' 'V131' 'V132' 'V133' 'V134' 'V136' 'V137'
 'V143' 'V144' 'V145' 'V150' 'V159' 'V160' 'V164' 'V165' 'V166' 'V202'
 'V203' 'V204' 'V205' 'V206' 'V207' 'V208' 'V209' 'V210' 'V211' 'V212'
 'V213' 'V214' 'V215' 'V216' 'V263' 'V264' 'V265' 'V266' 'V267' 'V268'
 'V270' 'V271' 'V272' 'V273' 'V274' 'V275' 'V276' 'V277' 'V278' 'V306'
 'V307' 'V308' 'V309' 'V310' 'V312' 'V313' 'V314' 'V315' 'V316' 'V317'
 'V318' 'V320' 'V321' 'V331' 'V332' 'V333' 'V335'

选取处理方式：

对numerical的V做scaling和pca
对Vs做Group PCA、一些其他处理，但是LB没有提升便放弃了

2.Deep Feature Engineering

初步特征处理思路（LB–>0.9487）请参考：
Konstantin Yakovlev——IEEE - Internal Blend
David Cairuz——Feature Engineering & LightGBM
后期特征处理思路（LB：0.9487–>0.9526）请参见其他实验记录，以下为最终采用的特征工程代码：

import numpy as np
import pandas as pd
import gc
import os, sys, random, datetime

将数据集缩小，占用更小内存，并得到更高的处理效率，请参考：Konstantin Yakovlev——IEEE Data minification

def seed_everything(seed=0):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

## Memory Reducer
# :df pandas dataframe to reduce size             # type: pd.DataFrame()
# :verbose                                        # type: bool
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

载入训练集和测试集，缩小其占用空间。

print('Load Data')
train_df = pd.read_csv('../input/train_transaction.csv')
test_df = pd.read_csv('../input/test_transaction.csv')
test_df['isFraud'] = 0
train_identity = pd.read_csv('../input/train_identity.csv')
test_identity = pd.read_csv('../input/test_identity.csv')
print('Reduce Memory')
train_df = reduce_mem_usage(train_df)
test_df  = reduce_mem_usage(test_df)
train_identity = reduce_mem_usage(train_identity)
test_identity  = reduce_mem_usage(test_identity)

Load Data
Reduce Memory
Mem. usage decreased to 542.35 Mb (69.4% reduction)
Mem. usage decreased to 473.07 Mb (68.9% reduction)
Mem. usage decreased to 25.86 Mb (42.7% reduction)
Mem. usage decreased to 25.44 Mb (42.7% reduction)

对identity部分数据进行初步处理，主要是将字符串特征，如DeviceInfo、id_30(系统信息)、id_31(浏览器信息)，split生成新的特征，用id_33(分辨率)生成设备特征；并将其余类别特征从字符串转为numerical，部分信息bin处理：

def id_split(dataframe):
    
    dataframe['device_name'] = dataframe['DeviceInfo'].str.split('/', expand=True)[0]
    dataframe['device_version'] = dataframe['DeviceInfo'].str.split('/', expand=True)[1]

    dataframe['OS_id_30'] = dataframe['id_30'].str.split(' ', expand=True)[0]
    dataframe['version_id_30'] = dataframe['id_30'].str.split(' ', expand=True)[1]
 
    dataframe['browser_id_31'] = dataframe['id_31'].str.split(' ', expand=True)[0]
    dataframe['version_id_31'] = dataframe['id_31'].str.split(' ', expand=True)[1]

    dataframe['screen_width'] = dataframe['id_33'].str.split('x', expand=True)[0]
    dataframe['screen_height'] = dataframe['id_33'].str.split('x', expand=True)[1]
    dataframe['id_12'] = dataframe['id_12'].map({
   'Found':1, 'NotFound':0})
    dataframe['id_15'] = dataframe['id_15'].map({
   'New':2, 'Found':1, 'Unknown':0})
    dataframe['id_16'] = dataframe['id_16'].map({
   'Found':1, 'NotFound':0})

    dataframe['id_23'] = dataframe['id_23'].map({
   'TRANSPARENT':4, 'IP_PROXY':3, 'IP_PROXY:ANONYMOUS':2, 'IP_PROXY:HIDDEN':1})

    dataframe['id_27'] = dataframe['id_27'].map({
   'Found':1, 'NotFound':0})
    dataframe['id_28'] = dataframe['id_28'].map({
   'New':2, 'Found':1})

    dataframe['id_29'] = dataframe['id_29'].map({
   'Found':1, 'NotFound':0})

    dataframe['id_35'] = dataframe['id_35'].map({
   'T':1, 'F':0})
    dataframe['id_36'] = dataframe['id_36'].map({
   'T':1, 'F':0})
    dataframe['id_37'] = dataframe['id_37'].map({
   'T':1, 'F':0})
    dataframe['id_38'] = dataframe['id_38'].map({
   'T':1, 'F':0})

    dataframe['id_34'] = dataframe['id_34'].fillna(':0')
    dataframe['id_34'] = dataframe['id_34'].apply(lambda x: x.split(':')[1]).astype(np.int8)
    dataframe['id_34'] = np.where(dataframe['id_34']==0, np.nan, dataframe['id_34'])
    
    dataframe['id_33'] = dataframe['id_33'].fillna('0x0')
    dataframe['id_33_0'] = dataframe['id_33'].apply(lambda x: x.split('x')[0]).astype(int)
    dataframe['id_33_1'] = dataframe['id_33'].apply(lambda x: x.split('x')[1]).astype(int)
    dataframe['id_33'] = np.where(dataframe['id_33']=='0x0', np.nan, dataframe['id_33'])
    
    for feature in ['id_01', 'id_31', 'id_33', 'id_36']:
        dataframe[feature + '_count_dist'] = dataframe[feature].map(dataframe[feature].value_counts(dropna=False))
    
    dataframe['DeviceType'].map({
   'desktop':1, 'mobile':0})
    
    dataframe.loc[dataframe['device_name'].str.contains('SM', na=False), 'device_name'] = 'Samsung'
    dataframe.loc[dataframe['device_name'].str.contains('SAMSUNG', na=False), 'device_name'] = 'Samsung'
    dataframe.loc[dataframe['device_name'].str.contains('GT-', na=False), 'device_name'] = 'Samsung'
    dataframe.loc[dataframe['device_name'].str.contains('Moto G', na=False), 'device_name'] = 'Motorola'
    dataframe.loc[dataframe['device_name'].str.contains('Moto', na=False), 'device_name'] = 'Motorola'
    dataframe.loc[dataframe['device_name'].str.contains('moto', na=False), 'device_name'] = 'Motorola'
    dataframe.loc[dataframe['device_name'].str.contains('LG-', na=False), 'device_name'] = 'LG'
    dataframe.loc[dataframe['device_name'].str.contains('rv:', na=False), 'device_name'] = 'RV'
    dataframe.loc[dataframe['device_name'].str.contains('HUAWEI', na=False), 'device_name'] = 'Huawei'
    dataframe.loc[dataframe['device_name'].str.contains('ALE-', na=False), 'device_name'] = 'Huawei'
    dataframe.loc[dataframe['device_name'].str.contains('-L', na=False), 'device_name'] = 'Huawei'
    dataframe.loc[dataframe['device_name'].str.contains('Blade', na=False)