Kaggle | 金融交易欺诈检测(Synthetic Financial Datasets For Fraud Detection)

1 篇文章 0 订阅
1 篇文章 0 订阅

【学习用】
原题目地址:Synthetic Financial Datasets For Fraud Detection
原代码地址:Predicting Fraud in Financial Payment Services

目录

1 准备工作

1.1 各种引入

1.1.1 Import Packages

1.1.2 Import Dataset

1.2 数据探索

目录

1 准备工作

1.1 各种引入

1.1.1 Import Packages

1.1.2 Import Dataset

1.2 数据探索(Exploratory Data Analysis, EDA)

1.2.1 交易类型

1.2.2 商人账户

1.3 数据清洗

1.3.1 选取有用数据

1.3.2 缺失数据处理

2 深入分析

2.1 特征工程

2.2 可视化

3 建模求解

4 结论


1.3 数据清洗

2 深入分析

2.1 特征工程

2.2 可视化

3 建模求解

4 结论


1 准备工作

1.1 各种引入

1.1.1 Import Packages

# 基本
import pandas as pd
import numpy as np
# 可视化
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
# 建模
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.metrics import average_precision_score
from xgboost.sklearn import XGBClassifier
from xgboost import plot_importance, to_graphviz
# 警告忽略
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

1.1.2 Import Dataset

# 数据导入
df = pd.read_csv('data/raw_data.csv')
# 重命名变量(hhh是强迫症的概念:-P)
df = df.rename(columns={'oldbalanceOrg':'oldBalanceOrig', 'newbalanceOrig':'newBalanceOrig', 'oldbalanceDest':'oldBalanceDest', 'newbalanceDest':'newBalanceDest'})
# 前5条数据
print(df.head())
# 缺失数据初判
df.isnull().values.any()

1.2 数据探索(Exploratory Data Analysis, EDA)

在这部分中对数据进行简单分析

1.2.1 交易类型

共有5种交易类型(type):CASH_IN, CASH_OUT, DEBIT, PAYMENT, TRANSFER.

# 判断欺诈出现在哪种交易类型
print('\n The types of fraudulent transactions are {}'.format(df.loc[df.isFraud == 1].type.drop_duplicates().values)) 
dfFraudTransfer = df.loc[(df.isFraud == 1) & (df.type == 'TRANSFER')]
dfFraudCashout = df.loc[(df.isFraud == 1) & (df.type == 'CASH_OUT')]
print ('\n The number of fraudulent TRANSFERs = {}'.format(len(dfFraudTransfer)))
print ('\n The number of fraudulent CASH_OUTs = {}'.format(len(dfFraudCashout)))
 The types of fraudulent transactions are ['TRANSFER', 'CASH_OUT']
 The number of fraudulent TRANSFERs = 4097
 The number of fraudulent CASH_OUTs = 4116

【待补充】判定isFlaggedFraud insignificant 的方法

1.2.2 商人账户

判断商人账户在各笔交易中的位置,已知商人账户以M打头

# M是否出现在CASH_IN的Orig账户中
print('\nAre there any merchants among originator accounts for CASH_IN transactions? {}'.format((df.loc[df.type == 'CASH_IN'].nameOrig.str.contains('M')).any())) # False
# M是否出现在CASH_OUT的Dest账户中
print('\nAre there any merchants among originator accounts for CASH_OUT transactions? {}'.format((df.loc[df.type == 'CASH_OUT'].nameOrig.str.contains('M')).any())) # False
# M是否出现在任何Orig账户中
print('\nAre there merchants among any originator accounts? {}'.format(df.nameOrig.str.contains('M').any())) # False
# M是否出现在任何非PAYMENT的Dest账户中
print('\nAre there any transactions having merchants among destination accounts other than the PAYMENT type? {}'.format((df.loc[df.nameDest.str.contains('M')].type != 'PAYMENT').any())) # False

1.3 数据清洗

1.3.1 选取有用数据

# 只取部分交易(行)数据研究
X = df.loc[(df.type == 'TRANSFER') | (df.type == 'CASH_OUT')]

Y = X['isFraud']
del X['isFraud'] # del删除变量-非数据

# 只取部分列数据研究
X = X.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1)

# type数据转换0-1
X.loc[X.type == 'TRANSFER', 'type'] = 0
X.loc[X.type == 'CASH_OUT', 'type'] = 1
X.type = X.type.astype(int)

1.3.2 缺失数据处理

# 划分欺诈交易和非欺诈交易
Xfraud = X.loc[Y == 1]
XnonFraud = X.loc[Y == 0]

# 欺诈数据中缺失数据占比
print('\nThe fraction of fraudulent transactions with \'oldBalanceDest\' = \
\'newBalanceDest\' = 0 although the transacted \'amount\' is non-zero is: {}'.\
format(len(Xfraud.loc[(Xfraud.oldBalanceDest == 0) & \
(Xfraud.newBalanceDest == 0) & (Xfraud.amount)]) / (1.0 * len(Xfraud))))

# 非欺诈数据中缺失数据占比
print('\nThe fraction of genuine transactions with \'oldBalanceDest\' = \
newBalanceDest\' = 0 although the transacted \'amount\' is non-zero is: {}'.\
format(len(XnonFraud.loc[(XnonFraud.oldBalanceDest == 0) & \
(XnonFraud.newBalanceDest == 0) & (XnonFraud.amount)]) / (1.0 * len(XnonFraud))))

# 给缺失数据打标签
X.loc[(X.oldBalanceDest == 0) & (X.newBalanceDest == 0) & (X.amount != 0), ['oldBalanceDest', 'newBalanceDest']] = - 1
X.loc[(X.oldBalanceOrig == 0) & (X.newBalanceOrig == 0) & (X.amount != 0), ['oldBalanceOrig', 'newBalanceOrig']] = np.nan

2 深入分析

2.1 特征工程

X['errorBalanceOrig'] = X.newBalanceOrig + X.amount - X.oldBalanceOrig
X['errorBalanceDest'] = X.oldBalanceDest + X.amount - X.newBalanceDest

2.2 可视化

limit = len(X)

def plotStrip(x, y, hue, figsize = (14, 9)):  
    fig = plt.figure(figsize = figsize)
    colours = plt.cm.tab10(np.linspace(0, 1, 9))
    with sns.axes_style('ticks'):
        ax = sns.stripplot(x, y, hue = hue, jitter = 0.4, marker = '.', size = 4, palette = colours)
        ax.set_xlabel('')
        ax.set_xticklabels(['genuine', 'fraudulent'], size = 16)
        for axis in ['top','bottom','left','right']:
            ax.spines[axis].set_linewidth(2)

        handles, labels = ax.get_legend_handles_labels()
        plt.legend(handles, ['Transfer', 'Cash out'], bbox_to_anchor=(1, 1), loc=2, borderaxespad=0, fontsize = 16);
    return ax
ax = plotStrip(Y[:limit], X.step[:limit], X.type[:limit])
ax.set_ylabel('time [hour]', size = 16)
ax.set_title('Striped vs. homogenous fingerprints of genuine and fraudulent transactions over time', size = 20);
ax = plotStrip(Y[:limit], X.amount[:limit], X.type[:limit], figsize = (14, 9))
ax.set_ylabel('amount', size = 16)
ax.set_title('Same-signed fingerprints of genuine and fraudulent transactions over amount', size = 18);
ax = plotStrip(Y[:limit], - X.errorBalanceDest[:limit], X.type[:limit], figsize = (14, 9))
ax.set_ylabel('- errorBalanceDest', size = 16)
ax.set_title('Opposite polarity fingerprints over the error in
destination account balances', size = 18);

3 建模求解

4 结论

  • 7
    点赞
  • 35
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
IEEE-CIS Fraud Detection is a Kaggle competition that challenges participants to detect fraudulent transactions using machine learning techniques. KNN (k-Nearest Neighbors) is one of the machine learning algorithms that can be used to solve this problem. KNN is a non-parametric algorithm that classifies new data points based on the majority class of their k-nearest neighbors in the training data. In the context of fraud detection, KNN can be used to classify transactions as either fraudulent or not based on the similarity of their features to those in the training data. To implement KNN for fraud detection, one can follow the following steps: 1. Preprocess the data: This involves cleaning and transforming the data into a format that the algorithm can work with. 2. Split the data: Split the data into training and testing sets. The training data is used to train the KNN model, and the testing data is used to evaluate its performance. 3. Choose the value of k: This is the number of neighbors to consider when classifying a new data point. The optimal value of k can be determined using cross-validation. 4. Train the model: Train the KNN model on the training data. 5. Test the model: Test the performance of the model on the testing data. 6. Tune the model: Fine-tune the model by changing the hyperparameters such as the distance metric used or the weighting function. Overall, KNN can be a useful algorithm for fraud detection, but its performance depends heavily on the quality of the data and the choice of hyperparameters.
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值