【工作】Amazon Fraud Detection

AryaDP

已于 2022-08-12 15:32:46 修改

阅读量435

点赞数

文章标签：图搜索算法

于 2022-08-11 11:18:17 首次发布

原文链接：https://aws.amazon.com/blogs/machine-learning/detecting-fraud-in-heterogeneous-networks-using-amazon-sagemaker-and-deep-graph-library/

版权

Amazon Real-time Fraud detection:

亚马逊实时欺诈检测：

Amazon Lambda

AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without …

AWS Lambda 是一种无服务器、事件驱动的计算服务，可让您为几乎任何类型的应用程序或后端服务运行代码，而无需…

Amazon S3
Amazon S3 is cloud object storage with industry-leading scalability, data availability, security, and performance. S3 is ideal for data lakes, …
Amazon S3 是具有行业领先的可扩展性、数据可用性、安全性和性能的云对象存储。

Amazon SageMake
Machine learning platform 机器学习平台

Amazon Neptune: Graph database
图数据库

Data Preprocessing and feature engineering

数据预处理和特征工程

In this section, we will introduce how to preprocess the sample data set to determine the relationship between nodes in a heterogeneous graph!
在本节中，我们将介绍如何对样本数据集进行预处理以确定异构图中节点之间的关系！

dataset 数据集

In this use case, we benchmark the modeling method using the IEEE-CIS fraud data set. This is an anonymous data set containing up to 500000 transactions between users. The dataset contains two main tables:

在这个用例中，我们使用 IEEE-CIS 欺诈数据集对建模方法进行基准测试。这是一个匿名数据集，最多包含用户之间的 500000 笔交易。数据集包含两个主要表：

Transactions table: a transaction table that contains information about transactions or interactions between users.

交易表：包含有关用户之间的交易或交互的信息的表

- TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
- TransactionAMT: transaction payment amount in USD
- ProductCD: product code, the product for each transaction
- card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
- addr1, addr2: “both addresses are for purchaser addr1 as billing region addr2 as billing country. certain transactions don’t need recipient, so R_emaildomain is null.
- dist: distances between (not limited) billing address, mailing address, zip code, IP address, phone area, etc.
- P_ and (R__) emaildomain: purchaser and recipient email domain
- C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked. It could be counts of phone numbers, email addresses, names associated with the user, device, ipaddr, billingaddr, etc. Also these are for both purchaser and recipient.
- D1-D15: timedelta, such as days between previous transaction, etc.
- M1-M9: match, such as names on card and address, etc.
- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.
  For example, how many times the payment card associated with a IP and email or address appeared in 24 hours time range, etc
  All Vesta features were derived as numerical. some of them are count of orders within a clustering, a time-period or condition, so the value is finite and has ordering (or ranking).
  Categorical Features:
- ProductCD
- card1 - card6
- addr1, addr2
- P_emaildomain R_emaildomain
- M1 -M9

train_transaction.csv 553 features (399vesta engineered features)

在这里插入图片描述

Identity table: which contains the log access, equipment and network information of the specific user executing the transaction.
身份表：包含执行事务的特定用户的日志访问、设备和网络信息。
Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
They’re collected by Vesta’s fraud protection system and digital security partners.

此表中的变量是身份信息——与交易相关的网络连接信息（IP、ISP、代理等）和数字签名（UA/浏览器/操作系统/版本等）。
Categorical Features:

DeviceType
DeviceInfo
id_12 - id_38

“id01 to id11 are numerical features for identity, which is collected by Vesta and security partners such as device rating, ip_domain rating, proxy rating, etc. Also it recorded behavioral fingerprint like account login times/failed to login times, how long an account stayed on the page, etc. All of these are not able to elaborate due to security partner T&C. I hope you could get basic meaning of these features, and by mentioning them as numerical/categorical, you won’t deal with them inappropriately.”
“id01 到 id11 是身份的数字特征，由 Vesta 和安全合作伙伴收集，如设备评级、ip_domain 评级、代理评级等。它还记录了行为指纹，如帐户登录次数/登录失败次数、帐户多长时间停留在页面上等。由于安全合作伙伴的条款和条件，所有这些都无法详细说明。我希望你能了解这些特征的基本含义，并通过将它们作为数字/分类来提及，你不会对它们进行不当处理

*test_identity.csv snapshot of test_identity.csv

We can use the subsets of these transactions and their labels as supervision signals in model training. For transactions in the test data set, their labels will be masked during training.

The task of the model is very clear: predict which blocked transactions are fraudulent and which are legal.

我们可以将这些交易的子集及其标签用作模型训练中的监督信号。对于测试数据集中的交易，它们的标签将在训练期间被遮住。

该模型的任务非常明确：预测哪些被遮住的交易是欺诈性的，哪些是合法的。

Loading Pre-processed data from S3 从 S3 加载预处理数据

The dataset used in this Solution is the IEEE-CIS Fraud Detection dataset which is a typical example of financial transactions dataset that many companies have. The dataset consists of two tables:

Transactions: Records transactions and metadata about transactions between two users. Examples of columns include the product code for the transaction and features on the card used for the transaction.
Identity: Contains information about the identity users performing transactions. Examples of columns here include the device type and device ids used.

This notebook assumes that the two data tables had been pre-processed, mimicing the 1st time data preparation.

Current version uses the pre-processed data in nearly raw format, include all relation files, a feature file, a tag file, and a test index files..

本方案使用的数据集是IEEE-CIS Fraud Detection dataset许多公司拥有的交易数据集。数据集由两个表组成：

交易：记录有关两个用户之间交易的交易和元数据。列的示例包括交易的产品代码和用于交易的卡上的功能。
身份：包含有关执行交易的身份用户的信息。此处的列示例包括使用的设备类型和设备 ID。

当前版本使用几乎原始格式的预处理数据，包括所有关系文件、特征文件、标签文件和测试索引文件。。

from os import path
from sagemaker.s3 import S3Downloader
processed_files = S3Downloader.list(train_data)
print("===== Processed Files =====")
print('n'.join(processed_files))Output:
===== Processed Files =====
s3://graph-fraud-detection/dgl/processed-data/features.csv
s3://graph-fraud-detection/dgl/processed-data/relation_DeviceInfo_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_DeviceType_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_P_emaildomain_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_ProductCD_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_R_emaildomain_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_TransactionID_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_addr1_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_addr2_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card1_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card2_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card3_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card4_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card5_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card6_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_01_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_02_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_03_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_04_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_05_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_06_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_07_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_08_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_09_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_10_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_11_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_12_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_13_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_14_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_15_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_16_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_17_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_18_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_19_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_20_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_21_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_22_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_23_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_24_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_25_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_26_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_27_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_28_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_29_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_30_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_31_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_32_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_33_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_34_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_35_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_36_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_37_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_38_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/tags.csv
s3://graph-fraud-detection/dgl/processed-data/test.csv

All relational Edgelist files represent different types of edges used to construct heterogeneous graphs during training. Features.csv contains the features after the final conversion of the transaction node, while tags.csv contains node labels as training supervision signals. Test.csv contains TransactionID data as a test data set to evaluate the performance of the model. These node labels are shielded during training to avoid interference with model prediction.

所有的关系 Edgelist 文件都表示用于在训练期间构建异构图的不同类型的边。 Features.csv 包含交易节点最终转换后的特征，而 tags.csv 包含节点标签作为训练监督信号。 Test.csv 包含 TransactionID 数据作为测试数据集，用于评估模型的性能。这些节点标签在训练过程中被屏蔽以避免干扰模型预测。

Train Graph Neural Network with DGL

使用 DGL 训练图神经网络

We can model the fraud detection problem as a node classification task, and the goal of the graph neural network would be to learn how to use information from the topology of the sub-graph for each transaction node to transform the node’s features to a representation space where the node can be easily classified as fraud or not.

Specifically, we will be using a relational graph convolutional neural network model (R-GCN) on a heterogeneous graph since we have nodes and edges of different types.

我们可以将欺诈检测问题建模为节点分类任务，图神经网络的目标是学习如何使用每个交易节点的子图拓扑信息将节点的特征转换为表示空间其中节点可以很容易地被分类为欺诈与否。

具体来说，我们将在异构图上使用关系图卷积神经网络模型 (R-GCN)，因为我们有不同类型的节点和边。

def parse_args():
    parser = argparse.ArgumentParser()

    parser.add_argument('--training-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--output-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--nodes', type=str, default='features.csv')
    parser.add_argument('--target-ntype', type=str, default='TransactionID')
    parser.add_argument('--edges', type=str, default='homogeneous_edgelist.csv')
    parser.add_argument('--labels', type=str, default='tags.csv')
    parser.add_argument('--new-accounts', type=str, default='test.csv')
    parser.add_argument('--compute-metrics', type=lambda x: (str(x).lower() in ['true', '1', 'yes']),
                        default=True, help='compute evaluation metrics after training')
    parser.add_argument('--threshold', type=float, default=0, help='threshold for making predictions, default : argmax')
    parser.add_argument('--num-gpus', type=int, default=0)
    parser.add_argument('--optimizer', type=str, default='adam')
    parser.add_argument('--lr', type=float, default=1e-2)
    parser.add_argument('--n-epochs', type=int, default=100)
    parser.add_argument('--n-hidden', type=int, default=16, help='number of hidden units')
    parser.add_argument('--n-layers', type=int, default=3, help='number of hidden layers')
    parser.add_argument('--weight-decay', type=float, default=5e-4, help='Weight for L2 loss')
    parser.add_argument('--dropout', type=float, default=0.2, help='dropout probability, for gat only features')
    parser.add_argument('--embedding-size', type=int, default=360, help="embedding size for node embedding")

nodes is the name of the file that contains the node_ids of the target nodes and the node features.
nodes 是包含目标节点的node_id和节点特征的文件名。
edges is a regular expression that when expanded lists all the filenames for the edgelists
edges 是一个正则表达式，在展开时会列出边缘列表的所有文件名
labels is the name of the file tha contains the target node_ids and their labels
labels 是包含目标 node_ids 及其标签的文件名