机器学习进行回环检测_借助机器学习进行实时欺诈检测

机器学习进行回环检测

Unlike our parents and grandparents, we live and breathe in the digital world. Initially, it was discussions on online forums, then chats and emails, and now most of our entire life and financial transactions are executed in digital mode.

与父母和祖父母不同,我们在数字世界中生活和呼吸。 最初,它是在在线论坛上进行讨论,然后是聊天和电子邮件,现在我们的大部分生活和财务交易都是以数字方式执行的。

As the stakes are getting higher, it is not enough to detect fraud after the event. Imagine someone with a few confidential information about your bank or credit card details, able to execute a fraudulent transaction. Banks and insurance companies need tools and techniques to detect frauds in real-time to take appropriate actions.

随着风险越来越高,在事件发生后检测欺诈还远远不够。 想象一下,某人拥有一些有关您的银行或信用卡详细信息的机密信息,能够执行欺诈性交易。 银行和保险公司需要工具和技术来实时检测欺诈,以采取适当的措施。

We humans lose the sense of interpretation and visualisation as we move beyond three-dimensional space.

当我们超越三维空间时,人类将失去解释和可视化的感觉。

Image for post

Today a financial transaction involves hundreds of parameters like transaction amount, past transaction trends, GPS location of the transaction, transaction time, merchant name etc. We need to consider many parameters to detect an anomaly and fraud in realtime.

如今,金融交易涉及数百个参数,例如交易金额,过去的交易趋势,交易的GPS位置,交易时间,商户名称等。我们需要考虑许多参数以实时检测异常和欺诈。

Isolation forest algorithm implemented in Scikit-Learn can help to identify the frauds in realtime and avoid financial loss. In this article, I will discuss step by step process of a fraudulent transaction with machine learning.

在Scikit-Learn中实施的隔离林算法可以帮助实时识别欺诈并避免财务损失。 在本文中,我将逐步讨论利用机器学习进行欺诈性交易的过程。

Step 1: We need to import the packages which we are going to use. We will use “make_blobs” to generate our test data and will measure the accuracy of the fit model with accuracy_score.

步骤1:我们需要导入要使用的软件包。 我们将使用“ make_blobs”生成我们的测试数据,并使用precision_score测量拟合模型的准确性。

from sklearn.datasets import make_blobs
from sklearn.metrics import accuracy_score
from sklearn.ensemble import IsolationForest

Step 2: In real life, we base the model based on millions and billions of past transactions and hundreds of parameters. In this article, we will consider a hundred samples and four features to understand the core concept and the process.

步骤2:在现实生活中,我们基于数以亿计的过去交易和数百个参数建立模型。 在本文中,我们将考虑一百个示例和四个功能,以了解核心概念和过程。

X, y = make_blobs(n_samples=[4,96], centers=[[5,3,3,10],[9,3,6,11]], n_features=4, random_state=0, shuffle="True")

The array X holds values of the four parameters for a hundred records and, y stores whether it is a fraud or normal transaction.

数组X拥有一百条记录的四个参数的值,而y存储是欺诈还是正常交易。

Step 3: We will use 300 base estimators (trees) in the ensemble and 10 number of samples from the dataset to train each base estimator.

步骤3:我们将在集合中使用300个基本估计量(树),并从数据集中使用10个样本来训练每个基本估计量。

clf = IsolationForest(n_estimators=300,max_samples=10,
random_state=0,max_features=4,contamination=0.1).fit(X)

Also, we will use all four feature values (“max_feature” parameter) for the model. In projects, with feature engineering, the importance of each parameter is determined and ascertained the list of features on which model is to be based. I will not discuss the details of feature engineering in this article and will discuss it later in a separate article. The IsolationForest model is further fitted with the sample dataset.

另外,我们将对模型使用所有四个特征值(“ max_feature”参数)。 在项目中,通过特征工程,确定每个参数的重要性,并确定要作为模型基础的特征列表。 我不会在本文中讨论功能工程的详细信息,而稍后将在另一篇文章中进行讨论。 IsolationForest模型进一步与样本数据集拟合。

We set the value of the parameter “contamination” based on the proportion of the anomaly in historical data and stakes of missing anomaly against false alarms. Let say that proportion of fraud transaction in the historical dataset is 0.05 % and it is a very high stake transaction. In such a scenario, we may like to set the contamination value from 0.25 to 0.35. Setting the contamination value 5 to 7 times the anomaly proportion in historical data records will ensure that none of the rogue transaction is wrongly classified. Indeed setting a high contamination value compare to anomaly proportion also lead to increase few false alarms. In case the stakes are lower, then we may afford to miss to catch a few fraudulent transactions but decrease false alarms with lower contamination value.

我们根据历史数据中异常的比例以及针对虚假警报的异常缺失风险来设置参数“污染”的值。 假设历史数据集中欺诈交易的比例为0.05%,这是非常高的风险交易。 在这种情况下,我们可能希望将污染值设置为0.25到0.35。 在历史数据记录中将污染值设置为异常比例的5到7倍将确保不会对任何恶意交易进行错误分类。 实际上,设置较高的污染值(与异常比例相比)也将导致很少的误报。 如果风险较低,那么我们可能会错过一些欺诈性交易,但可以减少污染值较低的误报。

Step 4: In the below code, fitted IsolationForest model predicts whether a transaction is a fraud or normal transaction. IsolationForest predicts the anomaly as “-1” and normal transaction as “1”. In our sample test dataset, fraud transactions are codified as “0” and normal transactions as “1”.

步骤4:在下面的代码中,拟合的IsolationForest模型预测交易是欺诈交易还是正常交易。 IsolationForest将该异常预测为“ -1”,将正常交易预测为“ 1”。 在我们的样本测试数据集中,欺诈交易被编码为“ 0”,正常交易被编码为“ 1”。

y_pred=clf.predict(X)
y_pred[y_pred == -1] = 0

To compare the model prediction accuracy with actual classification from sample datasets, we will classify the predicted fraud transaction from “-1” to “0”.

为了将模型预测的准确性与样本数据集的实际分类进行比较,我们将预测的欺诈交易从“ -1”分类为“ 0”。

Step 5: As now the fraud transaction is labelled as “0 in the sample and predicted set, hence we can compare the prediction accuracy of the model directly with accuracy_score function.

步骤5:由于现在欺诈交易在样本和预测集中标记为“ 0”,因此我们可以直接将模型的预测准确性与precision_score函数进行比较。

fraud_accuracy_prediction= round(accuracy_score(y,y_pred),2)
print("The accuracy to detect fraud is {accuracy} %" .format (accuracy=fraud_accuracy_prediction*100))

It seems the model identified the fraud transaction with 93% accuracy. The prediction accuracy of the model may not look good enough on first glance, but remember as the stakes are higher, hence we are ok with few false alarms (false positive). These false alarms sacrifice the prediction accuracy, but it is better to be ultra-safe than missing a few frauds transactions.

该模型似乎以93%的准确性识别了欺诈交易。 乍一看,该模型的预测准确性可能看起来不够好,但是请记住,因为风险很高,因此我们可以接受很少的误报(误报)。 这些错误警报会牺牲预测的准确性,但是超安全性比错过一些欺诈性交易要好。

Image for post

Step 6: We will use the confusion matrix to look deeper into the predictions.

第6步:我们将使用混淆矩阵深入了解 预测

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y, y_pred))

Out of the total 100 transactions in sample datasets, the model could identify all four true fraud transactions.

在样本数据集中的全部100笔交易中,该模型可以识别所有四个真正的欺诈交易。

Model labelled seven genuine transactions as fraud (false alarm) due to contamination (safety factor) parameter of 0.1 in the model. We have set the contamination value higher than the actual proportion of the fraud transaction in historical data as it is better to be safe than sorry in case stakes are higher.

由于模型中的污染(安全系数)参数为0.1,该模型将7个真实交易标记为欺诈(虚假警报)。 在历史数据中,我们已将污染值设置为高于欺诈交易的实际比例,因为在赌注较高的情况下,安全起来比后悔要好。

Image for post

Step 7: We have written a small function to detect whether the new transaction is a fraud in realtime. It takes the parameter values of the new transaction feeds into the trained model to detect the authenticity of the transaction.

步骤7:我们编写了一个小函数来实时检测新交易是否为欺诈。 它会将新交易馈送的参数值输入经过训练的模型中,以检测交易的真实性。

def frauddetection(trans):
transaction_type=(clf.predict([trans]))
if transaction_type[0] < 0:
print("Suspect fraud")
else:
print("Normal transaction")
return

Step 8: Various transaction parameters are collected at the time of the new transaction.

步骤8:在新交易时收集各种交易参数。

frauddetection([7,4,3,8])   
frauddetection([10,4,5,11])

The authenticity of the transaction is ascertained by calling the function defined earlier with transaction parameters.

事务的真实性是通过调用先前使用事务参数定义的函数来确定的。

Image for post

I have simplified a few things like the number of features in the transaction, the number of historical transaction to fit the model, feature engineering etc. to explain the core concept. We have seen the way isolation forest algorithm can help to detect fraudulent transactions in real-time.

我已经简化了一些事情,例如事务中的要素数量,适合模型的历史事务数量,要素工程等,以解释核心概念。 我们已经看到隔离林算法可以帮助实时检测欺诈性交易的方式。

If you would like to know the way we can perform feature engineering with exploratory data analysis then read the article on Advanced Visualisation for Exploratory data analysis (EDA) .

如果您想了解我们可以使用探索性数据分析执行特征工程的方式,请阅读有关探索性数据分析(EDA)的高级可视化的文章。

翻译自: https://towardsdatascience.com/real-time-fraud-detection-with-machine-learning-485fa502087e

机器学习进行回环检测

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值