信用卡欺诈检测

最新推荐文章于 2023-12-25 15:24:55 发布

sysu63

最新推荐文章于 2023-12-25 15:24:55 发布

阅读量2.3k

点赞数 1

分类专栏：机器学习文章标签：下采样过采样不平衡数据信用卡欺诈

本文链接：https://blog.csdn.net/sysu63/article/details/80182106

版权

机器学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

项目背景

从银行提供的数据中，找出信用卡欺诈样本。

数据简介

原始数据为个人交易记录，但是考虑数据本身的隐私性，已经对原始数据进行了类似PCA的处理，现在已经把特征数据提取好了，接下来的目的就是如何建立模型使得检测的效果达到最好，所以我们虽然不需要对数据做特征提取。
数据规模：数据共284807条，后期算法选择需要注意复杂度。
数据特征：V1~V28是PCA的结果，而且进行了规范化，Amount和Time数据需要进一步处理。
数据质量：无缺失值。
经验：时间字段最好可以处理为月份、小时和日期，直接的秒数字段往往无意义。

数据处理

读入数据

import pandas as pd
import matplotlib.pylab as plt
import numpy as np
data = pd.read_csv('creditcard.csv')
data.head()
# 0 - 正常的样本，1 - 有问题的数据
%matplotlib inline

这里写图片描述
% matplotlib inline表示将图表嵌入到Notebook中。
处理Amount和Time数据

# 这个sklearn库: 预处理操作 => 标准化的模块
from sklearn.preprocessing import  StandardScaler
# fit_transform(): 对数据进行一个变化了; 变化好的列成为一个新属性添加到 data
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1))
data = data.drop(['Time', 'Amount'], axis=1)
data.head()

从上表可以看到其他列的数据都在（-1,1）的区间内，而Amount这一列浮动范围比较大，所以要将Amount这一列的数据进行归一化（标准化）操作。而数据中有一列是Time，这一列数据是无用的，也就是说对于信用卡欺诈预测是没有用的，所以我们要将其删掉。
注意：data.drop([‘Time’, ‘Amount’], axis=1)不会改变data本身，所以要将其赋值给data。
data[‘Amount’].values.reshape(-1,1)中reshape后（-1,1）表示将这个数组变换成X*1的形式，至于X是多少就要看原始数组中到底有多少元素了。

处理好数据后，我们可以对数据进行简单的统计学分析，这里略去。需要说明的是，在这个数据集中正负样本的比例极不平衡，正样本数达到284315个，而负样本只有492个。

这里采用两种数据处理方法：下采样和过采样。

下采样

所谓下采样，就是从样本量大的类别中随机选出与小样本的类别数量相同的样本，在本例中就是从标签为0的样本中随机选择492个样本与标签为1的样本重新组成数据集。

X = data.ix[:, data.columns != 'Class']
y = data.ix[:, data.columns == 'Class']
# 负样本的个数
number_records_fraud = len(data[data.Class == 1])
# 样本的索引, 所有值为1的 索引
fraud_indices = np.array(data[data.Class == 1].index)
# 值为0 的 数据的索引
normal_indices = data[data.Class == 0].index
# 使两个样本一样少
# 随机选取样本中的数据, 随机选择: np.random.choice(Class为0的数据索引=>样本, 选择多少数据量, 是否选择代替=false)
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace=False)
random_normal_indices = np.array(random_normal_indices)
# (连接缺陷数据索引 和 随机选取的Class=0的数据索引)
under_sample_indices = np.concatenate([fraud_indices, random_normal_indices])
#  下采样 (选取该索引下的数据)
under_sample_data = data.iloc[under_sample_indices, :]
X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']
# Showing Ratio
print('Percentage oif normal transactions: ', len(under_sample_data[under_sample_data.Class == 0]) / len(under_sample_data))
print('Percentage oif fraud transactions: ', len(under_sample_data[under_sample_data.Class == 1]) / len(under_sample_data))
print('Total number of transactions in resampled data: ', len(under_sample_data))

通过np.random.choice在正样本的索引（normal_indices）中随机选负样本个数（number_record_fraud ）个索引。
np.concatenate将负样本和挑选的负样本索引进行合并成
根据上面得到的索引来去原始数据中提取特征（X_under_sample）和标签（Y_under_sample ）

# 交叉验证， train_test_split: 切分数据
from sklearn.cross_validation import train_test_split
# Whole dataset, 将数据切割成训练集0.7 和测试集 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

接下来进行K折交叉验证找到最优参数。

import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold,cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report

def printing_Kfold_scores(X_train_data,Y_train_data):
    fold = KFold(len(Y_train_data),5,shuffle=False)
    print(fold)
    c_param_range = [0.01,0.1,1,10,100]

    results_table = pd.DataFrame(index=range(len(c_param_range)),columns=['C_Parameter','Mean recall score'])
    results_table['C_Parameter'] = c_param_range
    j=0
    for c_param in c_param_range:
        print('c_param:',c_param)
        recall_accs = []
        for iteration,indices in enumerate(fold, start=1):
            #print iteration,indices
            lr = LogisticRegression(C = c_param, penalty = 'l1')
            lr.fit(X_train_data.iloc[indices[0],:],Y_train_data.iloc[indices[0],:].values.ravel())
            Y_pred_undersample = lr.predict(X_train_data.iloc[indices[1],:].values)

            recall_acc = recall_score(Y_train_data.iloc[indices[1],:].values,Y_pred_undersample)
            recall_accs.append(recall_acc)

            print ('Iteration:',iteration,'recall_acc:',recall_acc)

        print ('Mean recall score',np.mean(recall_accs))
        results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
        j+=1

    best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_Parameter']
    print ('best_c is :',best_c)
    return best_c

best_c = printing_Kfold_scores(X_train_under_sample,Y_train_under_sample)

这里使用logistic回归，并用l1正则化防止过拟合，通过k折交叉验证寻找最佳的参数C。
得到最佳参数之后进行训练模型，并使用测试数据对模型进行测试。

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_under_sample,Y_train_under_sample.values.ravel())
Y_pred_undersample = lr.predict(X_test_under_sample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test_under_sample,Y_pred_undersample)

np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

这里写图片描述
从下采样测试集来看，效果似乎不错，但是我们还要将模型用于原数据集来看看效果如何。

咋一看，我们的Recall似乎很高，但是从混淆矩阵来看，我们将9729个正常样本错分为了欺诈样本，代价还是很高的。
因此，我们调整了参数，使用L2正则化，并确定最佳的参数C，得到以下结果：这里写图片描述

可以看到Recall有所下降，但是在原数据集上的错分数从9729下降到了3982，应该说也是一个进步。

过采样（SMOTE算法）

过采样是对样本中少的数量较少的那一类进行生成补齐。在本数据集中就是将标签为1的数据进行补齐。最常用的一种方法是SMOTE算法。
1.针对少数类中的每一个样本，找到其k个近邻（k值可选）；
2.针对每一个样本，根据少数类需要扩大的情况，从近邻中随机挑选出需要的近邻；
例如，需要少数类增加200%，即原来有100个，希望扩到到300个，就从k个近邻中随机挑选出两个近邻。
3.针对一个样本a和它的近邻b：
3.1.计算两者在各个特征空间中值的差值；
3.2.并将这个差值乘以一个（0,1）的随机数后，与当前样本的特征值相加，作为新的合成样本。

from imblearn.over_sampling import SMOTE
columns=data.columns
features_columns=columns.delete(len(columns)-1)
features=data[features_columns]

labels=data['Class']

features_train, features_test, labels_train, labels_test = train_test_split(features, 
                                                                            labels, 
                                                                            test_size=0.3, 
                                                                            random_state=0)

oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_sample(features_train,labels_train)

os_features = pd.DataFrame(os_features)
os_labels = pd.DataFrame(os_labels)

得到数据后进行模型训练和预测：

best_c = printing_Kfold_scores(os_features,os_labels)
lr = LogisticRegression(C = best_c, penalty = 'l2')
lr.fit(os_features,os_labels.values.ravel())
y_pred = lr.predict(features_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

结果如下：
这里写图片描述
可以看到过采样惊人的效果，Recall达到了1，并且错分为欺诈样本的只有13个，大大提升了模型的可用性。

sysu63

关注

1
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
信用卡欺诈检测

项目背景从银行提供的数据中，找出信用卡欺诈样本。数据简介原始数据为个人交易记录，但是考虑数据本身的隐私性，已经对原始数据进行了类似PCA的处理，现在已经把特征数据提取好了，接下来的目的就是如何建立模型使得检测的效果达到最好，所以我们虽然不需要对数据做特征提取。数据规模：数据共284807条，后期算法选择需要注意复杂度。数据特征：V1~V28是PCA的结果，而且进行了规范化，A...
复制链接

扫一扫

专栏目录