qiuzitao机器学习（六）：信用卡欺诈检测项目

最新推荐文章于 2023-06-09 10:00:01 发布

qiuzitao

最新推荐文章于 2023-06-09 10:00:01 发布

阅读量2.6k

点赞数 7

分类专栏：机器学习系列文章标签：数据挖掘机器学习数据分析

本文链接：https://blog.csdn.net/qiuzitao/article/details/105052285

版权

机器学习系列专栏收录该内容

8 篇文章 4 订阅

订阅专栏

机器学习实战–信用卡欺诈检测项目

学校大三校企合作的课程设计项目

一、任务基础

拿到的信用卡数据集是由欧洲人于2013年9月使用信用卡进行交易的数据。此数据集显示两天内发生的交易，其中284807笔交易中有492笔被盗刷。特征’Class’是响应变量，如果发生被盗刷，则取值1，否则为0。

项目的目的是完成数据集中正常交易数据和异常交易数据的分类，并对测试数据进行预测。

数据集链接：https://pan.baidu.com/s/1Gt7F9pszGNX_pm_75YSO8w
提取码：9tp6

二、数据分析与挖掘处理

导入一些库后，先读取数据再查看分析数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
data = pd.read_csv('creditcard.csv')
count_classes = pd.value_counts(data['Class'], sort=True).sort_index()
count_classes.plot(kind='bar',fc = 'r') 
plt.title("class 类别统计")
plt.xlabel("Class")
plt.ylabel("个数")
print(count_classes)
x = 0,1
for x, count_classes in zip(x, count_classes):
    plt.text(x, count_classes , '%.2f' % count_classes, ha='center', va='bottom')

柱状图看出1和0数据数量差别很大
因为Amount这列的数据浮动太大，在做机器学习的过程中，需要保证特征值差异不能过大，于是我对Amount进行标准化数据。

# 预处理  标准化数据
from sklearn.preprocessing import StandardScaler
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time', 'Amount'], axis=1)
data.head()

在这里插入图片描述
特征V1，V2，… V28是使用PCA降维获得的特征数据，没有用PCA转换的唯一特征是“Class”和“Amount”。特征’Time’包含数据集中每个刷卡时间和第一次刷卡时间之间经过的秒数，对于此次项目用处不大所以去掉。

对于数据不平衡有两种方法处理：下采样和过采样。

下采样：一组数据，比如我们这个项目的数据，它的y-label（要预测的）中，0的数据多，1的数据少，我们把多的数据通过随机取到和少的一样多。

过采样：比如我们这个项目的数据，对1的数据进行生成数列，让生成的数据与0的数据一样多。通常有SMOTE算法实现。

我们这里用下采样试试效果。

#导入使用标准化的包
from  sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('creditcard.csv')

#reshape(行，列)，指定列数，那么行数可以用-1替代，计算机会帮我们又多少行，反之亦然。
# -1  正整数通配符， 它代表任何整数。
reshape = data["Amount"].values.reshape(-1, 1)
# print(reshape)
# print(type(reshape))

#把数据按标准正太分布的方式转换均值为0，方差为1的数据
data['normAmount'] = StandardScaler().fit_transform(reshape)

#drop() 删除数据
#参数1，要删除的特征列
#参数2 ，axis，指定删除行还是列， 0 代表行， 1 代表列
data = data.drop(['Time', 'Amount'],  axis=1)

# print(data.head(8))

#x拿到非Class的特征数据
#y拿到Class的特征数据
X = data.loc[:, data.columns != "Class"]
Y = data.loc[:, data.columns == "Class"]

# 计算少数类别的样本有多少
number_records_fraud = len(data[data.Class == 1])
#求出少数类别样本对应的索引值。
fraud_indices = np.array(data[data.Class == 1].index)
#求出多数类别样本对应的索引值
normal_indices = np.array(data[data.Class == 0].index)
# print(number_records_fraud)
# print(fraud_indices)
# print(normal_indices)

#从多数类别的样本索引normal_indices中随机筛选number_records_fraud个
#参数1 a  样本集
#参数2  选择多少
#参数3 replace 是否替换原来的数据
random_normal_indices = np.random.choice(normal_indices,
                                         size=number_records_fraud,
                                         replace=False)
# print(len(random_normal_indices))   #492

#把random_normal_indices 和 fraud_indices混合起来作为后面undersampled的学习样本
#a_tuple 元组， 要整合的所有的数据
# 参数2 axis 整合的方向，
under_sample_indices = np.concatenate((random_normal_indices, fraud_indices),
                                      axis=0)
# print(len(under_sample_indices))    #984
#iloc根据索引得到数据。
under_sample_data = data.iloc[under_sample_indices, :]

# print(under_sample_data.head(8))
# print(len(under_sample_data))       #984

#用X_undersampled从undersample样本数据中拿到非Class特征
X_undersampled = under_sample_data.iloc[:,data.columns != "Class"]
# print(X_undersampled.head())
# 用Y_undersampled从undersample样本中拿到class特性
Y_undersampled = under_sample_data.iloc[:, data.columns == "Class"]


# 汇总
print("Undersampled中所有样本的个数", len(under_sample_data))
print("Undersampled中正常样本的个数", len(random_normal_indices))
print("Undersampled中正常样本占的比例", len(random_normal_indices)/len(under_sample_data))
print("Undersampled中欺诈样本的个数",len(fraud_indices))
print("Undersampled中欺诈样本占的比例",len(fraud_indices)/len(under_sample_data))

在这里插入图片描述
我们可以看到下采样后1和0的数据数量比例一样都是0.5。

下面对数据进行切分，用来训练和测试，这里用sklearn库模型的train_test_split，顺便把没有经过下采样的原始数据也切分。

#----------------------UnderSample数据进行划分------------------------------
from  sklearn.model_selection import train_test_split
#参数1   所要划分的样本的特征集
#参数2    所要划分的样本的结果集
#参数3    test_size 测试集占所有样本的比例
#参数4    random_state 划分结果是否随机
#           0   每次划分的结果都不同
#           1   每次划分的结果都相同

#返回值1  特征的训练集 +验证集
#返回值2  特征的测试集
#返回值3   结果的测试集 + 验证集
#返回值4   结果的测试集

X_train_undersample,X_test_undersamples,Y_train_undersample,Y_test_undersample=train_test_split(X_undersampled,Y_undersampled,test_size=0.3,random_state=0)
# print(X_train_undersample)
print("对UnderSample数据进行划分，训练集+验证集数据长度是 {}".format(len(Y_train_undersample))) #688
print("对UnderSample数据进行划分，测试集数据长度是 {}".format(len(Y_test_undersample))) #296
print("总共的数量是", len(Y_test_undersample) + len(Y_train_undersample))

#----------------对原始数据进行划分------------------
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

print('对原始数据进行划分，训练集+ 验证集数据长度',len(Y_train))
print('对原始数据进行划分，测试集数据长度',len(Y_test))
print('总共的数量是', len(Y_train) + len(Y_test))

在这里插入图片描述

三、选模型进行训练和预测

一般二分类项目都会先用逻辑斯特回归试试水，这里也不例外，不过我用了交叉验证找出LR的最佳参数，这里用recall召回率（查全率）评价。
LR的激活函数

from sklearn.linear_model import LogisticRegression  # 使用逻辑斯特回归模型
from sklearn.model_selection import KFold, cross_val_score  # fold:折叠 KFold 表示切分成几分数据进行交叉验证
from sklearn.metrics import confusion_matrix, recall_score, classification_report
def printing_Kfold_scores(x_train_data,y_train_data):
    fold = KFold(5,shuffle=False)
    c_param_range = [0.01,0.1,1,10,100]
    result_table = pd.DataFrame(index=range(len(c_param_range),2),columns=['C_parameter','Mean recall score'])
    result_table['C_parameter'] = c_param_range
     
    j=0  # 循环找到最好的惩罚力度
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter:',c_param)
        print('-------------------------------------------')
        print('')
         
        recall_accs = []
        for iteration,indices in enumerate(fold.split(x_train_data)):
           
            # 使用特定的C参数调用逻辑回归模型
            # 参数 solver=’liblinear’ 消除警告
            # 出现警告：模型未能收敛 ，请增加收敛次数
            #  增加参数 max_iter 默认1000
            lr = LogisticRegression(C = c_param, penalty='l1', solver='liblinear',max_iter=10000)
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
             
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
      
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            recall_accs.append(recall_acc)
            print('Iteration ',iteration,': recall score = ',recall_acc)
             
        result_table.loc[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ',np.mean(recall_accs))
        print('')      
    best_c = result_table.loc[result_table['Mean recall score'].astype('float64').idxmax()]['C_parameter']
print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter',best_c)
    print('*********************************************************************************')
     
    return best_c
best_c = printing_Kfold_scores(X_train_undersample,Y_train_undersample)

from sklearn.metrics import classification_report
lr = LogisticRegression(C = best_c, penalty='l1', solver='liblinear',max_iter=10000)
lr.fit(X_train_undersample, Y_train_undersample)
lr_y_pred = lr.predict(X_test_undersamples)
# 输出逻辑斯特在测试集上的分类准确性，以及更加详细的精确率、召回率、F1指标。
print('The accuracy of lr classifier is', lr.score(X_test_undersamples, Y_test_undersample))
print(classification_report(lr_y_pred, Y_test_undersample))

在这里插入图片描述
下面我用没有进行下采样的原始数据也用相同模型训练看看准确率。

#原始数据
lr.fit(X_train, Y_train)
lr_y_pred = lr.predict(X_test)
# 输出逻辑斯特在测试集上的分类准确性，以及更加详细的精确率、召回率、F1指标。
print('The accuracy of lr classifier is', lr.score(X_test, Y_test))
print(classification_report(lr_y_pred, Y_test))

在这里插入图片描述
结果发现原始数据准确率更高，简单分析发现应该是采用的下采样数据集，数据利用率太低，但是实战的话数据不平衡对刚开始的建模结果不是很好。

四、总结

这次项目对我帮助较多的是：① 数据不平衡处理 ② 交叉验证取最佳参数

对项目我也再参考了网上其他人的做法，发现挺多不错的文章。其中“|旧市拾荒|”的博客我学到挺多，也发现授课老师也是参考ta的文章教。

这是他的文章链接：https://www.cnblogs.com/xiaoyh/p/11194053.html

qiuzitao

关注

7
点赞
踩
50

收藏

觉得还不错? 一键收藏
打赏
1
评论
qiuzitao机器学习（六）：信用卡欺诈检测项目

机器学习实战–信用卡欺诈检测项目学校大三校企合作的课程设计项目一、任务基础拿到的信用卡数据集是由欧洲人于2013年9月使用信用卡进行交易的数据。此数据集显示两天内发生的交易，其中284807笔交易中有492笔被盗刷。特征’Class’是响应变量，如果发生被盗刷，则取值1，否则为0。项目的目的是完成数据集中正常交易数据和异常交易数据的分类，并对测试数据进行预测。数据集链接：https://...
复制链接

扫一扫