信用卡欺诈检测

三不达十，

已于 2023-09-17 23:25:57 修改

阅读量176

点赞数 1

分类专栏：信用卡检测数据分析文章标签：机器学习

于 2023-09-17 23:22:27 首次发布

本文链接：https://blog.csdn.net/qq_52481781/article/details/132941375

版权

信用卡检测同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

数据分析

1 篇文章 0 订阅

订阅专栏

思路：真实数据中正常肯定多，利用逻辑回归建立模型，然后在对模型进行训练。

1、读入数据并看看数据，简单分析数据

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

data = pd.read_csv('creditcard.csv')
data.head()

data.shape

看到数据是(284807, 31)，因为考虑到用户的隐私啥的，所以我们拿到的数据已经处理好的，V1-V28是可以直接用的特征，Amount是交互金额。class就是我们所说的目标字段

目的：信用卡检测，class正常0,异常1,就是将0和1分开，所以我们做的分类问题。

count_classes = pd.value_counts(data['Class'],sort = True).sort_index()#算一下Class列中有多少个不同值
count_classes.plot(kind = 'bar')#利用pandas直接画一个条形图
plt.title('Class')
plt.xlabel('class')
plt.ylabel('Frequency')

运行结果：

2、看到异常是特别少，样本不均衡

解决：

方案1：过采样

在生成数据让样本一样多

方案2：下采样

可以让样本同样少(代码来自哔哩哔哩buchiyu_脏果君）

X = data.loc[:,data.columns != 'Class']
y = data.loc[:,data.columns == 'Class']

# Number of data points in the minority class
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)
# Picking the indices of the normal classes
normal_indices = data[data.Class == 0].index
# Out of the indices we picked randomly select "x" number (number records fraud)
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices= np.array(random_normal_indices)
# Appending the 2 indices
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])
#Under sample dataset
under_sample_data= data.iloc[under_sample_indices,:]
X_undersample = under_sample_data.loc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.loc[:,under_sample_data.columns == 'Class']

print(len(under_sample_data))

3、对Amount做归一化

from sklearn.preprocessing import StandardScaler

data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))#reshape自动识别
data = data.drop(['Time','Amount'],axis=1)#删除Time和Amount
data.head()

4、切分数据训练集和测试集

train建立模型，test评估模型

切分数据集

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)#test_size是传入的比例，random_state是洗牌

5、交叉验证（调参）

就是在train的数据集中，train平均切分三分，验证三次，第一次就是拿出第一份和第二份，用三验证，第二次是将二份和第三份，一验证，第三次是一三，二验证，然后求平均值，

6、模型评估

#导入模型，调用逻辑回归LogisticRegression()函数
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='l2', solver='newton-cg', multi_class='multinomial')
#1. penalty: str类型，正则化项的选择。正则化主要有两种:11和12，默认为12正则化。
#2.newton-cg:利用损失函数二阶导数矩阵即海森矩阵来迭代优化损失函数。
#3.'multinomial':直接采用多分类逻辑回归策略。
lr.fit(X_train, y_train)#训练
#对模型进行评估
print('逻辑回归训练集准确率:%.3f'% lr.score(X_train,y_train))
print('逻辑回归测试集准确率:%.3f'% lr.score(X_test,y_test))
from sklearn import metrics
pred = lr.predict(X_test)#预测

7、混淆矩阵

recall查全率，召回率

TP(true postitves)真实正-->预测正

TF(false postitves)真实负-->判断正

FN（false negaticves)真实正-->判断负

TN（ture negatives)真实负-->判断负

from sklearn.metrics import confusion_matrix # 生成混淆矩阵函数
import matplotlib.pyplot as plt # 绘图库

class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix,classes=class_names,title = 'Confusion matrix')
plt.show()

参考：哔哩哔哩buchiyu_脏果君