基于逻辑回归的信用卡欺诈检测

最新推荐文章于 2021-02-02 14:26:38 发布

大犀牛冲鸭

最新推荐文章于 2021-02-02 14:26:38 发布

阅读量825

点赞数 2

分类专栏：复习文章标签：机器学习逻辑回归数据不平衡

本文链接：https://blog.csdn.net/weixin_42598936/article/details/94304158

版权

本文介绍了如何使用逻辑回归进行信用卡欺诈检测，重点探讨了数据不平衡问题，包括欠采样和过采样（特别是SMOTE算法）。通过分析模型训练，发现过采样的效果优于欠采样，能提高模型精度。

摘要由CSDN通过智能技术生成

本文是我学习唐宇迪老师的课程做的整理，仅供自己复习。

（一）导入需要使用的包

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score  #交叉验证
from sklearn.metrics import confusion_matrix,recall_score,classification_report #混淆矩阵，召回率

%matplotlib inline

（二）读取数据

数据来源：kaggle

data = pd.read_csv("E:\\AAAAAAAAA\\逻辑回归信用卡欺诈检测\\creditcard.csv",engine='python')
data.head()

由于数据涉及隐私，因此每一列的名称没有给出，数据集包含31列，284807个数据，最后一列Class表示类别，0表示正常，1表示欺诈。
（收集数据的方法：

官方网站：kaggle数据集、亚马逊数据集、UCI机器学习数据库、谷歌数据集等
爬虫）

（三）数据预处理

对Amount列进行归一化处理，reshape(-1,1)表示将Amount变成1列，-1表示行数未知；然后去掉‘Time’列和‘Amount’列

from sklearn.preprocessing import StandardScaler

data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1))   #列数等于1，行数未知
data = data.drop(['Time','Amount'],axis=1)
data.head()

在这里插入图片描述

（四）处理类别不平衡问题

信用卡欺诈毕竟是少数，推断样本可能存在类别不平衡的情况，下面做条状图观察类别的分布情况

count_classes = pd.value_counts(data['Class'], sort = True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")

由条状图可以看出，的确存在类别不平衡的情况。类别不平衡的问题通常可以使用欠采样和过采样的方法加以解决。
为什么类别不平衡会影响模型输出?
许多模型的输出类别是基于阈值的，例如逻辑回归中小于0.5的为反例，大于0.5的为正例。在数据类别不平衡时，默认阈值会导致模型输出倾向于类别数据多的类别.
类别不平衡的解决方法：
1）调整阈值，使得模型倾向于类别少的数据；（效果不好）
2）选择合适的评估标准，如ROC曲线或F1值，而不是准确率；
3）欠采样：二分类问题中，假设正例比反例多很多，那么去掉一些正例使得正负比例平衡；（容易出现过拟合问题，泛化能力不强）
4）过采样：二分类问题中，假设正例比反例多很多，那么增加一些负例（重复负例的数据）使得正负比例平衡（容易出现过拟合问题）
对过采样的改进：SMOTE算法（数据生成策略）
在这里插入图片描述

欠采样

X = data.ix[:, data.columns != 'Class']  #样本特征集
y = data.ix[:, data.columns == 'Class']  #样本特征标签

# Number of data points in the minority class（欺诈）
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)

# Picking the indices of the normal classes（正常）
normal_indices = data[data.Class == 0].index

# Out of the indices we picked, randomly select "x" number (number_records_fraud) 
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False) #随机取number_records_fraud个正常数据
random_normal_indices = np.array(random_normal_indices)

# Appending the 2 indices（合并正常数据和欺诈数据）
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

# Under sample dataset（欠采样数据集）
under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']  #欠采样数据集
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']  #欠采样数据标签
# Showing ratio
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print(

最低0.47元/天解锁文章

大犀牛冲鸭

关注

2
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
基于逻辑回归的信用卡欺诈检测

本文是我学习唐宇迪老师的课程做的整理，仅供自己复习。目录（一）导入需要使用的包（二）读取数据（三）数据预处理（四）处理类别不平衡问题欠采样（五）模型训练1.划分训练集和测试集2.利用逻辑回归进行模型训练3.画混淆矩阵（confusion matrix）4.过采样（SMOTE）（一）导入需要使用的包import pandas as pdimport matplotlib.pyplot as ...
复制链接

扫一扫