逻辑回归进行信用卡欺诈检测

最新推荐文章于 2024-08-15 10:30:56 发布

妄念驱动

最新推荐文章于 2024-08-15 10:30:56 发布

阅读量1w

点赞数 12

分类专栏： python 机器学习文章标签：逻辑回归 smote算法信用卡

本文链接：https://blog.csdn.net/hx2017/article/details/78389376

版权

本文介绍了利用逻辑回归进行信用卡欺诈检测的实践，数据集经过脱敏处理。在极度不平衡的数据分布下，采取了数据标准化、随机下采样、过采样（SMOTE）等策略。通过模型训练和评估，发现过采样能有效提高模型的召回率，降低误杀率，是处理不平衡数据的有效方法。

摘要由CSDN通过智能技术生成

利用Logistic regression进行信用卡欺诈检测，使用的是一份竞赛数据集（已脱敏处理），使用的是Python的Jupyter Notebook工具。

观察数据

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

导入数据并查看前5行

data = pd.read_csv("creditcard.csv")
data.head()

这里写图片描述

数据有31列：Time、V1-V28、Amount和Class，注意到最后一列Class，这是我们的label值，0代表正常数据，1代表欺诈数据。首先习惯性地画个图观察一下欺诈数据的分布。

count_classes = pd.value_counts(data['Class'], sort = True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")

这里写图片描述

可以看到Class=0的数据大概有28W，欺诈数据Class=1极少，极度不均匀的分布状态。
通常有两种处理方法：
1. 过采样（让1变得和0一样多）；
2. 下采样（在0中取出部分数据，数量与1一致）

标准化

在特征数据中，Amount与其他特征数据的取值范围相比，太大了，应该是还没有标准化。所以，需要先对这一列进行标准化：

from sklearn.preprocessing import StandardScaler
# 标准化，将Amount这一列传进
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1)) #reshape（-1，1）# -1表示默认计算，转化行数模糊，1表示维度，最终转化为一列
data = data.drop(['Time','Amount'],axis=1) # 删除没用的两列数据，得到一个新的数据集
data.head()

这里写图片描述

这个时候所有特征数据都已经完成了标准化的操作。

随机下采样

下采样相对简单，所以我们先进行下采样。现在，分别取出特征和标签：

X = data.loc[:, data.columns != 'Class'] # 取特征（列名不等于class的所有数据）
y = data.loc[:, data.columns == 'Class'] # 取label

为了保证拿到的是数据的原始分布，我们采用的是随机的下采样：

# 随机下采样
# 筛选出class为1的数据总数，并取得其索引值
number_records_fraud = len(data[data.Class == 1]) 
fraud_indices = np.array(data[data.Class == 1].index) 

# 把class为0的数据索引拿到手
normal_indices = data[data.Class == 0].index

random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)  # 随机采样，并不对原始dataframe进行替换
random_normal_indices = np.array(random_normal_indices)  # 转换成numpy的array格式

# 将两组索引数据连接成性的数据索引
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices]) 

# 下采样数据集
under_sample_data = data.iloc[under_sample_indices,:] # 定位到真正的数据

# 切分出下采样数据的特征和标签
X_undersample = under_sample_data.loc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.loc[:, under_sample_data.columns == 'Class']

# 展示下比例
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))

这里写图片描述

数据切分

将数据集切分为训练集和测试集：

from sklearn.model_selection import train_test_split

# 对全部数据集进行切分，注意使用相同的随机策略
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)  # 30%作为测试集，random_state = 0保证数据集一致性，以便调参

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ",