机器学习案例实战之信用卡欺诈检测（逻辑回归）

最新推荐文章于 2024-08-15 10:30:56 发布

听挽风讲大数据

最新推荐文章于 2024-08-15 10:30:56 发布

阅读量5k

点赞数 10

分类专栏： Machine Learning 文章标签：逻辑回归案例实战信用卡欺诈检测

本文链接：https://blog.csdn.net/huahuaxiaoshao/article/details/85232089

版权

机器学习案例实战之信用卡欺诈检测

1.实战案例背景：这是一份个人交易记录，因为其中涉及到了隐私的内容，进行了类似PCA的处理，我们的数据已经把特征数据提取出来了，接下来，通过逻辑回归进行检测。
2.拿到数据千万不要忙着去建立模型，一定要先观察数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('creditcard.csv')
data.head()

在这里插入图片描述

count_classess = pd.value_counts(data['Class'],sort=True)
count_classess.plot(kind = 'bar')
plt.title('Fraud class histogram')
plt.xlabel('Class')
plt.ylabel('Frequency')

在这里插入图片描述
属于0 这个类是正常的，属于1 这个类是异常的。我们的目的是进行分类任务。绝大多数样本是正样本，少数样本是负样本，出现了样本不平衡的现象。这里介绍两种解决方案：①下采样：以少的样本数为标准，在多的样本中取得样本数和少的样本数一样多。（让0和1 两个样本一样小）②以多的样本数为标准，生成一些样本使得少的样本数和多的样本数一样多。（对1号样本进行生成，让 0 和 1 这两个样本一样多。）
3.由于Amount这一列的数，没有进行规范化，所以接下来对其进行规范化处理

from sklearn.preprocessing import StandardScaler
#StandardScaler作用：去均值和方差归一化。且是针对每一个特征维度来做的，而不是针对样本。 
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time','Amount'],axis=1)
data.head()

在这里插入图片描述

4.进行下采样，使得两种样本一样少

X = data.loc[:,data.columns != 'Class']
y = data.loc[:,data.columns =='Class']
#得到label为1的数据长度
number_records_fraud = len(data[data.Class == 1])
#得到label为1的数据索引
fraud_indices = np.array(data[data.Class == 1].index)
#得到label为0的数据索引
normal_indices = np.array(data[data.Class == 0].index)
#在label为0d的数据中随机选取number_records_fraud个数据下标
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)
#合并两种样本
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])
#下采样
under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.loc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.loc[:, under_sample_data.columns == 'Class']
# 显示数据占比
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))

输出结果：
Percentage of normal transactions: 0.5
Percentage of fraud transactions: 0.5
Total number of transactions in resampled data: 984
5.训练集、验证集与测试集
       一个形象的比喻：
       训练集：学生的课本；学生根据课本里的内容来掌握知识。
       验证集：作业，通过作业可以知道不同学生学习情况、进步的速度快慢。
       测试集：考试，考的题是平常都没有见过，考察学生举一反三的能力。
       传统上，一般三者切分的比例是：6：2：2，验证集并不是必须的。
       a)训练集直接参与了模型调参的过程，显然不能用来反映模型真实的能力（防止课本死记硬背的学生拥有最好的成绩，即防止过拟合)
       b)验证集参与了人工调参(超参数)的过程，也不能用来最终评判一个模型（刷题库的学生不能算是学习好的学生）。
       c) 所以要通过最终的考试(测试集)来考察一个学(模)生(型)真正的能力（期末考试）
       这里仅仅将数据分成了训练集和测试集，之后介绍交叉验证，把训练集分成训练集与测试集。

from sklearn.model_selection import train_test_split
#所有数据集
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)
print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))
#下取样的数据集
X_train_undersample,X_test_undersample,y_train_undersample,y_test_undersample=train_test_split(X_undersample,y_undersample ,test_size = 0.3,random_state = 0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))

输出结果：
Number transactions train dataset: 199364
Number transactions test dataset: 85443
Total number of transactions: 284807

Number transactions train dataset: 688
Number transactions test dataset: 296
Total number of transactions: 984
说明：为什么进行了下采样，还要把原始数据进行切分呢？对数据集的训练是通过下采样的训练集，对数据的测试的是通过原始的数据集的测试集，下采样的测试集可能没有原始部分当中的一些特征，不能充分进行测试。
6.交叉验证
       交叉验证是在机器学习建立模型和验证模型参数时常用的办法。交叉验证，顾名思义，就是重复的使用数据，把得到的样本数据进行切分，组合为不同的训练集和测试集，用训练集来训练模型，用测试集来评估模型预测的好坏。在此基础上可以得到多组不同的训练集和测试集，某次训练集中的某样本在下次可能成为测试集中的样本，即所谓“交叉”。
       对于普通适中问题，如果数据样本量小于一万条，我们就会采用交叉验证来训练优化选择模型。如果样本大于一万条的话，我们一般随机的把数据分成三份，一份为训练集（Training Set），一份为验证集（Validation Set），最后一份为测试集（Test Set）。用训练集来训练模型，用验证集来评估模型预测的好坏和选择模型及其对应的参数。把最终得到的模型再用于测试集，最终决定使用哪个模型以及对应参数。
       简单交叉验证：首先随机将已给数据分成两份，一份作为训练集，另一份作为测试
集（比如： 70%的训练集，30%的测试集）。然后用训练集来训练模型，在测试集上验证模型及参数。接着，我们再把样本打乱，重新选择训练集和测试集，继续训练数据和检验模型。最后我们选择损失函数评估最优的模型和参数。
       K折交叉验证：会把样本数据随机的分成S份，每次随机的选择S-1份作为训练集，剩下的1份做测试集。当这一轮完成后，重新随机选择S-1份来训练数据。若干轮（小于S）之后，选择损失函数评估最优的模型和参数。
       留一交叉验证：它是第二种情况的特例，此时S等于样本数N，这样对于N个样本，每次选择N-1个样本来训练数据，留一个样本来验证模型预测的好坏。此方法主要用于样本量非常少的情况，比如对于普通适中问题，N小于50时，一般采用留一交叉验证。
       本文采用K折交叉验证具体算法流程如下：
在这里插入图片描述
7.模型评估方法
       假设我们在医院中有1000个病人，其中990个为正样本（正常），10个为负样本（癌症），我们的目的是找出其中的10个负样本，假如我们的模型将多有的1000个病人都预测为正样本，虽然精度有99%，但是并没有找到我们所要的10个负样本，所以这个模型是没用的，因为一个癌症病人都找不出来。所以在建立模型的时候，我们应该想好怎么去评估这个模型。目前常常采用的评价指标有准确率、召回率和F值（F-Measure）等。
在这里插入图片描述

8.正则化
       模型选择的典型方法就是正则化。正则化是结构风险最小化策略的实现，是在经验风险吗上加上一个正则化项或惩罚项。正则化项一般是模型负责度的单调递增函数，模型越复杂，正则化值就越大。比如正则化相可以是模型参数向量的范数。