机器学习入门之信用卡欺诈案例

最新推荐文章于 2023-03-01 10:37:54 发布

2034丶

最新推荐文章于 2023-03-01 10:37:54 发布

阅读量549

点赞数

分类专栏：机器学习相关案例

本文链接：https://blog.csdn.net/qq_45315982/article/details/105497793

版权

机器学习相关案例专栏收录该内容

4 篇文章 1 订阅

订阅专栏

信用卡欺诈检测（二分类问题）

因为数据的隐私问题，很多指标进行了降维处理，特征已经提取完毕。
数据进行分类：
属于0类：正常
属于1类：异常

在数据集中，正常数据大于异常数据。
在这里插入图片描述

class里面为0的是正样本，1的是负样本。

样本不均衡时采用的方法

1.过采样
让1样本像0样本一样多。
2.欠采样（下采样）
让0样本像1样本一样少。

Amount样本分布差异过大（标准化或者归一化）

#Amount数据差异过大进行标准化或者归一化
from sklearn.preprocessing import StandardScaler
#reshape(-1,1)行不定，列数定
data['normAmount']=StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))
#删除不需要的列
data=data.drop(['Time','Amount'],axis=1)
# print(data)

交叉验证（训练集和测试集）

将数据进行分类：80%的数据是train（训练集）用来建立模型。
20%的数据是test（测试集）用来测试模型。

将训练集train在进行划分，假设平均分为三等分A，B，C
A+B去验证C
A+C去验证B
B+C去验证A
这样进行交叉验证，保证model的评估效果更加可信。

#交叉验证 （训练集，测试集）

from sklearn.model_selection import train_test_split

# 对整个训练集进行切分，testsize表示训练集大小，state=0在切分时进行数据重新洗牌的标识位
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print("Number transactions train dataset:",len(X_train))
print("Number transactions test dataset:",len(X_test))
print("Total number of transactions:",len(X_train)+len(X_test))


#对下采样数据集进行切分
X_train_undersample,X_test_undersample,y_train_undersample,y_test_undersample=train_test_split(X_undersample,y_undersample,test_size=0.3,random_state=0)
print("切分:")
print("Number transactions train dataset:",len(X_train_undersample))
print("Number transactions test dataset:",len(X_test_undersample))
print("Total number of transactions:",len(X_train_undersample)+len(X_test_undersample))

调用逻辑回归模型

# 引用混淆矩阵，召回率
def printing_Kfold_scores(x_train_data, y_train_data):
    fold=KFold(5,shuffle=False)
    # 传入选择正则化的参数，正则化（L1，L2）惩罚项
    #希望模型浮动较小
    c_param_range = [0.01, 0.1, 1, 10, 100]
    results_table = pd.DataFrame(index=range(len(c_param_range)), columns=['C_paramter', "Mean recall score"])

    results_table['C_parameter'] = c_param_range
    print(results_table)
    j=0
    # 第一个for循环用来了打印每个正则化参数下的输出
    for c_param in c_param_range:
        print("---------------------------")
        print('C_paramter:',c_param)
        print('---------------------------')
        print('')
        recall_accs=[]

        for iteration,indices in enumerate(fold.split(x_train_data)):

            # 传入正则化参数下的输出
            # 用一个确定的c参数调用逻辑回归模型，把c_param_range代入到逻辑回归模型中，并使用了l1正则化
            lr = LogisticRegression(C=c_param, penalty='l1', solver='liblinear')
            # 使用训练数据拟合模型，在这个例子中，我们使用这交叉部分训练模型
            # 套路：使训练模型fit模型,使用indices[0]的数据进行拟合曲线，使用indices[1]的数据进行误差测试
            # lr.fit(x_train_data.iloc[indices[0], :], y_train_data.iloc[indices[0], :].values.ravel())
            lr.fit(x_train_data.iloc[indices[0], :], y_train_data.iloc[indices[0], :].values.ravel())


            # 在训练集数据中，使用测试指标来预测值
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1], :].values)

            # 评估the recall score
            recall_acc = recall_score(y_train_data.iloc[indices[1], :].values, y_pred_undersample)
            recall_accs.append(recall_acc)
            print("Iteration", iteration, ':recall score =', recall_acc)

        # 这些recall scores的平均值，就是我们想要的指标
        results_table.loc[j, 'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print("")
        print("Mean recall score", np.mean(recall_accs))
        print("haha",results_table)
    best_c = results_table.loc[results_table['Mean recall score'].values.argmax()]['C_parameter']

    # 最后，验证那个c参数是最好的选择
    print("Best model to choose from cross validation is with parameter=", best_c)
    return best_c

best_c = printing_Kfold_scores(X_train_undersample, y_train_undersample)

使用过采样，使得两种样本数据一样多
①SMOTE
AMOTE全称是Synthetic Minority Oversampling Technique,即合成少数过采样技术。
它是基于采样算法的一种改进方案。由于随机采样采取简单素质样本的策略来增加少数类样本，这样容易产生模型过拟合的问题，即是的模型学习到的信息过于特别而不够泛化。
SMOTE算法的基本思想是UID少数类样本进行分析并根据少数类样本人工合成新样本添加到数据集中，

SMOTE算法步骤：

对于少数类中每个样本X，以欧式距离为标准计算它到少数类样本集中所有样本的距离，得到K近邻。
（如图a，计算Xi到其他样本的距离）
根据样本不平衡比例设置一个采样比例以确定采样倍率N，对于每一个少数类样本X，从其K近邻中随机选择若干个样本，假设选择的近邻为Xn。
对于每一个随机选出的近邻Xn，分别于原样本安装公式构建新的样本。
xnew=x+rand(0,1)×（x~-x）

欧式距离：（样本之间的差异）

如下图所示：
在这里插入图片描述

算法流程如下：
设训练的一个少数类样本数为T，那么SMOTE算法将为这个少数类合成NT个新样本。这里要求N必须是正整数，如果给定的N<1，那么算法认为少数类的样本数T=NT，并将强制N=1。
考虑该少数类的一个样本i，其特征向量为xi,i∈{1,…,T}
Step1：首先从该少数类的全部T个样本中找到样本xi的k个近邻（例如欧式距离），记为：xi(near),near∈{1,…,k}
Step2：然后从这k个近邻中随机选择一个样本xi(nn)，再生成一个0到1之间的随机数random，从而合成一个新样本xi1：xi1=xi+random*(xi(nn)-xi)；

在这里插入图片描述

Step3：将步骤2重复进行N次，从而可以合成N个新样本： xinew,new∈{1,…,k}
那么，对全部的T个少数类样本进行上述操作，便可为该少数类合成NT个新样本。
如果样本的特征维数是2维，那么每个样本都可以用二维平面的一个点来表示。SMOTE算法所合成出的一个新样本xi1相当于是表示样本xi的点和表示样本xi(nn)的点之间所连线段上的一个点，所以说该算法是基于“差值”来合成新样本。

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

#信用卡欺诈检测
data=pd.read_csv('./creditcard.csv')
# print(data.head())#查看数据

#查看样本的正负比例
#查看class列中不同的属性值
count_calsses=pd.value_counts(data['Class'],sort=True).sort_index()
#pandas 也可以进行一些简单的画图
count_calsses.plot(kind='bar')#kind=‘bar’条形图
plt.title('Fraud class histogrm')
plt.xlabel('Class')
plt.ylabel('Frequency')
# plt.show()

#Amount数据差异过大进行标准化或者归一化
from sklearn.preprocessing import StandardScaler
#reshape(-1,1)行不定，列数定
data['normAmount']=StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))
#删除不需要的列
data=data.drop(['Time','Amount'],axis=1)
print('hah',data)

#先采用下采样
#取出Class列和其它列
y=data.loc[:,data.columns=='Class'] #取出包含class得列

X=data.loc[:,data.columns!='Class']#取出所有不包含class的列

#计算class列中为1的个数,异常样本
number_records_fraud=len(data[data.Class==1])

# 取出Class这一列所有等于1的行索引
fraud_indices=np.array(data[data.Class==1].index)

# 取出Class这一列所有等于0的行索引
normal_indices = np.array(data[data.Class == 0].index)

#进行随机的选择构造下采样使0跟1一样少
random_normal_indices=np.random.choice(normal_indices,number_records_fraud,replace=False)
random_normal_indices=np.array(random_normal_indices)



# 将正负样本拼接在一起
under_sample_indices = np.concatenate([fraud_indices, random_normal_indices])


# 下采集数据
under_sample_data = data.iloc[under_sample_indices]
print(under_sample_data)
# 下采集数据集的数据（除Class这列外）
X_undersample = under_sample_data.loc[:,under_sample_data.columns!='Class']
print("HEIHEI",X_undersample)

# 下采集数据集的label（只取Class这列）
y_undersample = under_sample_data.loc[:,under_sample_data.columns=='Class']


# 输出
# 打印正样本数目
print("Percentage of normal transactions:",
      len(under_sample_data[under_sample_data.Class == 0]) / len(under_sample_data))
# 打印负样本数目
print("Percentage of fraud transactions:",
      len(under_sample_data[under_sample_data.Class == 1]) / len(under_sample_data))
# 打印总数
print("Total number of transaction in resampled data:",len(under_sample_data))

#交叉验证 （训练集，测试集）

from sklearn.model_selection import train_test_split

# 对整个训练集进行切分，testsize表示训练集大小，state=0在切分时进行数据重新洗牌的标识位
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print("Number transactions train dataset:",len(X_train))
print("Number transactions test dataset:",len(X_test))
print("Total number of transactions:",len(X_train)+len(X_test))


#对下采样数据集进行切分
X_train_undersample,X_test_undersample,y_train_undersample,y_test_undersample=train_test_split(X_undersample,y_undersample,test_size=0.3,random_state=0)
print("切分:")
print(y_train_undersample)
print("Number transactions train dataset:",len(X_train_undersample))
print("Number transactions test dataset:",len(X_test_undersample))
print("Total number of transactions:",len(X_train_undersample)+len(X_test_undersample))

#用逻辑回归进行建模
from sklearn.linear_model import LogisticRegression

# 调用K折交叉验证 KFold(交叉验证切分）
from sklearn.model_selection import KFold,cross_val_score
from sklearn.metrics import  confusion_matrix,recall_score,classification_report

# 引用混淆矩阵，召回率
def printing_Kfold_scores(x_train_data, y_train_data):
    fold=KFold(5,shuffle=False)
    # 传入选择正则化的参数，正则化（L1，L2）惩罚项
    #希望模型浮动较小
    c_param_range = [0.01, 0.1, 1, 10, 100]
    results_table = pd.DataFrame(index=range(len(c_param_range)), columns=['C_paramter', "Mean recall score"])

    results_table['C_parameter'] = c_param_range
    print(results_table)
    j=0
    # 第一个for循环用来了打印每个正则化参数下的输出
    for c_param in c_param_range:
        print("---------------------------")
        print('C_paramter:',c_param)
        print('---------------------------')
        print('')
        recall_accs=[]

        for iteration,indices in enumerate(fold.split(x_train_data)):

            # 传入正则化参数下的输出
            # 用一个确定的c参数调用逻辑回归模型，把c_param_range代入到逻辑回归模型中，并使用了l1正则化
            lr = LogisticRegression(C=c_param, penalty='l1', solver='liblinear')
            # 使用训练数据拟合模型，在这个例子中，我们使用这交叉部分训练模型
            # 套路：使训练模型fit模型,使用indices[0]的数据进行拟合曲线，使用indices[1]的数据进行误差测试
            # lr.fit(x_train_data.iloc[indices[0], :], y_train_data.iloc[indices[0], :].values.ravel())
            lr.fit(x_train_data.iloc[indices[0], :], y_train_data.iloc[indices[0], :].values.ravel())


            # 在训练集数据中，使用测试指标来预测值
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1], :].values)

            # 评估the recall score
            recall_acc = recall_score(y_train_data.iloc[indices[1], :].values, y_pred_undersample)
            recall_accs.append(recall_acc)
            print("Iteration", iteration, ':recall score =', recall_acc)

        # 这些recall scores的平均值，就是我们想要的指标
        results_table.loc[j, 'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print("")
        print("Mean recall score", np.mean(recall_accs))
        print("haha",results_table)
    best_c = results_table.loc[results_table['Mean recall score'].values.argmax()]['C_parameter']

    # 最后，验证那个c参数是最好的选择
    print("Best model to choose from cross validation is with parameter=", best_c)
    return best_c

best_c = printing_Kfold_scores(X_train_undersample, y_train_undersample)


#混淆矩阵
import itertools

# 这个方法输出和画出混淆矩阵
def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues):
    # cm为数据，interpolation=‘nearest'使用最近邻插值，cmap颜色图谱（colormap），默认绘制为RGB（A）颜色空间
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    # xticks为刻度下标
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)
    # text()命令可以在任意位置添加文字
    thresh = cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else 'black')
    # 自动紧凑布局
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

#使用下采样数据集进行训练与测试
lr = LogisticRegression(C=best_c, penalty='l1', solver='liblinear')
lr.fit(X_train_undersample, y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)
# 计算混淆矩阵
cnf_matrix = confusion_matrix(y_test_undersample, y_pred_undersample)
# 输出精度为小数点后两位
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset:", cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))
# 画出非标准化的混淆矩阵
class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()

# 下采样数据进行训练，使用原始数据进行测试
lr = LogisticRegression(C=best_c,penalty='l1',solver='liblinear')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)
# 计算混淆矩阵
cnf_matrix = confusion_matrix(y_test,y_pred)
# 输出精度为小数点后两位
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset :",cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# 画出非标准化的混淆矩阵
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix,classes=class_names,title='Confusion matrix')
plt.show()

# 原始数据进行K折交叉验证
best_c = printing_Kfold_scores(X_train, y_train)

# 使用原始数据进行训练与测试
lr = LogisticRegression(C=best_c,penalty='l1',solver='liblinear')
lr.fit(X_train,y_train.values.ravel())
y_pred = lr.predict(X_test.values)
# 计算混淆矩阵
cnf_matrix = confusion_matrix(y_test,y_pred)
# 输出精度为小数点后两位
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset :", cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))
# 画出非标准化的混淆矩阵
class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()

# 使用下采样数据训练与测试（不同的阈值对结果的影响）
lr = LogisticRegression(C=best_c, penalty='l1', solver='liblinear')
lr.fit(X_train_undersample, y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)
#对于阈值，设置的太大不好，设置的太小也不好，所以阈值设定地越适当，才能使得模型拟合效果越好。【
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
plt.figure(figsize=(10, 10))
j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:, 1] > i
    plt.subplot(3, 3, j)
    j += 1
    # 计算混淆矩阵
    cnf_matrix = confusion_matrix(y_test_undersample, y_test_predictions_high_recall)
    # 输出精度为小数点后两位
    np.set_printoptions(precision=2)
    print("Recall metric in the testing dataset:", cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))
    # 画出非标准化的混淆矩阵
    class_names = [0, 1]
    plot_confusion_matrix(cnf_matrix, classes=class_names, title='Threshold>=%s' % i)
plt.show()


#过采样的操作 stmote算法进行建模

from imblearn.over_sampling import  SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

credits_cards=pd.read_csv('creditcard.csv')
columns=credits_cards.columns
print(columns)
#移除最后一列标签
features_columns=columns.delete(len(columns)-1)
features=credits_cards[features_columns]
labels=credits_cards['Class']
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.2,
                                                                            random_state=0)
oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_sample(features_train,labels_train)
print(len(os_labels[os_labels==1]))

os_features=pd.DataFrame(os_features)
os_labels=pd.DataFrame(os_labels)
best_c=printing_Kfold_scores(os_features,os_labels)

参考博客：https://blog.csdn.net/huahuaxiaoshao/article/details/85232089

2034丶

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
机器学习入门之信用卡欺诈案例

信用卡欺诈检测（二分类问题）因为数据的隐私问题，很多指标进行了降维处理，特征已经提取完毕。数据进行分类：属于0类：正常属于1类：异常在数据集中，正常数据大于异常数据。class里面为0的是正样本，1的是负样本。样本不均衡时采用的方法1.过采样让1样本像0样本一样多。2.欠采样（下采样）让0样本像1样本一样少。Amount样本分布差异过大（标准化或者归一化）#Amount...
复制链接

扫一扫

专栏目录