《Web安全之深度学习实战》笔记：第十五章反信用卡欺诈

mooyuan天天

于 2022-03-13 11:22:41 发布

阅读量389

点赞数

分类专栏： Web安全之深度学习实战文章标签：深度学习人工智能

本文链接：https://blog.csdn.net/mooyuan/article/details/123441914

版权

Web安全之深度学习实战专栏收录该内容

19 篇文章 24 订阅

订阅专栏

本文详细介绍了如何利用标准化、过采样和降采样技术处理不平衡的信用卡欺诈数据集，以提高分类算法如朴素贝叶斯、XGBoost和多层感知机的性能。通过数据预处理，改善了模型在检测欺诈交易时的精度、召回率和F1分数。

摘要由CSDN通过智能技术生成

本章主要以Credit Card Fraud Detection数据集为例子介绍针对信用卡欺诈的检测技术，使用特征提取方法为标准化，以及基于标准化基础上的降采样和过采样，介绍的分类算法包括朴素贝叶斯、XGBoost和多层感知机。相对于其他章节，本小节主要是学习过采样和降采样的处理方法，这在机器学习领域是非常重要的知识。

一、信用卡欺诈

        信用卡欺诈是指故意使用伪造、作废的信用卡，冒用他人的信用卡骗取财物，或用本人信用卡进行恶意透支的行为，常见的信用卡欺诈主要包括以下几种形式。
        ·失卡冒用。失卡一般有三种情况，一是发卡银行在向持卡人寄卡时丢失，即未达卡；二是持卡人自己保管不善丢失；三是被不法分子窃取。
        ·假冒申请。一般都是利用他人资料申请信用卡，或是故意填写虚假资料。最常见的是伪造身份证，填报虚假单位或家庭地址。
        ·伪造信用卡。国际上的信用卡诈骗案件中，有60%以上是伪造卡诈骗，其特点是团伙做案，从盗取卡资料、制造假卡、贩卖假卡，到用假卡作案。伪造者经常利用一些最新的科技手段盗取真实的信用卡资料，有些是用微型测录机窃取信用卡资料，有些是伺机偷改授权机终端功能窃取信用卡资料，当窃取真实的信用卡资料后，便进行批量性的制造假卡，然后通过贩卖假卡大肆作案，牟取暴利。

二、数据集

测试数据来自Kaggle上Credit Card Fraud Detection数据集，该数据集记录了2013年9月欧洲信用卡交易数据，总共包括两天的交易数据。在284807次交易中包含了492例诈骗。数据集极其不平衡，诈骗频率只占了交易频次的0.172%。Credit Card Fraud Detection数据集为了避免泄露用户隐私，将原始数据做了脱敏等处理，最后使用28维向量描述，分别对应V1-V28，该笔交易发生时间为Time，该笔交易涉及的金额定义为Amount，该笔交易是否为欺诈定义为Class字段，其中1表示为欺诈，0表示为正常交易。

三、特征提取

（一）标准化处理

def get_feature():
    df = pd.read_csv("../data/fraud/creditcard.csv")
    df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))
    df = df.drop(['Time', 'Amount'], axis=1)

    y=df['Class']
    features = df.drop(['Class'], axis=1).columns
    x=df[features]

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)
    return x_train, x_test, y_train, y_test

（二）标准化&降采样

np.random.choice可以从整数或一维数组里随机选取内容，并将选取结果放入n维数组中返回

函数原型为：

numpy.random.choice(a, size=None, replace=True, p=None)

其中，a通常表示对应的数组，如果为整数，可以理解为一个连续的整数集合；size表明随机挑选的个数，常见使用方法如下：

#1-5这些数中随机选择3个
>>> np.random.choice(5, 3) 
array([0, 3, 4])
#1-5这些数中按照概率p表，随机选择3个
>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
array([3, 3, 0])
>>> np.random.choice(5, 3, replace=False)
array([3,1,0])
>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
array([2, 3, 0])

相对于标准化，diff如下所示，主要是随机在较多的白样本选取与黑样本相同数量number_fraud

的白样本，再与黑样本进行合并

于是完整的降采样方法代码如下所示：

def get_feature_undersampling():
    df = pd.read_csv("../data/fraud/creditcard.csv")
    df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))
    df = df.drop(['Time', 'Amount'], axis=1)

    number_fraud=len(df[df.Class==1])
    fraud_index=np.array(df[df.Class==1].index)

    normal_index=df[df.Class==0].index
    random_choice_index=np.random.choice(normal_index,size=number_fraud,replace=False)

    x_index=np.concatenate([fraud_index,random_choice_index])
    df = df.drop(['Class'], axis=1)
    x=df.iloc[x_index,:]
    y=[1]*number_fraud+[0]*number_fraud


    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)
    return x_train, x_test, y_train, y_test

还有一种方法是，先对训练集、测试集进行分割，再对训练集进行降采样，代码相对于标准化，diff如下

完整降采样的方法如下：

def get_feature_undersampling_2():
    df = pd.read_csv("../data/fraud/creditcard.csv")
    df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))
    df = df.drop(['Time', 'Amount'], axis=1)

    y = df['Class']
    features = df.drop(['Class'], axis=1).columns
    x = df[features]
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)

    print ("raw data")
    print (pd.value_counts(y_train))

    number_fraud=len(y_train[y_train==1])
    print (number_fraud)
    fraud_index=np.array(y_train[y_train==1].index)
    print (fraud_index)

    normal_index=y_train[y_train==0].index
    random_choice_index=np.random.choice(normal_index,size=number_fraud,replace=False)

    x_index=np.concatenate([fraud_index,random_choice_index])
    print (x_index)
    x_train_1=x.iloc[x_index,:]
    y_train_1=[1]*number_fraud+[0]*number_fraud

    return x_train_1, x_test, y_train_1, y_test

（三）标准化&过采样

解决黑白样本不均衡的问题还有一种方式叫做“过采样”。与劫富济贫的欠采样相反，过采样保留数量占优势的样本，通过一定的算法，在数量较少样本的基础上生成新样本。在本例中，保留白样本，通过一定的算法，在原有黑样本的基础上生成新的黑样本，最终形成的样本同样可以达到黑白样本均衡。其中最常见的生成算法就是Smote

def get_feature_upsampling():
    df = pd.read_csv("../data/fraud/creditcard.csv")
    df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))
    df = df.drop(['Time', 'Amount'], axis=1)

    y = df['Class']
    features = df.drop(['Class'], axis=1).columns
    x = df[features]
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)

    print ("raw data")
    print (pd.value_counts(y_train))

    os = SMOTE(random_state=0)
    x_train_1,y_train_1=os.fit_resample(x_train,y_train)
    print ("Smote data")
    print (pd.value_counts(y_train_1))

    return x_train, x_test, y_train, y_test

相对于标准化，diff如下

四、模型构建

（一）NB

def do_nb(x_train, x_test, y_train, y_test):
    gnb = GaussianNB()
    gnb.fit(x_train, y_train)
    y_pred = gnb.predict(x_test)
    do_metrics(y_test,y_pred)

（二）XGBOOST

def do_xgboost(x_train, x_test, y_train, y_test):
    xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
    y_pred = xgb_model.predict(x_test)
    do_metrics(y_test, y_pred)

（三）MLP

def do_mlp(x_train, x_test, y_train, y_test):
    #mlp
    clf = MLPClassifier(solver='lbfgs',
                        alpha=1e-5,
                        hidden_layer_sizes=(5, 2),
                        random_state=1)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    do_metrics(y_test,y_pred)

五、运行结果

（一）标准化

XGBoost
metrics.accuracy_score:
0.9995084399111681
metrics.confusion_matrix:
[[113700     14]
 [    42    167]]
metrics.precision_score:
0.9226519337016574
metrics.recall_score:
0.7990430622009569
metrics.f1_score:
0.8564102564102563

mlp
metrics.accuracy_score:
0.9994118834651475
metrics.confusion_matrix:
[[113701     13]
 [    54    155]]
metrics.precision_score:
0.9226190476190477
metrics.recall_score:
0.7416267942583732
metrics.f1_score:
0.8222811671087533

nb
metrics.accuracy_score:
0.9787926933103939
metrics.confusion_matrix:
[[111334   2380]
 [    36    173]]
metrics.precision_score:
0.06776341558950255
metrics.recall_score:
0.8277511961722488
metrics.f1_score:
0.12527154236060825

Process finished with exit code 0

（二）降采样

XGBoost
metrics.accuracy_score:
0.9137055837563451
metrics.confusion_matrix:
[[190  10]
 [ 24 170]]
metrics.precision_score:
0.9444444444444444
metrics.recall_score:
0.8762886597938144
metrics.f1_score:
0.9090909090909091

mlp
metrics.accuracy_score:
0.9187817258883249
metrics.confusion_matrix:
[[187  13]
 [ 19 175]]
metrics.precision_score:
0.9308510638297872
metrics.recall_score:
0.9020618556701031
metrics.f1_score:
0.9162303664921466

nb
metrics.accuracy_score:
0.9010152284263959
metrics.confusion_matrix:
[[193   7]
 [ 32 162]]
metrics.precision_score:
0.9585798816568047
metrics.recall_score:
0.8350515463917526
metrics.f1_score:
0.8925619834710744

（三）过采样


raw data
0    170576
1       308
Name: Class, dtype: int64
Smote data
1    170576
0    170576
Name: Class, dtype: int64
XGBoost
metrics.accuracy_score:
0.9996401077921052
metrics.confusion_matrix:
[[113731      8]
 [    33    151]]
metrics.precision_score:
0.949685534591195
metrics.recall_score:
0.8206521739130435
metrics.f1_score:

0.880466472303207

mlp
metrics.accuracy_score:
0.9993943277476892
metrics.confusion_matrix:
[[113698     41]
 [    28    156]]
metrics.precision_score:
0.7918781725888325
metrics.recall_score:
0.8478260869565217
metrics.f1_score:
0.8188976377952757

nb
metrics.accuracy_score:
0.9790911405071847
metrics.confusion_matrix:
[[111382   2357]
 [    25    159]]
metrics.precision_score:
0.06319554848966613
metrics.recall_score:
0.8641304347826086
metrics.f1_score:
0.11777777777777776

Process finished with exit code 0

mooyuan天天

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
0
评论
《Web安全之深度学习实战》笔记：第十五章反信用卡欺诈

本章主要以Credit Card Fraud Detection数据集为例子介绍针对信用卡欺诈的检测技术，使用特征提取方法为标准化，以及基于标准化基础上的降采样和过采样，介绍的分类算法包括朴素贝叶斯、XGBoost和多层感知机。相对于其他章节，本小节主要是学习过采样和降采样的处理方法，这在机器学习领域是非常重要的知识。一、信用卡欺诈信用卡欺诈是指故意使用伪造、作废的信用卡，冒用他人的信用卡骗取财物，或用本人信用卡进行恶意透支的行为，常见的信用卡欺诈主要包括以下几种形式。...
复制链接

扫一扫