4、【逻辑回归】信用卡欺诈检测(下采样、SMOTE,调整sigmod函数阈值)

17 篇文章 7 订阅 ¥99.90 ¥299.90
13 篇文章 0 订阅

对于一个二分类问题,首先想到的还是逻辑回归!(我愿称逻辑回归为最nb的二分类算法,目前为止)

s i g m o d ( z )   =   1 1 + e − z 值 域 [ 0 , 1 ] sigmod(z)\ =\ {1\over{1+e^{-z}}} \\ 值域[0,1] sigmod(z) = 1+ez1[0,1]
y ( i )   =   Q T ∗ x i   +   E i y^{(i)}\ =\ Q^T*x^i\ +\ E^i y(i) = QTxi + Ei

  • sigmod函数实则将数值转为概率值,一般z取QX
    ( θ 0 , θ 1 , θ 2 )   ∗   ( 1 , X 1 , X 2 ) . T =   θ 0   + θ 1 X 1   +   θ 2 X 2 h ( θ T X )   =   g ( θ T X ) (\theta_0,\theta_1,\theta_2)\ *\ (1,X1,X2).T =\ \theta_0\ +\theta_1X_1\ +\ \theta_2X_2 \\ h(\theta^TX)\ =\ g(\theta^TX) (θ0,θ1,θ2)  (1,X1,X2).T= θ0 +θ1X1 + θ2X2h(θTX) = g(θTX)

信用卡欺诈检测

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('./creditcard.csv')
data.head()

在这里插入图片描述

对异常特征进行归一化处理

from sklearn.preprocessing import StandardScaler
std = StandardScaler()
data.insert(0,'Norm_Amount',std.fit_transform(np.array(data['Amount']).reshape(-1,1)))
data = data.drop(['Amount','Time'],axis=1)

在这里插入图片描述

在这里插入图片描述

检测数据集正负样本分布情况

class_counts = pd.value_counts(values=data['Class'],sort=True).sort_index()
class_counts.plot(kind='bar')
plt.xlabel('0 or 1')
plt.ylabel('numbers')

在这里插入图片描述
在这里插入图片描述

  • 此处样本数据极度不均衡,故考虑过采样或者下采样
    • 下采样:从0里面随机挑取与1相同数目的样本
    • 上采样:生成1的类别,使之与0类别相同或接近

采用下采样调整数据正负不平衡

X = data.loc[:,data.columns != 'Class']
y = data.loc[:,data.columns == 'Class']
one_index = data[data.Class == 1].index
zero_index = data[data.Class == 0].index
random_zero_index = np.random.choice(zero_index,492,replace=False)
all_data_index_1 = np.concatenate([one_index,random_zero_index])
data_under = data.iloc[all_data_index_1,:]
data_under

在这里插入图片描述

X_under = data_under.loc[:,data_under.columns != 'Class']
Y_under = data_under.loc[:,data_under.columns == 'Class']
display(X_under,Y_under)

在这里插入图片描述

Y_under.value_counts().sort_index().plot(kind='bar')
Y_under.value_counts()

在这里插入图片描述

交叉验证并使用logic回归进行建模,确定最佳C

  • 正则化惩罚项:避免过拟合
  • 惩罚权重:例如,loss = loss + 1/2 * w^2 (lamuda * L2正则化,lambda代表惩罚力度)
  • 确定最佳lambda值,使得模型recall最高
from sklearn.model_selection import train_test_split,cross_val_score,KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,recall_score,classification_report
#对原始数据集进行切分(使用原始数据集中的test进行测试)
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size=0.3,random_state=0)
#下采样数据集进行切分
X_under_train,X_under_test,Y_under_train,Y_under_test = train_test_split(X_under,Y_under,test_size=0.3,random_state=0)
def print_KFold_scores(x_train,y_train):
    fold = KFold(5,shuffle=False)
    C_punish = [0.01,0.1,1,10,100]#lambda
    recall_tb = pd.DataFrame(columns=['C_punish','Mean_recall'])
    recall_lst = []
    for c in C_punish:
        lr = LogisticRegression(penalty='l1',C=c)
        lst = []
        print("---------C为:",c)
        for index,value in enumerate(fold.split(x_train)):
            count = 1
            lr.fit(x_train.iloc[value[0],:].values,y_train.iloc[value[0],:].values.ravel())
            pred = lr.predict(x_train.iloc[value[1],:].values)
            rs = recall_score(y_train.iloc[value[1],:].values,pred)
            lst.append(rs)
            print("test",count,"召回率为",rs)
            count+=1
        recall_lst.append(sum(lst)/len(lst))
    recall_tb['Mean_recall'] = recall_lst
    recall_tb['C_punish'] = C_punish
    return recall_tb

bst_recall = print_KFold_scores(X_under_train,Y_under_train)
bst_recall

在这里插入图片描述

模型评估方法之混淆矩阵

import itertools
def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
  • 下采样样本进行预测
lr = LogisticRegression(C=0.01,penalty='l1')
lr.fit(X_under_train,Y_under_train)
y_pred = lr.predict(X_under_test)
conf = confusion_matrix(Y_under_test,y_pred)
plot_confusion_matrix(conf,[0,1])

在这里插入图片描述

  • 我们可以看到,查全是比较好的,准确率(被误伤情况)也较好

将上述模型应用全部数据的测试集进行预测

lr = LogisticRegression(C=0.01,penalty='l1')
lr.fit(X_under_train,Y_under_train)
y_pred = lr.predict(X_test)
conf = confusion_matrix(Y_test,y_pred)
plot_confusion_matrix(conf,[0,1])

print("召回率",conf[1,1]/(conf[1,1]+conf[1,0]))
print("准确率",(conf[1,1]+conf[0,0])/(conf[1,1]+conf[0,1]+conf[1,0]+conf[0,0]))

在这里插入图片描述

  • 召回率 0.9183673469387755
  • 准确率 0.8816637992579849

但是,如果一开始就拿全部数据进行建模,康康情况如何

lr = LogisticRegression()
lr.fit(X_train,Y_train)
y_pred = lr.predict(X_test)
conf = confusion_matrix(Y_test,y_pred)
plot_confusion_matrix(conf,[0,1])
print("召回率",conf[1,1]/(conf[1,1]+conf[1,0]))
print("准确率",(conf[1,1]+conf[0,0])/(conf[1,1]+conf[0,1]+conf[1,0]+conf[0,0]))

在这里插入图片描述
在这里插入图片描述

可以看到,直接拿原始数据进行建模预测,结果来看,查全率(recall),非常不理想,尽管准确率较高,但是我们试想,对于一个信贷检测的目的,尽可能找出有诈骗风险的用户更为重要,还是准确率更为重要呢?跟癌症检测同理,因此,在此处的测评指标中我们更倾向于recall这个指标。当然,不可否认,此处将无欺骗风险的客户识别为欺骗风险较高的客户在此处也是比较夸张了,实际生活中所需的维护代价过大,故笔者进一步探索,分别从调节sigmod的阈值(将0.5上调或者下调)以及向上采样)(SMOTE)进行模型完善。

通过调节sigmod函数的阈值进行查找最佳recall跟precision

lr = LogisticRegression(C=0.01,penalty='l1')
lr.fit(X_under_train,Y_under_train)
y_pred_undersample_proba = lr.predict_proba(X_under_test.values)
#此处将predict改为predict_proba,即传入sigmod中的概率值,以便调节sigmod阈值

#可能的阈值0-1
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
    
    plt.subplot(3,3,j)
    j += 1
    
    # Compute confusion matrix
    cnf_matrix = confusion_matrix(Y_under_test,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)

    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

    # Plot non-normalized confusion matrix
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix
                          , classes=class_names
                          , title='Threshold >= %s'%i) 
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  0.9863945578231292
Recall metric in the testing dataset:  0.9251700680272109
Recall metric in the testing dataset:  0.8639455782312925
Recall metric in the testing dataset:  0.8299319727891157
Recall metric in the testing dataset:  0.7482993197278912
Recall metric in the testing dataset:  0.5850340136054422

在这里插入图片描述
由此可见,0.4可能才是最佳阈值(上述测试仍在调整模型,故在under下采样数据集上进行调试)

使用SMOTE方法进行测试

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
credit_cards=pd.read_csv('creditcard.csv')

columns=credit_cards.columns

features_columns=columns.delete(len(columns)-1)

features=credit_cards[features_columns]
labels=credit_cards['Class']

features_train, features_test, labels_train, labels_test = train_test_split(features, 
                                                                            labels, 
                                                                            test_size=0.3, 
                                                                            random_state=0)
oversampler=SMOTE(random_state=0)

os_features,os_labels=oversampler.fit_resample(features_train,labels_train)
print(len(os_labels[os_labels==1]),len(os_labels[os_labels==0]))
os_features = pd.DataFrame(os_features)
os_labels = pd.DataFrame(os_labels)
best_c = print_KFold_scores(os_features,os_labels)
best_c

在这里插入图片描述

import numpy as np
lr = LogisticRegression(C = 0.10, penalty = 'l2')
lr.fit(os_features,os_labels.values.ravel())
y_pred = lr.predict(features_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

在这里插入图片描述

可以看到SMOTE之后采样的数据集进行预测,不光召回率可观,准确率也大大提高,误伤情况也大大改善,所以总结,优先使用SMOTE采样。

  • 1
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Alexander plus

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值