Python实现Logistic回归实例——信用卡欺诈检测

最新推荐文章于 2022-11-15 19:18:47 发布

爆炒小青蛙

最新推荐文章于 2022-11-15 19:18:47 发布

阅读量1.9k

点赞数 1

分类专栏： python 机器学习文章标签： Python 机器学习 Logistic回归

本文链接：https://blog.csdn.net/ismedal/article/details/79703200

版权

python 同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

机器学习

9 篇文章 0 订阅

订阅专栏

信用卡欺诈检测算是一个比较经典的例子了，这里记录一下python的算法。

导入三个基本的包，读取数据，发现数据的不平衡性：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

path=r"D:\learning\data_for_py\creditcard.csv"
data=pd.read_csv(path)
data.head()

#统计每一类有多少个样本
count_classes=pd.value_counts(data["Class"],sort=True).sort_index()
count_classes
#发现样本不平衡性

由于自变量存在异方差性，对于方差特别大的变量，（为了保证变量有同等的重要性，）需要对其做标准化处理：

from sklearn.preprocessing import StandardScaler
data["normAmount"]=StandardScaler().fit_transform(data["Amount"].reshape(-1,1))
data=data.drop(["Time","Amount"],axis=1)

注：以上reshape(-1,1)中，第一个参数-1是设定了新数据为1列以后让python自行计算行数。

为解决数据不平衡性，这里采用下采样策略：

X=data.iloc[:,data.columns!="Class"]
y=data.iloc[:,data.columns=="Class"]
number_records_fraud=len(y[y.iloc[:,0]==1])
#以上还可以写成：len(data[data.Class==1])
#提取异常样本的索引
fraud_indices=np.array(data[data.Class==1].index)
#提取正常样本的索引
normal_indices=data[data.Class==0].index
#在正常样本索引中抽取number_records_fraud这么多样本
random_normal_indices=np.random.choice(normal_indices,number_records_fraud,replace=False)
random_normal_indices=np.array(random_normal_indices)
#将两部分样本进行合并,一共是984个样本
under_sample_indices=np.concatenate([fraud_indices,random_normal_indices])
print(len(under_sample_indices))
#按照已经提取合并的index在data里面抽样
under_sample_data=data.iloc[under_sample_indices]
X_undersample=under_sample_data.iloc[:,data.columns!="Class"]
y_undersample=under_sample_data.iloc[:,data.columns=="Class"]

交叉验证，切分样本数据集：

from sklearn.cross_validation import train_test_split
#切分原始数据集（用于测试和预测）
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0)
print(len(X_train),len(X_test))
print(len(y_train),len(y_test))
#切分下采样数据集
X_train_undersample,X_test_undersample,y_train_undersample,y_test_undersample=train_test_split(X_undersample,y_undersample,test_size=0.3,random_state=0)
print(len(X_train_undersample),len(X_test_undersample))
print(len(y_train_undersample),len(y_test_undersample))

建立模型，这里用到了关键的机器学习包sklearn，对于sklearn各个函数的用法，参见官方文档：http://scikit-learn.org/0.15/user_guide.html ，是超级棒的资料。

def printing_Kfold_scores(x_train_data,y_train_data):
    fold=KFold(len(y_train_data),5,shuffle=False)
    #将训练集分为5折，用enumerate函数以后得到的是索引iteration和分集列表indices（一串[训练集，验证集]组成，它们都属于x_train_data即原来的训练集）
    c_param_range=[0.01,0.1,1,10,100] #设定惩罚力度的列表（惩罚是为了使theta分布波动性更大的模型有较小的比重，通过增大其损失函数）
    #下面设定一个result table，用来存放每个惩罚力度下的召回率的值
    results_table=pd.DataFrame(index=range(len(c_param_range),2),columns=["C_parameter","Mean recall score"])
    results_table["C_parameter"]=c_param_range
    j=0
    for c_param in c_param_range:
        print("---------------------")
        print("C parameter:",c_param)
        print("---------------------")
        print('')
        
        recall_accs=[]
        for iteration,indices in enumerate(fold,start=1):
            lr=LogisticRegression(C=c_param,penalty="l1") #LogisticRegression的参数：惩罚力度和惩罚方式
            
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel()) 
            #每次的训练集找的是x_train_data中indices[0]中提到的观测值
            
            y_pred_undersample=lr.predict(x_train_data.iloc[indices[1],:].values)
            
            recall_acc=recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            #计算召回率，recall_score的两个参数：验证集上面实际的y的值，建模预测的y的值
            recall_accs.append(recall_acc)
            
            print("Iteration",iteration,":recall score=",recall_acc)
            
        results_table.loc[j,"Mean recall score"]=np.mean(recall_accs) #记录每一次验证的召回率
        j += 1
        print("")
        print("Mean Recall Score:",np.mean(recall_accs))
        print("")
        
    #找出使召回率最大的惩罚力度C:
    #best_c=results_table.loc[results_table["Mean recall score"].argmax()]["C-parameter"]
    best_c=results_table.loc[results_table.loc[:,"Mean recall score"]==max(results_table.loc[:,"Mean recall score"])].iloc[:,0]
    print("********The best C parameter is ",best_c)
    print(results_table)
    return best_c

【注】：在看上面这段代码时，对于KFold函数我一直很困惑它的输出结果，以至于“for iteration,indices in enumerate(fold,start=1): ”这个循环没怎么懂，后来查阅了官方文档，了解了KFold是怎么个玩法，用几个例子就能看出来：

kf = KFold(10, n_folds=5)
for train, test in kf:
    print("%s %s" % (train, test))

运行结果为：

[2 3 4 5 6 7 8 9] [0 1]
[0 1 4 5 6 7 8 9] [2 3]
[0 1 2 3 6 7 8 9] [4 5]
[0 1 2 3 4 5 8 9] [6 7]
[0 1 2 3 4 5 6 7] [8 9]

将10个样本进行5折交叉验证，KFold是给出每次的训练集的样本编号和测试集的样本编号；

for index, indices in enumerate(kf):
    print("%s %s" % (index, indices))

运行结果为：

0 (array([2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1]))
1 (array([0, 1, 4, 5, 6, 7, 8, 9]), array([2, 3]))
2 (array([0, 1, 2, 3, 6, 7, 8, 9]), array([4, 5]))
3 (array([0, 1, 2, 3, 4, 5, 8, 9]), array([6, 7]))
4 (array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))

enumerate在字典上是枚举、列举的意思;
对于一个可迭代的（iterable）/可遍历的对象（如列表、字符串），enumerate将其组成一个索引序列，利用它可以同时获得索引和值;

多用于在for循环中得到计数;

for index, indices in enumerate(kf):
    print("%s" % indices[0])
#用这种方法就得到了每次交叉验证中的训练集的样本编号

for index, indices in enumerate(kf):
    print("%s" % indices[1])
#用这种方法就得到了每次交叉验证中的测试集的样本编号

言归正传，printing_Kfold_scores函数写好后，运行下面语句，可以得到各个惩罚力度下的召回率：

best_c=printing_Kfold_scores(X_train_undersample,y_train_undersample)

得到的result_table为：

   C_parameter Mean recall score
0         0.01          0.958143
1         0.10          0.897571
2         1.00          0.912173
3        10.00          0.920683
4       100.00          0.921333

#用原数据集建模，发现recall值明显降低，下采样优于不采样
best_c=printing_Kfold_scores(X_train,y_train)

   C_parameter Mean recall score
0         0.01          0.559568
1         0.10           0.59531
2         1.00          0.612646
3        10.00          0.618479
4       100.00          0.618479

下面画出混淆矩阵：

#画出混淆矩阵
def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
   
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

import itertools
best_c=0.01
lr = LogisticRegression(C = best_c, penalty = 'l1') #在下采样中最好的C值是0.01
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values) #这里用predict函数预测，结果是分类而不是概率值

#计算混淆矩阵
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample) #confusion_matrix函数的两个参数：测试集上y的真实分类，y的预测值
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) #召回率：TP/(TP+FN)
#这里是137/（10+137）=0.9319727891156463
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

#用原数据集建模时，最好的C值是10
#best_c=10
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel()) #fit模型的时候用的是undersample的样本
y_pred = lr.predict(X_test.values) #这里用的是X_test预测而不是X_test_undersample

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()
#可以看到“误伤”非常多，达到了一万多

由于阈值默认为0.5时recall虽高但误伤太多，模型整体精度不高，下面看阈值取各个不同的值时的混淆矩阵：

lr = LogisticRegression(C = 0.01, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i #如y_test_predictions_high_recall是bool型数据，为True表示大于阈值
    
    plt.subplot(3,3,j)
    j += 1
    
    #计算混淆矩阵
    cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)

    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

    # Plot non-normalized confusion matrix
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix
                          , classes=class_names
                          , title='Threshold >= %s'%i)

下面是过采样方法解决样本不平衡问题，过采样需要用SMOTE算法，因此需要导入相应的包：

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

【注】：过采样SMOTE算法是从每个异类样本（这里是指Class=1的样本）中生成n个新的异类样本，计算每个异类样本到其他异类样本的距离，取其前n的距离，乘以random(0,1)再加到原样本上。

之后的一系列操作就很简单啦：

columns=credit_cards.columns
# The labels are in the last column ('Class'). Simply remove it to obtain features columns
features_columns=columns.delete(len(columns)-1)

features=credit_cards[features_columns]
labels=credit_cards['Class']

features_train, features_test, labels_train, labels_test = train_test_split(features, 
                                                                            labels, 
                                                                            test_size=0.2, 
                                                                            random_state=0)

oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_sample(features_train,labels_train)

os_features = pd.DataFrame(os_features)
os_labels = pd.DataFrame(os_labels)
best_c = printing_Kfold_scores(os_features,os_labels)

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(os_features,os_labels.values.ravel())
y_pred = lr.predict(features_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

可以看到用过采样解决样本不平衡问题后，召回率有略微的下降，但同时模型的精度得到了很大的提高，误伤的样本大大减小了。可见遇到此类样本不均衡问题，效率允许的情况下，首选过采样。

爆炒小青蛙

关注

1
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
Python实现Logistic回归实例——信用卡欺诈检测

信用卡欺诈检测算是一个比较经典的例子了，这里记录一下python的算法。导入三个基本的包，读取数据，发现数据的不平衡性：import numpy as npimport pandas as pdimport matplotlib.pyplot as pltpath=r"D:\learning\data_for_py\creditcard.csv"data=pd.read_csv(path...
复制链接

扫一扫