Credit Fraud || Dealing with Imbalanced Datasets

最新推荐文章于 2023-12-29 01:49:40 发布

置顶 Kyrie_Irving

最新推荐文章于 2023-12-29 01:49:40 发布

阅读量556

点赞数

本文链接：https://blog.csdn.net/Kyrie_Irving/article/details/102824539

版权

Credit Fraud || Dealing with Imbalanced Datasets
1、导入数据

import pandas as pd
df=pd.read_csv(r'C:\Users\Administrator\Desktop\project\creditcard.csv')
df

在这里插入图片描述
2、数据处理

缺失值

df.isnull().sum().max()

查看标签数据分布

import seaborn as sns
import matplotlib.pyplot as plt
print('No Frauds',round((df['Class'].value_counts()[0]/df.shape[0])*100,2),'% of the dataset')
print('Frauds',round((df['Class'].value_counts()[1]/df.shape[0])*100,2),'% of the dataset')

colors= ["#0101DF", "#DF0101"]
plt.title("Class Distributions \n(0:Fraud    ||     1:Fraud)",fontsize=14)
sns.countplot(x='Class',data=df,palette=colors)

countplot参考地址
在这里插入图片描述

查看特征中的数据分布

fig,ax=plt.subplots(1,2,figsize=(18,4))
amount_value=df['Amount'].values
time_value=df['Time'].values
ax[0].set_title('Distribution of Transaction Amount')
ax[0].set_xlim(min(amount_value),max(amount_value))
sns.distplot(amount_value,color='g',ax=ax[0])

ax[1].set_title('Distribution of Transaction Time')
ax[1].set_xlim(min(time_value),max(time_value))
sns.distplot(time_value,color='red',ax=ax[1])

在这里插入图片描述

RobustScaler缩放特征数据（因为数据含有异常值）

from sklearn.preprocessing import StandardScaler,RobustScaler

std_scaler=StandardScaler()
rob_scaler=RobustScaler()

df['amount_scaler']=rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['time_scaler']=rob_scaler.fit_transform(df['Time'].values.reshape(-1,1))

drop=['Time','Amount']
df.drop(labels=drop,axis=1,inplace=True)
df.head()

在这里插入图片描述

新增的列插入指定位置

scaled_amount = df['amount_scaler']
scaled_time = df['time_scaler']

df.drop(['amount_scaler', 'time_scaler'], axis=1, inplace=True)
#将新增的列插入指定位置
df.insert(0, 'scaled_amount', scaled_amount)
df.insert(1, 'scaled_time', scaled_time)

# Amount and Time are Scaled!
df.head()

在这里插入图片描述

分割数据

from sklearn.model_selection import StratifiedKFold

X=df.drop(['Class'],axis=1)
Y=df['Class']

sss = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)
for train_index,test_index in sss.split(X,Y):
    orginal_Xtrain,orginal_Ytrain=X.iloc[train_index],Y.iloc[train_index]
    orginal_Xtest,orginal_Ytest=X.iloc[test_index],Y.iloc[test_index]

查看标签的数据分布（np.unique）

# Turn into an array
orginal_Xtrain = orginal_Xtrain.values
orginal_Ytrain = orginal_Ytrain.values
orginal_Xtest = orginal_Xtest.values
orginal_Ytest = orginal_Ytest.values

import numpy as np    
train_unique_label,train_counts_label=np.unique(orginal_Ytrain,return_counts=True)
test_unique_label,test_counts_label=np.unique(orginal_Ytest,return_counts=True)

print('Label Distributions: \n')
print(train_counts_label/ len(orginal_Ytrain))
print(test_counts_label/ len(orginal_Ytest))

在这里插入图片描述

下采样处理不均衡
为什么?出于测试目的，请记住，尽管我们在实现随机欠采样或过采样技术时对数据进行了分割，但我们希望在原始测试集上测试模型，而不是在这两种技术创建的测试集上测试模型。用构建的1:1数据挑选模型，测试还是用原数据）
严重失衡！所以要构建子样本，其中欺诈和非欺诈交易的比率为50/50。这意味着我们的子样本将具有相同数量的欺诈和非欺诈交易。使用不能使用原始数据帧因为将导致以下问题：
过度拟合：我们的分类模型将假设在大多数情况下没有欺诈！我们希望我们的模型在欺诈发生时能够确定。

错误的相关性：虽然我们不知道“v”功能代表什么，但了解每个功能如何通过具有不平衡数据框影响结果（欺诈或无欺诈）是很有用的，我们无法看到类和功能之间的真正相关性。

Scaled amoun and scaled time are the columns with scaled values. （对未缩放的特征进行缩放，采用RobustScaler）

创建子样本

df=df.sample(frac=1) #打乱数据。frac:取出样本比例
fraud_df=df[df['Class']==0][:492]
print(fraud_df.shape)
nofraud_df=df[df['Class']==1]
print(nofraud_df.shape)
normal_distributed_df=pd.concat([fraud_df,nofraud_df])
new_df=normal_distributed_df.sample(frac=1)

print('Distribution of the Classes in the subsample dataset')
print(new_df['Class'].value_counts()/len(new_df))

sns.countplot('Class', data=new_df, palette=colors)
plt.title('Equally Distributed Classes', fontsize=14)
plt.show()

在这里插入图片描述
相关矩阵

相关矩阵是理解数据的关键。我们想要知道是否有一些特征在很大程度上影响了特定的交易是否属于欺诈行为。然而，我们使用正确的dataframe(子样本)是很重要的，这样我们才能看到哪些特性与欺诈交易具有高度的正相关或负相关。

总结和解释:

负相关:V17、V14、V12、V10呈负相关。请注意，这些值越低，最终结果越可能是欺诈交易。

正相关:V2、V4、V11、V19呈正相关。请注意，这些值越高，最终结果越可能是欺诈交易。

箱形图:我们将使用箱形图来更好地了解这些特征在非成熟交易和非成熟交易中的分布。

注意:我们必须确保在相关矩阵中使用子样本，否则我们的相关矩阵将受到类间高度不平衡的影响。这是由于原始数据流中的高类不平衡造成的。

fig,ax=plt.subplots(2,1,figsize=(24,20))

corr=df.corr()
sns.heatmap(corr,cmap='coolwarm_r',annot_kws={'size':20},ax=ax[0])
ax[0].set_title("Imbalanceed Correlation Matrix \n (don't use  for reference)",fontsize=21)

sub_sample_corr=new_df.corr()
sns.heatmap(sub_sample_corr,cmap='coolwarm_r',annot_kws={'size':20},ax=ax[1])
ax[1].set_title("Imbalanceed Correlation Matrix \n (use  for reference)",fontsize=21)
plt.show()

在这里插入图片描述

根据皮尔逊积系数进行特征选择

from sklearn.base import TransformerMixin,BaseEstimator
class CustomCorrelationChooser(TransformerMixin,BaseEstimator):
    def __init__(self,response,cols_keep=[],threshold=None):
        #保存响应变量
        self.response=response
        #初始化一个变量，存放要保留的特征名
        self.cols_keep=cols_keep
        #保存阈值
        self.threshold=threshold
    def transform(self,X):
        #转换会选择合适的列的数据
        return X[self.cols_keep]
    def fit(self,X,*_):
        #新创建的DataFrame,存放特征和响应
        df=pd.concat([X,self.response],axis=1)
        #保存高于阈值的列的名称
        self.cols_keep=df.columns[df.corr()[df.columns[-1]].abs()>self.threshold]
        #只保留X的列，删掉响应变量
        self.cols_keep=[c for c in self.cols_keep if c in X.columns]                        
        return self
ccc=CustomCorrelationChooser(response=df['Class'],threshold=.2)
ccc.fit(X)

在这里插入图片描述

异常值的检测

与我们的类负相关(我们的特性值越低，越有可能是欺诈交易)

#Anomaly Detection  负相关 ：异常检测
fig,ax=plt.subplots(1,4,figsize=(20,4))

sns.boxplot(x='Class',y='V17',data=new_df,palette=colors,ax=ax[0])
ax[0].set_title('V17 Class Negative Correlation')

sns.boxplot(x='Class',y='V14',data=new_df,palette=colors,ax=ax[1])
ax[1].set_title('V14 Class Negative Correlation')

sns.boxplot(x='Class',y='V12',data=new_df,palette=colors,ax=ax[2])
ax[2].set_title('V12 Class Negative Correlation')

sns.boxplot(x='Class',y='V10',data=new_df,palette=colors,ax=ax[3])
ax[3].set_title('V10 Class Negative Correlation')

plt.show()

#正相关的列数据
f, axes = plt.subplots(ncols=4, figsize=(20,4))

# Positive correlations (The higher the feature the probability increases that it will be a fraud transaction)
sns.boxplot(x="Class", y="V11", data=new_df, palette=colors, ax=axes[0])
axes[0].set_title('V11 vs Class Positive Correlation')

sns.boxplot(x="Class", y="V4", data=new_df, palette=colors, ax=axes[1])
axes[1].set_title('V4 vs Class Positive Correlation')


sns.boxplot(x="Class", y="V2", data=new_df, palette=colors, ax=axes[2])
axes[2].set_title('V2 vs Class Positive Correlation')


sns.boxplot(x="Class", y="V19", data=new_df, palette=colors, ax=axes[3])
axes[3].set_title('V19 vs Class Positive Correlation')

plt.show()

在这里插入图片描述

t-SNE算法：
是用于降维的一种机器学习算法，是由 Laurens van der Maaten 和 Geoffrey Hinton在08年提出来。此外，t-SNE 是一种非线性降维算法，非常适用于高维数据降维到2维或者3维，进行可视化。
t-SNE算法可以非常准确地聚类数据集中的欺诈和非欺诈案例。
尽管子样本非常小，但是t-SNE算法能够非常准确地在每个场景中检测集群(在运行t-SNE之前，我对数据集进行了洗牌)
什么是t分布？
1.在y轴两侧对称分布
2.离散度比正态分布大一些，所以图形稍显“矮而胖”；
3.t分布只有一个参数，叫做自由度；
是否标准化可视化

from scipy.stats import norm#拟合标准正态分布 : 均值为0 方差为1    横坐标,偏离平均直的差距.纵坐标,概率密度


fig,ax=plt.subplots(1,3,figsize=(20,3))
v14_foaud_dist=new_df['V14'][new_df['Class']==1].values
sns.distplot(v14_foaud_dist,fit=norm,ax=ax[0],color='g')
ax[0].set_title('V14 Distribution \n (Fraud Transactions)')

v12_foaud_dist=new_df['V12'][new_df['Class']==1].values
sns.distplot(v14_foaud_dist,fit=norm,ax=ax[1],color='r')
ax[1].set_title('V12 Distribution \n (Fraud Transactions)')

v10_foaud_dist=new_df['V10'][new_df['Class']==1].values
sns.distplot(v14_foaud_dist,fit=norm,ax=ax[2],color='b')
ax[2].set_title('V10 Distribution \n (Fraud Transactions)')

在这里插入图片描述

各种降维所需时间（t_SNE,PCA,SVD）

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA,TruncatedSVD
import time
X=new_df.drop('Class',axis=1)
y=new_df['Class']

t0=time.time()
X_reduced_tsne=TSNE(n_components=2,random_state=42).fit_transform(X.values)
t1=time.time()
print('TSNE took {:.2}'.format(t1-t0))

t0=time.time()
X_reduced_pca=PCA(n_components=2,random_state=42).fit_transform(X.values)
t1=time.time()
print('PCA took {:.2}'.format(t1-t0))

t0=time.time()
X_reduced_svd=TruncatedSVD(n_components=2,algorithm='randomized',random_state=42).fit_transform(X.values)
t1=time.time()
print('TruncatedSVD took {:.2}'.format(t1-t0))

在这里插入图片描述

降维可视化

f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(24,6))
# labels = ['No Fraud', 'Fraud']
f.suptitle('Clusters using Dimensionality Reduction', fontsize=14)


blue_patch = mpatches.Patch(color='#0A0AFF', label='No Fraud')
red_patch = mpatches.Patch(color='#AF0000', label='Fraud')


# t-SNE scatter plot
ax1.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax1.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax1.set_title('t-SNE', fontsize=14)

ax1.grid(True)

ax1.legend(handles=[blue_patch, red_patch])


# PCA scatter plot
ax2.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax2.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax2.set_title('PCA', fontsize=14)

ax2.grid(True)

ax2.legend(handles=[blue_patch, red_patch])

# TruncatedSVD scatter plot
ax3.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax3.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax3.set_title('Truncated SVD', fontsize=14)

ax3.grid(True)

ax3.legend(handles=[blue_patch, red_patch])

plt.show()

在这里插入图片描述

GridSearchCV 参数调优
训练分数与交叉验证分数之间的差距越大，模型越有可能过度拟合（高变异）。
如果训练和交叉验证集中的得分都很低，则表明我们的模型不合适（高偏差）
使用原始数据会产生过拟合的问题，所以要用子样本来训练（undersample）。

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import collections
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report

#分类器
classifiers={
    'LogisticRegression':LogisticRegression(),
    'KNearest':KNeighborsClassifier(),
    'Support Vector Classifier':SVC(),
    'DecisionTreeClassifier':DecisionTreeClassifier(),
}

for key,classifier in classifiers.items():
    classifier.fit(X_train,y_train)
    training_score=cross_val_score(classifier,X_train,y_train,cv=5)
    print("Classifiers: ", classifier.__class__.__name__, "Has a training score of", round(training_score.mean(), 2) * 100, "% accuracy score")

在这里插入图片描述
使用原始数据会产生过拟合的问题，所以要用子样本来训练（undersample）。

#Use GridSearchCV to the find best parameters

# Logistic Regression
log_reg_params={'penalty':['l1','l2'],'C':[0.001,0.01,0.1,1,10,100,1000]}
grid_log_reg=GridSearchCV(LogisticRegression(),param_grid=log_reg_params)
grid_log_reg.fit(X_train,y_train)
# We automatically get the logistic regression with the best parameters.
log_reg=grid_log_reg.best_estimator_

#Support Vector Classifie
svc_params={'C':[0.5,0.7,0.9,1],'kernel':['rbf','poly','sigmoid','linear']}
grid_svc=GridSearchCV(SVC(),param_grid=svc_params)
grid_svc.fit(X_train,y_train)
svc=grid_svc.best_estimator_

#KNearest
knn_params={'n_neighbors':list(range(2,5,1)),'algorithm':['auto','ball_tree','kd_tree','brute']}
grid_knn=GridSearchCV(KNeighborsClassifier(),param_grid=knn_params)
grid_knn.fit(X_train,y_train)
knn=grid_knn.best_estimator_

#Support Vector Classifie
tree_params={'criterion':['gini',"entropy"],'max_depth':list(range(2,4,1)),
            "min_samples_leaf":list(range(5,7,1))}
grid_tree=GridSearchCV(DecisionTreeClassifier(),param_grid=tree_params)
grid_tree.fit(X_train,y_train)
tree=grid_tree.best_estimator_

log_reg_score = cross_val_score(log_reg, X_train, y_train, cv=5)
print('Logistic Regression Cross Validation Score: ', round(log_reg_score.mean() * 100, 2).astype(str) + '%')


knn_score = cross_val_score(knn, X_train, y_train, cv=5)
print('Knears Neighbors Cross Validation Score', round(knn_score.mean() * 100, 2).astype(str) + '%')

svc_score = cross_val_score(svc, X_train, y_train, cv=5)
print('Support Vector Classifier Cross Validation Score', round(svc_score.mean() * 100, 2).astype(str) + '%')

tree_score = cross_val_score(tree, X_train, y_train, cv=5)
print('DecisionTree Classifier Cross Validation Score', round(tree_score.mean() * 100, 2).astype(str) + '%')

在这里插入图片描述

评估模型（查看准确率，精准率，召回率，f1,auc）

undersample_X=df.drop(labels='Class',axis=1)
undersample_y=df['Class']

sss=StratifiedKFold(n_splits=5,shuffle=True,random_state=None)
for train_index,test_index in sss.split(undersample_X,undersample_y):
    undersample_Xtrain,undersample_ytrain=undersample_X.iloc[train_index],undersample_y.iloc[train_index]
    undersample_Xtest,undersample_ytest=undersample_X.iloc[test_index],undersample_y.iloc[test_index]
    
undersample_Xtrain=undersample_Xtrain.values
undersample_ytrain=undersample_ytrain.values
undersample_Xtest=undersample_Xtest.values
undersample_ytest=undersample_ytest.values

undersample_accuracy=[]
undersample_precision=[]
undersample_recall=[]
undersample_f1=[]
undersample_auc=[]

from imblearn.under_sampling import NearMiss
#实现近距离技术

# near - miss的分布(只是为了看看它是如何分布标签的，我们不会使用这些变量)

#NearMiss方法对应Python库中函数为NearMiss，通过version来选择使用的规则：
# NearMiss-1：选择离N个近邻的负样本的平均距离最小的正样本；
# NearMiss-2：选择离N个负样本最远的平均距离最小的正样本；
# NearMiss-3：是一个两段式的算法。 首先，对于每一个负样本， 保留它们的M个近邻样本；接着, 那些到N个近邻样本平均距离最大的正样本将被选择。

from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
# sklearn中的classification_report函数用于显示主要分类指标的文本报告．在报告中显示每个类的精确度，召回率，F1值等信息。 
# 主要参数: 
# y_true：1维数组，或标签指示器数组/稀疏矩阵，目标值。 
# y_pred：1维数组，或标签指示器数组/稀疏矩阵，分类器返回的估计值。 
# labels：array，shape = [n_labels]，报表中包含的标签索引的可选列表。 
# target_names：字符串列表，与标签匹配的可选显示名称（相同顺序）。 
# sample_weight：类似于shape = [n_samples]的数组，可选项，样本权重。 
# digits：int，输出浮点值的位数．
from collections import Counter
#求数组中每个数字出现了几次
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
#pipeline为方便处理数据，提供了两种模式：串行化和并行化
# Pipeline 类能够将多个处理进程兼并（glue）为单个scikit-learn 估量器。Pipeline 类自身具有fit、predict 和score 办法
# 其行为与scikit-learn 中的其他模型相同。Pipeline 类最常见的用例是将预处理进程（比方数据缩放）与一个监督模型（比方分类器）链接在一起。

x_nearmiss,y_nearmiss=NearMiss().fit_sample(undersample_X.values,undersample_y.values)
# print(type(x_nearmiss))<class 'numpy.ndarray'>
print('Nearmiss Label Distributions:{}'.format(Counter(y_nearmiss)))

for train, test in sss.split(undersample_Xtrain, undersample_ytrain):
#    imbalanced_make_pipeline：pipeline类本身具有fit、predict和score方法   对多数类重新采样
    undersample_pipeline = imbalanced_make_pipeline(NearMiss(sampling_strategy='majority'), log_reg)
    # SMOTE happens during Cross Validation not before..
    undersample_model = undersample_pipeline.fit(undersample_Xtrain[train], undersample_ytrain[train])
    undersample_prediction = undersample_model.predict(undersample_Xtrain[test])
    
    undersample_accuracy.append(undersample_pipeline.score(orginal_Xtrain[test], orginal_Ytrain[test]))
    undersample_precision.append(precision_score(orginal_Ytrain[test], undersample_prediction))
    undersample_recall.append(recall_score(orginal_Ytrain[test], undersample_prediction))
    undersample_f1.append(f1_score(orginal_Ytrain[test], undersample_prediction))
    undersample_auc.append(roc_auc_score(orginal_Ytrain[test], undersample_prediction))

查看是否过拟合

# Let's Plot LogisticRegression Learning Curve
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator1, estimator2, estimator3, estimator4, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, figsize=(20,14), sharey=True)
    if ylim is not None:
        plt.ylim(*ylim)
    # First Estimator
    train_sizes, train_scores, test_scores = learning_curve(
        estimator1, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    ax1.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="#ff9124")
    ax1.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
    ax1.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
             label="Training score")
    ax1.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
             label="Cross-validation score")
    ax1.set_title("Logistic Regression Learning Curve", fontsize=14)
    ax1.set_xlabel('Training size (m)')
    ax1.set_ylabel('Score')
    ax1.grid(True)
    ax1.legend(loc="best")
    
    # Second Estimator 
    train_sizes, train_scores, test_scores = learning_curve(
        estimator2, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    ax2.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="#ff9124")
    ax2.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
    ax2.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
             label="Training score")
    ax2.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
             label="Cross-validation score")
    ax2.set_title("Knears Neighbors Learning Curve", fontsize=14)
    ax2.set_xlabel('Training size (m)')
    ax2.set_ylabel('Score')
    ax2.grid(True)
    ax2.legend(loc="best")
    
    # Third Estimator
    train_sizes, train_scores, test_scores = learning_curve(
        estimator3, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    ax3.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="#ff9124")
    ax3.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
    ax3.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
             label="Training score")
    ax3.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
             label="Cross-validation score")
    ax3.set_title("Support Vector Classifier \n Learning Curve", fontsize=14)
    ax3.set_xlabel('Training size (m)')
    ax3.set_ylabel('Score')
    ax3.grid(True)
    ax3.legend(loc="best")
    
    # Fourth Estimator
    train_sizes, train_scores, test_scores = learning_curve(
        estimator4, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    ax4.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="#ff9124")
    ax4.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
    ax4.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
             label="Training score")
    ax4.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
             label="Cross-validation score")
    ax4.set_title("Decision Tree Classifier \n Learning Curve", fontsize=14)
    ax4.set_xlabel('Training size (m)')
    ax4.set_ylabel('Score')
    ax4.grid(True)
    ax4.legend(loc="best")
    return plt
    cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=42)
plot_learning_curve(log_reg, knn, svc, tree, X_train, y_train, (0.87, 1.01), cv=cv, n_jobs=4)

在这里插入图片描述

画roc曲线

from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_predict
# Create a DataFrame with all the scores and the classifiers names.

log_reg_pred = cross_val_predict(log_reg, X_train, y_train, cv=5,
                             method="decision_function")

knears_pred = cross_val_predict(knn, X_train, y_train, cv=5)

svc_pred = cross_val_predict(svc, X_train, y_train, cv=5,
                             method="decision_function")

tree_pred = cross_val_predict(tree, X_train, y_train, cv=5)

from sklearn.metrics import roc_auc_score
#roc_auc_score 是预测得分曲线下的 auc，在计算的时候调用了 auc；

print('Logistic Regression: ', roc_auc_score(y_train, log_reg_pred))
print('KNears Neighbors: ', roc_auc_score(y_train, knears_pred))
print('Support Vector Classifier: ', roc_auc_score(y_train, svc_pred))
print('Decision Tree Classifier: ', roc_auc_score(y_train, tree_pred))

log_fpr,log_tpr,log_thresold=roc_curve(y_train,log_reg_pred)
knn_fpr,knn_tpr,knn_thresold=roc_curve(y_train,knn_pred)
svc_fpr,svc_tpr,svc_thresold=roc_curve(y_train,svc_pred)
tree_fpr,tree_tpr,log_thresold=roc_curve(y_train,tree_pred)

def graph_roc_curve_muliple(log_fpr,log_tpr,knn_fpr,knn_tpr,svc_fpr,svc_tpr,tree_fpr,tree_tpr):
    plt.figure(figsize=(16,8))
    plt.title('Roc Curve \n Top 4 Classifiers',fontsize=18)
    plt.plot(log_fpr,log_tpr,label="Logistic Regreesion Classifiers Score:{:.4f}".format(roc_auc_score(y_train,log_reg_pred)))
    plt.plot(knn_fpr,knn_tpr,label="KNears Neighbors Classifiers Score:{:.4f}".format(roc_auc_score(y_train,knn_pred)))
    plt.plot(svc_fpr,svc_tpr,label="Support Vecto Classifiers Score:{:.4f}".format(roc_auc_score(y_train,svc_pred)))
    plt.plot(tree_fpr,tree_tpr,label="Decision Tree Classifiers Score:{:.4f}".format(roc_auc_score(y_train,tree_pred)))

    plt.plot([0,1],[0,1],'k--')
    plt.axis([-0.01,1,0,1])
    
    plt.xlabel("False Positive Rate",fontsize=16)
    plt.ylabel("True Positive Rate",fontsize=16)
    plt.annotate('Minimum ROC Score of 50% \n (This is the minimum score to get)', xy=(0.5, 0.5), xytext=(0.6, 0.3),
                arrowprops=dict(facecolor='#6E726D', shrink=0.05),
                )
    plt.legend()
graph_roc_curve_muliple(log_fpr,log_tpr,knn_fpr,knn_tpr,svc_fpr,svc_tpr,tree_fpr,tree_tpr)
plt.show()

在这里插入图片描述
对逻辑回归的深入研究:

在本节中，我们将更深入地研究逻辑回归分类器。

术语:

真阳性:正确分类的欺诈交易

误报:错误分类的欺诈交易

真阴性:正确分类非欺诈交易

假阴性:分类不正确的非欺诈交易

精度:真阳性/(真阳性+假阳性)

回忆:真阳性/(真阳性+假阴性)

顾名思义，精确度表示我们的模型在检测欺诈交易方面有多精确(有多确定)，而召回率则表示我们的模型能够检测到的欺诈案件数量。

精度/召回权衡:我们的模型越精确(选择性)，它检测到的情况就越少。示例:假设我们的模型的精确度为95%，我们假设只有5个欺诈案例，其中模型的精确度达到95%或更高，而这些都是欺诈案例。然后我们假设还有5个案例，我们的模型认为90%是欺诈案例，如果我们降低精确度，我们的模型将能够发现更多的案例。

简介:

精确度开始下降，在0。90到0。92之间，尽管如此，我们的精确度分数仍然很高，我们仍然有一个下降的回忆分数。

def logistic_roc_curve(log_fpr, log_tpr):
    plt.figure(figsize=(12,8))
    plt.title('Logistic Regression ROC Curve', fontsize=16)
    plt.plot(log_fpr, log_tpr, 'b-', linewidth=2)
    plt.plot([0, 1], [0, 1], 'r--')
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)
    plt.axis([-0.01,1,0,1])
    
    
logistic_roc_curve(log_fpr, log_tpr)
plt.show()

在这里插入图片描述

from sklearn.metrics import precision_recall_curve

precision, recall, threshold = precision_recall_curve(y_train, log_reg_pred)

from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
y_pred = log_reg.predict(X_train)

# Overfitting Case
print('---' * 45)
print('Overfitting: \n')
print('Recall Score: {:.2f}'.format(recall_score(y_train, y_pred)))
print('Precision Score: {:.2f}'.format(precision_score(y_train, y_pred)))
print('F1 Score: {:.2f}'.format(f1_score(y_train, y_pred)))
print('Accuracy Score: {:.2f}'.format(accuracy_score(y_train, y_pred)))
print('---' * 45)

# How it should look like
print('---' * 45)
print('How it should be:\n')
print("Accuracy Score: {:.2f}".format(np.mean(undersample_accuracy)))
print("Precision Score: {:.2f}".format(np.mean(undersample_precision)))
print("Recall Score: {:.2f}".format(np.mean(undersample_recall)))
print("F1 Score: {:.2f}".format(np.mean(undersample_f1)))
print('---' * 45)

在这里插入图片描述

undersample_y_score = log_reg.decision_function(original_Xtest)
from sklearn.metrics import average_precision_score

undersample_average_precision = average_precision_score(original_ytest, undersample_y_score)

print('Average precision-recall score: {0:0.2f}'.format(
      undersample_average_precision))

在这里插入图片描述

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(12,6))

precision, recall, _ = precision_recall_curve(original_ytest, undersample_y_score)

plt.step(recall, precision, color='#004a93', alpha=0.2,
         where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2,
                 color='#48a6ff')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('UnderSampling Precision-Recall curve: \n Average Precision-Recall Score ={0:0.2f}'.format(
          undersample_average_precision), fontsize=16)

在这里插入图片描述