不平衡样本的分类实践---Credit-Card-Fraud-Detection

”明月如霜,好风如水,清景无限 “

突然找到了一个Credit-Card-Fraud-Detection的数据集,就顺便简单做了一下分类问题。

数据集:链接:https://pan.baidu.com/s/12rjSBqUWvCkXhINbh06M8w 
提取码:0ero 

具体的代码如下,主要还是讨论不平衡样本分类时,需要进行的处理:

先是直接原数据直接用:(原数据特征已经降维过了)

import pandas as pd
import numpy as np

#导入数据
path=r'C:\Users\文远\Desktop\git_test\数据处理类\Kaggle-Data-Credit-Card-Fraud-Detection\creditcard.csv'
df=pd.read_csv(path)
print(df)
#为了得到train,test
####先直接用原始数据
y_data=np.array(df['Class'])
#去掉没用的time
df.drop('Class',axis=1,inplace=True)
print('df.shape',df.shape)
df.drop('Time',axis=1,inplace=True)
print('df.shape',df.shape)
x_data=np.array(df)
print(x_data.shape)
##最原始的操作,看看效果
from sklearn.model_selection import train_test_split
##random_state:保证随机不变
X_train,X_test,y_train,y_test=train_test_split(x_data,y_data,test_size=0.25,random_state=0)
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn import metrics  #评价模型的指标

#最简单的:LogisticRegression
# logreg=LogisticRegression()  ##必须添加最大迭代次数
logre=LogisticRegression(max_iter=5000)
logre.fit(X_train,y_train)##实例化LogisticRegression模型
y_pre=logre.predict(X_test)
y_pre_prob=logre.predict_proba(X_test)[:,1]###取第一列,即最大值
print(y_pre)
print(y_pre_prob)
print(y_pre.shape)
print(y_pre_prob.shape)
# all_prob=logre.predict_proba(X_test)
# print(all_prob.shape)
from sklearn.metrics._classification import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
cnf_matrix = confusion_matrix(y_test,y_pre)
print(cnf_matrix)
print("原始数据的LogisticRegression:")
plot_confusion_matrix(logre, X_test, y_test)  
from sklearn import metrics
# from sklearn.metrics._classification import confusion_matrix
# plot_confusion_matrix(logre, X_test, y_test)  
#Performance metrics evaluation
print("Confusion Matrix:\n",metrics.confusion_matrix(y_test,y_pre))
print("Accuracy:\n",metrics.accuracy_score(y_test,y_pre))
print("Precision:\n",metrics.precision_score(y_test,y_pre))
print("Recall:\n",metrics.recall_score(y_test,y_pre))
print("AUC:\n",metrics.roc_auc_score(y_test,y_pre_prob))
auc=metrics.roc_auc_score(y_test,y_pre_prob)
fpr,tpr,thresholds=metrics.roc_curve(y_test,y_pre_prob)
plt.plot(fpr,tpr,'b', label='AUC = %0.2f'% auc)
plt.plot([0,1],[0,1],'r-.')
plt.xlim([-0.2,1.2])
plt.ylim([-0.2,1.2])
plt.title('original data\nLogistic Regression')
plt.legend(loc='lower right')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

效果如下:(0是正常,1是异常)

发现FN数据很大,模型不是很行。recall和precsion都不算高。

用过采样试一下(就扩充了数据,其他都不变):记得装imblearn库

print(X_train.shape,y_train.shape)
#applyting SMOTE to oversample the minority class    ###不平衡样本的处理,选择过采样
from imblearn.over_sampling import SMOTE
sm=SMOTE(random_state=2)
X_sm,y_sm=sm.fit_sample(X_train,y_train)
print("过采样后样本的数量:")
print(X_sm.shape,y_sm.shape)
print("正负样本的数量:")
print(len(y_sm[y_sm==1]),":",len(y_sm[y_sm==0]))

from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn import metrics  #评价模型的指标

#最简单的:LogisticRegression
# logreg=LogisticRegression()  ##必须添加最大迭代次数
logre2=LogisticRegression(max_iter=5000)
logre2.fit(X_sm,y_sm)##实例化LogisticRegression模型
y_pre2=logre2.predict(X_test)
y_pre_prob2=logre2.predict_proba(X_test)[:,1]###取第一列,即最大值

from sklearn.metrics._classification import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
cnf_matrix2 = confusion_matrix(y_test,y_pre2)
print(cnf_matrix2)
print("过采样后的混淆矩阵:")
plot_confusion_matrix(logre2, X_test, y_test)  

from sklearn import metrics
# from sklearn.metrics._classification import confusion_matrix
# plot_confusion_matrix(logre, X_test, y_test)  
#Performance metrics evaluation
print("Confusion Matrix:\n",metrics.confusion_matrix(y_test,y_pre2))
print("Accuracy:\n",metrics.accuracy_score(y_test,y_pre2))
print("Precision:\n",metrics.precision_score(y_test,y_pre2))
print("Recall:\n",metrics.recall_score(y_test,y_pre2))
print("AUC:\n",metrics.roc_auc_score(y_test,y_pre_prob2))
auc=metrics.roc_auc_score(y_test,y_pre_prob2)
fpr,tpr,thresholds=metrics.roc_curve(y_test,y_pre_prob2)
plt.plot(fpr,tpr,'b', label='AUC = %0.2f'% auc)
plt.plot([0,1],[0,1],'r-.')
plt.xlim([-0.2,1.2])
plt.ylim([-0.2,1.2])
plt.title('Over_Sample\nLogistic Regression')
plt.legend(loc='lower right')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

同样效果不是很好,FP过大,即误杀率有点大,为了找到异常样本,错杀了正常样本有点多。

但是recall相对变大了很多。

试一下,下采样。

## 缺少标准化
# 1、preprocessing.StandardScaler.fit
# 用于计算训练数据的均值和方差, 后面就会用均值和方差来转换训练数据

# 2、preprocessing.StandardScaler.fit_transform
# 不仅计算训练数据的均值和方差,还会基于计算出来的均值和方差来转换训练数据,从而把数据转换成标准的正太分布

# 3、preprocessing.StandardScaler.transform
# 很显然,它只是进行转换,只是把训练数据转换成标准的正态分布
x_data=np.array(df)
print(x_data.shape)
print(x_data)
from sklearn.preprocessing import StandardScaler
standard_data=StandardScaler().fit_transform(x_data)

###下采样加交叉验证
## 从X_train,y_train中选出1个数的0
## 获取y_data=1的索引
all_data=np.hstack((standard_data,y_data.reshape(-1,1)))###一定要先和成tuple
print(all_data.shape)
# print(all_data)
data_fraud=all_data[all_data[:,-1] == 1]
num_records_fraud = len(all_data[all_data[:,-1] == 1])
print(num_records_fraud)
# print(data_fraud)
data_nomal=all_data[all_data[:,-1] == 0]
num_records_nomal = len(data_nomal)

random_normal_indexes = np.random.choice(np.arange(num_records_nomal), num_records_fraud, replace=False)###随机抽 num_records_fraud个正常样本,replace=False即不放回抽样
# print(random_normal_indexes)

down_data=np.vstack((data_fraud,data_nomal[random_normal_indexes,:]))
print(down_data.shape)

# import numpy as np
y_pre_list=np.array([])
y_pre_prob_list=np.array([])
logre3=LogisticRegression(max_iter=5000)
random_index=np.random.choice(np.arange(len(down_data)),len(down_data),replace=False)
# print(random_index)
down_random_data=down_data[random_index,:]
print(down_random_data)
# logre2.fit(X_sm,y_sm)##实例化LogisticRegression模型

from sklearn.model_selection import train_test_split
##random_state:保证随机不变
X_train1,X_test1,y_train1,y_test1=train_test_split(down_random_data[:,:-1],down_random_data[:,-1],test_size=0.25,random_state=0)
print(X_train1.shape,X_test1.shape,y_train1.shape,y_test1.shape)

from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn import metrics  #评价模型的指标

#最简单的:LogisticRegression
# logreg=LogisticRegression()  ##必须添加最大迭代次数
logre3=LogisticRegression(max_iter=5000)
logre3.fit(X_train1,y_train1)##实例化LogisticRegression模型
y_pre3=logre3.predict(X_test1)
y_pre_prob3=logre3.predict_proba(X_test1)[:,1]###取第一列,即最大值
from sklearn.metrics._classification import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
cnf_matrix3 = confusion_matrix(y_test1,y_pre3)
print(cnf_matrix3)
print("下采样的混淆矩阵")
plot_confusion_matrix(logre3, X_test1, y_test1)  
from sklearn import metrics
# from sklearn.metrics._classification import confusion_matrix
# plot_confusion_matrix(logre, X_test, y_test)  
#Performance metrics evaluation
print("Confusion Matrix:\n",metrics.confusion_matrix(y_test1,y_pre3))
print("Accuracy:\n",metrics.accuracy_score(y_test1,y_pre3))
print("Precision:\n",metrics.precision_score(y_test1,y_pre3))
print("Recall:\n",metrics.recall_score(y_test1,y_pre3))
print("AUC:\n",metrics.roc_auc_score(y_test1,y_pre_prob3))
auc=metrics.roc_auc_score(y_test1,y_pre_prob3)
fpr,tpr,thresholds=metrics.roc_curve(y_test1,y_pre_prob3)
plt.plot(fpr,tpr,'b', label='AUC = %0.2f'% auc)
plt.plot([0,1],[0,1],'r-.')
plt.xlim([-0.2,1.2])
plt.ylim([-0.2,1.2])
plt.title('Down_Sample\nLogistic Regression')
plt.legend(loc='lower right')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

效果相当好,但是每家CrossOver,自己可以试一下,文远尝试了一下,发现效果一般,但是可以尝试样本数量小时,验证如下文章

https://ashleyzh666.github.io/2020/07/15/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E9%A2%84%E6%B5%8B%E5%AE%9E%E8%B7%B5%E5%90%8E%E7%9A%84%E6%80%BB%E7%BB%93%EF%BC%8C%E4%BB%A5%E6%95%B0%E6%8D%AE%E9%87%8F%E8%BE%83%E5%B0%8F%E4%B8%BA%E5%85%B8%E5%9E%8B%E7%90%86%E8%A7%A3%E4%BA%A4%E5%8F%89%E9%AA%8C%E8%AF%81/

这个recall应该时够了,FP也小。精度也不错。

最后试了一下,过采样加上自带的交叉验证。顺带做一下标准化,毕竟’mount’特征的数字很大。

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV # 带有正则化参数C的粒度
from sklearn.model_selection import cross_val_score # 交叉验证
from sklearn.preprocessing import StandardScaler
# X_sm,y_sm=sm.fit_sample(X_train,y_train)
standard_X_sm=StandardScaler().fit_transform(X_sm)
print(standard_X_sm.shape)
print(y_sm.shape)
y_sm=y_sm.reshape(-1,1)
print(y_sm.shape)
y_sm=y_sm.reshape(-1)
print(y_sm.shape)
# over_data=np.vstack((standard_X_sm,y_sm))
# print(over_data)
# print(over_data.shape)

from sklearn.model_selection import train_test_split
##random_state:保证随机不变
X_train2,X_test2,y_train2,y_test2=train_test_split(standard_X_sm,y_sm,test_size=0.25,random_state=0)
print(X_train2.shape,X_test2.shape,y_train2.shape,y_test2.shape)

from sklearn.linear_model import LogisticRegressionCV # 带有正则化参数C的粒度
# 大量样本(30W+)、高维度(29),L1正则 --> 可选用saga优化求解器(0.19版本新功能)
# LogisticRegressionCV比GridSearchCV快     solver:逻辑回归损失函数的优化方法。
lgrecv_L2 = LogisticRegressionCV(Cs=1, cv = 10, penalty='l2', solver='saga')
####  关于 Cs参数,Cs:也就是正则化系数,相当于 LogisticRegression里的参数C。
lgrecv_L2.fit(X_train2, y_train2) 
y_pre5=lgrecv_L2.predict(X_test2)
y_pre_prob5=lgrecv_L2.predict_proba(X_test2)[:,1]###取第一列,即最大值

from sklearn.metrics._classification import confusion_matrix
cnf_matrix5 = confusion_matrix(y_test2,y_pre5)
print(cnf_matrix5)
print("过采样+交叉验证的混淆矩阵")
plot_confusion_matrix(lgrecv_L2, X_test2, y_test2) 

from sklearn import metrics
# from sklearn.metrics._classification import confusion_matrix
# plot_confusion_matrix(logre, X_test, y_test)  
#Performance metrics evaluation
print("Confusion Matrix:\n",metrics.confusion_matrix(y_test2,y_pre5))
print("Accuracy:\n",metrics.accuracy_score(y_test2,y_pre5))
print("Precision:\n",metrics.precision_score(y_test2,y_pre5))
print("Recall:\n",metrics.recall_score(y_test2,y_pre5))
print("AUC:\n",metrics.roc_auc_score(y_test2,y_pre_prob5))
auc=metrics.roc_auc_score(y_test2,y_pre_prob5)
fpr,tpr,thresholds=metrics.roc_curve(y_test2,y_pre_prob5)
plt.plot(fpr,tpr,'b', label='AUC = %0.2f'% auc)
plt.plot([0,1],[0,1],'r-.')
plt.xlim([-0.2,1.2])
plt.ylim([-0.2,1.2])
plt.title('Over_Sample+CrossOver_10_fold\nLogistic Regression')
plt.legend(loc='lower right')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

相对来说效果好了,毕竟FP骤减。

但是,recall还是偏小了。主要是FN变大了,看起来参数相对之前,都算是一种两级的状态,没处在适中状态。

不多说了,加深记忆。需要源码点击阅读原文,如果对你有所帮助的话,记得为公众号添加星标啊。

END

作者:不爱跑马的影迷不是好程序猿

   喜欢的话请关注点赞👇 👇👇 👇                     

图片图片

壹句: 月寒日暖,来煎人寿

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
IEEE-CIS Fraud Detection is a Kaggle competition that challenges participants to detect fraudulent transactions using machine learning techniques. KNN (k-Nearest Neighbors) is one of the machine learning algorithms that can be used to solve this problem. KNN is a non-parametric algorithm that classifies new data points based on the majority class of their k-nearest neighbors in the training data. In the context of fraud detection, KNN can be used to classify transactions as either fraudulent or not based on the similarity of their features to those in the training data. To implement KNN for fraud detection, one can follow the following steps: 1. Preprocess the data: This involves cleaning and transforming the data into a format that the algorithm can work with. 2. Split the data: Split the data into training and testing sets. The training data is used to train the KNN model, and the testing data is used to evaluate its performance. 3. Choose the value of k: This is the number of neighbors to consider when classifying a new data point. The optimal value of k can be determined using cross-validation. 4. Train the model: Train the KNN model on the training data. 5. Test the model: Test the performance of the model on the testing data. 6. Tune the model: Fine-tune the model by changing the hyperparameters such as the distance metric used or the weighting function. Overall, KNN can be a useful algorithm for fraud detection, but its performance depends heavily on the quality of the data and the choice of hyperparameters.
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值