前言关于信用卡反欺诈分析,之前已经写了一篇上,见下面超链接,最好先看下这片文章,了解下大致情况,再来看本文;
本文主要是针对前面一篇文章中提到的数据不平衡,采取下采样和过采样的办法规避,并试着对比二者的效果;
本文还以逻辑回归算法为例,对影响逻辑模型效果的最重要的2个参数C、Threshold在那种情况下较好进行了简单调试,详见代码,希望能开拓大家调参的思路;
阅读本文时,最好先了解下一些模型的基本参数和知识,如recall、TP、FN等,至少要会看混淆举证,不然后面看的时候有点吃力;
阅读本文大致需要20分钟,如发现错误欢迎留言指正,谢谢
一,数据准备
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
/Applications/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
data=pd.read_csv('./creditcard.csv')
from sklearn.preprocessing import StandardScaler
# 标准化Amount列数据
data['normAmount']=StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))
data=data.drop(['Amount','Time'],axis=1)
data.shape,data.info()
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 30 columns):
V1 284807 non-null float64
V2 284807 non-null float64
V3 284807 non-null float64
V4 284807 non-null float64
V5 284807 non-null float64
V6 284807 non-null float64
V7 284807 non-null float64
V8 284807 non-null float64
V9 284807 non-null float64
V10 284807 non-null float64
V11 284807 non-null float64
V12 284807 non-null float64
V13 284807 non-null float64
V14 284807 non-null float64
V15 284807 non-null float64
V16 284807 non-null float64
V17 284807 non-null float64
V18 284807 non-null float64
V19 284807 non-null float64
V20 284807 non-null float64
V21 284807 non-null float64
V22 284807 non-null float64
V23 284807 non-null float64
V24 284807 non-null float64
V25 284807 non-null float64
V26 284807 non-null float64
V27 284807 non-null float64
V28 284807 non-null float64
Class 284807 non-null int64
normAmount 284807 non-null float64
dtypes: float64(29), int64(1)
memory usage: 65.2 MB
((284807, 30), None)
# 看看Class列数据的分布
count_classes=pd.value_counts(data['Class'],sort=True).sort_index()
print (count_classes)
count_classes.plot(kind = 'bar')
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
0 284315
1 492
Name: Class, dtype: int64
Text(0,0.5,'Frequency')
如上,数据严重不平衡,负样本(欺诈时的值为1的样本)的数量太少,如果我们不进行处理,直接用这样的数据来进行训练建模,那得到的结果将非常糟糕。
所以我们要进行样本数据处理,主要有2种思路:下采样
过采样
下面分别展开如下。
二,下采样处理数据
2.1 下采样:
对于数据集中出现的数量严重不等的两类数据,从数量比较多的那类样本中,随机选出和与数量比较少的那类样本数量相同的样本,最终组成正负样本数量相同的样本集进行训练建模。
# 获取原始的特征、标签数据集
X = data.loc[:,data.columns != 'Class']
Y = data.loc[:,data.columns == 'Class']
X.shape,Y.shape
((284807, 29), (284807, 1))
# 找出负样本的个数
number_record_fraud = len(Y[Y.Class==1])
# 获取负样本的索引
fraud_indices = np.array(data[data.Class == 1].index)
normal_indices = np.array(data[data.Class == 0].index)
# 通过np.random.choice在正样本的索引(normal_indices)中随机选负样本个数(number_record_fraud )个索引
random_normal_indices = np.array(np.random.choice(normal_indices,number_record_fraud,replace=False))
# 汇总正、负样本的索引
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])
# 根据汇总的索引提取数据集
under_sample_data = data.iloc[under_sample_indices,:]
# 在数据集中提取特征、标签数据
X_under_sample = under_sample_data.iloc[:,under_sample_data.columns != 'Class']
Y_under_sample = under_sample_data.iloc[:,under_sample_data.columns == 'Class']
# 检查获取的样本特征、标签数据
X_under_sample.shape,Y_under_sample.shape
((984, 29), (984, 1))
# 拆分数据集
from sklearn.cross_validation import train_test_split
# 拆分获取的下采样特征、标签数据集
X_train_under_sample,X_test_under_sample,Y_train_under_sample,Y_test_under_sample = train_test_split(X_under_sample,
Y_under_sample,
test_size=0.3,
random_state=0)
# 拆分原始的未处理的特征、标签数据集,以备后面之需
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.3,
random_state=0)
/Applications/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
# 查看采样数据拆分后的形状,应经常检查,及时发现异常
print(X_train_under_sample.shape,
Y_train_under_sample.shape,
'\n',X_test_under_sample.shape,
Y_test_under_sample.shape)
(688, 29) (688, 1)
(296, 29) (296, 1)
# 查看原始的未处理的数据拆分后的形状
print(X_train.shape,
Y_train.shape,
'\n',X_test.shape,
Y_test.shape)
(199364, 29) (199364, 1)
(85443, 29) (85443, 1)
三,交叉验证与调参
得到模型后,必不可少的步骤是验证模型,这也将有助于我们知道模型的效果怎么样,适不适合应用,而调参又是决定模型好坏的最核心因素。
机器学习中,当将要采用的机器算法确定后,模型训练的实质就是确定一系列的参数了(调参)。调参其实就是各种试,但也是有章可循的。
1. 首先要用一些数据和某个参数来训练得到一个模型,
2. 然后用另外一些数据来带入刚才训练好的模型,
3. 输出结果和标签进行比较,计算出来一个评价指标,
4. 根据这个评价指标来判断刚才带入的那个参数到底好不好。
所以我们要通知评价指标来衡量效果,这里介绍2个重要的评价指标:精度
recall值
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold,cross_val_score
from sklearn.metrics import (confusion_matrix,recall_score,
classification_report)
pass
# 定义求KFold的函数
def printing_Kfold_scores(X_train_data,Y_train_data):
fold = KFold(len(Y_train_data),5,shuffle=False)
print (fold)
c_param_range = [0.01,0.1,1,10,100]
# results_table为创建的DataFrame对象,来存储不同参数交叉验证后所得的recall值
results_table = pd.DataFrame(index=range(len(c_param_range)),columns=['C_Parameter','Mean recall score'])
results_table['C_Parameter'] = c_param_range
j=0
for c_param in c_param_range:
print ('c_param:',c_param)
recall_accs = []
#enumerate将一个可遍历对象(如列表、字符串)组成一个索引序列,
#获得索引和元素值,start=1表示索引从1开始(默认为0)
for iteration,indices in enumerate(fold, start=1):
lr = LogisticRegression(C = c_param, penalty = 'l1')
lr.fit(X_train_data.iloc[indices[0],:],Y_train_data.iloc[indices[0],:].values.ravel())
Y_pred_undersample = lr.predict(X_train_data.iloc[indices[1],:].values)
recall_acc = recall_score(Y_train_data.iloc[indices[1],:].values,Y_pred_undersample)
recall_accs.append(recall_acc)
print ('Iteration:',iteration,'recall_acc:',recall_acc)
#求每个C参数的平均recall值
print ('Mean recall score',np.mean(recall_accs))
results_table.loc[j,'Mean recall score'] = np.mean(recall_accs)
j+=1
# 最佳C参数
# 千万注意results_table['Mean recall score']的类型是object,要转成float64!
results_table['Mean recall score']=results_table['Mean recall score'].astype('float64')
#hh=results_table['Mean recall score']#.idxmax()
#print('hh',results_table.info())
best_c = results_table['C_Parameter'].iloc[results_table['Mean recall score'].idxmax()]
print ('best_c is :',best_c)
return best_c
# 带入下采样数据
best_c = printing_Kfold_scores(X_train_under_sample,
Y_train_under_sample)
sklearn.cross_validation.KFold(n=688,
n_folds=5,
shuffle=False,
random_state=None)
c_param: 0.01
Iteration: 1 recall_acc: 0.931506849315
Iteration: 2 recall_acc: 0.917808219178
Iteration: 3 recall_acc: 1.0
Iteration: 4 recall_acc: 0.959459459459
Iteration: 5 recall_acc: 0.954545454545
Mean recall score 0.9526639965
c_param: 0.1
Iteration: 1 recall_acc: 0.835616438356
Iteration: 2 recall_acc: 0.86301369863
Iteration: 3 recall_acc: 0.915254237288
Iteration: 4 recall_acc: 0.918918918919
Iteration: 5 recall_acc: 0.893939393939
Mean recall score 0.885348537427
c_param: 1
Iteration: 1 recall_acc: 0.849315068493
Iteration: 2 recall_acc: 0.890410958904
Iteration: 3 recall_acc: 0.966101694915
Iteration: 4 recall_acc: 0.945945945946
Iteration: 5 recall_acc: 0.893939393939
Mean recall score 0.90914261244
c_param: 10
Iteration: 1 recall_acc: 0.86301369863
Iteration: 2 recall_acc: 0.904109589041
Iteration: 3 recall_acc: 0.966101694915
Iteration: 4 recall_acc: 0.932432432432
Iteration: 5 recall_acc: 0.909090909091
Mean recall score 0.914949664822
c_param: 100
Iteration: 1 recall_acc: 0.890410958904
Iteration: 2 recall_acc: 0.904109589041
Iteration: 3 recall_acc: 0.983050847458
Iteration: 4 recall_acc: 0.959459459459
Iteration: 5 recall_acc: 0.909090909091
Mean recall score 0.929224352791
best_c is : 0.01
四,混淆矩阵
定义画混淆矩阵的函数plot_confusion_matrix,如下:
import itertools
def plot_confusion_matrix(cm, classes,
title='Confusion matrix',
cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title,fontsize=22)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
fontsize=15,
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label',fontsize=15)
plt.xlabel('Predicted label',fontsize=15)
将下采样处理得到的测试数据带入模型,利用得到的预测结果和实际标签画出混淆矩阵
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_under_sample,Y_train_under_sample.values.ravel())
Y_pred_undersample = lr.predict(X_test_under_sample.values)
cnf_matrix = confusion_matrix(Y_test_under_sample,Y_pred_undersample)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ",
float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))
class_names = [0,1]
f,ax=plt.subplots(figsize=(8,6))
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()
Recall metric in the testing dataset: 0.925170068027
由上图所示的混淆矩阵可知recall为:
recall=TP/(TP+FN)=136/(136+11)
可见recall只和TP和FN有关系,那当FP很大时(本来为0,没有欺诈风险,但预测为1,预测成有风险),所以在调参的时候不仅要看recall值,还要通过混淆矩阵,看看FP等参数。
上面是用下采样处理得到的测试数据来求recall和混淆矩阵的,因为下采样得到的数据相比于原始数据是很少的,所以这个测试结果没什么说服力,所以我们要用原始数据(没有经过下采样的数据)来进行测试。
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_under_sample,Y_train_under_sample.values.ravel())
Y_pred = lr.predict(X_test.values)
# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,Y_pred)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# Plot non-normalized confusion matrix
class_names = [0,1]
f,ax=plt.subplots(figsize=(8,6))
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()
Recall metric in the testing dataset: 0.918367346939
由上图可知,通过下采样处理数据得到的逻辑回归模型,虽然recall值挺高的,但NP值非常高8404,也就是误杀率非常高。这也是用下采样处理数据的一个弊端,如果采用过采样来处理数据,效果就会好很多。
用原始数据X_train,Y_train试试看效果怎么样:
best_c = printing_Kfold_scores(X_train,Y_train)
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train,Y_train.values.ravel())
Y_pred = lr.predict(X_test.values)
# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,Y_pred)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()
sklearn.cross_validation.KFold(n=199364, n_folds=5, shuffle=False, random_state=None)
c_param: 0.01
Iteration: 1 recall_acc: 0.492537313433
Iteration: 2 recall_acc: 0.602739726027
Iteration: 3 recall_acc: 0.683333333333
Iteration: 4 recall_acc: 0.569230769231
Iteration: 5 recall_acc: 0.45
Mean recall score 0.559568228405
c_param: 0.1
Iteration: 1 recall_acc: 0.567164179104
Iteration: 2 recall_acc: 0.616438356164
Iteration: 3 recall_acc: 0.683333333333
Iteration: 4 recall_acc: 0.584615384615
Iteration: 5 recall_acc: 0.525
Mean recall score 0.595310250644
c_param: 1
Iteration: 1 recall_acc: 0.55223880597
Iteration: 2 recall_acc: 0.616438356164
Iteration: 3 recall_acc: 0.716666666667
Iteration: 4 recall_acc: 0.615384615385
Iteration: 5 recall_acc: 0.5625
Mean recall score 0.612645688837
c_param: 10
Iteration: 1 recall_acc: 0.55223880597
Iteration: 2 recall_acc: 0.616438356164
Iteration: 3 recall_acc: 0.733333333333
Iteration: 4 recall_acc: 0.615384615385
Iteration: 5 recall_acc: 0.575
Mean recall score 0.61847902217
c_param: 100
Iteration: 1 recall_acc: 0.55223880597
Iteration: 2 recall_acc: 0.616438356164
Iteration: 3 recall_acc: 0.733333333333
Iteration: 4 recall_acc: 0.615384615385
Iteration: 5 recall_acc: 0.575
Mean recall score 0.61847902217
best_c is : 10.0
Recall metric in the testing dataset: 0.619047619048
五,参数Threshold的调整
lr = LogisticRegression(C = 0.01, penalty = 'l1')
lr.fit(X_train_under_sample,Y_train_under_sample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_under_sample.values)
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
plt.figure(figsize=(15,15))
recall_accs = []
j = 1
for i in thresholds:
y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
plt.subplot(3,3,j)
j += 1
# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test_under_sample,y_test_predictions_high_recall)
np.set_printoptions(precision=2)
recall_acc = float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1])
print('Threshold>=%sRecall: '%i, recall_acc)
recall_accs.append(recall_acc)
# Plot non-normalized confusion matrix
class_names = [0,1]
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Threshold>=%s'%i)
Threshold>=0.1 Recall: 1.0
Threshold>=0.2 Recall: 1.0
Threshold>=0.3 Recall: 1.0
Threshold>=0.4 Recall: 0.986394557823
Threshold>=0.5 Recall: 0.925170068027
Threshold>=0.6 Recall: 0.863945578231
Threshold>=0.7 Recall: 0.823129251701
Threshold>=0.8 Recall: 0.734693877551
Threshold>=0.9 Recall: 0.571428571429
如上图,可知Threshold=0.5的时候效果最好。
六,过采样
与下采样采用减少数据的做法不同,过采样采用的另一种思路:过采样:对样本中数量较少的那一类进行生成补齐,使之达到与较多的那一类相匹配的程度。
那么该如何生成数据,使之扩充到相匹配的程度呢?
最常用的一种方法是SMOTE算法,关于SMOTE的详细介绍见这篇文献:
下面逐步展开。
生成数据分离数据中的特征和标签
将数据分成训练数据和测试数据,比例7:3。
利用SMOTE来处理训练样本,得到均衡的训练样本
columns=data.columns
features_columns=columns.delete(len(columns)-1)
features=data[features_columns]
labels=data['Class']
features_train, features_test, labels_train, labels_test = train_test_split(features,
labels,
test_size=0.3,
random_state=0)
oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_sample(features_train,labels_train)
os_features = pd.DataFrame(os_features)
os_labels = pd.DataFrame(os_labels)
print(len(os_labels[os_labels==1]))
398038
features_columns
Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Class'],
dtype='object')
# 检查过采样生成的数据集
os_features.shape,os_labels.shape
((398038, 29), (398038, 1))
best_c = printing_Kfold_scores(os_features,os_labels)
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(os_features,os_labels.values.ravel())
y_pred = lr.predict(features_test.values)
# 将数据带入生成混淆矩阵的函数
cnf_matrix = confusion_matrix(labels_test,y_pred)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ",
float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()
sklearn.cross_validation.KFold(n=398038, n_folds=5,
shuffle=False, random_state=None)
c_param: 0.01
Iteration: 1 recall_acc: 1.0
Iteration: 2 recall_acc: 1.0
Iteration: 3 recall_acc: 1.0
Iteration: 4 recall_acc: 1.0
Iteration: 5 recall_acc: 1.0
Mean recall score 1.0
c_param: 0.1
Iteration: 1 recall_acc: 1.0
Iteration: 2 recall_acc: 1.0
Iteration: 3 recall_acc: 1.0
Iteration: 4 recall_acc: 1.0
Iteration: 5 recall_acc: 1.0
Mean recall score 1.0
c_param: 1
Iteration: 1 recall_acc: 1.0
Iteration: 2 recall_acc: 1.0
Iteration: 3 recall_acc: 1.0
Iteration: 4 recall_acc: 1.0
Iteration: 5 recall_acc: 1.0
Mean recall score 1.0
c_param: 10
Iteration: 1 recall_acc: 1.0
Iteration: 2 recall_acc: 1.0
Iteration: 3 recall_acc: 1.0
Iteration: 4 recall_acc: 1.0
Iteration: 5 recall_acc: 1.0
Mean recall score 1.0
c_param: 100
Iteration: 1 recall_acc: 1.0
Iteration: 2 recall_acc: 1.0
Iteration: 3 recall_acc: 1.0
Iteration: 4 recall_acc: 1.0
Iteration: 5 recall_acc: 1.0
Mean recall score 1.0
best_c is : 0.01
Recall metric in the testing dataset: 1.0
过采样使得模型的recall进一步提高(训练数据多了,模型固然更优),最主要的是误杀率降了很多。从原来的误杀8404到现在的0个,所以过采样对于这种大数据量下的不平衡有很好的补充。
七,小结拿到数据,首先应看一下数据的结构,是否存在不平衡;
若数据不平衡,应采取下采样或过采样的办法获取全新的数据集,再来选模型、算法;
模型的调参是个痛苦的过程,只有不断的试,才能知道最佳的参数;
预测的时候应综合考虑精度、recall、混淆矩阵等多个参数,而不应只盯着某一个参数;
以上就是本文的全部,后面看有时间的话,再用这个数据集应用决策树模型试试看。
谢谢你查看本文。
(人气稀薄 ,急需关爱 。如果您竟然看到了这里还没走开,请帮忙多多点赞、收藏哈,谢谢啦朋友们~~)