python过采样fit参数_python分析信用卡反欺诈(下)——两种采样方法解决数据不平衡及效果分析、模型调参示例...

最新推荐文章于 2022-07-06 12:44:04 发布

weixin_39937447

最新推荐文章于 2022-07-06 12:44:04 发布

阅读量318

点赞数

文章标签： python过采样fit参数

前言关于信用卡反欺诈分析，之前已经写了一篇上，见下面超链接，最好先看下这片文章，了解下大致情况，再来看本文；

本文主要是针对前面一篇文章中提到的数据不平衡，采取下采样和过采样的办法规避，并试着对比二者的效果；

本文还以逻辑回归算法为例，对影响逻辑模型效果的最重要的2个参数C、Threshold在那种情况下较好进行了简单调试，详见代码，希望能开拓大家调参的思路；

阅读本文时，最好先了解下一些模型的基本参数和知识，如recall、TP、FN等，至少要会看混淆举证，不然后面看的时候有点吃力；

阅读本文大致需要20分钟，如发现错误欢迎留言指正，谢谢

一，数据准备

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

plt.style.use('ggplot')

from imblearn.over_sampling import SMOTE

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix

from sklearn.model_selection import train_test_split

/Applications/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88

return f(*args, **kwds)

data=pd.read_csv('./creditcard.csv')

from sklearn.preprocessing import StandardScaler

# 标准化Amount列数据

data['normAmount']=StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))

data=data.drop(['Amount','Time'],axis=1)

data.shape,data.info()

RangeIndex: 284807 entries, 0 to 284806

Data columns (total 30 columns):

V1 284807 non-null float64

V2 284807 non-null float64

V3 284807 non-null float64

V4 284807 non-null float64

V5 284807 non-null float64

V6 284807 non-null float64

V7 284807 non-null float64

V8 284807 non-null float64

V9 284807 non-null float64

V10 284807 non-null float64

V11 284807 non-null float64

V12 284807 non-null float64

V13 284807 non-null float64

V14 284807 non-null float64

V15 284807 non-null float64

V16 284807 non-null float64

V17 284807 non-null float64

V18 284807 non-null float64

V19 284807 non-null float64

V20 284807 non-null float64

V21 284807 non-null float64

V22 284807 non-null float64

V23 284807 non-null float64

V24 284807 non-null float64

V25 284807 non-null float64

V26 284807 non-null float64

V27 284807 non-null float64

V28 284807 non-null float64

Class 284807 non-null int64

normAmount 284807 non-null float64

dtypes: float64(29), int64(1)

memory usage: 65.2 MB

((284807, 30), None)

# 看看Class列数据的分布

count_classes=pd.value_counts(data['Class'],sort=True).sort_index()

print (count_classes)

count_classes.plot(kind = 'bar')

plt.title("Fraud class histogram")

plt.xlabel("Class")

plt.ylabel("Frequency")

0 284315

1 492

Name: Class, dtype: int64

Text(0,0.5,'Frequency')

如上，数据严重不平衡，负样本(欺诈时的值为1的样本)的数量太少，如果我们不进行处理，直接用这样的数据来进行训练建模，那得到的结果将非常糟糕。

所以我们要进行样本数据处理，主要有2种思路：下采样

过采样

下面分别展开如下。

二，下采样处理数据

2.1 下采样：

对于数据集中出现的数量严重不等的两类数据，从数量比较多的那类样本中，随机选出和与数量比较少的那类样本数量相同的样本，最终组成正负样本数量相同的样本集进行训练建模。

# 获取原始的特征、标签数据集

X = data.loc[:,data.columns != 'Class']

Y = data.loc[:,data.columns == 'Class']

X.shape,Y.shape

((284807, 29), (284807, 1))

# 找出负样本的个数

number_record_fraud = len(Y[Y.Class==1])

# 获取负样本的索引

fraud_indices = np.array(data[data.Class == 1].index)

normal_indices = np.array(data[data.Class == 0].index)

# 通过np.random.choice在正样本的索引(normal_indices)中随机选负样本个数(number_record_fraud )个索引

random_normal_indices = np.array(np.random.choice(normal_indices,number_record_fraud,replace=False))

# 汇总正、负样本的索引

under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

# 根据汇总的索引提取数据集

under_sample_data = data.iloc[under_sample_indices,:]

# 在数据集中提取特征、标签数据

X_under_sample = under_sample_data.iloc[:,under_sample_data.columns != 'Class']

Y_under_sample = under_sample_data.iloc[:,under_sample_data.columns == 'Class']

# 检查获取的样本特征、标签数据

X_under_sample.shape,Y_under_sample.shape

((984, 29), (984, 1))

# 拆分数据集

from sklearn.cross_validation import train_test_split

# 拆分获取的下采样特征、标签数据集

X_train_under_sample,X_test_under_sample,Y_train_under_sample,Y_test_under_sample = train_test_split(X_under_sample,

Y_under_sample,

test_size=0.3,

random_state=0)

# 拆分原始的未处理的特征、标签数据集，以备后面之需

X_train, X_test, Y_train, Y_test = train_test_split(X, Y,

test_size=0.3,

random_state=0)

/Applications/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88

return f(*args, **kwds)

# 查看采样数据拆分后的形状，应经常检查，及时发现异常

print(X_train_under_sample.shape,

Y_train_under_sample.shape,

'\n',X_test_under_sample.shape,

Y_test_under_sample.shape)

(688, 29) (688, 1)

(296, 29) (296, 1)

# 查看原始的未处理的数据拆分后的形状

print(X_train.shape,

Y_train.shape,

'\n',X_test.shape,

Y_test.shape)

(199364, 29) (199364, 1)

(85443, 29) (85443, 1)

三，交叉验证与调参

得到模型后，必不可少的步骤是验证模型，这也将有助于我们知道模型的效果怎么样，适不适合应用，而调参又是决定模型好坏的最核心因素。

机器学习中，当将要采用的机器算法确定后，模型训练的实质就是确定一系列的参数了(调参)。调参其实就是各种试，但也是有章可循的。

1. 首先要用一些数据和某个参数来训练得到一个模型，

2. 然后用另外一些数据来带入刚才训练好的模型，

3. 输出结果和标签进行比较，计算出来一个评价指标，

4. 根据这个评价指标来判断刚才带入的那个参数到底好不好。

所以我们要通知评价指标来衡量效果，这里介绍2个重要的评价指标：精度

recall值

import sklearn

from sklearn.linear_model import LogisticRegression

from sklearn.cross_validation import KFold,cross_val_score

from sklearn.metrics import (confusion_matrix,recall_score,

classification_report)

pass

# 定义求KFold的函数

def printing_Kfold_scores(X_train_data,Y_train_data):

fold = KFold(len(Y_train_data),5,shuffle=False)

print (fold)

c_param_range = [0.01,0.1,1,10,100]

# results_table为创建的DataFrame对象，来存储不同参数交叉验证后所得的recall值

results_table = pd.DataFrame(index=range(len(c_param_range)),columns=['C_Parameter','Mean recall score'])

results_table['C_Parameter'] = c_param_range

j=0

for c_param in c_param_range:

print ('c_param:',c_param)

recall_accs = []

#enumerate将一个可遍历对象(如列表、字符串)组成一个索引序列，

#获得索引和元素值，start=1表示索引从1开始(默认为0)

for iteration,indices in enumerate(fold, start=1):

lr = LogisticRegression(C = c_param, penalty = 'l1')

lr.fit(X_train_data.iloc[indices[0],:],Y_train_data.iloc[indices[0],:].values.ravel())

Y_pred_undersample = lr.predict(X_train_data.iloc[indices[1],:].values)

recall_acc = recall_score(Y_train_data.iloc[indices[1],:].values,Y_pred_undersample)

recall_accs.append(recall_acc)

print ('Iteration:',iteration,'recall_acc:',recall_acc)

#求每个C参数的平均recall值

print ('Mean recall score',np.mean(recall_accs))

results_table.loc[j,'Mean recall score'] = np.mean(recall_accs)

j+=1

# 最佳C参数

# 千万注意results_table['Mean recall score']的类型是object，要转成float64！

results_table['Mean recall score']=results_table['Mean recall score'].astype('float64')

#hh=results_table['Mean recall score']#.idxmax()

#print('hh',results_table.info())

best_c = results_table['C_Parameter'].iloc[results_table['Mean recall score'].idxmax()]

print ('best_c is :',best_c)

return best_c

# 带入下采样数据

best_c = printing_Kfold_scores(X_train_under_sample,

Y_train_under_sample)

sklearn.cross_validation.KFold(n=688,

n_folds=5,

shuffle=False,

random_state=None)

c_param: 0.01

Iteration: 1 recall_acc: 0.931506849315

Iteration: 2 recall_acc: 0.917808219178

Iteration: 3 recall_acc: 1.0

Iteration: 4 recall_acc: 0.959459459459

Iteration: 5 recall_acc: 0.954545454545

Mean recall score 0.9526639965

c_param: 0.1

Iteration: 1 recall_acc: 0.835616438356

Iteration: 2 recall_acc: 0.86301369863

Iteration: 3 recall_acc: 0.915254237288

Iteration: 4 recall_acc: 0.918918918919

Iteration: 5 recall_acc: 0.893939393939

Mean recall score 0.885348537427

c_param: 1

Iteration: 1 recall_acc: 0.849315068493

Iteration: 2 recall_acc: 0.890410958904

Iteration: 3 recall_acc: 0.966101694915

Iteration: 4 recall_acc: 0.945945945946

Iteration: 5 recall_acc: 0.893939393939

Mean recall score 0.90914261244

c_param: 10

Iteration: 1 recall_acc: 0.86301369863

Iteration: 2 recall_acc: 0.904109589041

Iteration: 3 recall_acc: 0.966101694915

Iteration: 4 recall_acc: 0.932432432432

Iteration: 5 recall_acc: 0.909090909091

Mean recall score 0.914949664822

c_param: 100

Iteration: 1 recall_acc: 0.890410958904

Iteration: 2 recall_acc: 0.904109589041

Iteration: 3 recall_acc: 0.983050847458

Iteration: 4 recall_acc: 0.959459459459

Iteration: 5 recall_acc: 0.909090909091

Mean recall score 0.929224352791

best_c is : 0.01

四，混淆矩阵

定义画混淆矩阵的函数plot_confusion_matrix，如下：

import itertools

def plot_confusion_matrix(cm, classes,

title='Confusion matrix',

cmap=plt.cm.Blues):

plt.imshow(cm, interpolation='nearest', cmap=cmap)

plt.title(title,fontsize=22)

plt.colorbar()

tick_marks = np.arange(len(classes))

plt.xticks(tick_marks, classes, rotation=0)

plt.yticks(tick_marks, classes)

thresh = cm.max() / 2.

for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):

plt.text(j, i, cm[i, j],

horizontalalignment="center",

fontsize=15,

color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()

plt.ylabel('True label',fontsize=15)

plt.xlabel('Predicted label',fontsize=15)

将下采样处理得到的测试数据带入模型，利用得到的预测结果和实际标签画出混淆矩阵

lr = LogisticRegression(C = best_c, penalty = 'l1')

lr.fit(X_train_under_sample,Y_train_under_sample.values.ravel())

Y_pred_undersample = lr.predict(X_test_under_sample.values)

cnf_matrix = confusion_matrix(Y_test_under_sample,Y_pred_undersample)

np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ",

float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))

class_names = [0,1]

f,ax=plt.subplots(figsize=(8,6))

plot_confusion_matrix(cnf_matrix

, classes=class_names

, title='Confusion matrix')

plt.show()

Recall metric in the testing dataset: 0.925170068027

由上图所示的混淆矩阵可知recall为：

recall＝TP/(TP+FN)=136/(136+11)

可见recall只和TP和FN有关系，那当FP很大时(本来为0，没有欺诈风险，但预测为1，预测成有风险)，所以在调参的时候不仅要看recall值，还要通过混淆矩阵，看看FP等参数。

上面是用下采样处理得到的测试数据来求recall和混淆矩阵的，因为下采样得到的数据相比于原始数据是很少的，所以这个测试结果没什么说服力，所以我们要用原始数据(没有经过下采样的数据)来进行测试。

lr = LogisticRegression(C = best_c, penalty = 'l1')

lr.fit(X_train_under_sample,Y_train_under_sample.values.ravel())

Y_pred = lr.predict(X_test.values)

# Compute confusion matrix

cnf_matrix = confusion_matrix(Y_test,Y_pred)

np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix

class_names = [0,1]

f,ax=plt.subplots(figsize=(8,6))

plot_confusion_matrix(cnf_matrix

, classes=class_names

, title='Confusion matrix')

plt.show()

Recall metric in the testing dataset: 0.918367346939

由上图可知，通过下采样处理数据得到的逻辑回归模型，虽然recall值挺高的，但NP值非常高8404，也就是误杀率非常高。这也是用下采样处理数据的一个弊端，如果采用过采样来处理数据，效果就会好很多。

用原始数据X_train,Y_train试试看效果怎么样：

best_c = printing_Kfold_scores(X_train,Y_train)

lr = LogisticRegression(C = best_c, penalty = 'l1')

lr.fit(X_train,Y_train.values.ravel())

Y_pred = lr.predict(X_test.values)

# Compute confusion matrix

cnf_matrix = confusion_matrix(Y_test,Y_pred)

np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix

class_names = [0,1]

plt.figure()

plot_confusion_matrix(cnf_matrix

, classes=class_names

, title='Confusion matrix')

plt.show()

sklearn.cross_validation.KFold(n=199364, n_folds=5, shuffle=False, random_state=None)

c_param: 0.01

Iteration: 1 recall_acc: 0.492537313433

Iteration: 2 recall_acc: 0.602739726027

Iteration: 3 recall_acc: 0.683333333333

Iteration: 4 recall_acc: 0.569230769231

Iteration: 5 recall_acc: 0.45

Mean recall score 0.559568228405

c_param: 0.1

Iteration: 1 recall_acc: 0.567164179104

Iteration: 2 recall_acc: 0.616438356164

Iteration: 3 recall_acc: 0.683333333333

Iteration: 4 recall_acc: 0.584615384615

Iteration: 5 recall_acc: 0.525

Mean recall score 0.595310250644

c_param: 1

Iteration: 1 recall_acc: 0.55223880597

Iteration: 2 recall_acc: 0.616438356164

Iteration: 3 recall_acc: 0.716666666667

Iteration: 4 recall_acc: 0.615384615385

Iteration: 5 recall_acc: 0.5625

Mean recall score 0.612645688837

c_param: 10

Iteration: 1 recall_acc: 0.55223880597

Iteration: 2 recall_acc: 0.616438356164

Iteration: 3 recall_acc: 0.733333333333

Iteration: 4 recall_acc: 0.615384615385

Iteration: 5 recall_acc: 0.575

Mean recall score 0.61847902217

c_param: 100

Iteration: 1 recall_acc: 0.55223880597

Iteration: 2 recall_acc: 0.616438356164

Iteration: 3 recall_acc: 0.733333333333

Iteration: 4 recall_acc: 0.615384615385

Iteration: 5 recall_acc: 0.575

Mean recall score 0.61847902217

best_c is : 10.0

Recall metric in the testing dataset: 0.619047619048

五，参数Threshold的调整

lr = LogisticRegression(C = 0.01, penalty = 'l1')

lr.fit(X_train_under_sample,Y_train_under_sample.values.ravel())

y_pred_undersample_proba = lr.predict_proba(X_test_under_sample.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(15,15))

recall_accs = []

j = 1

for i in thresholds:

y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i

plt.subplot(3,3,j)

j += 1

# Compute confusion matrix

cnf_matrix = confusion_matrix(Y_test_under_sample,y_test_predictions_high_recall)

np.set_printoptions(precision=2)

recall_acc = float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1])

print('Threshold>=%sRecall: '%i, recall_acc)

recall_accs.append(recall_acc)

# Plot non-normalized confusion matrix

class_names = [0,1]

plot_confusion_matrix(cnf_matrix

, classes=class_names

, title='Threshold>=%s'%i)

Threshold>=0.1 Recall: 1.0

Threshold>=0.2 Recall: 1.0

Threshold>=0.3 Recall: 1.0

Threshold>=0.4 Recall: 0.986394557823

Threshold>=0.5 Recall: 0.925170068027

Threshold>=0.6 Recall: 0.863945578231

Threshold>=0.7 Recall: 0.823129251701

Threshold>=0.8 Recall: 0.734693877551

Threshold>=0.9 Recall: 0.571428571429

如上图，可知Threshold＝0.5的时候效果最好。

六，过采样

与下采样采用减少数据的做法不同，过采样采用的另一种思路：过采样：对样本中数量较少的那一类进行生成补齐，使之达到与较多的那一类相匹配的程度。

那么该如何生成数据，使之扩充到相匹配的程度呢？

最常用的一种方法是SMOTE算法，关于SMOTE的详细介绍见这篇文献:

下面逐步展开。

生成数据分离数据中的特征和标签

将数据分成训练数据和测试数据，比例7:3。

利用SMOTE来处理训练样本，得到均衡的训练样本

columns=data.columns

features_columns=columns.delete(len(columns)-1)

features=data[features_columns]

labels=data['Class']

features_train, features_test, labels_train, labels_test = train_test_split(features,

labels,

test_size=0.3,

random_state=0)

oversampler=SMOTE(random_state=0)

os_features,os_labels=oversampler.fit_sample(features_train,labels_train)

os_features = pd.DataFrame(os_features)

os_labels = pd.DataFrame(os_labels)

print(len(os_labels[os_labels==1]))

398038

features_columns

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',

'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',

'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Class'],

dtype='object')

# 检查过采样生成的数据集

os_features.shape,os_labels.shape

((398038, 29), (398038, 1))

best_c = printing_Kfold_scores(os_features,os_labels)

lr = LogisticRegression(C = best_c, penalty = 'l1')

lr.fit(os_features,os_labels.values.ravel())

y_pred = lr.predict(features_test.values)

# 将数据带入生成混淆矩阵的函数

cnf_matrix = confusion_matrix(labels_test,y_pred)

np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ",

float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))

class_names = [0,1]

plt.figure()

plot_confusion_matrix(cnf_matrix

, classes=class_names

, title='Confusion matrix')

plt.show()

sklearn.cross_validation.KFold(n=398038, n_folds=5,

shuffle=False, random_state=None)

c_param: 0.01

Iteration: 1 recall_acc: 1.0

Iteration: 2 recall_acc: 1.0

Iteration: 3 recall_acc: 1.0

Iteration: 4 recall_acc: 1.0

Iteration: 5 recall_acc: 1.0

Mean recall score 1.0

c_param: 0.1

Iteration: 1 recall_acc: 1.0

Iteration: 2 recall_acc: 1.0

Iteration: 3 recall_acc: 1.0

Iteration: 4 recall_acc: 1.0

Iteration: 5 recall_acc: 1.0

Mean recall score 1.0

c_param: 1

Iteration: 1 recall_acc: 1.0

Iteration: 2 recall_acc: 1.0

Iteration: 3 recall_acc: 1.0

Iteration: 4 recall_acc: 1.0

Iteration: 5 recall_acc: 1.0

Mean recall score 1.0

c_param: 10

Iteration: 1 recall_acc: 1.0

Iteration: 2 recall_acc: 1.0

Iteration: 3 recall_acc: 1.0

Iteration: 4 recall_acc: 1.0

Iteration: 5 recall_acc: 1.0

Mean recall score 1.0

c_param: 100

Iteration: 1 recall_acc: 1.0

Iteration: 2 recall_acc: 1.0

Iteration: 3 recall_acc: 1.0

Iteration: 4 recall_acc: 1.0

Iteration: 5 recall_acc: 1.0

Mean recall score 1.0

best_c is : 0.01

Recall metric in the testing dataset: 1.0

过采样使得模型的recall进一步提高(训练数据多了，模型固然更优)，最主要的是误杀率降了很多。从原来的误杀8404到现在的0个，所以过采样对于这种大数据量下的不平衡有很好的补充。

七，小结拿到数据，首先应看一下数据的结构，是否存在不平衡；

若数据不平衡，应采取下采样或过采样的办法获取全新的数据集，再来选模型、算法；

模型的调参是个痛苦的过程，只有不断的试，才能知道最佳的参数；

预测的时候应综合考虑精度、recall、混淆矩阵等多个参数，而不应只盯着某一个参数；

以上就是本文的全部，后面看有时间的话，再用这个数据集应用决策树模型试试看。

谢谢你查看本文。

(人气稀薄，急需关爱。如果您竟然看到了这里还没走开，请帮忙多多点赞、收藏哈，谢谢啦朋友们～～)

weixin_39937447

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python过采样fit参数_python分析信用卡反欺诈(下)——两种采样方法解决数据不平衡及效果分析、模型调参示例...

前言关于信用卡反欺诈分析，之前已经写了一篇上，见下面超链接，最好先看下这片文章，了解下大致情况，再来看本文；本文主要是针对前面一篇文章中提到的数据不平衡，采取下采样和过采样的办法规避，并试着对比二者的效果；本文还以逻辑回归算法为例，对影响逻辑模型效果的最重要的2个参数C、Threshold在那种情况下较好进行了简单调试，详见代码，希望能开拓大家调参的思路；阅读本文时，最好先了解下一些模型的基本参数...
复制链接

扫一扫

python过采样fit参数_python分析信用卡反欺诈(下)——两种采样方法解决数据不平衡及效果分析、模型调参示例...

“相关推荐”对你有帮助么？