Python数据分析与机器学习实战<七>逻辑回归应用案例

目录

案例:信用卡----欺诈检测

样本不均衡解决办法

下采样

交叉验证

模型评估方法

混淆矩阵

下采样处理后的混淆矩阵

原始数据集上的混淆矩阵

原始数据不做下采样处理的结果

 逻辑回归阈值对结果的影响

过采样

SMOTE算法

交叉验证

混淆矩阵

总结 


案例:信用卡----欺诈检测

credit_card数据集:数据集中包含284807行31列数据(一个银行的信用卡交易记录,这些数据是经过处理后发布的(保护了客户的隐私),因此它的第一行名字是V1~V28)

<这个数据在GitHub上是找不到的,我是在博客中找到的。>

(数据资源:  https://pan.baidu.com/s/1fzqeieHOrBmV5TJ1RfhBbw 提取码: buiu (转))

数据中有正常数据可以用0表示,也有异常数据可以用1表示: 目标,根据已知的30个特征和Class判断数据的正常还是异常(即将0类和1类别区分开)

这是一个经典的二分类问题

对于这样一个正常的数据,则大概率属于0的数据要远远大于属于1的数据。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
# 读入数据,数据总共有284807行31列(较大)
data = pd.read_csv("E:/cluster/data/creditcard.csv")
data.head()#只看一下前5行

 观察每列数据的取值范围,大部分数-1~1之间,Amount(第29个特征)那一列数据范围为几百到几十,差别太大,这会在机器学习中导致,误以为大范围的数据的影响程度更大,所以为了使每个特征的重要性一致,需要对范围较大的数据进行归一化或标准化(0~1或-1~1区间都可以)在下面,将用sklearn进行处理

# 导入标准化模块
from sklearn.preprocessing import StandardScaler

# 注意:这里如果不加.values会报错:'Series' object has no attribute 'reshape'
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))
# -1:让Python根据列数(1)自动计算行数;
data = data.drop(['Time','Amount'],axis = 1)
# Time和Amount这两列特征没用了,所以去掉这两列
data.head()

count_classes = pd.value_counts(data['Class'],sort=True).sort_index()# 当前数据Class这一列不同值得个数
count_classes.plot(kind='bar')# 条形图图
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
# 结果几乎看不到异常数据有多少(非常少)

 由图可以看出样本分布极度不均衡

样本不均衡解决办法

下采样

使得两个样本同样少(取多的样本中与少的样本一样数量的样本)

X = data.iloc[:,data.columns != "Class"]
#除了特征Class  : 的意思是把所有样本都拿出来
y = data.iloc[:,data.columns == "Class"]
# 把特征等于Class的所有样本拿出来

number_records_fraud = len(data[data.Class==1])
# 计算class=1的样本个数
fraud_indices = np.array(data[data.Class==1].index)
# 将class=1的样本的index值全部取出来

normal_indices = data[data.Class==0].index #将所有为0的样本的索引取出来
# 从normal_indices中随机选择number_records_fraud个
random_normal_indices = np.random.choice(normal_indices,number_records_fraud,replace = False)
# 将上面取出的样本转换为numpy.array格式
random_normal_indices = np.array(random_normal_indices)
# 随机选的几百个Class=0的样本

#将上面选出的两组数据合并在一起
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

under_sample_data = data.iloc[under_sample_indices,:]# 经过下采样处理后拿到的数据

X_undersimply = under_sample_data.iloc[:,under_sample_data.columns != "Class"]
# dataFrame结构没有.ix,要写成.iloc
y_undersimply = under_sample_data.iloc[:,under_sample_data.columns == "Class"]

print("Percentage of normal transactions: ",len(under_sample_data[under_sample_data.Class==0])/len(under_sample_data))
print("Percentage of fraud transactions: ",len(under_sample_data[under_sample_data.Class==1])/len(under_sample_data))
print("Total number of transactions in resample data: ",len(under_sample_data))
Percentage of normal transactions:  0.5
Percentage of fraud transactions:  0.5
Total number of transactions in resample data:  984

这样 0、1类的数据各占50%了。但由于class=0的数据只取了一部分,因此会出现问题

交叉验证

机器学习是,经常对数据进行切分

第一步:80%-->训练集(用来做一个model),另外20%-->测试集(测试这个model)

建立模型(model)时(会进行参数的选择,参数难选

第二步:将训练集平均切分。(比如分成了3分①,②,③)

为了选择合适大小的参数:进行交叉验证

对同一组参数,训练集不同,验证集也不同,为啥这么做呢?

求稳。使模型的评估效果是可信的(既不能偏高也不要偏低)

交叉验证:第一次,用①和②训练建立一个model,③为验证集来验证参数

                  第二次,用①和③训练建立一个model,②为验证集来验证参数

                  第三次,用②和③训练建立一个model,①为验证集来验证参数

# from sklearn.cross_validation import train_test_split  #cross_validation这个包没有了划分到.model_selection里了
from sklearn.model_selection import train_test_split
# 交叉验证模块
# 原始数据集
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
# test_size切分比例 random_state:随机切分 每次切分数据会被重新洗牌

print("Number transactions train dataset: ",len(X_train))
print("Number transactions test dataset: ",len(X_test))
print("Total number of transactions: ",len(X_train)+len(X_test))
# 下采样数据集
X_train_undersample,X_test_undersample,y_train_undersample,y_test_undersample=train_test_split(X_undersimply,y_undersimply,test_size=0.3,random_state=0)
print("")
print("Number transactions train dataset: ",len(X_train_undersample))
print("Number transactions test dataset: ",len(X_test_undersample))
print("Total number of transactions: ",len(X_train_undersample)+len(X_test_undersample))

# 最终是要用原始切分的测试数据集对下采样做出的modal进行验证
Number transactions train dataset:  199364
Number transactions test dataset:  85443
Total number of transactions:  284807

Number transactions train dataset:  688
Number transactions test dataset:  296
Total number of transactions:  984

模型评估方法

1. 精度评估:在样本数据不均衡是,是不可靠的(比如,要检测出癌症患者的概率,1000个样本,990个正样本,10个负(即癌症患者),若模型检测出1000个全正,虽然此时精度为99%,但是该模型是没有用的。)

2. recall(召回率)(同上例,召回率=模型检测出的癌症患者数/10(即样本中所有癌症患者))

Recall=FP/(FN+FP)

其中TP就是模型挑出的正确的样本数量,FP就是模型挑出的样本数-TP;

FN就是没有被挑出的样本中应该正确的样本数量,TN就是未被挑出的样本中本就不应该被挑出来的样本数量。

如果了有两组数据的召回率相同,选择更稳定的一组数据(浮动太大,容易造成过拟合

过拟合:就是对已知的样本数据拟合的很好,(但这种预测不一定准确,有可能相差很多)

L2正则化:loss(损失函数)+ 1/2 W平方(惩罚项)

L1正则化:loss+|W|(惩罚项)

# Recall
from sklearn.linear_model import LogisticRegression
# from sklearn.cross_validation import KFold, cross_val_score  //.cross_validation这个包划分到model_selection中了
from sklearn.model_selection import KFold, cross_val_score
# KFold:交叉验证是数据集切分分数可以任意,cross_val_score:交叉验证评估结果
from sklearn.metrics import confusion_matrix,recall_score,classification_report
import itertools

def printing_KFlod_scores(x_train_data,y_train_data):
    fold = KFold(5,shuffle=False)
    # 新库中KFold参数为两个:KFold(n_split,shuffle)
    # 不同的c参数(正则化惩罚项)
    c_param_range=[0.01,0.1,1,10,100]# 惩罚力度
    
    results_table = pd.DataFrame(index=range(len(c_param_range),2),columns=['c_parameter','Mean recall score'])
    results_table['c_parameter']=c_param_range
    
    j=0
    # 每个c都试一下,找到效果最好的参数c
    for c_param in c_param_range:
        print("-----------------------------------")
        print('c_parameter: ',c_param)
        print("-----------------------------------")
        print('')
        
        recall_accs = []
        it = 0
        # 交叉验证
        for train_idx, test_idx in fold.split(x_train_data):
            # 实例化逻辑回归模型, 使用 L1正则化
            lr = LogisticRegression(C = c_param, penalty = 'l1',solver='liblinear')
            # 给逻辑回归模型喂 训练集数据
            lr.fit(x_train_data.iloc[train_idx, :], y_train_data.iloc[train_idx, :].values.ravel())
            # 对验证集 进行预测, 返回预测标签
            y_pred_undersample = lr.predict(x_train_data.iloc[test_idx, :].values)
            # 对交叉验证的某一轮(共5轮),计算recall
            recall_acc = recall_score(y_train_data.iloc[test_idx, :].values, y_pred_undersample)
            recall_accs.append(recall_acc)
            # 打印交叉验证 当前这一轮的recall
            print(f"Iteration: {it}, recall = {recall_acc}")
            it += 1
        # 计算某一惩罚力度下的平均recall,并打印
        results_table.loc[j, "Mean recall score"] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')
    # # argmax方法已被弃用,改用idxmax
    best_c = results_table.loc[results_table['Mean recall score'].astype(float).idxmax()]['c_parameter']
    print("********************************************************************************************")
    print("Best model to choose from cross validation is with C parameter = ", best_c)
    print("********************************************************************************************")
    
    return best_c

best_c = printing_KFlod_scores(X_train_undersample,y_train_undersample)
-----------------------------------
c_parameter:  0.01
-----------------------------------

Iteration: 0, recall = 0.9726027397260274
Iteration: 1, recall = 0.9452054794520548
Iteration: 2, recall = 1.0
Iteration: 3, recall = 0.972972972972973
Iteration: 4, recall = 1.0

Mean recall score  0.9781562384302112

-----------------------------------
c_parameter:  0.1
-----------------------------------

Iteration: 0, recall = 0.8356164383561644
Iteration: 1, recall = 0.863013698630137
Iteration: 2, recall = 0.9152542372881356
Iteration: 3, recall = 0.9459459459459459
Iteration: 4, recall = 0.8939393939393939

Mean recall score  0.8907539428319554

-----------------------------------
c_parameter:  1
-----------------------------------

Iteration: 0, recall = 0.8767123287671232
Iteration: 1, recall = 0.863013698630137
Iteration: 2, recall = 0.9491525423728814
Iteration: 3, recall = 0.9459459459459459
Iteration: 4, recall = 0.8939393939393939

Mean recall score  0.9057527819310962

-----------------------------------
c_parameter:  10
-----------------------------------

Iteration: 0, recall = 0.8767123287671232
Iteration: 1, recall = 0.863013698630137
Iteration: 2, recall = 0.9491525423728814
Iteration: 3, recall = 0.9459459459459459
Iteration: 4, recall = 0.8939393939393939

Mean recall score  0.9057527819310962

-----------------------------------
c_parameter:  100
-----------------------------------

Iteration: 0, recall = 0.8767123287671232
Iteration: 1, recall = 0.8904109589041096
Iteration: 2, recall = 0.9661016949152542
Iteration: 3, recall = 0.9459459459459459
Iteration: 4, recall = 0.8939393939393939

Mean recall score  0.9146220644943653
********************************************************************************************
Best model to choose from cross validation is with C parameter =  0.01
********************************************************************************************

混淆矩阵

下采样处理后的混淆矩阵

import itertools
def plot_confusion_matrix(cm, classes,
                         title="Confusion matrix",
                         cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm,interpolation='nearest',cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks=np.arange(len(classes))
    plt.xticks(tick_marks,classes,rotation=0)
    plt.yticks(tick_marks,classes)
    
    thresh = cm.max()/2
    for i,j in itertools.product(range(cm.shape[0]),range(cm.shape[1])):
        plt.text(j,i,cm[i,j],
                horizontalalignment="center",
                color="white" if cm[i,j] > thresh else "black")
        plt.tight_layout()
        plt.ylabel('True label')
        plt.xlabel('Predicted label')
# 决定惩罚项选择的有2个参数:dual和solver,如果要选L1范数,dual必须是False,solver必须是liblinear
lr = LogisticRegression(C=best_c,penalty='l1',solver='liblinear')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample=lr.predict(X_test_undersample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ",cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plt non-nomralized confusion matrix
class_names=[0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix,
                     classes=class_names,
                     title="Confusion matrix")
plt.show()
Recall metric in the testing dataset:  0.9455782312925171

x轴为预测值,y值为真实值。

(1,1)表示TP(0,1)FN 

精度=(138+139)/(138+139+11+8)

原始数据集上的混淆矩阵

lr = LogisticRegression(C=best_c,penalty='l1',solver='liblinear')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred=lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ",cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plt non-nomralized confusion matrix
class_names=[0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix,
                     classes=class_names,
                     title="Confusion matrix")
plt.show()
Recall metric in the testing dataset:  0.9319727891156463

(1,0)数据较大会使精度下降。误差’问题 

原始数据不做下采样处理的结果

best_c=best_c = printing_KFlod_scores(X_train,y_train)
-----------------------------------
c_parameter:  0.01
-----------------------------------

Iteration: 0, recall = 0.4925373134328358
Iteration: 1, recall = 0.6027397260273972
Iteration: 2, recall = 0.6833333333333333
Iteration: 3, recall = 0.5692307692307692
Iteration: 4, recall = 0.45

Mean recall score  0.5595682284048672

-----------------------------------
c_parameter:  0.1
-----------------------------------

Iteration: 0, recall = 0.5671641791044776
Iteration: 1, recall = 0.6164383561643836
Iteration: 2, recall = 0.6833333333333333
Iteration: 3, recall = 0.5846153846153846
Iteration: 4, recall = 0.525

Mean recall score  0.5953102506435158

-----------------------------------
c_parameter:  1
-----------------------------------

Iteration: 0, recall = 0.5522388059701493
Iteration: 1, recall = 0.6164383561643836
Iteration: 2, recall = 0.7166666666666667
Iteration: 3, recall = 0.6153846153846154
Iteration: 4, recall = 0.5625

Mean recall score  0.612645688837163

-----------------------------------
c_parameter:  10
-----------------------------------

Iteration: 0, recall = 0.5522388059701493
Iteration: 1, recall = 0.6164383561643836
Iteration: 2, recall = 0.7333333333333333
Iteration: 3, recall = 0.6153846153846154
Iteration: 4, recall = 0.575

Mean recall score  0.6184790221704963

-----------------------------------
c_parameter:  100
-----------------------------------

Iteration: 0, recall = 0.5522388059701493
Iteration: 1, recall = 0.6164383561643836
Iteration: 2, recall = 0.7333333333333333
Iteration: 3, recall = 0.6153846153846154
Iteration: 4, recall = 0.575

Mean recall score  0.6184790221704963

********************************************************************************************
Best model to choose from cross validation is with C parameter =  10.0
********************************************************************************************
lr = LogisticRegression(C=best_c,penalty='l1',solver='liblinear')
lr.fit(X_train,y_train.values.ravel())
y_pred=lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ",cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plt non-nomralized confusion matrix
class_names=[0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix,
                     classes=class_names,
                     title="Confusion matrix")
plt.show()
Recall metric in the testing dataset:  0.6190476190476191

 逻辑回归阈值对结果的影响

阈值可以自己设定:sigmoid函数:可以将阈值设为0.1,也就是说当样本>0.1就判断其为正样本。

lr = LogisticRegression(C = 0.01, penalty = 'l1',solver='liblinear')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)#原来时预测类别值,而此处是预测概率。方便后续比较
# 阈值
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
plt.figure(figsize=(10,10))
 
j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
    plt.subplot(3,3,j)
    j += 1
    # Compute confusion matrix
    cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)
 
    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
 
    # Plot non-normalized confusion matrix
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix, classes=class_names, title='Threshold >= %s'%i)
Recall metric in the testing dataset:  1.0 //此时回归率虽然很大,但精度很低
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  0.9727891156462585
Recall metric in the testing dataset:  0.9387755102040817
Recall metric in the testing dataset:  0.891156462585034
Recall metric in the testing dataset:  0.8367346938775511
Recall metric in the testing dataset:  0.7619047619047619
Recall metric in the testing dataset:  0.6054421768707483

过采样

使得两个样本同样多(对少的样本生成数据)

SMOTE算法

注意:预测集不动,只生成训练集

import pandas as pd
# 过采样
from imblearn.over_sampling import SMOTE # 需要先安装
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split 

credit_cards=pd.read_csv('E:/cluster/data/creditcard.csv')

columns=credit_cards.columns
features_columns=columns.delete(len(columns)-1)

features=credit_cards[features_columns]
labels=credit_cards['Class']

features_train,features_test,labels_train,labels_test=train_test_split(features,labels,test_size=0.2,random_state=0)

oversampler=SMOTE(random_state=0)
# 传入原始数据
os_features,os_labels=oversampler.fit_resample(features_train,labels_train)
len(os_labels[os_labels==1])
227454

交叉验证

os_features=pd.DataFrame(os_features)
os_labels=pd.DataFrame(os_labels)
best_c=printing_KFlod_scores(os_features,os_labels)
-----------------------------------
c_parameter:  0.01
-----------------------------------

Iteration: 0, recall = 0.8903225806451613
Iteration: 1, recall = 0.8947368421052632
Iteration: 2, recall = 0.9688834790306518
Iteration: 3, recall = 0.9576944636792297
Iteration: 4, recall = 0.958408898561238

Mean recall score  0.9340092528043087

-----------------------------------
c_parameter:  0.1
-----------------------------------

Iteration: 0, recall = 0.8903225806451613
Iteration: 1, recall = 0.8947368421052632
Iteration: 2, recall = 0.9702113533252186
Iteration: 3, recall = 0.9600356118310417
Iteration: 4, recall = 0.9605741858190172

Mean recall score  0.9351761147451404

-----------------------------------
c_parameter:  1
-----------------------------------

Iteration: 0, recall = 0.8903225806451613
Iteration: 1, recall = 0.8947368421052632
Iteration: 2, recall = 0.9703884032311608
Iteration: 3, recall = 0.9598487596311318
Iteration: 4, recall = 0.9581451072201889

Mean recall score  0.9346883385665812

-----------------------------------
c_parameter:  10
-----------------------------------

Iteration: 0, recall = 0.8903225806451613
Iteration: 1, recall = 0.8947368421052632
Iteration: 2, recall = 0.9703884032311608
Iteration: 3, recall = 0.9602444466427056
Iteration: 4, recall = 0.960739055407173

Mean recall score  0.9352862656062928

-----------------------------------
c_parameter:  100
-----------------------------------

Iteration: 0, recall = 0.8903225806451613
Iteration: 1, recall = 0.8947368421052632
Iteration: 2, recall = 0.9703220095164324
Iteration: 3, recall = 0.9603323770897221
Iteration: 4, recall = 0.9608379771600664

Mean recall score  0.9353103573033291

********************************************************************************************
Best model to choose from cross validation is with C parameter =  100.0
********************************************************************************************

混淆矩阵

与下采样模型比较,虽然recall值偏低了一些,但精度是提高了,整体效果较好

lr = LogisticRegression(C = best_c, penalty = 'l1',solver='liblinear')
lr.fit(os_features,os_labels.values.ravel())
y_pred = lr.predict(features_test.values)
 
# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ",cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plt non-nomralized confusion matrix
class_names=[0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix,
                     classes=class_names,
                     title="Confusion matrix")
plt.show()
Recall metric in the testing dataset:  0.900990099009901

总结 

特征数据提取(由于本例的数据是经过处理后较纯净的数据)

首先观察数据:是否均衡。

对于样本不均衡的数据,在计算能力相当的情况下,采用过采样整体效果更好。

数据标准化(使所有数据浮动区间差不多)

【下采样中(随机选择数据)】

通过交叉验证,找到最后的参数值

recall值和混淆矩阵

逻辑回归的阈值可自行设定,通过实验比较,选则合适的阈值

【过采样的SMOTE函数】

  • 7
    点赞
  • 43
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值