Python数据分析与机器学习实战＜七＞逻辑回归应用案例

-小透明-

已于 2022-06-28 09:49:42 修改

阅读量3.1k

点赞数 7

分类专栏： Python数据分析与机器学习文章标签： 1024程序员节

于 2021-11-02 13:24:10 首次发布

本文链接：https://blog.csdn.net/qq_54809548/article/details/120942188

版权

Python数据分析与机器学习专栏收录该内容

10 篇文章 13 订阅

订阅专栏

案例：信用卡----欺诈检测

credit_card数据集：数据集中包含284807行31列数据（一个银行的信用卡交易记录，这些数据是经过处理后发布的（保护了客户的隐私），因此它的第一行名字是V1~V28）

<这个数据在GitHub上是找不到的，我是在博客中找到的。>

(数据资源: https://pan.baidu.com/s/1fzqeieHOrBmV5TJ1RfhBbw 提取码: buiu （转）)

数据中有正常数据可以用0表示，也有异常数据可以用1表示：目标，根据已知的30个特征和Class判断数据的正常还是异常（即将0类和1类别区分开）

这是一个经典的二分类问题

对于这样一个正常的数据，则大概率属于0的数据要远远大于属于1的数据。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
# 读入数据,数据总共有284807行31列（较大）
data = pd.read_csv("E:/cluster/data/creditcard.csv")
data.head()#只看一下前5行

观察每列数据的取值范围，大部分数-1~1之间，Amount（第29个特征）那一列数据范围为几百到几十，差别太大，这会在机器学习中导致，误以为大范围的数据的影响程度更大，所以为了使每个特征的重要性一致，需要对范围较大的数据进行归一化或标准化（0~1或-1~1区间都可以）在下面，将用sklearn进行处理

# 导入标准化模块
from sklearn.preprocessing import StandardScaler

# 注意：这里如果不加.values会报错：'Series' object has no attribute 'reshape'
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))
# -1：让Python根据列数（1）自动计算行数;
data = data.drop(['Time','Amount'],axis = 1)
# Time和Amount这两列特征没用了，所以去掉这两列
data.head()

count_classes = pd.value_counts(data['Class'],sort=True).sort_index()# 当前数据Class这一列不同值得个数
count_classes.plot(kind='bar')# 条形图图
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
# 结果几乎看不到异常数据有多少（非常少）

由图可以看出样本分布极度不均衡

样本不均衡解决办法

下采样

使得两个样本同样少（取多的样本中与少的样本一样数量的样本）

X = data.iloc[:,data.columns != "Class"]
#除了特征Class  : 的意思是把所有样本都拿出来
y = data.iloc[:,data.columns == "Class"]
# 把特征等于Class的所有样本拿出来

number_records_fraud = len(data[data.Class==1])
# 计算class=1的样本个数
fraud_indices = np.array(data[data.Class==1].index)
# 将class=1的样本的index值全部取出来

normal_indices = data[data.Class==0].index #将所有为0的样本的索引取出来
# 从normal_indices中随机选择number_records_fraud个
random_normal_indices = np.random.choice(normal_indices,number_records_fraud,replace = False)
# 将上面取出的样本转换为numpy.array格式
random_normal_indices = np.array(random_normal_indices)
# 随机选的几百个Class=0的样本

#将上面选出的两组数据合并在一起
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

under_sample_data = data.iloc[under_sample_indices,:]# 经过下采样处理后拿到的数据

X_undersimply = under_sample_data.iloc[:,under_sample_data.columns != "Class"]
# dataFrame结构没有.ix,要写成.iloc
y_undersimply = under_sample_data.iloc[:,under_sample_data.columns == "Class"]

print("Percentage of normal transactions: ",len(under_sample_data[under_sample_data.Class==0])/len(under_sample_data))
print("Percentage of fraud transactions: ",len(under_sample_data[under_sample_data.Class==1])/len(under_sample_data))
print("Total number of transactions in resample data: ",len(under_sample_data))

Percentage of normal transactions:  0.5
Percentage of fraud transactions:  0.5
Total number of transactions in resample data:  984

这样 0、1类的数据各占50%了。但由于class=0的数据只取了一部分，因此会出现问题

交叉验证

机器学习是，经常对数据进行切分

第一步：80%-->训练集（用来做一个model），另外20%-->测试集（测试这个model）

建立模型（model）时（会进行参数的选择，参数难选）

第二步：将训练集平均切分。（比如分成了3分①，②，③）

为了选择合适大小的参数：进行交叉验证

对同一组参数，训练集不同，验证集也不同，为啥这么做呢？

求稳。使模型的评估效果是可信的（既不能偏高也不要偏低）

交叉验证：第一次，用①和②训练建立一个model，③为验证集来验证参数

第二次，用①和③训练建立一个model，②为验证集来验证参数

第三次，用②和③训练建立一个model，①为验证集来验证参数

# from sklearn.cross_validation import train_test_split  #cross_validation这个包没有了划分到.model_selection里了
from sklearn.model_selection import train_test_split
# 交叉验证模块
# 原始数据集
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
# test_size切分比例 random_state：随机切分 每次切分数据会被重新洗牌

print("Number transactions train dataset: ",len(X_train))
print("Number transactions test dataset: ",len(X_test))
print("Total number of transactions: ",len(X_train)+len(X_test))
# 下采样数据集
X_train_undersample,X_test_undersample,y_train_undersample,y_test_undersample=train_test_split(X_undersimply,y_undersimply,test_size=0.3,random_state=0)
print("")
print("Number transactions train dataset: ",len(X_train_undersample))
print("Number transactions test dataset: ",len(X_test_undersample))
print("Total number of transactions: ",len(X_train_undersample)+len(X_test_undersample))

# 最终是要用原始切分的测试数据集对下采样做出的modal进行验证

Number transactions train dataset:  199364
Number transactions test dataset:  85443
Total number of transactions:  284807

Number transactions train dataset:  688
Number transactions test dataset:  296
Total number of transactions:  984

模型评估方法

1. 精度评估：在样本数据不均衡是，是不可靠的（比如，要检测出癌症患者的概率，1000个样本，990个正样本，10个负（即癌症患者），若模型检测出1000个全正，虽然此时精度为99%，但是该模型是没有用的。）

2. recall（召回率）（同上例，召回率=模型检测出的癌症患者数/10(即样本中所有癌症患者)）

Recall=FP/(FN+FP)

其中TP就是模型挑出的正确的样本数量，FP就是模型挑出的样本数-TP；

FN就是没有被挑出的样本中应该正确的样本数量，TN就是未被挑出的样本中本就不应该被挑出来的样本数量。

如果了有两组数据的召回率相同，选择更稳定的一组数据（浮动太大，容易造成过拟合）

过拟合：就是对已知的样本数据拟合的很好，（但这种预测不一定准确，有可能相差很多）

L2正则化：loss（损失函数）+ 1/2 W平方（惩罚项）

L1正则化：loss+|W|（惩罚项）

# Recall
from sklearn.linear_model import LogisticRegression
# from sklearn.cross_validation import KFold, cross_val_score  //.cross_validation这个包划分到model_selection中了
from sklearn.model_selection import KFold, cross_val_score
# KFold：交叉验证是数据集切分分数可以任意，cross_val_score：交叉验证评估结果
from sklearn.metrics import confusion_matrix,recall_score,classification_report
import itertools

def printing_KFlod_scores(x_train_data,y_train_data):
    fold = KFold(5,shuffle=False)
    # 新库中KFold参数为两个：KFold(n_split,shuffle)
    # 不同的c参数(正则化惩罚项)
    c_param_range=[0.01,0.1,1,10,100]# 惩罚力度
    
    results_table = pd.DataFrame(index=range(len(c_param_range),2),columns=['c_parameter','Mean recall score'])
    results_table['c_parameter']=c_param_range
    
    j=0
    # 每个c都试一下，找到效果最好的参数c
    for c_param in c_param_range:
        print("-----------------------------------")
        print('c_parameter: ',c_param)
        print("-----------------------------------")
        print('')
        
        recall_accs = []
        it = 0
        # 交叉验证
        for train_idx, test_idx in fold.split(x_train_data):
            # 实例化逻辑回归模型， 使用 L1正则化
            lr = LogisticRegression(C = c_param, penalty = 'l1',solver='liblinear')
            # 给逻辑回归模型喂 训练集数据
            lr.fit(x_train_data.iloc[train_idx, :], y_train_data.iloc[train_idx, :].values.ravel())
            # 对验证集 进行预测, 返回预测标签
            y_pred_undersample = lr.predict(x_train_data.iloc[test_idx, :].values)
            # 对交叉验证的某一轮（共5轮），计算recall
            recall_acc = recall_score(y_train_data.iloc[test_idx, :].values, y_pred_undersample)
            recall_accs.append(recall_acc)
            # 打印交叉验证 当前这一轮的recall
            print(f"Iteration: {it}, recall = {recall_acc}")
            it += 1
        # 计算某一惩罚力度下的平均recall，并打印
        results_table.loc[j, "Mean recall score"] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')
    # # argmax方法已被弃用，改用idxmax
    best_c = results_table.loc[results_table['Mean recall score'].astype(float).idxmax()]['c_parameter']
    print("********************************************************************************************")
    print("Best model to choose from cross validation is with C parameter = ", best_c)
    print("********************************************************************************************")
    
    return best_c

best_c = printing_KFlod_scores(X_train_undersample,y_train_undersample)

-----------------------------------
c_parameter:  0.01
-----------------------------------

Iteration: 0, recall = 0.9726027397260274
Iteration: 1, recall = 0.9452054794520548
Iteration: 2, recall = 1.0
Iteration: 3, recall = 0.972972972972973
Iteration: 4, recall = 1.0

Mean recall score  0.9781562384302112

-----------------------------------
c_parameter:  0.1
-----------------------------------

Iteration: 0, recall = 0.8356164383561644
Iteration: 1, recall = 0.863013698630137
Iteration: 2, recall = 0.9152542372881356
Iteration: 3, recall = 0.9459459459459459
Iteration: 4, recall = 0.8939393939393939

Mean recall score  0.8907539428319554

-----------------------------------
c_parameter:  1
-----------------------------------

Iteration: 0, recall = 0.8767123287671232
Iteration: 1, recall = 0.863013698630137
Iteration: 2, recall = 0.9491525423728814
Iteration: 3, recall = 0.9459459459459459
Iteration: 4, recall = 0.8939393939393939

Mean recall score  0.9057527819310962

-----------------------------------
c_parameter:  10
-----------------------------------

Iteration: 0, recall = 0.8767123287671232
Iteration: 1, recall = 0.863013698630137
Iteration: 2, recall = 0.9491525423728814
Iteration: 3, recall = 0.9459459459459459
Iteration: 4, recall = 0.8939393939393939

Mean recall score  0.9057527819310962

-----------------------------------
c_parameter:  100
-----------------------------------

Iteration: 0, recall = 0.8767123287671232
Iteration: 1, recall = 0.8904109589041096
Iteration: 2, recall = 0.9661016949152542
Iteration: 3, recall = 0.9459459459459459
Iteration: 4, recall = 0.8939393939393939

Mean recall score  0.9146220644943653

********************************************************************************************
Best model to choose from cross validation is with C parameter =  0.01
********************************************************************************************

混淆矩阵

下采样处理后的混淆矩阵

import itertools
def plot_confusion_matrix(cm, classes,
                         title="Confusion matrix",
                         cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm,interpolation='nearest',cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks=np.arange(len(classes))
    plt.xticks(tick_marks,classes,rotation=0)
    plt.yticks(tick_marks,classes)
    
    thresh = cm.max()/2
    for i,j in itertools.product(range(cm.shape[0]),range(cm.shape[1])):
        plt.text(j,i,cm[i,j],
                horizontalalignment="center",
                color="white" if cm[i,j] > thresh else "black")
        plt.tight_layout()
        plt.ylabel('True label')
        plt.xlabel('Predicted label')
# 决定惩罚项选择的有2个参数：dual和solver，如果要选L1范数，dual必须是False，solver必须是liblinear
lr = LogisticRegression(C=best_c,penalty='l1',solver='liblinear')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample=lr.predict(X_test_undersample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ",cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plt non-nomralized confusion matrix
class_names=[0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix,
                     classes=class_names,
                     title="Confusion matrix")
plt.show()

Recall metric in the testing dataset:  0.9455782312925171

x轴为预测值，y值为真实值。

（1,1）表示TP，（0,1）表FN

精度=（138+139）/（138+139+11+8）

原始数据集上的混淆矩阵

lr = LogisticRegression(C=best_c,penalty='l1',solver='liblinear')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred=lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ",cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plt non-nomralized confusion matrix
class_names=[0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix,
                     classes=class_names,
                     title="Confusion matrix")
plt.show()

Recall metric in the testing dataset:  0.9319727891156463

（1,0）数据较大会使精度下降。误差’问题

原始数据不做下采样处理的结果

best_c=best_c = printing_KFlod_scores(X_train,y_train)

-----------------------------------
c_parameter:  0.01
-----------------------------------

Iteration: 0, recall = 0.4925373134328358
Iteration: 1, recall = 0.6027397260273972
Iteration: 2, recall = 0.6833333333333333
Iteration: 3, recall = 0.5692307692307692
Iteration: 4, recall = 0.45

Mean recall score  0.5595682284048672

-----------------------------------
c_parameter:  0.1
-----------------------------------

Iteration: 0, recall = 0.5671641791044776
Iteration: 1, recall = 0.6164383561643836
Iteration: 2, recall = 0.6833333333333333
Iteration: 3, recall = 0.5846153846153846
Iteration: 4, recall = 0.525

Mean recall score  0.5953102506435158

-----------------------------------
c_parameter:  1
-----------------------------------

Iteration: 0, recall = 0.5522388059701493
Iteration: 1, recall = 0.6164383561643836
Iteration: 2, recall = 0.7166666666666667
Iteration: 3, recall = 0.6153846153846154
Iteration: 4, recall = 0.5625

Mean recall score  0.612645688837163

-----------------------------------
c_parameter:  10
-----------------------------------

Iteration: 0, recall = 0.5522388059701493
Iteration: 1, recall = 0.6164383561643836
Iteration: 2, recall = 0.7333333333333333
Iteration: 3, recall = 0.6153846153846154
Iteration: 4, recall = 0.575

Mean recall score  0.6184790221704963

-----------------------------------
c_parameter:  100
-----------------------------------

Iteration: 0, recall = 0.5522388059701493
Iteration: 1, recall = 0.6164383561643836
Iteration: 2, recall = 0.7333333333333333
Iteration: 3, recall = 0.6153846153846154
Iteration: 4, recall = 0.575

Mean recall score  0.6184790221704963

********************************************************************************************
Best model to choose from cross validation is with C parameter =  10.0
********************************************************************************************

lr = LogisticRegression(C=best_c,penalty='l1',solver='liblinear')
lr.fit(X_train,y_train.values.ravel())
y_pred=lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ",cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plt non-nomralized confusion matrix
class_names=[0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix,
                     classes=class_names,
                     title="Confusion matrix")
plt.show()

Recall metric in the testing dataset:  0.6190476190476191

逻辑回归阈值对结果的影响

阈值可以自己设定：sigmoid函数：可以将阈值设为0.1，也就是说当样本>0.1就判断其为正样本。

lr = LogisticRegression(C = 0.01, penalty = 'l1',solver='liblinear')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)#原来时预测类别值，而此处是预测概率。方便后续比较
# 阈值
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
plt.figure(figsize=(10,10))
 
j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
    plt.subplot(3,3,j)
    j += 1
    # Compute confusion matrix
    cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)
 
    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
 
    # Plot non-normalized confusion matrix
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix, classes=class_names, title='Threshold >= %s'%i)

Recall metric in the testing dataset:  1.0 //此时回归率虽然很大，但精度很低
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  0.9727891156462585
Recall metric in the testing dataset:  0.9387755102040817
Recall metric in the testing dataset:  0.891156462585034
Recall metric in the testing dataset:  0.8367346938775511
Recall metric in the testing dataset:  0.7619047619047619
Recall metric in the testing dataset:  0.6054421768707483

过采样

使得两个样本同样多（对少的样本生成数据）

SMOTE算法

注意：预测集不动，只生成训练集

import pandas as pd
# 过采样
from imblearn.over_sampling import SMOTE # 需要先安装
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split 

credit_cards=pd.read_csv('E:/cluster/data/creditcard.csv')

columns=credit_cards.columns
features_columns=columns.delete(len(columns)-1)

features=credit_cards[features_columns]
labels=credit_cards['Class']

features_train,features_test,labels_train,labels_test=train_test_split(features,labels,test_size=0.2,random_state=0)

oversampler=SMOTE(random_state=0)
# 传入原始数据
os_features,os_labels=oversampler.fit_resample(features_train,labels_train)
len(os_labels[os_labels==1])

交叉验证

os_features=pd.DataFrame(os_features)
os_labels=pd.DataFrame(os_labels)
best_c=printing_KFlod_scores(os_features,os_labels)

-----------------------------------
c_parameter:  0.01
-----------------------------------

Iteration: 0, recall = 0.8903225806451613
Iteration: 1, recall = 0.8947368421052632
Iteration: 2, recall = 0.9688834790306518
Iteration: 3, recall = 0.9576944636792297
Iteration: 4, recall = 0.958408898561238

Mean recall score  0.9340092528043087

-----------------------------------
c_parameter:  0.1
-----------------------------------

Iteration: 0, recall = 0.8903225806451613
Iteration: 1, recall = 0.8947368421052632
Iteration: 2, recall = 0.9702113533252186
Iteration: 3, recall = 0.9600356118310417
Iteration: 4, recall = 0.9605741858190172

Mean recall score  0.9351761147451404

-----------------------------------
c_parameter:  1
-----------------------------------

Iteration: 0, recall = 0.8903225806451613
Iteration: 1, recall = 0.8947368421052632
Iteration: 2, recall = 0.9703884032311608
Iteration: 3, recall = 0.9598487596311318
Iteration: 4, recall = 0.9581451072201889

Mean recall score  0.9346883385665812

-----------------------------------
c_parameter:  10
-----------------------------------

Iteration: 0, recall = 0.8903225806451613
Iteration: 1, recall = 0.8947368421052632
Iteration: 2, recall = 0.9703884032311608
Iteration: 3, recall = 0.9602444466427056
Iteration: 4, recall = 0.960739055407173

Mean recall score  0.9352862656062928

-----------------------------------
c_parameter:  100
-----------------------------------

Iteration: 0, recall = 0.8903225806451613
Iteration: 1, recall = 0.8947368421052632
Iteration: 2, recall = 0.9703220095164324
Iteration: 3, recall = 0.9603323770897221
Iteration: 4, recall = 0.9608379771600664

Mean recall score  0.9353103573033291

********************************************************************************************
Best model to choose from cross validation is with C parameter =  100.0
********************************************************************************************

混淆矩阵

与下采样模型比较，虽然recall值偏低了一些，但精度是提高了，整体效果较好

lr = LogisticRegression(C = best_c, penalty = 'l1',solver='liblinear')
lr.fit(os_features,os_labels.values.ravel())
y_pred = lr.predict(features_test.values)
 
# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ",cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plt non-nomralized confusion matrix
class_names=[0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix,
                     classes=class_names,
                     title="Confusion matrix")
plt.show()

Recall metric in the testing dataset:  0.900990099009901

总结

特征数据提取（由于本例的数据是经过处理后较纯净的数据）

首先观察数据：是否均衡。

对于样本不均衡的数据，在计算能力相当的情况下，采用过采样整体效果更好。

数据标准化（使所有数据浮动区间差不多）

【下采样中（随机选择数据）】

通过交叉验证，找到最后的参数值

recall值和混淆矩阵

逻辑回归的阈值可自行设定，通过实验比较，选则合适的阈值

【过采样的SMOTE函数】

-小透明-

关注

7
点赞
踩
43

收藏

觉得还不错? 一键收藏
0
评论
Python数据分析与机器学习实战＜七＞逻辑回归应用案例

案例背景和目标案例：信用卡的信用检测credit_card数据集：数据集中包含284807行31列数据（一个银行的信用卡交易记录，这些数据是经过处理后发布的（保护了客户的隐私），因此它的第一行名字是V1~V28）<这个数据在GitHub上是找不到的，我是在博客中找到的。>(数据资源: https://pan.baidu.com/s/1fzqeieHOrBmV5TJ1RfhBbw 提取码: buiu （转）)数据中有正常数据可以用0表示，也有异常数据可以用1表示：目标，根据.
复制链接

扫一扫

专栏目录