Python实现非平衡数据集的Logistic Regression二分类

最新推荐文章于 2024-06-14 21:42:03 发布

蔡知ru

最新推荐文章于 2024-06-14 21:42:03 发布

阅读量495

点赞数

文章标签：机器学习 python logistic regression

本文链接：https://blog.csdn.net/Doinru/article/details/115390654

版权

数据集介绍

该数据集的每条数据有两个属性，加一个标签列，标签分为0,1，是二分类数据集。数据集共有100条数据，其中40条标签为0，60条标签为1。我们设定前10行为测试集，后90行为训练集。

在这里插入图片描述

Python实现

数据集加载和准备

导入包

#!coding:utf-8

import math
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

start = time.time()

定义函数

def load_data():
    df=pd.read_table('reg_data.txt',
                     names=['feature1','feature2','label'],
                     sep=',',
                     encoding='utf-8')
    df1 = df[df['label'] == 0]
    df2 = df[df['label'] == 1]
    print("This is an imbalanced data set which has\n" +str(len(df1))+" negative samples and "+str(len(df2))+" positive samples\n")
    return df, df1, df2
    
# Divide the dataset into a training set and a testing set
def prepare_dataset(dt):
    train = dt[10:100]
    test = dt[0:10]
    
    tr = np.array(train)
    te = np.array(test)
    return tr, te

load_dataset()函数用来加载.txt格式的数据集，加载为Pandas的DataFrame格式。
prepare_dataset()函数将载入的DataFrame形式数据分割为训练集、测试集，再转换为numpy的数组array形式作为函数输出。
看看数据集长啥样：
在这里插入图片描述

Sigmoid函数

Sigmoid函数的公式：
$y=\frac{1}{(1 + e^{-z})}$
代码：

# Sigmoid function
def sigmoid(a):
    b = 1/(1+np.exp(-a))
    return b

arr = np.linspace(-10,10,50)
vis = sigmoid(arr)
plt.title('Sigmoid Function')
plt.plot(vis)
plt.show()

输出：
在这里插入图片描述

Gradient Descent优化权重和偏置值

对Loss函数求偏导：
$\frac{∂L(ω)}{∂ω_j}=\sum_{(i=1)}^{m}(\frac{1}{(1+exp⁡(ω^T x_i))} exp⁡(ω^T x_i ) x_{ij}-y_i x_{ij} )$
$w_j=w_j-η*Δw$
为减少算法复杂度，加快程序运行，写为矩阵形式：
$Δw=X^T(Xw-y)$

代码：

def gradient_des_revisedbyrain(tr):
    
    # Learning rate
    lr = 0.1
    
    # Maximum number of iterations
    iteration_max = 700000

    m, n = tr.shape
    tr_label = tr[:,2].reshape(m,1)
    
    # Fill the weight vector with ones
    result = np.ones((1, n))
    one = np.ones((m, 1))
    
    beta = np.c_[np.mat(tr[:, 0:2]),one]
    
    w_all = []
    # Iterations
    # Matrix form -- Δ = X^T * (X * w - y)
    
    for i in range(iteration_max):
        h = sigmoid(np.dot(beta, result.T))
    #     print(h.shape)
        err = h - tr_label
    #     print(err.shape)
        result = result - lr * np.dot(err.T, beta)/len(tr_label)
    #     print(result.shape)
        w_all.append(result)
    
    return result, w_all

beta矩阵要在右侧增加一列实质上是将其增广： $x = > (x; 1)$ 以便简化计算得到偏置值 $b$ ，增广的这列值全为1即为将偏置值初始化为1。
将上面代码for循环内部翻译为数学公式为：
$h=sigmoid(β*result^T)$ $err=h-tr\_label$ $\frac{lr×err^T·β}{len(tr\_lable)}$
每一次迭代得到的result都append到一个多维列表w_all里，到时候画图（画决策边界）的时候有用。

计算准确率

函数输入测试集、训练集标签、权重向量weight vector和阈值。输入值包含了阈值用于后续绘制P-R、ROC曲线。

def accuracy(te, te_label, w, threshold):
    
    one = np.ones((len(te), 1))
    mat = np.c_[te, one]
    y = sigmoid(np.dot(mat, w.T))
    m, n = np.shape(y)

    count = 0
    flag = []
    for i in range(m):
        if y[i, 0] > threshold:
            flag.append(1)
        else:
            flag.append(0)
            
        if te_label[i] == flag[i]:
            count += 1
            
        acc = count/len(flag)
    return acc, flag

输入测试集 $t e$ ，这个测试集实际上是只包含测试集的两个属性列，不包含标签列的。同梯度下降的函数，不要忘记给测试集增广一个全1列 $m a t = (t e, 1)$ 预测值 $y=sigmoid(mat·w^T)$
如果 $y > t h r e s h o l d$ ，则归为正类，标签为1；如果 $y < t h r e s h o l d$ ，则归为负类，标签为0。

输出结果

def print_results(w, b, lb, acc):
    print('Weight vector is\n' ,w[0,0:2])
    print('Bias is\n' ,b)
    print('Accuracy on testing set: %f' %acc)

输出值包括：权重向量Weight vector，偏置值Bias和在测试集上的准确率Accuracy。

主函数

if __name__ == '__main__':

    data, data_label0, data_label1 = load_data()
    
    # Normalize
    for i in ['feature1', 'feature2']:
        data[i] = (data[i] - data[i].min()) / (data[i].max() - data[i].min()) 
    
    training_set, testing_set = prepare_dataset(data)
    
    w_b,w_all = gradient_des_revisedbyrain(training_set)
    # w_b is the result of weight vector and bias value
    # w_all is all the weight and bias of iterations from 0 to max

    weight = w_b[0:2]
    bias = w_b[0,2]
    
    label_all = []
    acc_train =[]
    label_alltrain = []
    label_alltrain_measure = []
    
    # Accuracy on training set
    for i in range(0,2000):
        
        # Fix the threshold to be 0.5 and change weight vectors
        acc_rate_train, label_train = accuracy(training_set[:, 0:2], training_set[:,2], w_all[4*i], 0.5)
        acc_train.append(acc_rate_train)
        label_alltrain.append(label_train)
    
    
    # Different thresholds
    ran = np.arange(0, 1, 0.01)
    for i in ran:
        # Set threshold
        threshold = i
        
        # Accuracy and labels on testing set
        acc_rate, label = accuracy(testing_set[:, 0:2], testing_set[:,2], w_b, threshold)
        label_all.append(label)
        
        # Accuracy and labels on training set
        acc_rate_train_measure, label_train_measure = accuracy(training_set[:, 0:2], training_set[:,2], w_b, threshold)
        label_alltrain_measure.append(label_train_measure)
        
        # Accuracy and labels on testing set when threshold is 0.5
        final_acc_rate, final_label = accuracy(testing_set[:, 0:2], testing_set[:,2], w_b, 0.5)
        
    print_results(weight, bias, final_label, final_acc_rate)
    plt.plot(acc_train,c = 'r')
    plt.xlabel('Iterations')
    plt.ylabel('Accuracy')
    plt.title('Accuracy changes with Iterations')
    plt.show()

主函数第一个for循环：归一化
主函数第二个for循环：输出测试集上的准确率
让前2,000次迭代（在gradient descent函数那里我设置了总共700,000次迭代，这里的次数应当比700,000小）得到的权值和偏置去对整个训练集分类，输出所得到的准确率。

label_alltrain即为在训练集上的这2,000次迭代，所有的分类标签结果

主函数第三个for循环：
让阈值threshold在(0,1)间以0.01的步长变化，保持输入的权值向量为经过优化后最终的权值向量，更改不同的阈值得到准确率，为画P-R曲线、ROC曲线进行性能度量做准备。

label_alltrain_measure即为阈值从0到1变化得到的所有训练集分类标签

300次迭代:

1000次迭代：
在这里插入图片描述

训练集分类准确率随迭代次数的增加逐步增加，在0.1的学习率下，大约需要300次迭代可以达到90%的训练集分类准确率，且准确率基本不再随着迭代次数增加而变化。

画图

df1_norm = data[data['label'] == 0]
df2_norm = data[data['label'] == 1]
x = np.arange(0,1,0.1)

for i in range(0,1000,10):
    accu = acc_train[i]
    plt.scatter(df1_norm.feature1[:],df1_norm.feature2[:],
                c='c',marker='o',label='label 0')
    plt.scatter(df2_norm.feature1[:],df2_norm.feature2[:],
                c='k',marker='o',label='label 1')
    y = (-float(w_all[i][0,0]) * x - float(w_all[i][0,2]))/float(w_all[i][0,1])

    plt.plot(x, y, c='r')
    plt.legend()
    plt.title('Accuracy = '+str(accu))
    plt.savefig(r'./pic2/'+str(i)+'.png', dpi=300)

    plt.show()

因为此图绘制数据集上的分类的边界，因此采用前面得到的acc_train作为标题显示的分类准确率，采用w_all作为要画的边界的权重值。
汇总得到的图像如下：
在这里插入图片描述

性能度量

先按照以下表格计算各值

真实情况	预测结果
真实情况	正例(label=1)	反例(label=0)
正例(label=1)	True Positive(TP)	False Negetive(FN)
反例(label=0)	False Positive(FP)	True Negative(TN)

代码：

def performance_measure(lb, truelabel):
    
    tp = np.zeros((len(lb),1),dtype = int)
    fp = np.zeros((len(lb),1),dtype = int)
    tn = np.zeros((len(lb),1),dtype = int)
    fn = np.zeros((len(lb),1),dtype = int)
    
    for i in range(len(lb)):
        for j in range(10):

            if truelabel[j] == 1 and lb[i][j] == 1:
                tp[i] += 1        # True positive
            elif truelabel[j] == 1 and lb[i][j] == 0:
                fn[i] += 1        # False negative
            elif truelabel[j] == 0 and lb[i][j] == 1:
                fp[i] += 1        # False positive
            elif truelabel[j] == 0 and lb[i][j] == 0:
                tn[i] += 1        # True negative
                
    return tp, fn, fp, tn

P-R曲线

公式：
$P=\frac{TP}{TP+FP}$ $R=\frac{TP}{TP+FN}$
代码：

# 1. P-R curve(Precision and Recall)

# true_pos, false_neg, false_pos, true_neg = performance_measure(label_all, final_label)
true_pos, false_neg, false_pos, true_neg = performance_measure(label_alltrain_measure, training_set[:,2])

# Let NaN in the array be 0
np.seterr(divide='ignore',invalid='ignore')

precision = true_pos/(true_pos + false_pos)
recall = true_pos/(true_pos + false_neg)

precision[np.isnan(precision)]=0

precs_draw = np.zeros(len(precision))

for i in range(1,len(precision)):
    if precision[i] != 0:
        precs_draw[i] = precision[i]
    else:
        precs_draw[i] = 1

# Add point to draw plot
recall[len(recall)-2] = 0.5
recall[len(recall)-1] = 0


# Draw scatter
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(
#     recall, precision, "-o",
    recall, precs_draw, "-o",
    markersize=8,
    linewidth=4,
    markerfacecolor="b",
    markeredgecolor="b",
    markeredgewidth=1
)
ax.set(xlim=[0, 1.05], ylim=[0, 1.05], title='P-R(Precision-Recall) Curve',
       ylabel='Precision', xlabel='Recall')
plt.show()

在这里插入图片描述

ROC曲线和AUC面积

公式：
$TPR=\frac{TP}{TP+FN}$ $FPR=\frac{FP}{FP+TN}$
代码：

# 2. AUC curve(Area Under Curve)
# The area between the ROC curve and the coordinate axis

TPR = true_pos/(true_pos + false_neg)
FPR = false_pos/(true_neg + false_pos)

# Add point to draw plot
TPR[len(TPR)-2] = 0.5
TPR[len(TPR)-1] = 0

# Draw scatter
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(
    FPR, TPR, "-o",
    markersize=8,
    linewidth=4,
    markerfacecolor="b",
    markeredgecolor="b",
    markeredgewidth=1
)
ax.set(xlim=[-0.05, 1.05], ylim=[0, 1.05], title='ROC Curve & AUC(Area Under Curve)',
       ylabel='TPR(True Positive Rate)', xlabel='FPR(False Positive Rate)')

plt.show()

# Calculate AUC
temp = 0
for i in range(len(FPR)-1):
    temp = temp + (FPR[i] - FPR[i+1]) * (TPR[i] + TPR[i+1])
AUC = float(0.5*temp)
print('AUC is %.2f' %AUC)

在这里插入图片描述
输出AUC值：

AUC is 0.92

计时：

end = time.time()
t = end - start
print('Run time: %.2f s' %t)

输出：Run time: 38.25 s

代码已传GitHub，数据集和网址：https://github.com/RU0804/Binary_Logisticregression_imbalanceddata

蔡知ru

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
3
评论
Python实现非平衡数据集的Logistic Regression二分类

数据集介绍该数据集的每条数据仅具有两个属性，加一个标签列，标签同样分为0,1，是二分类数据集。共有100条数据。我们设定前10行为测试集，后90行为训练集。Python实现数据集加载和准备def load_data(): df=pd.read_table('reg_data.txt', names=['feature1','feature2','label'], sep=',',
复制链接

扫一扫