Python实现非平衡数据集的Logistic Regression二分类

数据集介绍

该数据集的每条数据有两个属性,加一个标签列,标签分为0,1,是二分类数据集。数据集共有100条数据,其中40条标签为0,60条标签为1。我们设定前10行为测试集,后90行为训练集

在这里插入图片描述
在这里插入图片描述

Python实现

数据集加载和准备

导入包

#!coding:utf-8

import math
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
start = time.time()

定义函数

def load_data():
    df=pd.read_table('reg_data.txt',
                     names=['feature1','feature2','label'],
                     sep=',',
                     encoding='utf-8')
    df1 = df[df['label'] == 0]
    df2 = df[df['label'] == 1]
    print("This is an imbalanced data set which has\n" +str(len(df1))+" negative samples and "+str(len(df2))+" positive samples\n")
    return df, df1, df2
    
# Divide the dataset into a training set and a testing set
def prepare_dataset(dt):
    train = dt[10:100]
    test = dt[0:10]
    
    tr = np.array(train)
    te = np.array(test)
    return tr, te

load_dataset()函数用来加载.txt格式的数据集,加载为Pandas的DataFrame格式。
prepare_dataset()函数将载入的DataFrame形式数据分割为训练集、测试集,再转换为numpy的数组array形式作为函数输出。
看看数据集长啥样:
在这里插入图片描述

Sigmoid函数

Sigmoid函数的公式:
y = 1 ( 1 + e − z ) y=\frac{1}{(1 + e^{-z})} y=(1+ez)1
代码:

# Sigmoid function
def sigmoid(a):
    b = 1/(1+np.exp(-a))
    return b

arr = np.linspace(-10,10,50)
vis = sigmoid(arr)
plt.title('Sigmoid Function')
plt.plot(vis)
plt.show()

输出:
在这里插入图片描述

Gradient Descent优化权重和偏置值

对Loss函数求偏导:
∂ L ( ω ) ∂ ω j = ∑ ( i = 1 ) m ( 1 ( 1 + e x p ⁡ ( ω T x i ) ) e x p ⁡ ( ω T x i ) x i j − y i x i j ) \frac{∂L(ω)}{∂ω_j}=\sum_{(i=1)}^{m}(\frac{1}{(1+exp⁡(ω^T x_i))} exp⁡(ω^T x_i ) x_{ij}-y_i x_{ij} ) ωjL(ω)=(i=1)m((1+exp(ωTxi))1exp(ωTxi)xijyixij)
w j = w j − η ∗ Δ w w_j=w_j-η*Δw wj=wjηΔw
为减少算法复杂度,加快程序运行,写为矩阵形式:
Δ w = X T ( X w − y ) Δw=X^T(Xw-y) Δw=XT(Xwy)

代码:

def gradient_des_revisedbyrain(tr):
    
    # Learning rate
    lr = 0.1
    
    # Maximum number of iterations
    iteration_max = 700000

    m, n = tr.shape
    tr_label = tr[:,2].reshape(m,1)
    
    # Fill the weight vector with ones
    result = np.ones((1, n))
    one = np.ones((m, 1))
    
    beta = np.c_[np.mat(tr[:, 0:2]),one]
    
    w_all = []
    # Iterations
    # Matrix form -- Δ = X^T * (X * w - y)
    
    for i in range(iteration_max):
        h = sigmoid(np.dot(beta, result.T))
    #     print(h.shape)
        err = h - tr_label
    #     print(err.shape)
        result = result - lr * np.dot(err.T, beta)/len(tr_label)
    #     print(result.shape)
        w_all.append(result)
    
    return result, w_all

beta矩阵要在右侧增加一列实质上是将其增广: x = > ( x ; 1 ) x=>(x;1) x=>(x;1)以便简化计算得到偏置值 b b b,增广的这列值全为1即为将偏置值初始化为1。
将上面代码for循环内部翻译为数学公式为:
h = s i g m o i d ( β ∗ r e s u l t T ) h=sigmoid(β*result^T) h=sigmoid(βresultT) e r r = h − t r _ l a b e l err=h-tr\_label err=htr_label r e s u l t = r e s u l t − l r × e r r T ⋅ β l e n ( t r _ l a b l e ) result = result - \frac{lr×err^T·β}{len(tr\_lable)} result=resultlen(tr_lable)lr×errTβ
每一次迭代得到的result都append到一个多维列表w_all里,到时候画图(画决策边界)的时候有用。

计算准确率

函数输入测试集、训练集标签、权重向量weight vector和阈值。输入值包含了阈值用于后续绘制P-R、ROC曲线。

def accuracy(te, te_label, w, threshold):
    
    one = np.ones((len(te), 1))
    mat = np.c_[te, one]
    y = sigmoid(np.dot(mat, w.T))
    m, n = np.shape(y)

    count = 0
    flag = []
    for i in range(m):
        if y[i, 0] > threshold:
            flag.append(1)
        else:
            flag.append(0)
            
        if te_label[i] == flag[i]:
            count += 1
            
        acc = count/len(flag)
    return acc, flag

输入测试集 t e te te,这个测试集实际上是只包含测试集的两个属性列,不包含标签列的。同梯度下降的函数,不要忘记给测试集增广一个全1列 m a t = ( t e , 1 ) mat=(te,1) mat=(te,1)预测值 y = s i g m o i d ( m a t ⋅ w T ) y=sigmoid(mat·w^T) y=sigmoid(matwT)
如果 y > t h r e s h o l d y>threshold y>threshold,则归为正类,标签为1;如果 y < t h r e s h o l d y<threshold y<threshold,则归为负类,标签为0。

输出结果

def print_results(w, b, lb, acc):
    print('Weight vector is\n' ,w[0,0:2])
    print('Bias is\n' ,b)
    print('Accuracy on testing set: %f' %acc)

输出值包括:权重向量Weight vector,偏置值Bias和在测试集上的准确率Accuracy。

主函数

if __name__ == '__main__':

    data, data_label0, data_label1 = load_data()
    
    # Normalize
    for i in ['feature1', 'feature2']:
        data[i] = (data[i] - data[i].min()) / (data[i].max() - data[i].min()) 
    
    training_set, testing_set = prepare_dataset(data)
    
    w_b,w_all = gradient_des_revisedbyrain(training_set)
    # w_b is the result of weight vector and bias value
    # w_all is all the weight and bias of iterations from 0 to max

    weight = w_b[0:2]
    bias = w_b[0,2]
    
    label_all = []
    acc_train =[]
    label_alltrain = []
    label_alltrain_measure = []
    
    # Accuracy on training set
    for i in range(0,2000):
        
        # Fix the threshold to be 0.5 and change weight vectors
        acc_rate_train, label_train = accuracy(training_set[:, 0:2], training_set[:,2], w_all[4*i], 0.5)
        acc_train.append(acc_rate_train)
        label_alltrain.append(label_train)
    
    
    # Different thresholds
    ran = np.arange(0, 1, 0.01)
    for i in ran:
        # Set threshold
        threshold = i
        
        # Accuracy and labels on testing set
        acc_rate, label = accuracy(testing_set[:, 0:2], testing_set[:,2], w_b, threshold)
        label_all.append(label)
        
        # Accuracy and labels on training set
        acc_rate_train_measure, label_train_measure = accuracy(training_set[:, 0:2], training_set[:,2], w_b, threshold)
        label_alltrain_measure.append(label_train_measure)
        
        # Accuracy and labels on testing set when threshold is 0.5
        final_acc_rate, final_label = accuracy(testing_set[:, 0:2], testing_set[:,2], w_b, 0.5)
        
    print_results(weight, bias, final_label, final_acc_rate)
    plt.plot(acc_train,c = 'r')
    plt.xlabel('Iterations')
    plt.ylabel('Accuracy')
    plt.title('Accuracy changes with Iterations')
    plt.show()

主函数第一个for循环:归一化
主函数第二个for循环:输出测试集上的准确率
让前2,000次迭代(在gradient descent函数那里我设置了总共700,000次迭代,这里的次数应当比700,000小)得到的权值和偏置去对整个训练集分类,输出所得到的准确率。

label_alltrain即为在训练集上的这2,000次迭代,所有的分类标签结果

主函数第三个for循环:
让阈值threshold在(0,1)间以0.01的步长变化,保持输入的权值向量为经过优化后最终的权值向量,更改不同的阈值得到准确率,为画P-R曲线、ROC曲线进行性能度量做准备。

label_alltrain_measure即为阈值从01变化得到的所有训练集分类标签

300次迭代:
300次迭代
1000次迭代:
在这里插入图片描述

训练集分类准确率随迭代次数的增加逐步增加,在0.1的学习率下,大约需要300次迭代可以达到90%的训练集分类准确率,且准确率基本不再随着迭代次数增加而变化。

画图

df1_norm = data[data['label'] == 0]
df2_norm = data[data['label'] == 1]
x = np.arange(0,1,0.1)

for i in range(0,1000,10):
    accu = acc_train[i]
    plt.scatter(df1_norm.feature1[:],df1_norm.feature2[:],
                c='c',marker='o',label='label 0')
    plt.scatter(df2_norm.feature1[:],df2_norm.feature2[:],
                c='k',marker='o',label='label 1')
    y = (-float(w_all[i][0,0]) * x - float(w_all[i][0,2]))/float(w_all[i][0,1])

    plt.plot(x, y, c='r')
    plt.legend()
    plt.title('Accuracy = '+str(accu))
    plt.savefig(r'./pic2/'+str(i)+'.png', dpi=300)

    plt.show()

因为此图绘制数据集上的分类的边界,因此采用前面得到的acc_train作为标题显示的分类准确率,采用w_all作为要画的边界的权重值。
汇总得到的图像如下:
在这里插入图片描述

性能度量

先按照以下表格计算各值

真实情况预测结果
正例(label=1)反例(label=0)
正例(label=1)True Positive(TP)False Negetive(FN)
反例(label=0)False Positive(FP)True Negative(TN)
代码:
def performance_measure(lb, truelabel):
    
    tp = np.zeros((len(lb),1),dtype = int)
    fp = np.zeros((len(lb),1),dtype = int)
    tn = np.zeros((len(lb),1),dtype = int)
    fn = np.zeros((len(lb),1),dtype = int)
    
    for i in range(len(lb)):
        for j in range(10):

            if truelabel[j] == 1 and lb[i][j] == 1:
                tp[i] += 1        # True positive
            elif truelabel[j] == 1 and lb[i][j] == 0:
                fn[i] += 1        # False negative
            elif truelabel[j] == 0 and lb[i][j] == 1:
                fp[i] += 1        # False positive
            elif truelabel[j] == 0 and lb[i][j] == 0:
                tn[i] += 1        # True negative
                
    return tp, fn, fp, tn

P-R曲线

公式:
P = T P T P + F P P=\frac{TP}{TP+FP} P=TP+FPTP R = T P T P + F N R=\frac{TP}{TP+FN} R=TP+FNTP
代码:

# 1. P-R curve(Precision and Recall)

# true_pos, false_neg, false_pos, true_neg = performance_measure(label_all, final_label)
true_pos, false_neg, false_pos, true_neg = performance_measure(label_alltrain_measure, training_set[:,2])

# Let NaN in the array be 0
np.seterr(divide='ignore',invalid='ignore')

precision = true_pos/(true_pos + false_pos)
recall = true_pos/(true_pos + false_neg)

precision[np.isnan(precision)]=0

precs_draw = np.zeros(len(precision))

for i in range(1,len(precision)):
    if precision[i] != 0:
        precs_draw[i] = precision[i]
    else:
        precs_draw[i] = 1

# Add point to draw plot
recall[len(recall)-2] = 0.5
recall[len(recall)-1] = 0


# Draw scatter
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(
#     recall, precision, "-o",
    recall, precs_draw, "-o",
    markersize=8,
    linewidth=4,
    markerfacecolor="b",
    markeredgecolor="b",
    markeredgewidth=1
)
ax.set(xlim=[0, 1.05], ylim=[0, 1.05], title='P-R(Precision-Recall) Curve',
       ylabel='Precision', xlabel='Recall')
plt.show()

在这里插入图片描述

ROC曲线和AUC面积

公式:
T P R = T P T P + F N TPR=\frac{TP}{TP+FN} TPR=TP+FNTP F P R = F P F P + T N FPR=\frac{FP}{FP+TN} FPR=FP+TNFP
代码:

# 2. AUC curve(Area Under Curve)
# The area between the ROC curve and the coordinate axis

TPR = true_pos/(true_pos + false_neg)
FPR = false_pos/(true_neg + false_pos)

# Add point to draw plot
TPR[len(TPR)-2] = 0.5
TPR[len(TPR)-1] = 0

# Draw scatter
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(
    FPR, TPR, "-o",
    markersize=8,
    linewidth=4,
    markerfacecolor="b",
    markeredgecolor="b",
    markeredgewidth=1
)
ax.set(xlim=[-0.05, 1.05], ylim=[0, 1.05], title='ROC Curve & AUC(Area Under Curve)',
       ylabel='TPR(True Positive Rate)', xlabel='FPR(False Positive Rate)')

plt.show()

# Calculate AUC
temp = 0
for i in range(len(FPR)-1):
    temp = temp + (FPR[i] - FPR[i+1]) * (TPR[i] + TPR[i+1])
AUC = float(0.5*temp)
print('AUC is %.2f' %AUC)

在这里插入图片描述
输出AUC值:

AUC is 0.92

计时:

end = time.time()
t = end - start
print('Run time: %.2f s' %t)

输出:Run time: 38.25 s

代码已传GitHub,数据集和网址:https://github.com/RU0804/Binary_Logisticregression_imbalanceddata

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值