机器学习之逻辑回归算法实现

最新推荐文章于 2024-05-28 16:41:57 发布

泰伦斯-Ternence

最新推荐文章于 2024-05-28 16:41:57 发布

阅读量801

点赞数 28

文章标签：机器学习回归人工智能

本文链接：https://blog.csdn.net/qq_56883085/article/details/138421536

版权

什么是逻辑回归

通常来说，回归是连续的，比如线性/非线性回归。然而，在实际的问题中，如果想用回归的思想解决分类问题，那么输出就变成离散的。每个元素则代表不同类别。我们考虑最简单的二分类问题，与线性回归相似
$f\left( x \right) =\left\{ \begin{array}{l} 0,\boldsymbol{\theta }^T\boldsymbol{x}\le z\\ 1,\boldsymbol{\theta }^T\boldsymbol{x}>z\\ \end{array} \right.$

sigmoid函数

$\sigma \left( x \right) =\frac{1}{1+e^{-x}}=\frac{e^x}{e^x+1}$
这个函数的牛逼之处就在于它对x进行求导非常之简洁，并且这个函数值的值域在0-1.

sigmoid函数求导

$\frac{d\sigma \left( x \right)}{dx}=\sigma \left( x \right) \left( 1-\sigma \left( x \right) \right)$

逻辑回归公式的推导

实话实说还有一点复杂，但是我可以说一下我理解的大体思路，主要运用的就是最大似然估计，把线性回归的结果当成样本预测为1的概率。把似然函数通过取对数重新构造优化目标，然后重新定义损失函数，对损失函数进行梯度下降，进行参数更新，这这里有一个十分关键的问题就是梯度下降公式的参数如何更新。

损失函数

$L\left( \theta \right) =\prod_{i=1}^N{f}_{\theta}\left( x_i \right) ^{y_i}\left( 1-f_{\theta}\left( x_i \right) \right) ^{1-y_i}$

$\text{损失函数：}-l\left( \theta \right) =-\log L\left( \theta \right) =-\sum_{i=1}^N{\left[ y_i\log f_{\theta}\left( x_i \right) +\left( 1-y_i \right) \log \left( 1-f_{\theta}\left( x_i \right) \right) \right]}$

我对损失函数的理解：

梯度的矩阵形式

$\nabla J\left( \theta \right) =-\nabla l\left( \theta \right) =\boldsymbol{X}^T\left( \boldsymbol{y}-\sigma \left( \boldsymbol{X}\theta \right) \right)$
$\boldsymbol{X}\text{为样本矩阵，}\boldsymbol{y}\text{为标签向量}$

梯度下降参数更新形式

$\boldsymbol{\theta }\gets \boldsymbol{\theta }+\eta \boldsymbol{X}^T\left( y-\sigma \left( \boldsymbol{X}\theta \right) \right)$

加入L2正则化约束后

$\begin{gathered} \min_{\theta}J(\theta)=\min_{\theta}\left(-l\left(\theta\right)+\frac{\lambda}{2}\left\|\theta\right\|_{2}^{2}\right) \\ \nabla J\left(\theta\right)=-X^{T}\left(y-\sigma\left(X\theta\right)\right)+\lambda\theta \\ \theta\leftarrow\left(1-\lambda\eta\right)\theta+\eta X^{T}\left(y-\sigma\left(X\theta\right)\right) \end{gathered}$

正则化约束后的损失函数

$-l\left( \boldsymbol{\theta } \right) +\frac{\lambda}{2}\lVert \boldsymbol{\theta } \rVert ^2$

数据链接

链接：https://pan.baidu.com/s/1_4k2uy5nFfKOt9AfeT0sZg?pwd=xhuo
提取码：xhuo

# 首先读取数据并且进行数据可视化
import numpy as np

# 使用genfromtxt函数读取CSV文件
data = np.genfromtxt('lr_dataset.csv', delimiter=',')

# 打印NumPy数组
print(data)

[[ 0.4304  0.2055  1.    ]
 [ 0.0898 -0.1527  1.    ]
 [ 0.2918 -0.1248  1.    ]
 ...
 [ 0.5826  0.4424  1.    ]
 [-0.0398  0.2877  1.    ]
 [ 0.0035  0.623   1.    ]]

# 绘制可视化图
import matplotlib.pyplot as plt

# 绘制散点图
x=data[:,0]
y=data[:,1]
labels=data[:,2]

# 根据不同的标签设置不同的颜色
colors = []
for label in labels:
    if label == 0:
        colors.append('#d62828')
    elif label == 1:
        colors.append('#006d77')
plt.scatter(x, y,c=colors)

# 添加标题和标签
plt.title('Scatter Plot')
plt.grid()
plt.xlabel('X')
plt.ylabel('Y')

png

def logostic(x):
    return 1 / (1 + np.exp(-x))

def acc(y_true, y_pred):
    return np.mean(y_true == y_pred)

def auc(y_true, y_pred):
    # 按预测值从大到小排序，越靠前的样本预测正类概率越大
    idx = np.argsort(y_pred)[::-1]
    y_true = y_true[idx]
    y_pred = y_pred[idx]
    # 把y_pred中不重复的值当作阈值，依次计算FP样本和TP样本数量
    # 由于两个数组已经排序且位置对应，直接从前向后累加即可
    tp = np.cumsum(y_true) 
    fp = np.cumsum(1 - y_true)
    tpr = tp / tp[-1]
    fpr = fp / fp[-1]
    # 依次枚举FPR，计算曲线下的面积
    # 方便起见，给FPR和TPR最开始添加(0,0)
    s = 0.0
    tpr = np.concatenate([[0], tpr])
    fpr = np.concatenate([[0], fpr])
    for i in range(1, len(fpr)):
        s += (fpr[i] - fpr[i - 1]) * tpr[i]
    return s

# 接下来进行训练集与测试集的划分

# 获取数据的行数
from matplotlib.ticker import MaxNLocator


N=len(data[:,0])

alpha=0.7
train_N=int(N*0.7)
test_N=N-train_N

train_x=data[0:train_N,0:2]
train_y=data[0:train_N,2:3]
test_x=data[train_N:N,0:2]
test_y=data[train_N:N,2:3]

# 接下来开始进行训练

# 对于训练数据，在矩阵后面拼接一列1，算是偏置项
# train_x.shape[0]表示获取train_x这个二维数组第一个维度数据的大小
# axis=1 表示按列连接，而 axis=0 表示按行连接。
print(train_x.shape)
print(train_y.shape)
X=np.concatenate([train_x,np.ones((train_x.shape[0],1))],axis=1)

print(test_x.shape)
X_test=np.concatenate([test_x,np.ones((test_x.shape[0],1))],axis=1)
print(X)
print(X.shape)

# 初始化线性回归的参数
theta = np.random.normal(size=(X.shape[1],))
theta=theta.reshape((theta.shape[0], 1))
# print(theta)
# print(theta.shape)
# 定义梯度下降迭代次数
num_step=250

# 定义L2正则约束强度
lameda_=1.0

# 定义学习率
learning_rate=0.002
# 这里还有点神奇X是n乘3的矩阵，theta是3乘1的矩阵
train_loss=[]
test_loss=[]
train_acc=[]
test_acc=[]
train_auc=[]
test_auc=[]
for i in range (num_step):
    predict=logostic(X@theta)
    # 这里使用加入L2正则化约束
    # 计算预测值
    predict=np.array(predict)
    predict = predict.reshape((predict.shape[0], 1))
    grad=-X.T@(train_y-predict)+lameda_*theta

    # 参数更新
    theta-=learning_rate*grad

    # 记录损失函数
    # 如果是正则化约束的话还要加入正则项
    # train_y:n×1
    # predict:n×1
    # theta:3×1
    # np.linalg.norm(theta) 是 NumPy 中的函数，用于计算向量或矩阵的范数（norm）。
    # np.linalg.norm(theta) 是计算theta中所有元素的平方和,然后开方
    Loss=-train_y.T@np.log(predict)-(1-train_y.T)@np.log(1-predict)+lameda_/2*np.linalg.norm(theta)*np.linalg.norm(theta)

    train_loss.append(Loss/train_x.shape[0])

    # 测试集预测
    test_predict=logostic(X_test@theta)
    test_predict=np.array(test_predict)
    test_predict = test_predict.reshape((test_predict.shape[0], 1))

    # 测试集的损失就不用加入正则项了
    test_Loss=-(test_y.T@np.log(test_predict)+(1-test_y.T)@np.log(1-test_predict))

    test_loss.append(test_Loss/train_x.shape[0])
    # print("第"+str(i)+"次迭代"+"\n")

    train_acc.append(acc(train_y,predict>0.5))
    test_acc.append(acc(test_y,test_predict>0.5))

    train_auc.append(auc(train_y,predict))
    test_auc.append(auc(test_y,test_predict))

# 计算测试集上的预测准确率
y_pred = np.where(logostic(X_test @ theta) >= 0.5, 1, 0)
final_acc = acc(test_y, y_pred)
print('预测准确率：', final_acc)
print('回归系数：', theta)


plt.figure(figsize=(13, 9))
xticks = np.arange(num_step) + 1
# 绘制训练曲线
# plt.subplot(221)
train_loss=np.array(train_loss)
train_loss= train_loss.flatten()
plt.plot(xticks, train_loss, color='blue', label='train loss')

test_loss=np.array(test_loss)
test_loss= test_loss.flatten()

plt.plot(xticks, test_loss, color='red', ls='--', label='test loss')
plt.gca().xaxis.set_major_locator(MaxNLocator(integer=True))
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

(700, 2)
(700, 1)
(300, 2)
[[ 0.4304  0.2055  1.    ]
 [ 0.0898 -0.1527  1.    ]
 [ 0.2918 -0.1248  1.    ]
 ...
 [ 0.2403  0.235   1.    ]
 [ 0.9708  0.7746  1.    ]
 [ 0.5301 -0.3728  1.    ]]
(700, 3)
预测准确率： 0.91
回归系数： [[2.90282086]
 [2.82757117]
 [0.54846955]]

png