牛顿法logistic回归

最新推荐文章于 2024-05-08 19:45:36 发布

置顶 s白龙鱼服s

最新推荐文章于 2024-05-08 19:45:36 发布

阅读量707

点赞数 5

分类专栏：机器学习文章标签：线性代数机器学习概率论 logistic regression 牛顿法

本文链接：https://blog.csdn.net/swilliamss/article/details/120816563

版权

机器学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

本文详细介绍了如何使用牛顿法求解Logistic Regression模型，包括模型公式、损失函数、一阶导数和二阶导数的推导。通过Matlab实现展示了从损失函数到Hessian矩阵的计算过程，并提供了迭代求解的牛顿法。此外，还探讨了学习率对模型性能的影响，以及正则化的潜在改进方案，以提高模型的泛化能力。

摘要由CSDN通过智能技术生成

牛顿法求解Logistic Regression

1、Logistic回归模型、牛顿法推导公式

Logistic模型为（ $\mathbf{x_i}$ 是第 $i$ 个样本，为行向量， $\mathbf{w}$ 是特征向量，为一个列向量）
$KaTeX parse error: Unknown column alignment: * at position 60: … \begin{array}{*̲*lr**} p(y_{i}…$
其损失函数为
$l=Loss(\mathbf{w})=-\frac{1}{m}\ln{p(\mathbf{y}|\mathbf{x},\mathbf{w})}=-[\mathbf{y}^{T}\ln(\mathbf{\sigma})+\mathbf{(1-y)}^{T}\ln{(1-\sigma)}].$
则损失函数的一阶导数为
$\frac{\partial{l}}{\partial{w_{j}}}=\frac{1}{m} \sum_{i=1}^m{(\sigma_{i}-y_{i})x_{ij}}.$
即可得向量形式
$\frac{\partial{l}}{\partial{\mathbf{w}}}=-\frac{1}{m}\frac{\partial{[\mathbf{y}^{T}\ln(sigmoid(X\mathbf{w}))+\mathbf{(1-y)}^{T}\ln{(1-sigmoid(X\mathbf{w}))}]}}{\partial{\mathbf{w}}},$

$\mathrm{d}l=-\frac{1}{m}tr((\mathbf{y}\odot \frac{\sigma(1-\sigma)}{\sigma})^T(\mathrm{d}X\mathbf{w})-((1-\mathbf{y})\odot\frac{\sigma(1-\sigma)}{1-\sigma})^T(\mathrm{d}X\mathbf{w}))$

$=-\frac{1}{m}tr((\mathbf{y}\odot \frac{\sigma(1-\sigma)}{\sigma})^TX(\mathrm{d}\mathbf{w})-((1-\mathbf{y})\odot\frac{\sigma(1-\sigma)}{1-\sigma})^TX(\mathrm{d}\mathbf{w}))$

$=-\frac{1}{m}tr((X^T(\mathbf{y}\odot \frac{\sigma(1-\sigma)}{\sigma}))^T(\mathrm{d}\mathbf{w})-(X^T((1-\mathbf{y})\odot\frac{\sigma(1-\sigma)}{1-\sigma}))^T(\mathrm{d}\mathbf{w}))=-\frac{1}{m}tr((X^T(\mathbf{y}-\sigma))^T(\mathrm{d}\mathbf{w})),$

$\mathrm{d}l=tr(\frac{\partial l}{\partial \mathbf{w}}^T\mathrm{d}\mathbf{w}),$

$\frac{\partial l}{\partial \mathbf{w}}=\frac{1}{m}X^T(\sigma-\mathbf{y}),$
接着求解损失函数的二阶导数（Heissian矩阵）
$H_{ij}=\frac{\partial^{2}{l}}{\partial{w_{i}}\partial{w_{j}}}=\frac{1}{m}\frac{\partial}{\partial{w_{j}}}\sum_{t=1}^{m}{(y_{t}-\sigma_{t})x_{ti}}=\frac{1}{m}\sum_{t=1}^{m}\sigma_{t}(\sigma_{t}-1)x_{ti}x_{tj}.$
即可得矩阵形式
$\mathrm{d}\nabla_w l=\frac{1}{m}X^T(sigmoid'(Xw)\otimes(\mathrm{d}Xw))=\frac{1}{m}X^T(sigmoid'(Xw)\otimes X(\mathrm{d}w)),$

$\mathrm{vec}(\mathrm{d}\nabla_w l)=\frac{1}{m}X^Tdiag(\sigma(1-\sigma))X\mathrm{vec}(\mathrm{d}w),$

$\mathrm{vec}(\mathrm{d}\nabla_w l) = \nabla^2_w l^T \mathrm{vec}(\mathrm{d}w),$

$H=\nabla^2_w l=H(\mathbf{w})=\frac{1}{m}XAX^{T},$

$A=diag(\sigma_{i}(1-\sigma_{i})).$
由于 $H$ 是一个二次型矩阵，且 $A$ 的各阶顺序主子式均大于0，故 $H$ 为正定矩阵，因此损失函数 $l$ 为凸函数，有最小值，可通过求解 $\frac{\partial{l}}{\partial{\mathbf{w}}}=0$ 得到最优解 $\mathbf{w}$ ，且牛顿法求解二次收敛，故具有较好的求解效果。
根据牛顿法的求解步骤，有下列迭代过程
$\mathbf{w}:=\mathbf{w}-\alpha(H^{-1}\frac{\partial{l}}{\partial{\mathbf{w}}}).$

其中，方向 $H^{-1}\frac{\partial{l}}{\partial{\mathbf{w}}}$ 被称为牛顿方向，由于牛顿法只是二阶近似，因此需要在每次的迭代方向前乘以一个学习率 $\alpha$ 作为对高阶无穷小（高阶导数项）的近似，即 $\alpha(H^{-1}\frac{\partial{l}}{\partial{\mathbf{w}}})$ 是每次牛顿法的迭代步长。

2、利用代码实现该Logistic Regression模型（Matlab实现）

sigmoid.m函数（用于计算sigmoid函数值）

function g = sigmoid(z)
%SIGMOID Compute sigmoid function
%   g = SIGMOID(z) computes the sigmoid of z.
g=1./(1+exp(-z));

end

Loss_function.m函数（求解模型在所有样本上的误差）

function J = loss_function(w, X, y)
%%LOSS_FUNCTION computes the loss of the logistic regression model
%       J = LOSS_FUNCTION(w, X, y) computes the loss of the logistic 
%regression model of all samples with weight w.
m = size(X, 1);
J = 0;
for i = 1:m
    if y(i) == 1
        J = J - log(sigmoid(X(i,:) * w));
    else
        J = J - log(1 - sigmoid(X(i,:) * w));
    end
end
J = J / m;

end

Loss_d1.m函数（用于计算损失函数的一阶导数，为一个向量）

function dl = Loss_d1(w, X, y)
%LOSS_D1 Compute the derivative
%   dl = LOSS_D1(w, X, y) computes the derivative of the loss function 
%of the logistic regression.
m = size(y, 1);
dl = X' * (sigmoid(X * w) - y) / m;

end

Loss_d2.m函数（用于计算损失函数的二阶导数，为一个Hessian矩阵）

function H = Loss_d2(w, X, y)
%LOSS_D2 Compute the Hessian Matrix
%   H = LOSS_D2(w, X, y) computes the Hessian Matrix of the loss function 
%of the logistic regression.
m = size(y, 1);
n = zeros(1, m);
for i = 1:m
    n(i) = sigmoid(X(i,:) * w) * (1 - sigmoid(X(i,:) * w));
end
A = diag(n);
H = X' * A * X / m;

end

newton_method.m函数（牛顿法求解）

function [out, iteration, weight] = newton_method(alpha, w, X, y)
%%NEWTON_METHOD Compute the solution of a the function
%   [out, iteration] = NEWTON_METHOD(w, X, y) gives the predicted results, 
%the weight vector of the logistic regression and the iteration.
%limit is the maximum iteration
%alpha is the learning rate

limit = 1000000;

%Iterative Solution (Newton Method)
for i = 1:limit
    dl = Loss_d1(w, X, y);
    H = Loss_d2(w, X, y);
    w_next = w - alpha * (pinv(H) * dl); %Use pinv in case the matrix is approaching a singular matrix
    if loss_function(w_next, X, y) < 1e-6 %The condition to judge the convergence
        w = w_next;
        break
    end
    w = w_next;
end

%Compute the result with the weight got above and predict all the samples
out = sigmoid(X * w);
iteration = i;
weight = w;
end

evaluation.m函数（用于绘制ROC曲线并计算AUC）

function auc = evaluation(out, y, alpha)
%EVALUATION Evaluate the performance of the model
%   auc = EVALUATION(out, y) evaluate the performance by ploting the ROC 
%curve and computing the AUC.

%(xx, yy) is the initial point
%x_step is the step when we meet FP
%y_step is the step when we meet TP
xx = 1.0;
yy = 1.0;
pos_num = sum(y == 1);
neg_num = sum(y == 0);
x_step = 1.0 / neg_num;
y_step = 1.0 / pos_num;

%Sort the predicted data and use the index to sort the real
%value y to judge whether the value is TP or FP
[out, index] = sort(out);
y = y(index);
X(1) = 1;
Y(1) = 1;
for i = 1:length(y)
    if y(i) == 1 % TP
        yy = yy - y_step;
    else %FP
        xx =xx - x_step;
    end
    X(i + 1) = xx;
    Y(i + 1) = yy;
end

%Plot the ROC curve
name = "alpha" + num2str(alpha) + '.jpg';
figure('name', name);
plot(X, Y, '-ro', 'LineWidth', 2, 'MarkerSize', 3);  
xlabel('FPR');  
ylabel('TPR');  
title('ROC');
f = gca;
f.XAxisLocation = 'origin';
f.YAxisLocation = 'origin';
f.XLim = [0 1.1];
f.YLim = [0 1.1];
saveas(gcf, name);

%Calculate the area below the ROC curve
auc = -trapz(X,Y);            
end

Logistic_Reg.m（模型求解）

%'data.xlsx' is the dataset
%X is the feature matrix
%y is the classification vector
%m is the amount of the features
%n is the amount of samples
%w_initial is the initial weight vector, whose value is a one vector
%w is the weight vector obtained by using newton method
%alpha is the learning rate
%out is the predicted results
%auc is the evalution standard
X = xlsread('data.xlsx','A:B');
y = xlsread('data.xlsx','C:C');
X = [ones(size(X,1), 1) X];%Add one colomn of 1s as the constant terms
y = (y == 1);%Change the label '-1' to '0' to make the computation easier
y = double(y);
m = size(X, 2);
n = size(X, 1);
w_initial = ones(m, 1);
alpha = 0.05;
w = zeros(80, m);
out = zeros(n, 80);
iteration = zeros(80, 1);
auc = zeros(80, 1);


%Solve the problem by using newton method and evaluate the performance at
%the same time
for i = 1:80
    [out(:,i), iteration(i), w(i,:)] = newton_method(alpha, w_initial, X, y);
    auc(i) = evaluation(out(:,i), y, alpha);
    alpha = alpha + 0.05;
end

save('out.mat');
save('iteration.mat');
save('auc.mat');
save('w.mat');

3、根据样本的标签和预测结果可分为真正例TP、假正例FP、假负例TN、真负例FN，试设置不同阈值p，编写代码绘制横坐标FPR，纵坐标TPR的ROC曲线，并计算AUC面积

利用2中的evalution.m函数即可根据分类结果out绘制相应的ROC曲线并计算AUC面积。

其中，ROC曲线的纵坐标为 $TPR=\frac{TP}{TP+FN}$ ，横坐标为 $FPR=\frac{FP}{FP+TN}$ ，曲线的趋势变化反映了当取不同阈值p的时候，TPR和FPR的变化。

绘制过程可按以下流程：

统计标签中正例和反例的个数，并将 $\frac{1.0}{正例个数}$ 作为y方向上的步长，将 $\frac{1.0}{反例个数}$ 作为x方向上的步长。
将预测结果进行升序排序，并按照该顺序对标签中的样本进行重排。
初始点设置为（1.0,1.0），然后对y中的每个值进行迭代：
- 当迭代时遇到标签值为1时，说明以该点的预测值作为阈值时，会将一个正例预测为反例，即TP的个数减1
- 当迭代时遇到标签值为0时，说明以该点的预测值作为阈值时，会将一个反例预测为反例，即FP的个数减1
最后求ROC与x轴所围的面积AUC作为评估模型的标准。

4、比较不同的学习率对性能的影响

由于牛顿法只是对于问题求解的二阶近似，故可以仿照梯度下降法引入一个学习率 $\alpha$ 来作为对高阶无穷小的近似，然后通过性能对比来选择一个收敛速度较快的学习率。

通常可以用一定的步长来对学习率 $\alpha$ 在一定区间内进行迭代，然后分别进行评估来对比不同的学习率。

5、讨论与改进

由于目前的模型只考虑了对当前数据集的拟合性能，且没有引入约束条件，故最后得到的权重向量可能会使模型对数据过拟合，泛化性能较差，所以可以考虑引入正则化项作为约束条件来对权重大小进行限制以消除一些由过拟合引起的问题，如引入二次正则化项，即将损失函数改为

$l=Loss(\mathbf{w})=-\frac{1}{m}\ln{p(\mathbf{y}|\mathbf{x},\mathbf{w})}+\lambda \mathbf{w}^T\mathbf{w}=-[\mathbf{y}^{T}\ln(\mathbf{\sigma})+\mathbf{(1-y)}^{T}\ln{(1-\sigma)}]+\lambda \mathbf{w}^T\mathbf{w}.$

目前的评估标准仅仅使用了ROC曲线以及其面积AUC，还可以考虑通过引入准确率 $precision=\frac{TP}{TP+FP}$ ， $recall=\frac{TP}{TP+FN}$ 来对模型进行综合评价以期得到更加准确的评估结果。
当前模型的选择仅建立在训练集的基础上，后续若要进一步提升性能可将按一定比例将数据集划分为训练集、验证集和测试集三部分来分别进行不同参数的选择，可以有效提高泛化性能。