Derivative of cost function of logical regression逻辑回归代价函数的一些数学推导

最新推荐文章于 2023-01-28 16:48:03 发布

Calm-Cat

最新推荐文章于 2023-01-28 16:48:03 发布

阅读量395

点赞数

分类专栏：人工智能机器学习文章标签：机器学习线性代数神经网络

本文链接：https://blog.csdn.net/qq_25297587/article/details/105304087

版权

人工智能同时被 2 个专栏收录

4 篇文章

订阅专栏

机器学习

4 篇文章

订阅专栏

Recently I have some misconceptions of Logistic Regression so I reviewed it and tried to do some mathematical derivations manually.

Generally Speaking

In general, suppose we have $m$ examples and $n$ parameters in each example. The Cost Function of Logistic Regression is
$J{(\theta)}=-\frac{1}{m}\sum_{i=1}^{m}log(h_{\theta}(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))$
in which $x^{(i)}$ represents the features of $i^{th}$ example, $y^{(i)}$ denotes the label of $i^{th}$ example,
$\log(h_\theta(x^{(i)})) =\log(\frac{1}{1+e^{-\theta^{T}x^{(i)}}}) =-\log (1+e^{-\theta^{T}x^{(i)}})\tag{1}$
and
$KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ &\log(1-h_\the…$

In formula $(1) (2)$ , we use Sigmoid Function as activate function, so we use updated $\log(h_\theta(x^{(i)}))$ and $\log(1-h_\theta(x^{(i)})$ by formula $(1) (2)$ as:
$\begin{aligned} J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}\Big[-y^{(i)}\log(1+e^{-\theta^{T}x^{(i)}})+(1-y^{(i)})\Big(-\theta^Tx^{(i)}-\log(1+e^{-\theta^Tx^{(i)}})\Big)\Big] \end{aligned} \tag{3}$

Mathematic Deduction of $J(\theta)$

we could observe and further do some mathematical derivation on formula $(3)$ :

$\begin{aligned} J(\theta)&=-\frac{1}{m}\sum_{i=1}^{m}[-y^{(i)}\log(1+e^{-\theta^{T}x^{(i)}})-\theta^Tx^{(i)}+y^{(i)}\theta^Tx^{(i)} -\log(1+e^{-\theta^Tx^{(i)}}) + y^{(i)}\log(1+e^{-\theta^Tx^{(i)}}) \\&=-\frac{1}{m}\sum_{i=1}^{m}[ - \theta^Tx^{(i)} + y^{(i)}\theta^Tx^{(i)} - \log(1 + e^{-\theta^Tx^{(i)}}) \end{aligned}\tag{4}$

Here note that $\theta^Tx^{(i)}=\log(e^{\theta^Tx{(i)}})$ , we continue doing some derivations using formula $(4)$ :
$\begin{aligned} J(\theta)&=-\frac{1}{m}\sum_{i=1}^{m}[ - \log(e^{\theta^Tx{(i)}}) + y^{(i)}\theta^Tx^{(i)} - \log(1 + e^{-\theta^Tx^{(i)}}) \\ &=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}\theta^Tx^{(i)} - \log\Big((1 + e^{-\theta^Tx^{(i)}})(e^{\theta^Tx{(i)}})\Big)\Big] \\ &=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}\theta^Tx^{(i)} - \log(1 + e^{\theta^Tx^{(i)}})\Big] \end{aligned}$

So the Cost Function of Logistic Regression can be written as:
$\begin{aligned} J(\theta) &=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}\theta^Tx^{(i)} - \log(1 + e^{\theta^Tx^{(i)}})\Big] \end{aligned}\tag{5}$

Mathematic Deduction of $\frac{\partial J}{\partial \theta_j}$

From above process, we can calculate $\frac{\partial J}{\partial \theta_i}$ more easily using formula $(5)$ :
$\begin{aligned} \frac{\partial J}{\partial \theta_j}&=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}x_j^{(i)} - \frac{x^{(i)}_je^{\theta^Tx^{(i)}}}{(1 + e^{\theta^Tx^{(i)}})}\Big] \\&=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}x_j^{(i)} - \frac{x^{(i)}_j}{(1 + e^{-\theta^Tx^{(i)}})}\Big] \end{aligned}\tag{6}$
Notice that:
$h_\theta(x^{(i)}) =\frac{1}{1+e^{-\theta^{T}x^{(i)}}}$
So the formula $(6)$ can be written as:

$\begin{aligned} \frac{\partial J}{\partial \theta_j}&=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}x_j^{(i)} - h_\theta(x^{(i)})x^{(i)}_j\Big] \\ &=\frac{1}{m}\sum_{i=1}^{m}\Big[ x_j^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big] \end{aligned} \tag{7}$
Until now, we get the $\frac{\partial J}{\partial \theta_j}$ . If we want to use MATLAB or Python to program, we could use some Liner Algebra knowledge to deduct it more deeply:
$\begin{aligned} \left[ \begin{matrix} \frac{\partial J}{\partial \theta_0}\\ \frac{\partial J}{\partial \theta_1}\\ \frac{\partial J}{\partial \theta_2}\\ \vdots\\ \frac{\partial J}{\partial \theta_n}\\ \end{matrix} \right] &=\frac{1}{m}\left[ \begin{matrix} \sum_{i=1}^{m}\Big( x_0^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big)\\ \sum_{i=1}^{m}\Big( x_1^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big)\\ \sum_{i=1}^{m}\Big( x_2^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big)\\ \vdots\\ \sum_{i=1}^{m}\Big( x_n^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big)\\ \end{matrix} \right]\tag{8} \end{aligned}$
In formula $(8)$ , we found that the $\big( h_\theta(x^{(i)})-y^{(i)}\big)$ is a scalar(single number), and the $x_j^{(i)}$ denotes the $j^{th}$ parameter in the $i^{th}$ example, which also is constant, so we can conclude that:
$\begin{aligned} \left[ \begin{matrix} \frac{\partial J}{\partial \theta_0}\\ \frac{\partial J}{\partial \theta_1}\\ \frac{\partial J}{\partial \theta_2}\\ \vdots\\ \frac{\partial J}{\partial \theta_n}\\ \end{matrix} \right] &=\frac{1}{m}\sum^m_{i=1}\Big(\big(h_\theta(x^{(i)})-y^{(i)}\big)x^{(i)}\Big)\tag{9} \end{aligned}$
in which
$x^{(i)}= \left[ \begin{matrix} x^{(i)}_0\\ x^{(i)}_1\\ x^{(i)}_2\\ \vdots\\ x^{(i)}_n\\ \end{matrix} \right]$
Finally using the above formula we have:
$\begin{aligned} \left[ \begin{matrix} \frac{\partial J}{\partial \theta_0}\\ \frac{\partial J}{\partial \theta_1}\\ \frac{\partial J}{\partial \theta_2}\\ \vdots\\ \frac{\partial J}{\partial \theta_n}\\ \end{matrix} \right] =\frac{1}{m}X^T(h_\theta(x)-y)\tag{10} \end{aligned}$
where
$\begin{aligned} &X= \left[ \begin{matrix} x^{(1)}_0 & x^{(1)}_1 & x^{(1)}_2 & x^{(1)}_3 & \cdots & x^{(1)}_n\\ x^{(2)}_0 & x^{(2)}_1 & x^{(2)}_2 & x^{(2)}_3 & \cdots & x^{(2)}_n\\ x^{(3)}_0 & x^{(3)}_1 & x^{(3)}_2 & x^{(3)}_3 & \cdots & x^{(3)}_n\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ x^{(m)}_0 & x^{(m)}_1 & x^{(m)}_2 & x^{(m)}_3 & \cdots & x^{(m)}_n\\ \end{matrix} \right] \\\\\\ &h_\theta(x)-y = \left[ \begin{matrix} h_\theta(x^{(1)})-y^{(1)}\\ h_\theta(x^{(2)})-y^{(2)}\\ h_\theta(x^{(3)})-y^{(3)}\\ \vdots\\ h_\theta(x^{(m)})-y^{(m)}\\ \end{matrix} \right] \end{aligned}$
We could easily program Logistic Regression Derivatives using formula $(10)$ .

An important detail

Note that $x^{(i)}_0$ denotes the Bias in every example, which constantly equals to $1$ . Nothing considerable unless if we add regularization methods for avoiding overfitting problem.

Examples in MATLAB

In this part, I use MATLAB to program the Liner Regression Cost Function. In order to avoid overfitting, I do some regularization operations (L2 regularization) to improve the quality of parameters.

function [J, grad] = lrCostFunction(theta, X, y, lambda)
%LRCOSTFUNCTION Compute cost and gradient for logistic regression with 
%regularization
%   J = LRCOSTFUNCTION(theta, X, y, lambda) computes the cost of using
%   theta as the parameter for regularized logistic regression and the
%   gradient of the cost w.r.t. to the parameters. 

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly 
J = 0;
grad = zeros(size(theta));

% =============================================================
h_x = sigmoid(X*theta);

grad = 1/m .*(X' * (h_x - y));
temp = theta; 
temp(1) = 0;   % because we don't add anything for j = 0  
J = 1/m*(sum(-y.*log(h_x) - (1-y) .* log(1-h_x))) + lambda/(2*m) .* sum(temp.^2);
grad = grad + lambda ./ m .* temp;

% =============================================================

grad = grad(:);

end

function g = sigmoid(z)
%SIGMOID Compute sigmoid functoon
%   J = SIGMOID(z) computes the sigmoid of z.

g = 1.0 ./ (1.0 + exp(-z));
end

reference

Andrew Ng, Machine Learning Course in Coursera: https://www.coursera.org/learn/machine-learning
Neural Networks and Deep Learning: https://nndl.github.io/