Derivative of cost function of logical regression逻辑回归代价函数的一些数学推导

Recently I have some misconceptions of Logistic Regression so I reviewed it and tried to do some mathematical derivations manually.

Generally Speaking

In general, suppose we have m m m examples and n n n parameters in each example. The Cost Function of Logistic Regression is
J ( θ ) = − 1 m ∑ i = 1 m l o g ( h θ ( x ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) J{(\theta)}=-\frac{1}{m}\sum_{i=1}^{m}log(h_{\theta}(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)})) J(θ)=m1i=1mlog(hθ(x(i))+(1y(i))log(1hθ(x(i)))
in which x ( i ) x^{(i)} x(i)represents the features of i t h i^{th} ith example, y ( i ) y^{(i)} y(i) denotes the label of i t h i^{th} ith example,
log ⁡ ( h θ ( x ( i ) ) ) = log ⁡ ( 1 1 + e − θ T x ( i ) ) = − log ⁡ ( 1 + e − θ T x ( i ) ) (1) \log(h_\theta(x^{(i)})) =\log(\frac{1}{1+e^{-\theta^{T}x^{(i)}}}) =-\log (1+e^{-\theta^{T}x^{(i)}})\tag{1} log(hθ(x(i)))=log(1+eθTx(i)1)=log(1+eθTx(i))(1)
and
KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ &\log(1-h_\the…

In formula ( 1 ) ( 2 ) (1)(2) (1)(2), we use Sigmoid Function as activate function, so we use updated log ⁡ ( h θ ( x ( i ) ) ) \log(h_\theta(x^{(i)})) log(hθ(x(i))) and log ⁡ ( 1 − h θ ( x ( i ) ) \log(1-h_\theta(x^{(i)}) log(1hθ(x(i)) by formula ( 1 ) ( 2 ) (1)(2) (1)(2) as:
J ( θ ) = − 1 m ∑ i = 1 m [ − y ( i ) log ⁡ ( 1 + e − θ T x ( i ) ) + ( 1 − y ( i ) ) ( − θ T x ( i ) − log ⁡ ( 1 + e − θ T x ( i ) ) ) ] (3) \begin{aligned} J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}\Big[-y^{(i)}\log(1+e^{-\theta^{T}x^{(i)}})+(1-y^{(i)})\Big(-\theta^Tx^{(i)}-\log(1+e^{-\theta^Tx^{(i)}})\Big)\Big] \end{aligned} \tag{3} J(θ)=m1i=1m[y(i)log(1+eθTx(i))+(1y(i))(θTx(i)log(1+eθTx(i)))](3)

Mathematic Deduction of J ( θ ) J(\theta) J(θ)

we could observe and further do some mathematical derivation on formula ( 3 ) (3) (3):

J ( θ ) = − 1 m ∑ i = 1 m [ − y ( i ) log ⁡ ( 1 + e − θ T x ( i ) ) − θ T x ( i ) + y ( i ) θ T x ( i ) − log ⁡ ( 1 + e − θ T x ( i ) ) + y ( i ) log ⁡ ( 1 + e − θ T x ( i ) ) = − 1 m ∑ i = 1 m [ − θ T x ( i ) + y ( i ) θ T x ( i ) − log ⁡ ( 1 + e − θ T x ( i ) ) (4) \begin{aligned} J(\theta)&=-\frac{1}{m}\sum_{i=1}^{m}[-y^{(i)}\log(1+e^{-\theta^{T}x^{(i)}})-\theta^Tx^{(i)}+y^{(i)}\theta^Tx^{(i)} -\log(1+e^{-\theta^Tx^{(i)}}) + y^{(i)}\log(1+e^{-\theta^Tx^{(i)}}) \\&=-\frac{1}{m}\sum_{i=1}^{m}[ - \theta^Tx^{(i)} + y^{(i)}\theta^Tx^{(i)} - \log(1 + e^{-\theta^Tx^{(i)}}) \end{aligned}\tag{4} J(θ)=m1i=1m[y(i)log(1+eθTx(i))θTx(i)+y(i)θTx(i)log(1+eθTx(i))+y(i)log(1+eθTx(i))=m1i=1m[θTx(i)+y(i)θTx(i)log(1+eθTx(i))(4)

Here note that θ T x ( i ) = log ⁡ ( e θ T x ( i ) ) \theta^Tx^{(i)}=\log(e^{\theta^Tx{(i)}}) θTx(i)=log(eθTx(i)), we continue doing some derivations using formula ( 4 ) (4) (4):
J ( θ ) = − 1 m ∑ i = 1 m [ − log ⁡ ( e θ T x ( i ) ) + y ( i ) θ T x ( i ) − log ⁡ ( 1 + e − θ T x ( i ) ) = − 1 m ∑ i = 1 m [ y ( i ) θ T x ( i ) − log ⁡ ( ( 1 + e − θ T x ( i ) ) ( e θ T x ( i ) ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) θ T x ( i ) − log ⁡ ( 1 + e θ T x ( i ) ) ] \begin{aligned} J(\theta)&=-\frac{1}{m}\sum_{i=1}^{m}[ - \log(e^{\theta^Tx{(i)}}) + y^{(i)}\theta^Tx^{(i)} - \log(1 + e^{-\theta^Tx^{(i)}}) \\ &=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}\theta^Tx^{(i)} - \log\Big((1 + e^{-\theta^Tx^{(i)}})(e^{\theta^Tx{(i)}})\Big)\Big] \\ &=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}\theta^Tx^{(i)} - \log(1 + e^{\theta^Tx^{(i)}})\Big] \end{aligned} J(θ)=m1i=1m[log(eθTx(i))+y(i)θTx(i)log(1+eθTx(i))=m1i=1m[y(i)θTx(i)log((1+eθTx(i))(eθTx(i)))]=m1i=1m[y(i)θTx(i)log(1+eθTx(i))]

So the Cost Function of Logistic Regression can be written as:
J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) θ T x ( i ) − log ⁡ ( 1 + e θ T x ( i ) ) ] (5) \begin{aligned} J(\theta) &=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}\theta^Tx^{(i)} - \log(1 + e^{\theta^Tx^{(i)}})\Big] \end{aligned}\tag{5} J(θ)=m1i=1m[y(i)θTx(i)log(1+eθTx(i))](5)

Mathematic Deduction of ∂ J ∂ θ j \frac{\partial J}{\partial \theta_j} θjJ

From above process, we can calculate ∂ J ∂ θ i \frac{\partial J}{\partial \theta_i} θiJ more easily using formula ( 5 ) (5) (5):
∂ J ∂ θ j = − 1 m ∑ i = 1 m [ y ( i ) x j ( i ) − x j ( i ) e θ T x ( i ) ( 1 + e θ T x ( i ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) x j ( i ) − x j ( i ) ( 1 + e − θ T x ( i ) ) ] (6) \begin{aligned} \frac{\partial J}{\partial \theta_j}&=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}x_j^{(i)} - \frac{x^{(i)}_je^{\theta^Tx^{(i)}}}{(1 + e^{\theta^Tx^{(i)}})}\Big] \\&=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}x_j^{(i)} - \frac{x^{(i)}_j}{(1 + e^{-\theta^Tx^{(i)}})}\Big] \end{aligned}\tag{6} θjJ=m1i=1m[y(i)xj(i)(1+eθTx(i))xj(i)eθTx(i)]=m1i=1m[y(i)xj(i)(1+eθTx(i))xj(i)](6)
Notice that:
h θ ( x ( i ) ) = 1 1 + e − θ T x ( i ) h_\theta(x^{(i)}) =\frac{1}{1+e^{-\theta^{T}x^{(i)}}} hθ(x(i))=1+eθTx(i)1
So the formula ( 6 ) (6) (6) can be written as:

∂ J ∂ θ j = − 1 m ∑ i = 1 m [ y ( i ) x j ( i ) − h θ ( x ( i ) ) x j ( i ) ] = 1 m ∑ i = 1 m [ x j ( i ) ( h θ ( x ( i ) ) − y ( i ) ) ] (7) \begin{aligned} \frac{\partial J}{\partial \theta_j}&=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}x_j^{(i)} - h_\theta(x^{(i)})x^{(i)}_j\Big] \\ &=\frac{1}{m}\sum_{i=1}^{m}\Big[ x_j^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big] \end{aligned} \tag{7} θjJ=m1i=1m[y(i)xj(i)hθ(x(i))xj(i)]=m1i=1m[xj(i)(hθ(x(i))y(i))](7)
Until now, we get the ∂ J ∂ θ j \frac{\partial J}{\partial \theta_j} θjJ. If we want to use MATLAB or Python to program, we could use some Liner Algebra knowledge to deduct it more deeply:
[ ∂ J ∂ θ 0 ∂ J ∂ θ 1 ∂ J ∂ θ 2 ⋮ ∂ J ∂ θ n ] = 1 m [ ∑ i = 1 m ( x 0 ( i ) ( h θ ( x ( i ) ) − y ( i ) ) ) ∑ i = 1 m ( x 1 ( i ) ( h θ ( x ( i ) ) − y ( i ) ) ) ∑ i = 1 m ( x 2 ( i ) ( h θ ( x ( i ) ) − y ( i ) ) ) ⋮ ∑ i = 1 m ( x n ( i ) ( h θ ( x ( i ) ) − y ( i ) ) ) ] (8) \begin{aligned} \left[ \begin{matrix} \frac{\partial J}{\partial \theta_0}\\ \frac{\partial J}{\partial \theta_1}\\ \frac{\partial J}{\partial \theta_2}\\ \vdots\\ \frac{\partial J}{\partial \theta_n}\\ \end{matrix} \right] &=\frac{1}{m}\left[ \begin{matrix} \sum_{i=1}^{m}\Big( x_0^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big)\\ \sum_{i=1}^{m}\Big( x_1^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big)\\ \sum_{i=1}^{m}\Big( x_2^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big)\\ \vdots\\ \sum_{i=1}^{m}\Big( x_n^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big)\\ \end{matrix} \right]\tag{8} \end{aligned} θ0Jθ1Jθ2JθnJ=m1i=1m(x0(i)(hθ(x(i))y(i)))i=1m(x1(i)(hθ(x(i))y(i)))i=1m(x2(i)(hθ(x(i))y(i)))i=1m(xn(i)(hθ(x(i))y(i)))(8)
In formula ( 8 ) (8) (8), we found that the ( h θ ( x ( i ) ) − y ( i ) ) \big( h_\theta(x^{(i)})-y^{(i)}\big) (hθ(x(i))y(i)) is a scalar(single number), and the x j ( i ) x_j^{(i)} xj(i)denotes the j t h j^{th} jth parameter in the i t h i^{th} ith example, which also is constant, so we can conclude that:
[ ∂ J ∂ θ 0 ∂ J ∂ θ 1 ∂ J ∂ θ 2 ⋮ ∂ J ∂ θ n ] = 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) ) (9) \begin{aligned} \left[ \begin{matrix} \frac{\partial J}{\partial \theta_0}\\ \frac{\partial J}{\partial \theta_1}\\ \frac{\partial J}{\partial \theta_2}\\ \vdots\\ \frac{\partial J}{\partial \theta_n}\\ \end{matrix} \right] &=\frac{1}{m}\sum^m_{i=1}\Big(\big(h_\theta(x^{(i)})-y^{(i)}\big)x^{(i)}\Big)\tag{9} \end{aligned} θ0Jθ1Jθ2JθnJ=m1i=1m((hθ(x(i))y(i))x(i))(9)
in which
x ( i ) = [ x 0 ( i ) x 1 ( i ) x 2 ( i ) ⋮ x n ( i ) ] x^{(i)}= \left[ \begin{matrix} x^{(i)}_0\\ x^{(i)}_1\\ x^{(i)}_2\\ \vdots\\ x^{(i)}_n\\ \end{matrix} \right] x(i)=x0(i)x1(i)x2(i)xn(i)
Finally using the above formula we have:
[ ∂ J ∂ θ 0 ∂ J ∂ θ 1 ∂ J ∂ θ 2 ⋮ ∂ J ∂ θ n ] = 1 m X T ( h θ ( x ) − y ) (10) \begin{aligned} \left[ \begin{matrix} \frac{\partial J}{\partial \theta_0}\\ \frac{\partial J}{\partial \theta_1}\\ \frac{\partial J}{\partial \theta_2}\\ \vdots\\ \frac{\partial J}{\partial \theta_n}\\ \end{matrix} \right] =\frac{1}{m}X^T(h_\theta(x)-y)\tag{10} \end{aligned} θ0Jθ1Jθ2JθnJ=m1XT(hθ(x)y)(10)
where
X = [ x 0 ( 1 ) x 1 ( 1 ) x 2 ( 1 ) x 3 ( 1 ) ⋯ x n ( 1 ) x 0 ( 2 ) x 1 ( 2 ) x 2 ( 2 ) x 3 ( 2 ) ⋯ x n ( 2 ) x 0 ( 3 ) x 1 ( 3 ) x 2 ( 3 ) x 3 ( 3 ) ⋯ x n ( 3 ) ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ x 0 ( m ) x 1 ( m ) x 2 ( m ) x 3 ( m ) ⋯ x n ( m ) ] h θ ( x ) − y = [ h θ ( x ( 1 ) ) − y ( 1 ) h θ ( x ( 2 ) ) − y ( 2 ) h θ ( x ( 3 ) ) − y ( 3 ) ⋮ h θ ( x ( m ) ) − y ( m ) ] \begin{aligned} &X= \left[ \begin{matrix} x^{(1)}_0 & x^{(1)}_1 & x^{(1)}_2 & x^{(1)}_3 & \cdots & x^{(1)}_n\\ x^{(2)}_0 & x^{(2)}_1 & x^{(2)}_2 & x^{(2)}_3 & \cdots & x^{(2)}_n\\ x^{(3)}_0 & x^{(3)}_1 & x^{(3)}_2 & x^{(3)}_3 & \cdots & x^{(3)}_n\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ x^{(m)}_0 & x^{(m)}_1 & x^{(m)}_2 & x^{(m)}_3 & \cdots & x^{(m)}_n\\ \end{matrix} \right] \\\\\\ &h_\theta(x)-y = \left[ \begin{matrix} h_\theta(x^{(1)})-y^{(1)}\\ h_\theta(x^{(2)})-y^{(2)}\\ h_\theta(x^{(3)})-y^{(3)}\\ \vdots\\ h_\theta(x^{(m)})-y^{(m)}\\ \end{matrix} \right] \end{aligned} X=x0(1)x0(2)x0(3)x0(m)x1(1)x1(2)x1(3)x1(m)x2(1)x2(2)x2(3)x2(m)x3(1)x3(2)x3(3)x3(m)xn(1)xn(2)xn(3)xn(m)hθ(x)y=hθ(x(1))y(1)hθ(x(2))y(2)hθ(x(3))y(3)hθ(x(m))y(m)
We could easily program Logistic Regression Derivatives using formula ( 10 ) (10) (10).

An important detail

Note that x 0 ( i ) x^{(i)}_0 x0(i)denotes the Bias in every example, which constantly equals to 1 1 1. Nothing considerable unless if we add regularization methods for avoiding overfitting problem.

Examples in MATLAB

In this part, I use MATLAB to program the Liner Regression Cost Function. In order to avoid overfitting, I do some regularization operations (L2 regularization) to improve the quality of parameters.

function [J, grad] = lrCostFunction(theta, X, y, lambda)
%LRCOSTFUNCTION Compute cost and gradient for logistic regression with 
%regularization
%   J = LRCOSTFUNCTION(theta, X, y, lambda) computes the cost of using
%   theta as the parameter for regularized logistic regression and the
%   gradient of the cost w.r.t. to the parameters. 

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly 
J = 0;
grad = zeros(size(theta));

% =============================================================
h_x = sigmoid(X*theta);

grad = 1/m .*(X' * (h_x - y));
temp = theta; 
temp(1) = 0;   % because we don't add anything for j = 0  
J = 1/m*(sum(-y.*log(h_x) - (1-y) .* log(1-h_x))) + lambda/(2*m) .* sum(temp.^2);
grad = grad + lambda ./ m .* temp;

% =============================================================

grad = grad(:);

end
function g = sigmoid(z)
%SIGMOID Compute sigmoid functoon
%   J = SIGMOID(z) computes the sigmoid of z.

g = 1.0 ./ (1.0 + exp(-z));
end

reference

Andrew Ng, Machine Learning Course in Coursera: https://www.coursera.org/learn/machine-learning
Neural Networks and Deep Learning: https://nndl.github.io/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值