Recently I have some misconceptions of Logistic Regression so I reviewed it and tried to do some mathematical derivations manually.
Generally Speaking
In general, suppose we have
m
m
m examples and
n
n
n parameters in each example. The Cost Function of Logistic Regression is
J
(
θ
)
=
−
1
m
∑
i
=
1
m
l
o
g
(
h
θ
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
l
o
g
(
1
−
h
θ
(
x
(
i
)
)
)
J{(\theta)}=-\frac{1}{m}\sum_{i=1}^{m}log(h_{\theta}(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))
J(θ)=−m1i=1∑mlog(hθ(x(i))+(1−y(i))log(1−hθ(x(i)))
in which
x
(
i
)
x^{(i)}
x(i)represents the features of
i
t
h
i^{th}
ith example,
y
(
i
)
y^{(i)}
y(i) denotes the label of
i
t
h
i^{th}
ith example,
log
(
h
θ
(
x
(
i
)
)
)
=
log
(
1
1
+
e
−
θ
T
x
(
i
)
)
=
−
log
(
1
+
e
−
θ
T
x
(
i
)
)
(1)
\log(h_\theta(x^{(i)})) =\log(\frac{1}{1+e^{-\theta^{T}x^{(i)}}}) =-\log (1+e^{-\theta^{T}x^{(i)}})\tag{1}
log(hθ(x(i)))=log(1+e−θTx(i)1)=−log(1+e−θTx(i))(1)
and
KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ &\log(1-h_\the…
In formula
(
1
)
(
2
)
(1)(2)
(1)(2), we use Sigmoid Function as activate function, so we use updated
log
(
h
θ
(
x
(
i
)
)
)
\log(h_\theta(x^{(i)}))
log(hθ(x(i))) and
log
(
1
−
h
θ
(
x
(
i
)
)
\log(1-h_\theta(x^{(i)})
log(1−hθ(x(i)) by formula
(
1
)
(
2
)
(1)(2)
(1)(2) as:
J
(
θ
)
=
−
1
m
∑
i
=
1
m
[
−
y
(
i
)
log
(
1
+
e
−
θ
T
x
(
i
)
)
+
(
1
−
y
(
i
)
)
(
−
θ
T
x
(
i
)
−
log
(
1
+
e
−
θ
T
x
(
i
)
)
)
]
(3)
\begin{aligned} J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}\Big[-y^{(i)}\log(1+e^{-\theta^{T}x^{(i)}})+(1-y^{(i)})\Big(-\theta^Tx^{(i)}-\log(1+e^{-\theta^Tx^{(i)}})\Big)\Big] \end{aligned} \tag{3}
J(θ)=−m1i=1∑m[−y(i)log(1+e−θTx(i))+(1−y(i))(−θTx(i)−log(1+e−θTx(i)))](3)
Mathematic Deduction of J ( θ ) J(\theta) J(θ)
we could observe and further do some mathematical derivation on formula ( 3 ) (3) (3):
J ( θ ) = − 1 m ∑ i = 1 m [ − y ( i ) log ( 1 + e − θ T x ( i ) ) − θ T x ( i ) + y ( i ) θ T x ( i ) − log ( 1 + e − θ T x ( i ) ) + y ( i ) log ( 1 + e − θ T x ( i ) ) = − 1 m ∑ i = 1 m [ − θ T x ( i ) + y ( i ) θ T x ( i ) − log ( 1 + e − θ T x ( i ) ) (4) \begin{aligned} J(\theta)&=-\frac{1}{m}\sum_{i=1}^{m}[-y^{(i)}\log(1+e^{-\theta^{T}x^{(i)}})-\theta^Tx^{(i)}+y^{(i)}\theta^Tx^{(i)} -\log(1+e^{-\theta^Tx^{(i)}}) + y^{(i)}\log(1+e^{-\theta^Tx^{(i)}}) \\&=-\frac{1}{m}\sum_{i=1}^{m}[ - \theta^Tx^{(i)} + y^{(i)}\theta^Tx^{(i)} - \log(1 + e^{-\theta^Tx^{(i)}}) \end{aligned}\tag{4} J(θ)=−m1i=1∑m[−y(i)log(1+e−θTx(i))−θTx(i)+y(i)θTx(i)−log(1+e−θTx(i))+y(i)log(1+e−θTx(i))=−m1i=1∑m[−θTx(i)+y(i)θTx(i)−log(1+e−θTx(i))(4)
Here note that
θ
T
x
(
i
)
=
log
(
e
θ
T
x
(
i
)
)
\theta^Tx^{(i)}=\log(e^{\theta^Tx{(i)}})
θTx(i)=log(eθTx(i)), we continue doing some derivations using formula
(
4
)
(4)
(4):
J
(
θ
)
=
−
1
m
∑
i
=
1
m
[
−
log
(
e
θ
T
x
(
i
)
)
+
y
(
i
)
θ
T
x
(
i
)
−
log
(
1
+
e
−
θ
T
x
(
i
)
)
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
θ
T
x
(
i
)
−
log
(
(
1
+
e
−
θ
T
x
(
i
)
)
(
e
θ
T
x
(
i
)
)
)
]
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
θ
T
x
(
i
)
−
log
(
1
+
e
θ
T
x
(
i
)
)
]
\begin{aligned} J(\theta)&=-\frac{1}{m}\sum_{i=1}^{m}[ - \log(e^{\theta^Tx{(i)}}) + y^{(i)}\theta^Tx^{(i)} - \log(1 + e^{-\theta^Tx^{(i)}}) \\ &=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}\theta^Tx^{(i)} - \log\Big((1 + e^{-\theta^Tx^{(i)}})(e^{\theta^Tx{(i)}})\Big)\Big] \\ &=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}\theta^Tx^{(i)} - \log(1 + e^{\theta^Tx^{(i)}})\Big] \end{aligned}
J(θ)=−m1i=1∑m[−log(eθTx(i))+y(i)θTx(i)−log(1+e−θTx(i))=−m1i=1∑m[y(i)θTx(i)−log((1+e−θTx(i))(eθTx(i)))]=−m1i=1∑m[y(i)θTx(i)−log(1+eθTx(i))]
So the Cost Function of Logistic Regression can be written as:
J
(
θ
)
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
θ
T
x
(
i
)
−
log
(
1
+
e
θ
T
x
(
i
)
)
]
(5)
\begin{aligned} J(\theta) &=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}\theta^Tx^{(i)} - \log(1 + e^{\theta^Tx^{(i)}})\Big] \end{aligned}\tag{5}
J(θ)=−m1i=1∑m[y(i)θTx(i)−log(1+eθTx(i))](5)
Mathematic Deduction of ∂ J ∂ θ j \frac{\partial J}{\partial \theta_j} ∂θj∂J
From above process, we can calculate
∂
J
∂
θ
i
\frac{\partial J}{\partial \theta_i}
∂θi∂J more easily using formula
(
5
)
(5)
(5):
∂
J
∂
θ
j
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
x
j
(
i
)
−
x
j
(
i
)
e
θ
T
x
(
i
)
(
1
+
e
θ
T
x
(
i
)
)
]
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
x
j
(
i
)
−
x
j
(
i
)
(
1
+
e
−
θ
T
x
(
i
)
)
]
(6)
\begin{aligned} \frac{\partial J}{\partial \theta_j}&=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}x_j^{(i)} - \frac{x^{(i)}_je^{\theta^Tx^{(i)}}}{(1 + e^{\theta^Tx^{(i)}})}\Big] \\&=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}x_j^{(i)} - \frac{x^{(i)}_j}{(1 + e^{-\theta^Tx^{(i)}})}\Big] \end{aligned}\tag{6}
∂θj∂J=−m1i=1∑m[y(i)xj(i)−(1+eθTx(i))xj(i)eθTx(i)]=−m1i=1∑m[y(i)xj(i)−(1+e−θTx(i))xj(i)](6)
Notice that:
h
θ
(
x
(
i
)
)
=
1
1
+
e
−
θ
T
x
(
i
)
h_\theta(x^{(i)}) =\frac{1}{1+e^{-\theta^{T}x^{(i)}}}
hθ(x(i))=1+e−θTx(i)1
So the formula
(
6
)
(6)
(6) can be written as:
∂
J
∂
θ
j
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
x
j
(
i
)
−
h
θ
(
x
(
i
)
)
x
j
(
i
)
]
=
1
m
∑
i
=
1
m
[
x
j
(
i
)
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
]
(7)
\begin{aligned} \frac{\partial J}{\partial \theta_j}&=-\frac{1}{m}\sum_{i=1}^{m}\Big[ y^{(i)}x_j^{(i)} - h_\theta(x^{(i)})x^{(i)}_j\Big] \\ &=\frac{1}{m}\sum_{i=1}^{m}\Big[ x_j^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big] \end{aligned} \tag{7}
∂θj∂J=−m1i=1∑m[y(i)xj(i)−hθ(x(i))xj(i)]=m1i=1∑m[xj(i)(hθ(x(i))−y(i))](7)
Until now, we get the
∂
J
∂
θ
j
\frac{\partial J}{\partial \theta_j}
∂θj∂J. If we want to use MATLAB or Python to program, we could use some Liner Algebra knowledge to deduct it more deeply:
[
∂
J
∂
θ
0
∂
J
∂
θ
1
∂
J
∂
θ
2
⋮
∂
J
∂
θ
n
]
=
1
m
[
∑
i
=
1
m
(
x
0
(
i
)
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
)
∑
i
=
1
m
(
x
1
(
i
)
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
)
∑
i
=
1
m
(
x
2
(
i
)
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
)
⋮
∑
i
=
1
m
(
x
n
(
i
)
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
)
]
(8)
\begin{aligned} \left[ \begin{matrix} \frac{\partial J}{\partial \theta_0}\\ \frac{\partial J}{\partial \theta_1}\\ \frac{\partial J}{\partial \theta_2}\\ \vdots\\ \frac{\partial J}{\partial \theta_n}\\ \end{matrix} \right] &=\frac{1}{m}\left[ \begin{matrix} \sum_{i=1}^{m}\Big( x_0^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big)\\ \sum_{i=1}^{m}\Big( x_1^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big)\\ \sum_{i=1}^{m}\Big( x_2^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big)\\ \vdots\\ \sum_{i=1}^{m}\Big( x_n^{(i)}\big( h_\theta(x^{(i)})-y^{(i)}\big)\Big)\\ \end{matrix} \right]\tag{8} \end{aligned}
⎣⎢⎢⎢⎢⎢⎢⎡∂θ0∂J∂θ1∂J∂θ2∂J⋮∂θn∂J⎦⎥⎥⎥⎥⎥⎥⎤=m1⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡∑i=1m(x0(i)(hθ(x(i))−y(i)))∑i=1m(x1(i)(hθ(x(i))−y(i)))∑i=1m(x2(i)(hθ(x(i))−y(i)))⋮∑i=1m(xn(i)(hθ(x(i))−y(i)))⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤(8)
In formula
(
8
)
(8)
(8), we found that the
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
\big( h_\theta(x^{(i)})-y^{(i)}\big)
(hθ(x(i))−y(i)) is a scalar(single number), and the
x
j
(
i
)
x_j^{(i)}
xj(i)denotes the
j
t
h
j^{th}
jth parameter in the
i
t
h
i^{th}
ith example, which also is constant, so we can conclude that:
[
∂
J
∂
θ
0
∂
J
∂
θ
1
∂
J
∂
θ
2
⋮
∂
J
∂
θ
n
]
=
1
m
∑
i
=
1
m
(
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
(
i
)
)
(9)
\begin{aligned} \left[ \begin{matrix} \frac{\partial J}{\partial \theta_0}\\ \frac{\partial J}{\partial \theta_1}\\ \frac{\partial J}{\partial \theta_2}\\ \vdots\\ \frac{\partial J}{\partial \theta_n}\\ \end{matrix} \right] &=\frac{1}{m}\sum^m_{i=1}\Big(\big(h_\theta(x^{(i)})-y^{(i)}\big)x^{(i)}\Big)\tag{9} \end{aligned}
⎣⎢⎢⎢⎢⎢⎢⎡∂θ0∂J∂θ1∂J∂θ2∂J⋮∂θn∂J⎦⎥⎥⎥⎥⎥⎥⎤=m1i=1∑m((hθ(x(i))−y(i))x(i))(9)
in which
x
(
i
)
=
[
x
0
(
i
)
x
1
(
i
)
x
2
(
i
)
⋮
x
n
(
i
)
]
x^{(i)}= \left[ \begin{matrix} x^{(i)}_0\\ x^{(i)}_1\\ x^{(i)}_2\\ \vdots\\ x^{(i)}_n\\ \end{matrix} \right]
x(i)=⎣⎢⎢⎢⎢⎢⎢⎡x0(i)x1(i)x2(i)⋮xn(i)⎦⎥⎥⎥⎥⎥⎥⎤
Finally using the above formula we have:
[
∂
J
∂
θ
0
∂
J
∂
θ
1
∂
J
∂
θ
2
⋮
∂
J
∂
θ
n
]
=
1
m
X
T
(
h
θ
(
x
)
−
y
)
(10)
\begin{aligned} \left[ \begin{matrix} \frac{\partial J}{\partial \theta_0}\\ \frac{\partial J}{\partial \theta_1}\\ \frac{\partial J}{\partial \theta_2}\\ \vdots\\ \frac{\partial J}{\partial \theta_n}\\ \end{matrix} \right] =\frac{1}{m}X^T(h_\theta(x)-y)\tag{10} \end{aligned}
⎣⎢⎢⎢⎢⎢⎢⎡∂θ0∂J∂θ1∂J∂θ2∂J⋮∂θn∂J⎦⎥⎥⎥⎥⎥⎥⎤=m1XT(hθ(x)−y)(10)
where
X
=
[
x
0
(
1
)
x
1
(
1
)
x
2
(
1
)
x
3
(
1
)
⋯
x
n
(
1
)
x
0
(
2
)
x
1
(
2
)
x
2
(
2
)
x
3
(
2
)
⋯
x
n
(
2
)
x
0
(
3
)
x
1
(
3
)
x
2
(
3
)
x
3
(
3
)
⋯
x
n
(
3
)
⋮
⋮
⋮
⋮
⋱
⋮
x
0
(
m
)
x
1
(
m
)
x
2
(
m
)
x
3
(
m
)
⋯
x
n
(
m
)
]
h
θ
(
x
)
−
y
=
[
h
θ
(
x
(
1
)
)
−
y
(
1
)
h
θ
(
x
(
2
)
)
−
y
(
2
)
h
θ
(
x
(
3
)
)
−
y
(
3
)
⋮
h
θ
(
x
(
m
)
)
−
y
(
m
)
]
\begin{aligned} &X= \left[ \begin{matrix} x^{(1)}_0 & x^{(1)}_1 & x^{(1)}_2 & x^{(1)}_3 & \cdots & x^{(1)}_n\\ x^{(2)}_0 & x^{(2)}_1 & x^{(2)}_2 & x^{(2)}_3 & \cdots & x^{(2)}_n\\ x^{(3)}_0 & x^{(3)}_1 & x^{(3)}_2 & x^{(3)}_3 & \cdots & x^{(3)}_n\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ x^{(m)}_0 & x^{(m)}_1 & x^{(m)}_2 & x^{(m)}_3 & \cdots & x^{(m)}_n\\ \end{matrix} \right] \\\\\\ &h_\theta(x)-y = \left[ \begin{matrix} h_\theta(x^{(1)})-y^{(1)}\\ h_\theta(x^{(2)})-y^{(2)}\\ h_\theta(x^{(3)})-y^{(3)}\\ \vdots\\ h_\theta(x^{(m)})-y^{(m)}\\ \end{matrix} \right] \end{aligned}
X=⎣⎢⎢⎢⎢⎢⎢⎡x0(1)x0(2)x0(3)⋮x0(m)x1(1)x1(2)x1(3)⋮x1(m)x2(1)x2(2)x2(3)⋮x2(m)x3(1)x3(2)x3(3)⋮x3(m)⋯⋯⋯⋱⋯xn(1)xn(2)xn(3)⋮xn(m)⎦⎥⎥⎥⎥⎥⎥⎤hθ(x)−y=⎣⎢⎢⎢⎢⎢⎡hθ(x(1))−y(1)hθ(x(2))−y(2)hθ(x(3))−y(3)⋮hθ(x(m))−y(m)⎦⎥⎥⎥⎥⎥⎤
We could easily program Logistic Regression Derivatives using formula
(
10
)
(10)
(10).
An important detail
Note that x 0 ( i ) x^{(i)}_0 x0(i)denotes the Bias in every example, which constantly equals to 1 1 1. Nothing considerable unless if we add regularization methods for avoiding overfitting problem.
Examples in MATLAB
In this part, I use MATLAB to program the Liner Regression Cost Function. In order to avoid overfitting, I do some regularization operations (L2 regularization) to improve the quality of parameters.
function [J, grad] = lrCostFunction(theta, X, y, lambda)
%LRCOSTFUNCTION Compute cost and gradient for logistic regression with
%regularization
% J = LRCOSTFUNCTION(theta, X, y, lambda) computes the cost of using
% theta as the parameter for regularized logistic regression and the
% gradient of the cost w.r.t. to the parameters.
% Initialize some useful values
m = length(y); % number of training examples
% You need to return the following variables correctly
J = 0;
grad = zeros(size(theta));
% =============================================================
h_x = sigmoid(X*theta);
grad = 1/m .*(X' * (h_x - y));
temp = theta;
temp(1) = 0; % because we don't add anything for j = 0
J = 1/m*(sum(-y.*log(h_x) - (1-y) .* log(1-h_x))) + lambda/(2*m) .* sum(temp.^2);
grad = grad + lambda ./ m .* temp;
% =============================================================
grad = grad(:);
end
function g = sigmoid(z)
%SIGMOID Compute sigmoid functoon
% J = SIGMOID(z) computes the sigmoid of z.
g = 1.0 ./ (1.0 + exp(-z));
end
reference
Andrew Ng, Machine Learning Course in Coursera: https://www.coursera.org/learn/machine-learning
Neural Networks and Deep Learning: https://nndl.github.io/