1. The basic theory of the logistic regression
Hypothesis:
h
θ
(
x
)
=
1
1
+
e
−
θ
T
x
h_\theta(x)=\frac{1}{1+e^{-\theta ^Tx}}
hθ(x)=1+e−θTx1
Cost Function:
J
(
θ
)
=
−
1
m
∑
i
=
1
m
y
(
i
)
log
h
θ
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
J(\theta) = -\frac1m\sum\limits_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))
J(θ)=−m1i=1∑my(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))
2. What is logistic model
In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one. [Source: Logistic regression]
So, we know that logistic regression is used to predict the relationship between predictors (our independent variables) and a predicted variable (the labels, or the probability) where the dependent variable is binary (e.g. 0-1).
Hypothesis of the logistic regression
We define the logistic regression model under the hypothesis:
(1) obey the Bernoulli distribution
(2)
p
(
y
=
1
∣
x
,
θ
)
=
1
1
+
e
−
θ
T
x
=
p
p(y=1|x,\theta)=\frac{1}{1+e^{-\theta^T x}}=p
p(y=1∣x,θ)=1+e−θTx1=p , for
p
(
y
=
0
∣
x
,
θ
)
=
1
−
p
p(y=0|x,\theta)=1-p
p(y=0∣x,θ)=1−p
the hypothesis in the logistic regression can be defined as:
h
θ
(
x
)
=
1
1
+
e
−
θ
T
x
h_\theta(x)=\frac{1}{1+e^{-\theta ^Tx}}
hθ(x)=1+e−θTx1
plot the curve of the sigmoid function:
Here, for the given samples, we can define the probability distribution as
p ( y ( i ) ∣ x ( i ) , θ ) = h θ ( x ( i ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) ( 1 − y ( i ) ) p(y^{(i)}|x^{(i)},\theta)=h_\theta(x^{(i)})^{y^{(i)}}(1-h_\theta(x^{(i)}))^{(1-y^{(i)})} p(y(i)∣x(i),θ)=hθ(x(i))y(i)(1−hθ(x(i)))(1−y(i))
for this equation, if we calculated the parameters θ \theta θ, we could predict the probability of a certain class for the given samples. Obviously, we need some training samples to come up with the parameters.
3. Maximum likelihood estimate (MLE)
Now, I am so curious about how to solve out the θ \theta θ.
Maybe Maximum likelihood estimate (MLE) could help us. In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function (using the given samples), so that under the assumed statistical model the observed data is most probable. [Source: Maximum likelihood estimation]
Here, we define the likelihood function
L
(
θ
)
L(\theta)
L(θ), which is formed from the joint probability distribution of the sample (each events are independent).
L
(
θ
)
=
∏
i
=
1
m
p
(
y
(
i
)
∣
x
(
i
)
,
θ
)
=
∏
i
=
1
m
h
θ
(
x
(
i
)
)
y
(
i
)
(
1
−
h
θ
(
x
(
i
)
)
)
(
1
−
y
(
i
)
)
\begin{aligned} L(\theta) &=\prod\limits_{i=1}^mp(y^{(i)}|x^{(i)},\theta) \\ &= \prod\limits_{i=1}^mh_\theta(x^{(i)})^{y^{(i)}}(1-h_\theta(x^{(i)}))^{(1-y^{(i)})} \end{aligned}
L(θ)=i=1∏mp(y(i)∣x(i),θ)=i=1∏mhθ(x(i))y(i)(1−hθ(x(i)))(1−y(i))
Our object is to find the
θ
\theta
θ that maximizes the likelihood function
L
(
θ
)
L(\theta)
L(θ). For the above equation, when
∂
∂
θ
j
L
(
θ
)
=
0
\frac{\partial}{\partial \theta_j}L(\theta)=0
∂θj∂L(θ)=0, we can get the maximum of the likelihood function at the
θ
\theta
θ. But it’s my be difficult for us to calculate the partial derivatives unless taking the log of the likelihood equation, which turns products into sums.
log
L
(
θ
)
=
l
(
θ
)
=
log
∏
i
=
1
m
h
θ
(
x
(
i
)
)
y
(
i
)
(
1
−
h
θ
(
x
(
i
)
)
)
(
1
−
y
(
i
)
)
=
∑
i
=
1
m
log
h
θ
(
x
(
i
)
)
y
(
i
)
+
log
(
1
−
h
θ
(
x
(
i
)
)
)
(
1
−
y
(
i
)
)
=
∑
i
=
1
m
y
(
i
)
log
h
θ
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
=
∑
i
=
1
m
y
(
i
)
log
(
1
1
+
e
−
θ
T
x
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
1
1
+
e
−
θ
T
x
(
i
)
)
=
∑
i
=
1
m
y
(
i
)
log
(
1
1
+
e
−
θ
T
x
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
e
θ
T
x
(
i
)
1
+
e
θ
T
x
(
i
)
)
=
∑
i
=
1
m
y
(
i
)
log
(
1
1
+
e
−
θ
T
x
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
1
+
e
θ
T
x
(
i
)
)
=
∑
i
=
1
m
−
y
(
i
)
log
(
1
+
e
−
θ
T
x
(
i
)
)
−
(
1
−
y
(
i
)
)
log
(
1
+
e
θ
T
x
(
i
)
)
\begin{aligned} \log L(\theta) = l(\theta)&=\log \prod\limits_{i=1}^mh_\theta(x^{(i)})^{y^{(i)}}(1-h_\theta(x^{(i)}))^{(1-y^{(i)})} \\ &= \sum\limits_{i=1}^m\log h_\theta(x^{(i)})^{y^{(i)}}+\log (1-h_\theta(x^{(i)}))^{(1-y^{(i)})} \\ &= \sum\limits_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))\\ &= \sum\limits_{i=1}^m y^{(i)} \log \left( \frac{1}{1+e^{-\theta ^Tx^{(i)}}} \right) + (1-y^{(i)})\log \left(1-\frac{1}{1+e^{-\theta ^Tx^{(i)}}}\right)\\ &= \sum\limits_{i=1}^m y^{(i)} \log \left( \frac{1}{1+e^{-\theta ^Tx^{(i)}}} \right) +(1-y^{(i)})\log \left(1-\frac{e^{\theta^T x^{(i)}}}{1+e^{\theta^T x^{(i)}}}\right) \\ & = \sum\limits_{i=1}^m y^{(i)} \log \left( \frac{1}{1+e^{-\theta ^Tx^{(i)}}} \right) +(1-y^{(i)})\log \left(\frac{1}{1+e^{\theta^T x^{(i)}}}\right) \\ &= \sum\limits_{i=1}^m -y^{(i)} \log \left(1+e^{-\theta ^Tx^{(i)}} \right)-(1-y^{(i)})\log \left(1+e^{\theta^T x^{(i)}}\right) \end{aligned}
logL(θ)=l(θ)=logi=1∏mhθ(x(i))y(i)(1−hθ(x(i)))(1−y(i))=i=1∑mloghθ(x(i))y(i)+log(1−hθ(x(i)))(1−y(i))=i=1∑my(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))=i=1∑my(i)log(1+e−θTx(i)1)+(1−y(i))log(1−1+e−θTx(i)1)=i=1∑my(i)log(1+e−θTx(i)1)+(1−y(i))log(1−1+eθTx(i)eθTx(i))=i=1∑my(i)log(1+e−θTx(i)1)+(1−y(i))log(1+eθTx(i)1)=i=1∑m−y(i)log(1+e−θTx(i))−(1−y(i))log(1+eθTx(i))
Now, we construct our cost function of logistic regression as:
J
(
θ
)
=
−
1
m
l
(
θ
)
J(\theta) = -\frac{1}{m}l(\theta)
J(θ)=−m1l(θ)
So, we could minimize the
J
(
θ
)
J(\theta)
J(θ) that will solve out the
θ
\theta
θ. How to do thsi? Yes, gradient descent method maybe wrok well. Firstly, we can take its partial derivative of
J
(
θ
)
J(\theta)
J(θ) with respect to
θ
\theta
θ.
∂
∂
θ
j
J
(
θ
)
=
−
1
m
∂
∂
θ
j
[
∑
i
=
1
m
−
y
(
i
)
log
(
1
+
e
−
θ
T
x
(
i
)
)
−
(
1
−
y
(
i
)
)
log
(
1
+
e
θ
T
x
(
i
)
)
]
=
−
1
m
[
∑
i
=
1
m
−
y
(
i
)
e
−
θ
T
x
(
i
)
(
−
x
j
(
i
)
)
1
+
e
−
θ
T
x
(
i
)
−
(
1
−
y
(
i
)
)
e
θ
T
x
(
i
)
x
j
(
i
)
1
+
e
θ
T
x
(
i
)
]
=
−
1
m
[
∑
i
=
1
m
−
y
(
i
)
e
−
θ
T
x
(
i
)
(
−
x
j
(
i
)
)
1
+
e
−
θ
T
x
(
i
)
−
(
1
−
y
(
i
)
)
x
j
(
i
)
1
+
e
−
θ
T
x
(
i
)
]
=
−
1
m
∑
i
=
1
m
y
(
i
)
e
−
θ
T
x
(
i
)
−
1
+
y
(
i
)
1
+
e
−
θ
T
x
(
i
)
x
j
(
i
)
=
−
1
m
∑
i
=
1
m
y
(
i
)
(
1
+
e
−
θ
T
x
(
i
)
)
−
1
1
+
e
−
θ
T
x
(
i
)
x
j
(
i
)
=
−
1
m
∑
i
=
1
m
(
y
(
i
)
−
1
1
+
e
−
θ
T
x
(
i
)
)
x
j
(
i
)
\begin{aligned} \frac{\partial }{\partial \theta_j} J(\theta) &= -\frac{1}{m}\frac{\partial }{\partial \theta_j} \left[\sum\limits_{i=1}^m -y^{(i)} \log \left(1+e^{-\theta ^Tx^{(i)}} \right)-(1-y^{(i)})\log \left(1+e^{\theta^T x^{(i)}}\right) \right]\\ &=-\frac{1}{m} \left[ \sum\limits_{i=1}^m -y^{(i)} \frac{e^{-\theta ^Tx^{(i)}}(-x_j^{(i)})}{1+e^{-\theta ^Tx^{(i)}}} -(1-y^{(i)})\frac{e^{\theta^T x^{(i)}}x_j^{(i)}}{1+e^{\theta^T x^{(i)}}}\right]\\ &= -\frac{1}{m}\left[ \sum\limits_{i=1}^m -y^{(i)} \frac{e^{-\theta ^Tx^{(i)}}(-x_j^{(i)})}{1+e^{-\theta ^Tx^{(i)}}}-(1-y^{(i)})\frac{x_j^{(i)}}{1+e^{-\theta^T x^{(i)}}}\right]\\ & =-\frac{1}{m} \sum\limits_{i=1}^m \frac{y^{(i)}e^{-\theta ^Tx^{(i)}}-1+y^{(i)}}{1+e^{-\theta^T x^{(i)}}}x_j^{(i)} \\ & = -\frac{1}{m} \sum\limits_{i=1}^m \frac{y^{(i)}(1+e^{-\theta^T x^{(i)}})-1}{1+e^{-\theta^T x^{(i)}}}x_j^{(i)} \\ &= -\frac{1}{m} \sum\limits_{i=1}^m \left( y^{(i)}-\frac{1}{1+e^{-\theta^T x^{(i)}}}\right)x_j^{(i)} \end{aligned}
∂θj∂J(θ)=−m1∂θj∂[i=1∑m−y(i)log(1+e−θTx(i))−(1−y(i))log(1+eθTx(i))]=−m1[i=1∑m−y(i)1+e−θTx(i)e−θTx(i)(−xj(i))−(1−y(i))1+eθTx(i)eθTx(i)xj(i)]=−m1[i=1∑m−y(i)1+e−θTx(i)e−θTx(i)(−xj(i))−(1−y(i))1+e−θTx(i)xj(i)]=−m1i=1∑m1+e−θTx(i)y(i)e−θTx(i)−1+y(i)xj(i)=−m1i=1∑m1+e−θTx(i)y(i)(1+e−θTx(i))−1xj(i)=−m1i=1∑m(y(i)−1+e−θTx(i)1)xj(i)
We have defined the hypothesis in the logistic regression as
h
θ
(
x
)
=
1
1
+
e
−
θ
T
x
h_\theta(x)=\frac{1}{1+e^{-\theta ^Tx}}
hθ(x)=1+e−θTx1, so
∂
∂
θ
j
J
(
θ
)
=
−
1
m
∑
i
=
1
m
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
x
j
(
i
)
=
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
\begin{aligned} \frac{\partial }{\partial \theta_j} J(\theta) &=-\frac{1}{m} \sum\limits_{i=1}^m \left(y^{(i)}-h_\theta(x^{(i)})\right) x_j^{(i)} \\ &= \frac{1}{m} \sum\limits_{i=1}^m \left(h_\theta(x^{(i)})-y^{(i)}\right) x_j^{(i)} \end{aligned}
∂θj∂J(θ)=−m1i=1∑m(y(i)−hθ(x(i)))xj(i)=m1i=1∑m(hθ(x(i))−y(i))xj(i)
HIts: This form is similar with the partial derivative of the linear regression. While they are based on the different hypothesis h θ ( x ) h_\theta(x) hθ(x). In the linear regression, we defined h θ ( x ) = θ T x h_\theta(x) = \theta^Tx hθ(x)=θTx.
Let us update θ j \theta_j θj simultaneously
θ j : = θ j − α ∂ ∂ θ j J ( θ ) \theta_j := \theta_j - \alpha \frac{\partial }{\partial \theta_j} J(\theta) θj:=θj−α∂θj∂J(θ)
We can vetorization the above equation as
θ
n
∗
1
:
=
θ
n
∗
1
−
α
m
X
T
n
∗
m
(
h
θ
(
X
)
−
y
)
⏟
m
∗
1
\mathop \theta\limits_{n*1} := \mathop\theta \limits_{n*1} - \frac{\alpha}{m}\mathop {X^T} \limits_{n*m} \underbrace{(h_\theta(X)-y)}_{m*1}
n∗1θ:=n∗1θ−mαn∗mXTm∗1
(hθ(X)−y)
Hints: There is a new concept “Decision Boundary”. It maybe coube be represnted as θ T x = 0 \theta^Tx=0 θTx=0. Decision boundary is a property of the h θ h_\theta hθ and is defined by θ \theta θ.
4. Matlab code for logistic regression with gradient descent
%% ================== Data Generation ===================
clc;clear
x = [55.5,69.5;41,81.5;53.5,86;46,84;41,73.5;51.5,69;51,62.5;42,75;53.5,83;57.5,71;42.5,72.5;41,80;46,82;46,60.5;49.5,76;41,76;48.5,72.5;51.5,82.5;44.5,70.5;44,66;33,76.5;33.5,78.5;31.5,72;33,81.5;42,59.5;30,64;61,45;49,79;26.5,64.5;34,71.5;42,83.5;29.5,74.5;39.5,70;51.5,66;41.5,71.5;42.5,79.5;35,59.5;38.5,73.5;32,81.5;46,60.5;36.5,53;36.5,53.5;24,60.5;19,57.5;34.5,60;37.5,64.5;35.5,51;37,50.5;21.5,42;35.5,58.5;26.5,68.5;26.5,55.5;18.5,67;40,67;32.5,71.5;39,71.5;43,55.5;22,54;36,62.5;31,55.5;38.5,76;40,75;37.5,63;24.5,58;30,67;33,56;56.5,61;41,57;49.5,63;34.5,72.5;32.5,69;36,73;27,53.5;41,63.5;29.5,52.5;20,65.5;38,65;18.5,74.5;16,72.5;33.5,68];
y1 =ones(40,1); y2 = zeros(40,1);
y = [y1; y2];clear y1 y2;
figure
pos = find(y == 1); neg = find(y == 0);
plot(x(pos, 1), x(pos,2), '+'); hold on
plot(x(neg, 1), x(neg, 2), 'o')
m = size(x,1);
X = [ones(m,1), x]; %let the X = (X_0, X_1, X_2), here the X_0 = 1.
%% ================== Feature scaling ==================
for i = 2:size(X,2) % As the X_0 equals 1, we dont scale them.
X(:,i) = (X(:,i)- mean(X(:,i)))./ std(X(:,i));
end
%% ================== Gradient descent ==================
itera = 300000;
theta = zeros(size(X,2),1); % Initialize the theta
alpha = 0.0001; % Set the learning rate value
itera_theta = zeros(itera,size(X,2)); % Record all the theta values during its iteration.
for i = 1:itera
hypothesis = 1 ./ (1+exp(-(X * theta))); % h_\theta(x)
theta = theta - (alpha/m) * (X' * (hypothesis - y));
J(i,1) = (-1/m) * sum(( y .* log(1+exp(-(X * theta))) + (1-y) .* log(1+exp(X * theta)))); % Cost fuction
itera_theta(i,:) = theta';
end
%% ================== Ploting the decision boundary ==================
figure
plot(X(pos, 2), X(pos,3), '+'); hold on
plot(X(neg, 2), X(neg, 3), 'o');hold on
range_size = 4;
range = [-range_size,range_size,-range_size,range_size];
fh = @(x1,x2) theta(1) + theta(2)* x1 + theta(3) * x2; % the hypothesis fuction
ezplot(fh,range);
5. Advanced optimization
- run much more quickly
- let the algorithm scale much better to very large machine learning problems
Some optimization algorithms:
- Gradient Descent
- Conjugate Gradient
- BFGS
- L-BFGS
function [jVal, gradient] = costfunction(X,y,theta)
jVal = [codehere]
gradient(1) = [codehere]
gradient(2) = [codehere]
...
options = optimset('GradObj','on','MaxIter', 100);
initialTheta = zeros(n+1,1);
[optTheta, functionVal, exitFlag] = fminunc(@ (t) (costfunction(X,y,t)), initialTheta, options);
6. Multi-class classification
we only need to train the different classifiers and set the label which got the maximum probability.
7. The problem of overfitting
- overfitting - high variance
- underfitting - high bias
Overfitting fail to generalize to new samples.
How to avoid the overfitting?
(1) Try to reduce the features (mannually/model selection algorithm)
(2) Regularization: this method will keep all features and reduce the magnitude/values of parameter
θ
j
\theta_j
θj .
8. Regularization the cost function
J ( θ ) = 1 2 m ( ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 ) J(\theta) = \frac{1}{2m} \left(\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum\limits_{j=1}^n\theta_j^2 \right) J(θ)=2m1(i=1∑m(hθ(x(i))−y(i))2+λj=1∑nθj2)
As seen from the above formula, we define ∑ j = 1 n θ j 2 \sum\limits_{j=1}^n\theta_j^2 j=1∑nθj2 as regularization term, and λ \lambda λ as regularization parameter to control the trade off.
We want to obtain small values for parameters θ 0 , θ 1 , … , θ n \theta_0, \theta_1,\dots,\theta_n θ0,θ1,…,θn, which helps simplify the hypothesis and lessen prone to overfitting.
- ** We dont penalize the θ 0 \theta_0 θ0 in the regularization.
8.1 Regularized linear regression
cost function: J ( θ ) = 1 2 m ( ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 ) J(\theta)=\frac{1}{2m}\left(\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum\limits_{j=1}^n\theta_j^2\right) J(θ)=2m1(i=1∑m(hθ(x(i))−y(i))2+λj=1∑nθj2)
∂ ∂ θ J ( θ ) = 1 m ( ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) + λ θ j ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) + λ m θ j \begin{aligned} \frac{\partial}{\partial\theta}J(\theta)&= \frac{1}{m}\left(\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+\lambda\theta_j\right) \\ &= \frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+ \frac{\lambda}{m}\theta_j \end{aligned} ∂θ∂J(θ)=m1(i=1∑m(hθ(x(i))−y(i))xj(i)+λθj)=m1i=1∑m(hθ(x(i))−y(i))xj(i)+mλθj
For gradient descent,
θ
0
=
θ
0
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
0
(
i
)
θ
j
=
θ
j
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
−
α
λ
m
θ
j
=
θ
j
(
1
−
α
λ
m
)
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
,
(
j
=
1
,
2
,
…
,
n
)
\begin{aligned} \theta_0 &= \theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\\ \theta_j &= \theta_j-\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}-\alpha\frac{\lambda}{m}\theta_j \\ &= \theta_j(1-\frac{\alpha\lambda}{m})-\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}, (j=1,2,\dots,n) \end{aligned}
θ0θj=θ0−αm1i=1∑m(hθ(x(i))−y(i))x0(i)=θj−αm1i=1∑m(hθ(x(i))−y(i))xj(i)−αmλθj=θj(1−mαλ)−αm1i=1∑m(hθ(x(i))−y(i))xj(i),(j=1,2,…,n)
*
1
−
α
λ
m
<
1
1-\frac{\alpha\lambda}{m} \lt 1
1−mαλ<1
For normal equation,
we can rewrite the cost function:
J
(
θ
)
=
1
2
m
(
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
+
λ
∑
j
=
1
n
θ
j
2
)
J(\theta)=\frac{1}{2m}\left(\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum\limits_{j=1}^n\theta_j^2\right)
J(θ)=2m1(i=1∑m(hθ(x(i))−y(i))2+λj=1∑nθj2) as
J
(
θ
)
=
1
2
(
X
θ
−
y
)
T
⏟
1
∗
m
(
X
θ
−
y
)
⏟
m
∗
1
+
λ
θ
T
⏟
1
∗
n
θ
⏟
n
∗
1
=
1
2
(
θ
T
X
T
X
θ
−
θ
T
X
T
y
−
y
T
X
θ
−
y
T
y
+
λ
θ
T
θ
)
\begin{aligned}J(\theta) &=\frac{1}{2}\underbrace{(X\theta-y)^T}_{1*m}\underbrace{(X\theta-y)}_{m*1}+\lambda\underbrace{\theta^T}_{1*n}\underbrace{\theta}_{n*1}\\ &=\frac{1}{2}(\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta-y^Ty+\lambda\theta^T\theta) \end{aligned}
J(θ)=211∗m
(Xθ−y)Tm∗1
(Xθ−y)+λ1∗n
θTn∗1
θ=21(θTXTXθ−θTXTy−yTXθ−yTy+λθTθ)
∂
J
(
θ
)
∂
θ
=
1
2
(
2
X
T
X
θ
−
X
T
y
−
(
y
T
X
)
T
−
0
+
2
λ
θ
)
=
1
2
(
2
X
T
X
θ
−
X
T
y
−
X
T
y
−
0
+
2
λ
θ
)
=
X
T
X
θ
−
X
T
y
+
λ
θ
=
0
\begin{aligned} \frac{\partial J(\theta)}{\partial \theta} &=\frac{1}{2}(2X^TX\theta-X^Ty-(y^TX)^T-0+2\lambda\theta)\\ &= \frac{1}{2}(2X^TX\theta-X^Ty-X^Ty-0+2\lambda\theta)\\ &= X^TX\theta-X^Ty+\lambda\theta=0 \end{aligned}
∂θ∂J(θ)=21(2XTXθ−XTy−(yTX)T−0+2λθ)=21(2XTXθ−XTy−XTy−0+2λθ)=XTXθ−XTy+λθ=0
(
X
T
X
+
λ
)
θ
=
X
T
y
(X^TX+\lambda)\theta=X^Ty
(XTX+λ)θ=XTy
θ
=
(
X
T
X
+
λ
[
0
1
⋱
1
]
)
−
1
X
T
y
\theta=\left(X^TX+\lambda \begin{bmatrix} 0&&&\\&1&&\\ &&\ddots&\\&&&1 \end{bmatrix}\right)^{-1}X^Ty
θ=⎝⎜⎜⎛XTX+λ⎣⎢⎢⎡01⋱1⎦⎥⎥⎤⎠⎟⎟⎞−1XTy
8.2 Regularized logistic regression
cost function: J ( θ ) = − 1 m ( ∑ i = 1 m y ( i ) log h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ) + λ 2 m ∑ j = 1 n θ j 2 J(\theta) = -\frac1m \left(\sum\limits_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)})) \right)+\frac{\lambda}{2m}\sum\limits_{j=1}^n\theta_j^2 J(θ)=−m1(i=1∑my(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i))))+2mλj=1∑nθj2
here the
h
θ
(
x
)
=
1
1
+
e
−
θ
T
x
h_\theta(x)=\frac{1}{1+e^{-\theta ^Tx}}
hθ(x)=1+e−θTx1
For gradient descent,
θ 0 = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) θ j = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) − α λ m θ j ( j = 1 , 2 , … , n ) \begin{aligned} \theta_0 &= \theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\\ \theta_j &= \theta_j-\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}-\alpha\frac{\lambda}{m}\theta_j (j=1,2,\dots,n) \end{aligned} θ0θj=θ0−αm1i=1∑m(hθ(x(i))−y(i))x0(i)=θj−αm1i=1∑m(hθ(x(i))−y(i))xj(i)−αmλθj(j=1,2,…,n)
8.3 Matlab code for compute cost and gradient of logistic regression with regularization
Jtheta = ((-1) / m) * sum(y .* log(sigmoid(X*theta)) + (1-y) .* log(1- sigmoid(X*theta))) + (lambda/(2*m))*theta(2:end)'*theta(2:end);
gradient = (1 / m) * (X' * (sigmoid(X*theta) - y)) + (lambda/m) .* theta;
gradient(1) = (1 / m) * (X(:,1)' * (sigmoid(X*theta) - y));
* Feature mapping function to polynomial features
degree = 6;
out = ones(size(X1(:,1)));
for i = 1:degree
for j = 0:i
out(:, end+1) = (X1.^(i-j)).*(X2.^j);
end
end