ML Notes: Week 3 - Logistic regression

1. The basic theory of the logistic regression

Hypothesis: h θ ( x ) = 1 1 + e − θ T x h_\theta(x)=\frac{1}{1+e^{-\theta ^Tx}} hθ(x)=1+eθTx1
Cost Function: J ( θ ) = − 1 m ∑ i = 1 m y ( i ) log ⁡ h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) J(\theta) = -\frac1m\sum\limits_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)})) J(θ)=m1i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))

2. What is logistic model

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one. [Source: Logistic regression]

So, we know that logistic regression is used to predict the relationship between predictors (our independent variables) and a predicted variable (the labels, or the probability) where the dependent variable is binary (e.g. 0-1).


Hypothesis of the logistic regression

We define the logistic regression model under the hypothesis:
(1) obey the Bernoulli distribution
(2) p ( y = 1 ∣ x , θ ) = 1 1 + e − θ T x = p p(y=1|x,\theta)=\frac{1}{1+e^{-\theta^T x}}=p p(y=1x,θ)=1+eθTx1=p , for p ( y = 0 ∣ x , θ ) = 1 − p p(y=0|x,\theta)=1-p p(y=0x,θ)=1p
the hypothesis in the logistic regression can be defined as: h θ ( x ) = 1 1 + e − θ T x h_\theta(x)=\frac{1}{1+e^{-\theta ^Tx}} hθ(x)=1+eθTx1

plot the curve of the sigmoid function:
在这里插入图片描述


Here, for the given samples, we can define the probability distribution as

p ( y ( i ) ∣ x ( i ) , θ ) = h θ ( x ( i ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) ( 1 − y ( i ) ) p(y^{(i)}|x^{(i)},\theta)=h_\theta(x^{(i)})^{y^{(i)}}(1-h_\theta(x^{(i)}))^{(1-y^{(i)})} p(y(i)x(i),θ)=hθ(x(i))y(i)(1hθ(x(i)))(1y(i))

for this equation, if we calculated the parameters θ \theta θ, we could predict the probability of a certain class for the given samples. Obviously, we need some training samples to come up with the parameters.


3. Maximum likelihood estimate (MLE)

Now, I am so curious about how to solve out the θ \theta θ.

Maybe Maximum likelihood estimate (MLE) could help us. In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function (using the given samples), so that under the assumed statistical model the observed data is most probable. [Source: Maximum likelihood estimation]

Here, we define the likelihood function L ( θ ) L(\theta) L(θ), which is formed from the joint probability distribution of the sample (each events are independent).
L ( θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) , θ ) = ∏ i = 1 m h θ ( x ( i ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) ( 1 − y ( i ) ) \begin{aligned} L(\theta) &=\prod\limits_{i=1}^mp(y^{(i)}|x^{(i)},\theta) \\ &= \prod\limits_{i=1}^mh_\theta(x^{(i)})^{y^{(i)}}(1-h_\theta(x^{(i)}))^{(1-y^{(i)})} \end{aligned} L(θ)=i=1mp(y(i)x(i),θ)=i=1mhθ(x(i))y(i)(1hθ(x(i)))(1y(i))

Our object is to find the θ \theta θ that maximizes the likelihood function L ( θ ) L(\theta) L(θ). For the above equation, when ∂ ∂ θ j L ( θ ) = 0 \frac{\partial}{\partial \theta_j}L(\theta)=0 θjL(θ)=0, we can get the maximum of the likelihood function at the θ \theta θ. But it’s my be difficult for us to calculate the partial derivatives unless taking the log of the likelihood equation, which turns products into sums.
log ⁡ L ( θ ) = l ( θ ) = log ⁡ ∏ i = 1 m h θ ( x ( i ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) ( 1 − y ( i ) ) = ∑ i = 1 m log ⁡ h θ ( x ( i ) ) y ( i ) + log ⁡ ( 1 − h θ ( x ( i ) ) ) ( 1 − y ( i ) ) = ∑ i = 1 m y ( i ) log ⁡ h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) = ∑ i = 1 m y ( i ) log ⁡ ( 1 1 + e − θ T x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − 1 1 + e − θ T x ( i ) ) = ∑ i = 1 m y ( i ) log ⁡ ( 1 1 + e − θ T x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − e θ T x ( i ) 1 + e θ T x ( i ) ) = ∑ i = 1 m y ( i ) log ⁡ ( 1 1 + e − θ T x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 1 + e θ T x ( i ) ) = ∑ i = 1 m − y ( i ) log ⁡ ( 1 + e − θ T x ( i ) ) − ( 1 − y ( i ) ) log ⁡ ( 1 + e θ T x ( i ) ) \begin{aligned} \log L(\theta) = l(\theta)&=\log \prod\limits_{i=1}^mh_\theta(x^{(i)})^{y^{(i)}}(1-h_\theta(x^{(i)}))^{(1-y^{(i)})} \\ &= \sum\limits_{i=1}^m\log h_\theta(x^{(i)})^{y^{(i)}}+\log (1-h_\theta(x^{(i)}))^{(1-y^{(i)})} \\ &= \sum\limits_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))\\ &= \sum\limits_{i=1}^m y^{(i)} \log \left( \frac{1}{1+e^{-\theta ^Tx^{(i)}}} \right) + (1-y^{(i)})\log \left(1-\frac{1}{1+e^{-\theta ^Tx^{(i)}}}\right)\\ &= \sum\limits_{i=1}^m y^{(i)} \log \left( \frac{1}{1+e^{-\theta ^Tx^{(i)}}} \right) +(1-y^{(i)})\log \left(1-\frac{e^{\theta^T x^{(i)}}}{1+e^{\theta^T x^{(i)}}}\right) \\ & = \sum\limits_{i=1}^m y^{(i)} \log \left( \frac{1}{1+e^{-\theta ^Tx^{(i)}}} \right) +(1-y^{(i)})\log \left(\frac{1}{1+e^{\theta^T x^{(i)}}}\right) \\ &= \sum\limits_{i=1}^m -y^{(i)} \log \left(1+e^{-\theta ^Tx^{(i)}} \right)-(1-y^{(i)})\log \left(1+e^{\theta^T x^{(i)}}\right) \end{aligned} logL(θ)=l(θ)=logi=1mhθ(x(i))y(i)(1hθ(x(i)))(1y(i))=i=1mloghθ(x(i))y(i)+log(1hθ(x(i)))(1y(i))=i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))=i=1my(i)log(1+eθTx(i)1)+(1y(i))log(11+eθTx(i)1)=i=1my(i)log(1+eθTx(i)1)+(1y(i))log(11+eθTx(i)eθTx(i))=i=1my(i)log(1+eθTx(i)1)+(1y(i))log(1+eθTx(i)1)=i=1my(i)log(1+eθTx(i))(1y(i))log(1+eθTx(i))

Now, we construct our cost function of logistic regression as:
J ( θ ) = − 1 m l ( θ ) J(\theta) = -\frac{1}{m}l(\theta) J(θ)=m1l(θ)
So, we could minimize the J ( θ ) J(\theta) J(θ) that will solve out the θ \theta θ. How to do thsi? Yes, gradient descent method maybe wrok well. Firstly, we can take its partial derivative of J ( θ ) J(\theta) J(θ) with respect to θ \theta θ.

∂ ∂ θ j J ( θ ) = − 1 m ∂ ∂ θ j [ ∑ i = 1 m − y ( i ) log ⁡ ( 1 + e − θ T x ( i ) ) − ( 1 − y ( i ) ) log ⁡ ( 1 + e θ T x ( i ) ) ] = − 1 m [ ∑ i = 1 m − y ( i ) e − θ T x ( i ) ( − x j ( i ) ) 1 + e − θ T x ( i ) − ( 1 − y ( i ) ) e θ T x ( i ) x j ( i ) 1 + e θ T x ( i ) ] = − 1 m [ ∑ i = 1 m − y ( i ) e − θ T x ( i ) ( − x j ( i ) ) 1 + e − θ T x ( i ) − ( 1 − y ( i ) ) x j ( i ) 1 + e − θ T x ( i ) ] = − 1 m ∑ i = 1 m y ( i ) e − θ T x ( i ) − 1 + y ( i ) 1 + e − θ T x ( i ) x j ( i ) = − 1 m ∑ i = 1 m y ( i ) ( 1 + e − θ T x ( i ) ) − 1 1 + e − θ T x ( i ) x j ( i ) = − 1 m ∑ i = 1 m ( y ( i ) − 1 1 + e − θ T x ( i ) ) x j ( i ) \begin{aligned} \frac{\partial }{\partial \theta_j} J(\theta) &= -\frac{1}{m}\frac{\partial }{\partial \theta_j} \left[\sum\limits_{i=1}^m -y^{(i)} \log \left(1+e^{-\theta ^Tx^{(i)}} \right)-(1-y^{(i)})\log \left(1+e^{\theta^T x^{(i)}}\right) \right]\\ &=-\frac{1}{m} \left[ \sum\limits_{i=1}^m -y^{(i)} \frac{e^{-\theta ^Tx^{(i)}}(-x_j^{(i)})}{1+e^{-\theta ^Tx^{(i)}}} -(1-y^{(i)})\frac{e^{\theta^T x^{(i)}}x_j^{(i)}}{1+e^{\theta^T x^{(i)}}}\right]\\ &= -\frac{1}{m}\left[ \sum\limits_{i=1}^m -y^{(i)} \frac{e^{-\theta ^Tx^{(i)}}(-x_j^{(i)})}{1+e^{-\theta ^Tx^{(i)}}}-(1-y^{(i)})\frac{x_j^{(i)}}{1+e^{-\theta^T x^{(i)}}}\right]\\ & =-\frac{1}{m} \sum\limits_{i=1}^m \frac{y^{(i)}e^{-\theta ^Tx^{(i)}}-1+y^{(i)}}{1+e^{-\theta^T x^{(i)}}}x_j^{(i)} \\ & = -\frac{1}{m} \sum\limits_{i=1}^m \frac{y^{(i)}(1+e^{-\theta^T x^{(i)}})-1}{1+e^{-\theta^T x^{(i)}}}x_j^{(i)} \\ &= -\frac{1}{m} \sum\limits_{i=1}^m \left( y^{(i)}-\frac{1}{1+e^{-\theta^T x^{(i)}}}\right)x_j^{(i)} \end{aligned} θjJ(θ)=m1θj[i=1my(i)log(1+eθTx(i))(1y(i))log(1+eθTx(i))]=m1[i=1my(i)1+eθTx(i)eθTx(i)(xj(i))(1y(i))1+eθTx(i)eθTx(i)xj(i)]=m1[i=1my(i)1+eθTx(i)eθTx(i)(xj(i))(1y(i))1+eθTx(i)xj(i)]=m1i=1m1+eθTx(i)y(i)eθTx(i)1+y(i)xj(i)=m1i=1m1+eθTx(i)y(i)(1+eθTx(i))1xj(i)=m1i=1m(y(i)1+eθTx(i)1)xj(i)
We have defined the hypothesis in the logistic regression as h θ ( x ) = 1 1 + e − θ T x h_\theta(x)=\frac{1}{1+e^{-\theta ^Tx}} hθ(x)=1+eθTx1, so
∂ ∂ θ j J ( θ ) = − 1 m ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \begin{aligned} \frac{\partial }{\partial \theta_j} J(\theta) &=-\frac{1}{m} \sum\limits_{i=1}^m \left(y^{(i)}-h_\theta(x^{(i)})\right) x_j^{(i)} \\ &= \frac{1}{m} \sum\limits_{i=1}^m \left(h_\theta(x^{(i)})-y^{(i)}\right) x_j^{(i)} \end{aligned} θjJ(θ)=m1i=1m(y(i)hθ(x(i)))xj(i)=m1i=1m(hθ(x(i))y(i))xj(i)

HIts: This form is similar with the partial derivative of the linear regression. While they are based on the different hypothesis h θ ( x ) h_\theta(x) hθ(x). In the linear regression, we defined h θ ( x ) = θ T x h_\theta(x) = \theta^Tx hθ(x)=θTx.

Let us update θ j \theta_j θj simultaneously

θ j : = θ j − α ∂ ∂ θ j J ( θ ) \theta_j := \theta_j - \alpha \frac{\partial }{\partial \theta_j} J(\theta) θj:=θjαθjJ(θ)

We can vetorization the above equation as
θ n ∗ 1 : = θ n ∗ 1 − α m X T n ∗ m ( h θ ( X ) − y ) ⏟ m ∗ 1 \mathop \theta\limits_{n*1} := \mathop\theta \limits_{n*1} - \frac{\alpha}{m}\mathop {X^T} \limits_{n*m} \underbrace{(h_\theta(X)-y)}_{m*1} n1θ:=n1θmαnmXTm1 (hθ(X)y)

Hints: There is a new concept “Decision Boundary”. It maybe coube be represnted as θ T x = 0 \theta^Tx=0 θTx=0. Decision boundary is a property of the h θ h_\theta hθ and is defined by θ \theta θ.


4. Matlab code for logistic regression with gradient descent

%% ================== Data Generation ===================
clc;clear

x = [55.5,69.5;41,81.5;53.5,86;46,84;41,73.5;51.5,69;51,62.5;42,75;53.5,83;57.5,71;42.5,72.5;41,80;46,82;46,60.5;49.5,76;41,76;48.5,72.5;51.5,82.5;44.5,70.5;44,66;33,76.5;33.5,78.5;31.5,72;33,81.5;42,59.5;30,64;61,45;49,79;26.5,64.5;34,71.5;42,83.5;29.5,74.5;39.5,70;51.5,66;41.5,71.5;42.5,79.5;35,59.5;38.5,73.5;32,81.5;46,60.5;36.5,53;36.5,53.5;24,60.5;19,57.5;34.5,60;37.5,64.5;35.5,51;37,50.5;21.5,42;35.5,58.5;26.5,68.5;26.5,55.5;18.5,67;40,67;32.5,71.5;39,71.5;43,55.5;22,54;36,62.5;31,55.5;38.5,76;40,75;37.5,63;24.5,58;30,67;33,56;56.5,61;41,57;49.5,63;34.5,72.5;32.5,69;36,73;27,53.5;41,63.5;29.5,52.5;20,65.5;38,65;18.5,74.5;16,72.5;33.5,68];
y1 =ones(40,1); y2 = zeros(40,1);
y = [y1; y2];clear y1 y2;

figure
pos = find(y == 1); neg = find(y == 0);
plot(x(pos, 1), x(pos,2), '+'); hold on
plot(x(neg, 1), x(neg, 2), 'o')

m = size(x,1);
X = [ones(m,1), x];  %let the X = (X_0, X_1, X_2), here the X_0 = 1.

%% ================== Feature scaling ================== 
for i = 2:size(X,2)  % As the X_0 equals 1, we dont scale them.
    X(:,i) = (X(:,i)- mean(X(:,i)))./ std(X(:,i));
end

%% ================== Gradient descent ================== 
itera = 300000;
theta = zeros(size(X,2),1); % Initialize the theta
alpha = 0.0001;             % Set the learning rate value
itera_theta = zeros(itera,size(X,2));    % Record all the theta values during its iteration.

for i = 1:itera
    hypothesis = 1 ./ (1+exp(-(X *  theta))); % h_\theta(x)
    theta = theta - (alpha/m) * (X' * (hypothesis - y)); 
    J(i,1) = (-1/m) * sum(( y .* log(1+exp(-(X *  theta))) + (1-y) .* log(1+exp(X *  theta)))); % Cost fuction
    itera_theta(i,:) = theta';   
end

%% ================== Ploting the decision boundary ================== 
figure
plot(X(pos, 2), X(pos,3), '+'); hold on
plot(X(neg, 2), X(neg, 3), 'o');hold on

range_size = 4;
range = [-range_size,range_size,-range_size,range_size];
fh = @(x1,x2) theta(1) + theta(2)* x1 + theta(3) * x2; % the hypothesis fuction
ezplot(fh,range);

5. Advanced optimization

  • run much more quickly
  • let the algorithm scale much better to very large machine learning problems

Some optimization algorithms:

  • Gradient Descent
  • Conjugate Gradient
  • BFGS
  • L-BFGS
function [jVal, gradient] = costfunction(X,y,theta)
jVal = [codehere]
gradient(1) = [codehere]
gradient(2) = [codehere]
...
options = optimset('GradObj','on','MaxIter', 100);
initialTheta = zeros(n+1,1);
[optTheta, functionVal, exitFlag] = fminunc(@ (t) (costfunction(X,y,t)), initialTheta, options);

6. Multi-class classification

we only need to train the different classifiers and set the label which got the maximum probability.

7. The problem of overfitting

  • overfitting - high variance
  • underfitting - high bias

Overfitting fail to generalize to new samples.

How to avoid the overfitting?
(1) Try to reduce the features (mannually/model selection algorithm)
(2) Regularization: this method will keep all features and reduce the magnitude/values of parameter θ j \theta_j θj .

8. Regularization the cost function

J ( θ ) = 1 2 m ( ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 ) J(\theta) = \frac{1}{2m} \left(\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum\limits_{j=1}^n\theta_j^2 \right) J(θ)=2m1(i=1m(hθ(x(i))y(i))2+λj=1nθj2)

As seen from the above formula, we define ∑ j = 1 n θ j 2 \sum\limits_{j=1}^n\theta_j^2 j=1nθj2 as regularization term, and λ \lambda λ as regularization parameter to control the trade off.

We want to obtain small values for parameters θ 0 , θ 1 , … , θ n \theta_0, \theta_1,\dots,\theta_n θ0,θ1,,θn, which helps simplify the hypothesis and lessen prone to overfitting.

  • ** We dont penalize the θ 0 \theta_0 θ0 in the regularization.
8.1 Regularized linear regression

cost function: J ( θ ) = 1 2 m ( ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 ) J(\theta)=\frac{1}{2m}\left(\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum\limits_{j=1}^n\theta_j^2\right) J(θ)=2m1(i=1m(hθ(x(i))y(i))2+λj=1nθj2)

∂ ∂ θ J ( θ ) = 1 m ( ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) + λ θ j ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) + λ m θ j \begin{aligned} \frac{\partial}{\partial\theta}J(\theta)&= \frac{1}{m}\left(\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+\lambda\theta_j\right) \\ &= \frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+ \frac{\lambda}{m}\theta_j \end{aligned} θJ(θ)=m1(i=1m(hθ(x(i))y(i))xj(i)+λθj)=m1i=1m(hθ(x(i))y(i))xj(i)+mλθj


For gradient descent,

θ 0 = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) θ j = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) − α λ m θ j = θ j ( 1 − α λ m ) − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) , ( j = 1 , 2 , … , n ) \begin{aligned} \theta_0 &= \theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\\ \theta_j &= \theta_j-\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}-\alpha\frac{\lambda}{m}\theta_j \\ &= \theta_j(1-\frac{\alpha\lambda}{m})-\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}, (j=1,2,\dots,n) \end{aligned} θ0θj=θ0αm1i=1m(hθ(x(i))y(i))x0(i)=θjαm1i=1m(hθ(x(i))y(i))xj(i)αmλθj=θj(1mαλ)αm1i=1m(hθ(x(i))y(i))xj(i),(j=1,2,,n)
* 1 − α λ m < 1 1-\frac{\alpha\lambda}{m} \lt 1 1mαλ<1


For normal equation,
we can rewrite the cost function: J ( θ ) = 1 2 m ( ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 ) J(\theta)=\frac{1}{2m}\left(\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum\limits_{j=1}^n\theta_j^2\right) J(θ)=2m1(i=1m(hθ(x(i))y(i))2+λj=1nθj2) as
J ( θ ) = 1 2 ( X θ − y ) T ⏟ 1 ∗ m ( X θ − y ) ⏟ m ∗ 1 + λ θ T ⏟ 1 ∗ n θ ⏟ n ∗ 1 = 1 2 ( θ T X T X θ − θ T X T y − y T X θ − y T y + λ θ T θ ) \begin{aligned}J(\theta) &=\frac{1}{2}\underbrace{(X\theta-y)^T}_{1*m}\underbrace{(X\theta-y)}_{m*1}+\lambda\underbrace{\theta^T}_{1*n}\underbrace{\theta}_{n*1}\\ &=\frac{1}{2}(\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta-y^Ty+\lambda\theta^T\theta) \end{aligned} J(θ)=211m (Xθy)Tm1 (Xθy)+λ1n θTn1 θ=21(θTXTXθθTXTyyTXθyTy+λθTθ)
∂ J ( θ ) ∂ θ = 1 2 ( 2 X T X θ − X T y − ( y T X ) T − 0 + 2 λ θ ) = 1 2 ( 2 X T X θ − X T y − X T y − 0 + 2 λ θ ) = X T X θ − X T y + λ θ = 0 \begin{aligned} \frac{\partial J(\theta)}{\partial \theta} &=\frac{1}{2}(2X^TX\theta-X^Ty-(y^TX)^T-0+2\lambda\theta)\\ &= \frac{1}{2}(2X^TX\theta-X^Ty-X^Ty-0+2\lambda\theta)\\ &= X^TX\theta-X^Ty+\lambda\theta=0 \end{aligned} θJ(θ)=21(2XTXθXTy(yTX)T0+2λθ)=21(2XTXθXTyXTy0+2λθ)=XTXθXTy+λθ=0
( X T X + λ ) θ = X T y (X^TX+\lambda)\theta=X^Ty (XTX+λ)θ=XTy
θ = ( X T X + λ [ 0 1 ⋱ 1 ] ) − 1 X T y \theta=\left(X^TX+\lambda \begin{bmatrix} 0&&&\\&1&&\\ &&\ddots&\\&&&1 \end{bmatrix}\right)^{-1}X^Ty θ=XTX+λ0111XTy

8.2 Regularized logistic regression

cost function: J ( θ ) = − 1 m ( ∑ i = 1 m y ( i ) log ⁡ h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ) + λ 2 m ∑ j = 1 n θ j 2 J(\theta) = -\frac1m \left(\sum\limits_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)})) \right)+\frac{\lambda}{2m}\sum\limits_{j=1}^n\theta_j^2 J(θ)=m1(i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i))))+2mλj=1nθj2

here the h θ ( x ) = 1 1 + e − θ T x h_\theta(x)=\frac{1}{1+e^{-\theta ^Tx}} hθ(x)=1+eθTx1
For gradient descent,

θ 0 = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) θ j = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) − α λ m θ j ( j = 1 , 2 , … , n ) \begin{aligned} \theta_0 &= \theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\\ \theta_j &= \theta_j-\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}-\alpha\frac{\lambda}{m}\theta_j (j=1,2,\dots,n) \end{aligned} θ0θj=θ0αm1i=1m(hθ(x(i))y(i))x0(i)=θjαm1i=1m(hθ(x(i))y(i))xj(i)αmλθj(j=1,2,,n)

8.3 Matlab code for compute cost and gradient of logistic regression with regularization
Jtheta =  ((-1) / m) * sum(y .* log(sigmoid(X*theta)) + (1-y) .* log(1- sigmoid(X*theta))) + (lambda/(2*m))*theta(2:end)'*theta(2:end);
gradient = (1 / m) * (X' * (sigmoid(X*theta) - y)) + (lambda/m) .* theta;
gradient(1) = (1 / m) * (X(:,1)' * (sigmoid(X*theta) - y));


* Feature mapping function to polynomial features

degree = 6;
out = ones(size(X1(:,1)));
for i = 1:degree
    for j = 0:i
        out(:, end+1) = (X1.^(i-j)).*(X2.^j);
    end
end
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
C语言是一种广泛使用的编程语言,它具有高效、灵活、可移植性强等特点,被广泛应用于操作系统、嵌入式系统、数据库、编译器等领域的开发。C语言的基本语法包括变量、数据类型、运算符、控制结构(如if语句、循环语句等)、函数、指针等。在编写C程序时,需要注意变量的声明和定义、指针的使用、内存的分配与释放等问题。C语言中常用的数据结构包括: 1. 数组:一种存储同类型数据的结构,可以进行索引访问和修改。 2. 链表:一种存储不同类型数据的结构,每个节点包含数据和指向下一个节点的指针。 3. 栈:一种后进先出(LIFO)的数据结构,可以通过压入(push)和弹出(pop)操作进行数据的存储和取出。 4. 队列:一种先进先出(FIFO)的数据结构,可以通过入队(enqueue)和出队(dequeue)操作进行数据的存储和取出。 5. 树:一种存储具有父子关系的数据结构,可以通过中序遍历、前序遍历和后序遍历等方式进行数据的访问和修改。 6. 图:一种存储具有节点和边关系的数据结构,可以通过广度优先搜索、深度优先搜索等方式进行数据的访问和修改。 这些数据结构在C语言中都有相应的实现方式,可以应用于各种不同的场景。C语言中的各种数据结构都有其优缺点,下面列举一些常见的数据结构的优缺点: 数组: 优点:访问和修改元素的速度非常快,适用于需要频繁读取和修改数据的场合。 缺点:数组的长度是固定的,不适合存储大小不固定的动态数据,另外数组在内存中是连续分配的,当数组较大时可能会导致内存碎片化。 链表: 优点:可以方便地插入和删除元素,适用于需要频繁插入和删除数据的场合。 缺点:访问和修改元素的速度相对较慢,因为需要遍历链表找到指定的节点。 栈: 优点:后进先出(LIFO)的特性使得栈在处理递归和括号匹配等问题时非常方便。 缺点:栈的空间有限,当数据量较大时可能会导致栈溢出。 队列: 优点:先进先出(FIFO)的特性使得
该资源内项目源码是个人的课程设计、毕业设计,代码都测试ok,都是运行成功后才上传资源,答辩评审平均分达到96分,放心下载使用! ## 项目备注 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用! 2、本项目适合计算机相关专业(如计科、人工智能、通信工程、自动化、电子信息等)的在校学生、老师或者企业员工下载学习,也适合小白学习进阶,当然也可作为毕设项目、课程设计、作业、项目初期立项演示等。 3、如果基础还行,也可在此代码基础上进行修改,以实现其他功能,也可用于毕设、课设、作业等。 下载后请首先打开README.md文件(如有),仅供学习参考, 切勿用于商业用途。 该资源内项目源码是个人的课程设计,代码都测试ok,都是运行成功后才上传资源,答辩评审平均分达到96分,放心下载使用! ## 项目备注 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用! 2、本项目适合计算机相关专业(如计科、人工智能、通信工程、自动化、电子信息等)的在校学生、老师或者企业员工下载学习,也适合小白学习进阶,当然也可作为毕设项目、课程设计、作业、项目初期立项演示等。 3、如果基础还行,也可在此代码基础上进行修改,以实现其他功能,也可用于毕设、课设、作业等。 下载后请首先打开README.md文件(如有),仅供学习参考, 切勿用于商业用途。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值