ML Notes: Week 3 - Logistic regression

CCrazyGuy

于 2020-05-26 22:44:52 发布

阅读量255

点赞数 1

分类专栏： ML学习笔记文章标签： machine learning

本文链接：https://blog.csdn.net/jty573894890/article/details/106360682

版权

ML学习笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

1. The basic theory of the logistic regression

Hypothesis: $h_\theta(x)=\frac{1}{1+e^{-\theta ^Tx}}$
Cost Function: $J(\theta) = -\frac1m\sum\limits_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))$

2. What is logistic model

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one. [Source: Logistic regression]

So, we know that logistic regression is used to predict the relationship between predictors (our independent variables) and a predicted variable (the labels, or the probability) where the dependent variable is binary (e.g. 0-1).

Hypothesis of the logistic regression

We define the logistic regression model under the hypothesis：
(1) obey the Bernoulli distribution
(2) $p(y=1|x,\theta)=\frac{1}{1+e^{-\theta^T x}}=p$ , for $p(y=0|x,\theta)=1-p$
the hypothesis in the logistic regression can be defined as: $h_\theta(x)=\frac{1}{1+e^{-\theta ^Tx}}$

plot the curve of the sigmoid function:

Here, for the given samples, we can define the probability distribution as

$p(y^{(i)}|x^{(i)},\theta)=h_\theta(x^{(i)})^{y^{(i)}}(1-h_\theta(x^{(i)}))^{(1-y^{(i)})}$

for this equation, if we calculated the parameters $\theta$ , we could predict the probability of a certain class for the given samples. Obviously, we need some training samples to come up with the parameters.

3. Maximum likelihood estimate (MLE)

Now, I am so curious about how to solve out the $\theta$ .

Maybe Maximum likelihood estimate (MLE) could help us. In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function (using the given samples), so that under the assumed statistical model the observed data is most probable. [Source: Maximum likelihood estimation]

Here, we define the likelihood function $L(\theta)$ , which is formed from the joint probability distribution of the sample (each events are independent).
$\begin{aligned} L(\theta) &=\prod\limits_{i=1}^mp(y^{(i)}|x^{(i)},\theta) \\ &= \prod\limits_{i=1}^mh_\theta(x^{(i)})^{y^{(i)}}(1-h_\theta(x^{(i)}))^{(1-y^{(i)})} \end{aligned}$

Our object is to find the $\theta$ that maximizes the likelihood function $L(\theta)$ . For the above equation, when $\frac{\partial}{\partial \theta_j}L(\theta)=0$ , we can get the maximum of the likelihood function at the $\theta$ . But it’s my be difficult for us to calculate the partial derivatives unless taking the log of the likelihood equation, which turns products into sums.
$\begin{aligned} \log L(\theta) = l(\theta)&=\log \prod\limits_{i=1}^mh_\theta(x^{(i)})^{y^{(i)}}(1-h_\theta(x^{(i)}))^{(1-y^{(i)})} \\ &= \sum\limits_{i=1}^m\log h_\theta(x^{(i)})^{y^{(i)}}+\log (1-h_\theta(x^{(i)}))^{(1-y^{(i)})} \\ &= \sum\limits_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))\\ &= \sum\limits_{i=1}^m y^{(i)} \log \left( \frac{1}{1+e^{-\theta ^Tx^{(i)}}} \right) + (1-y^{(i)})\log \left(1-\frac{1}{1+e^{-\theta ^Tx^{(i)}}}\right)\\ &= \sum\limits_{i=1}^m y^{(i)} \log \left( \frac{1}{1+e^{-\theta ^Tx^{(i)}}} \right) +(1-y^{(i)})\log \left(1-\frac{e^{\theta^T x^{(i)}}}{1+e^{\theta^T x^{(i)}}}\right) \\ & = \sum\limits_{i=1}^m y^{(i)} \log \left( \frac{1}{1+e^{-\theta ^Tx^{(i)}}} \right) +(1-y^{(i)})\log \left(\frac{1}{1+e^{\theta^T x^{(i)}}}\right) \\ &= \sum\limits_{i=1}^m -y^{(i)} \log \left(1+e^{-\theta ^Tx^{(i)}} \right)-(1-y^{(i)})\log \left(1+e^{\theta^T x^{(i)}}\right) \end{aligned}$

Now, we construct our cost function of logistic regression as:
$J(\theta) = -\frac{1}{m}l(\theta)$
So, we could minimize the $J(\theta)$ that will solve out the $\theta$ . How to do thsi? Yes, gradient descent method maybe wrok well. Firstly, we can take its partial derivative of $J(\theta)$ with respect to $\theta$ .

$\begin{aligned} \frac{\partial }{\partial \theta_j} J(\theta) &= -\frac{1}{m}\frac{\partial }{\partial \theta_j} \left[\sum\limits_{i=1}^m -y^{(i)} \log \left(1+e^{-\theta ^Tx^{(i)}} \right)-(1-y^{(i)})\log \left(1+e^{\theta^T x^{(i)}}\right) \right]\\ &=-\frac{1}{m} \left[ \sum\limits_{i=1}^m -y^{(i)} \frac{e^{-\theta ^Tx^{(i)}}(-x_j^{(i)})}{1+e^{-\theta ^Tx^{(i)}}} -(1-y^{(i)})\frac{e^{\theta^T x^{(i)}}x_j^{(i)}}{1+e^{\theta^T x^{(i)}}}\right]\\ &= -\frac{1}{m}\left[ \sum\limits_{i=1}^m -y^{(i)} \frac{e^{-\theta ^Tx^{(i)}}(-x_j^{(i)})}{1+e^{-\theta ^Tx^{(i)}}}-(1-y^{(i)})\frac{x_j^{(i)}}{1+e^{-\theta^T x^{(i)}}}\right]\\ & =-\frac{1}{m} \sum\limits_{i=1}^m \frac{y^{(i)}e^{-\theta ^Tx^{(i)}}-1+y^{(i)}}{1+e^{-\theta^T x^{(i)}}}x_j^{(i)} \\ & = -\frac{1}{m} \sum\limits_{i=1}^m \frac{y^{(i)}(1+e^{-\theta^T x^{(i)}})-1}{1+e^{-\theta^T x^{(i)}}}x_j^{(i)} \\ &= -\frac{1}{m} \sum\limits_{i=1}^m \left( y^{(i)}-\frac{1}{1+e^{-\theta^T x^{(i)}}}\right)x_j^{(i)} \end{aligned}$
We have defined the hypothesis in the logistic regression as $h_\theta(x)=\frac{1}{1+e^{-\theta ^Tx}}$ , so
$\begin{aligned} \frac{\partial }{\partial \theta_j} J(\theta) &=-\frac{1}{m} \sum\limits_{i=1}^m \left(y^{(i)}-h_\theta(x^{(i)})\right) x_j^{(i)} \\ &= \frac{1}{m} \sum\limits_{i=1}^m \left(h_\theta(x^{(i)})-y^{(i)}\right) x_j^{(i)} \end{aligned}$

HIts: This form is similar with the partial derivative of the linear regression. While they are based on the different hypothesis $h_\theta(x)$ . In the linear regression, we defined $h_\theta(x) = \theta^Tx$ .

Let us update $\theta_j$ simultaneously

$\theta_j := \theta_j - \alpha \frac{\partial }{\partial \theta_j} J(\theta)$

We can vetorization the above equation as
$\mathop \theta\limits_{n*1} := \mathop\theta \limits_{n*1} - \frac{\alpha}{m}\mathop {X^T} \limits_{n*m} \underbrace{(h_\theta(X)-y)}_{m*1}$

Hints: There is a new concept “Decision Boundary”. It maybe coube be represnted as $\theta^Tx=0$ . Decision boundary is a property of the $h_\theta$ and is defined by $\theta$ .

4. Matlab code for logistic regression with gradient descent

%% ================== Data Generation ===================
clc;clear

x = [55.5,69.5;41,81.5;53.5,86;46,84;41,73.5;51.5,69;51,62.5;42,75;53.5,83;57.5,71;42.5,72.5;41,80;46,82;46,60.5;49.5,76;41,76;48.5,72.5;51.5,82.5;44.5,70.5;44,66;33,76.5;33.5,78.5;31.5,72;33,81.5;42,59.5;30,64;61,45;49,79;26.5,64.5;34,71.5;42,83.5;29.5,74.5;39.5,70;51.5,66;41.5,71.5;42.5,79.5;35,59.5;38.5,73.5;32,81.5;46,60.5;36.5,53;36.5,53.5;24,60.5;19,57.5;34.5,60;37.5,64.5;35.5,51;37,50.5;21.5,42;35.5,58.5;26.5,68.5;26.5,55.5;18.5,67;40,67;32.5,71.5;39,71.5;43,55.5;22,54;36,62.5;31,55.5;38.5,76;40,75;37.5,63;24.5,58;30,67;33,56;56.5,61;41,57;49.5,63;34.5,72.5;32.5,69;36,73;27,53.5;41,63.5;29.5,52.5;20,65.5;38,65;18.5,74.5;16,72.5;33.5,68];
y1 =ones(40,1); y2 = zeros(40,1);
y = [y1; y2];clear y1 y2;

figure
pos = find(y == 1); neg = find(y == 0);
plot(x(pos, 1), x(pos,2), '+'); hold on
plot(x(neg, 1), x(neg, 2), 'o')

m = size(x,1);
X = [ones(m,1), x];  %let the X = (X_0, X_1, X_2), here the X_0 = 1.

%% ================== Feature scaling ================== 
for i = 2:size(X,2)  % As the X_0 equals 1, we dont scale them.
    X(:,i) = (X(:,i)- mean(X(:,i)))./ std(X(:,i));
end

%% ================== Gradient descent ================== 
itera = 300000;
theta = zeros(size(X,2),1); % Initialize the theta
alpha = 0.0001;             % Set the learning rate value
itera_theta = zeros(itera,size(X,2));    % Record all the theta values during its iteration.

for i = 1:itera
    hypothesis = 1 ./ (1+exp(-(X *  theta))); % h_\theta(x)
    theta = theta - (alpha/m) * (X' * (hypothesis - y)); 
    J(i,1) = (-1/m) * sum(( y .* log(1+exp(-(X *  theta))) + (1-y) .* log(1+exp(X *  theta)))); % Cost fuction
    itera_theta(i,:) = theta';   
end

%% ================== Ploting the decision boundary ================== 
figure
plot(X(pos, 2), X(pos,3), '+'); hold on
plot(X(neg, 2), X(neg, 3), 'o');hold on

range_size = 4;
range = [-range_size,range_size,-range_size,range_size];
fh = @(x1,x2) theta(1) + theta(2)* x1 + theta(3) * x2; % the hypothesis fuction
ezplot(fh,range);

5. Advanced optimization

run much more quickly
let the algorithm scale much better to very large machine learning problems

Some optimization algorithms:

Gradient Descent
Conjugate Gradient
BFGS
L-BFGS

function [jVal, gradient] = costfunction(X,y,theta)
jVal = [codehere]
gradient(1) = [codehere]
gradient(2) = [codehere]
...
options = optimset('GradObj','on','MaxIter', 100);
initialTheta = zeros(n+1,1);
[optTheta, functionVal, exitFlag] = fminunc(@ (t) (costfunction(X,y,t)), initialTheta, options);

6. Multi-class classification

we only need to train the different classifiers and set the label which got the maximum probability.

7. The problem of overfitting

overfitting - high variance
underfitting - high bias

Overfitting fail to generalize to new samples.

How to avoid the overfitting?
(1) Try to reduce the features (mannually/model selection algorithm)
(2) Regularization: this method will keep all features and reduce the magnitude/values of parameter $\theta_j$ .

8. Regularization the cost function

$J(\theta) = \frac{1}{2m} \left(\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum\limits_{j=1}^n\theta_j^2 \right)$

As seen from the above formula, we define $\sum\limits_{j=1}^n\theta_j^2$ as regularization term, and $\lambda$ as regularization parameter to control the trade off.

We want to obtain small values for parameters $\theta_0, \theta_1,\dots,\theta_n$ , which helps simplify the hypothesis and lessen prone to overfitting.

** We dont penalize the $\theta_0$ in the regularization.

8.1 Regularized linear regression

cost function: $J(\theta)=\frac{1}{2m}\left(\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum\limits_{j=1}^n\theta_j^2\right)$

$\begin{aligned} \frac{\partial}{\partial\theta}J(\theta)&= \frac{1}{m}\left(\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+\lambda\theta_j\right) \\ &= \frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+ \frac{\lambda}{m}\theta_j \end{aligned}$

For gradient descent,

$\begin{aligned} \theta_0 &= \theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\\ \theta_j &= \theta_j-\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}-\alpha\frac{\lambda}{m}\theta_j \\ &= \theta_j(1-\frac{\alpha\lambda}{m})-\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}, (j=1,2,\dots,n) \end{aligned}$
* $1-\frac{\alpha\lambda}{m} \lt 1$

For normal equation,
we can rewrite the cost function: $J(\theta)=\frac{1}{2m}\left(\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum\limits_{j=1}^n\theta_j^2\right)$ as
$\begin{aligned}J(\theta) &=\frac{1}{2}\underbrace{(X\theta-y)^T}_{1*m}\underbrace{(X\theta-y)}_{m*1}+\lambda\underbrace{\theta^T}_{1*n}\underbrace{\theta}_{n*1}\\ &=\frac{1}{2}(\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta-y^Ty+\lambda\theta^T\theta) \end{aligned}$
$\begin{aligned} \frac{\partial J(\theta)}{\partial \theta} &=\frac{1}{2}(2X^TX\theta-X^Ty-(y^TX)^T-0+2\lambda\theta)\\ &= \frac{1}{2}(2X^TX\theta-X^Ty-X^Ty-0+2\lambda\theta)\\ &= X^TX\theta-X^Ty+\lambda\theta=0 \end{aligned}$
$(X^TX+\lambda)\theta=X^Ty$
$\theta=\left(X^TX+\lambda \begin{bmatrix} 0&&&\\&1&&\\ &&\ddots&\\&&&1 \end{bmatrix}\right)^{-1}X^Ty$

8.2 Regularized logistic regression

cost function: $J(\theta) = -\frac1m \left(\sum\limits_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)})) \right)+\frac{\lambda}{2m}\sum\limits_{j=1}^n\theta_j^2$

here the $h_\theta(x)=\frac{1}{1+e^{-\theta ^Tx}}$
For gradient descent,

8.3 Matlab code for compute cost and gradient of logistic regression with regularization

Jtheta =  ((-1) / m) * sum(y .* log(sigmoid(X*theta)) + (1-y) .* log(1- sigmoid(X*theta))) + (lambda/(2*m))*theta(2:end)'*theta(2:end);
gradient = (1 / m) * (X' * (sigmoid(X*theta) - y)) + (lambda/m) .* theta;
gradient(1) = (1 / m) * (X(:,1)' * (sigmoid(X*theta) - y));

* Feature mapping function to polynomial features

degree = 6;
out = ones(size(X1(:,1)));
for i = 1:degree
    for j = 0:i
        out(:, end+1) = (X1.^(i-j)).*(X2.^j);
    end
end

CCrazyGuy

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
ML Notes: Week 3 - Logistic regression

What is logistic modelIn statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events
复制链接

扫一扫