Stanford ML - Lecture 3 - Logistic regression

最新推荐文章于 2018-02-25 10:55:28 发布

Quebradawill

最新推荐文章于 2018-02-25 10:55:28 发布

阅读量1.3k

点赞数

分类专栏： Machine Learning ML-Stanford-Andrew Ng

本文链接：https://blog.csdn.net/qiudw/article/details/8656900

版权

Machine Learning 同时被 2 个专栏收录

19 篇文章 0 订阅

订阅专栏

ML-Stanford-Andrew Ng

12 篇文章 0 订阅

订阅专栏


1. Classification
2. Hypothesis Representation
  
  Logistic Regression Model
  
  
   
   

  
  
  
  
   
   

  
  
  
  
   
   

  
  
  
  
   
   

  
  
  
  
   
   
   
   
    
           
       
        
        the above function is called sigmoid function or logistic function
       
       
       
       
        
        Interpretation of Hypothesis output
        
        
         
         estimated probability that  on input 
        
        
       
       
   
   
   
   3. Decision Boundary
   
   
    
           
       
        
        linear decision boundaries or non-linear decision boundaries

4. Cost function

$\textrm{cost}(h_{\theta}(x), y) = \begin{cases} -\log (h_{\theta}(x)) & \textrm{if} \ y = 1 \\ -\log (1- h_{\theta}(x)) & \textrm{if} \ y = 0 \\ \end{cases}$

5. Simplified cost function and gradient descent

$\textrm{cost}(h_{\theta}(x), y) = -y \log (h_{\theta}(x)) - (1 - y) \log (1- h_{\theta}(x))$

logistic regression cost function

$J(\theta) = \frac{1}{m} \sum_{i=1}^m \textrm{cost}(h_{\theta}(x^{(i)}), y^{(i)}) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log (h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \log (1- h_{\theta}(x^{(i)})) \right ]$

To fit parameters $\theta$

$\min J(\theta)$

To make a new prediction given new $x$

$\textrm{output} \ h_{\theta}(x) = \frac{1}{1+e^{-\theta^Tx}}$

Gradient descent

$\theta_j := \theta_j - \alpha \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$

algorithm looks identical to linear regression, why??

see http://blog.csdn.net/abcjennifer/article/details/7732417

$z = \frac{1}{1+e^{-\theta x}}$

$z'_{\theta}=\frac{e^{-\theta x}}{(1+e^{-\theta x})^2} \cdot (-x) = z(z-1)(-x)\\$

$J = ylnz+(1-y)ln(1-z)\\$

$J'_{\theta} = y\frac{1}{z}z'_{\theta}+(1-y)\frac{-z'_\theta}{1-z}\\ J'_{\theta}=z'_\theta(\frac{y}{z}-\frac{1-y}{1-z}) = z(z-1)(-x)\frac{y-yz-z+yz}{z(1-z)} = (y-z)x$

Q：Suppose you are running gradient descent to fit a logistic regression model with parameter θ∈Rn+1 . Which of the following is a reasonable way to make sure the learning rate α is set properly and that gradient descent is running correctly?

A：Plot $J(\theta)$ as a function of the number of iterations and make sure $J(\theta)$ is decreasing on every iteration.

6. Advanced optimization

optimization algorithms:

Gradient descent
Conjugate gradient
BFGS
L-BFGS

the last three algorithms have the following advantages:

no need to manually pick $\alpha$
often faster than gradient descent

disadvantages:

more complex

7. Multi-class classification: one-vs-all

train a logistic regression classifier for each class to predict the probability
on a new input $x$ , to make a prediction, pick the class $i$ that maximizes

$\max_i h_{\theta}^{(i)}(x)$

8. The problem of overfitting

if we have too many features, the learned hypothesis may fit the training set very well, but fail to generate to new examples.
the solutions for overfitting
1. reduce number of features
  - manually select which features to keep
  - model selection algorithm
2. regularization
  - keep all the features, but reduce magnitude/value of parameters
  - works well when we have a lot of features, each of which contributes a bit to predicting $y$

9. Cost function

Regularization
- small values for parameters $\theta_0, \theta_1, \cdots, \theta_n$
  - "simpler" hypothesis
  - less prone to overfitting

10. Regularized linear regression

$J(\theta) = \frac{1}{2m} \left[ \sum_{i=1}^m \left( h_{\theta}(x^{(i)}) - y^{(i)} \right)^2 + \lambda \sum_{j=1}^n \theta_j^2 \right ]$

in the above function, $\sum_{j=1}^n \theta_j^2$ is regularization term
$\lambda$ is regularization parameter, it does the following trade off
- it helps the first term to fit training set
- it helps to reduce the value of the parameters
Gradient descent

$\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m \left( h_{\theta}(x^{(i)}) - y^{(i)} \right ) x_0^{(i)}$

$\theta_j := \theta_j - \alpha \left[ \frac{1}{m} \sum_{i=1}^m \left( h_{\theta}(x^{(i)}) - y^{(i)} \right ) x_j^{(i)} - \frac{\lambda}{m} \theta_j \right ] \quad (j = 1, 2, \cdots, n)$

Normal equation

$\theta = \left(X^TX + \lambda \left(\begin{smallmatrix} 0 & & & \\ & 1 & & \\ & & \ddots & \\ & & & 1 \end{smallmatrix}\right) \right)^{-1} X^T Y$

the matrix in the above equation is a(n+1)-by-(n+1) matrix

$X^T X$ may be non-invertible/singular, but $\left(X^TX + \lambda \left(\begin{smallmatrix} 0 & & & \\ & 1 & & \\ & & \ddots & \\ & & & 1 \end{smallmatrix}\right) \right)$ may be invertible

11. Regularized logistic regression

cost function

$J(\theta) = - \left[ \frac{1}{m} \sum_{i=1}^m y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log \left(1 - h_{\theta}(x^{(i)})\right) \right ] + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2$

gradient descent

$\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_0^{(i)}$

$\theta_j := \theta_j - \alpha \left[ \frac{1}{m} \sum_{i=1}^m \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} - \frac{\lambda}{m} \theta_j \right] \quad (j = 1, 2, \cdots, n)$

From: http://blog.csdn.net/abcjennifer/article/details/7716281

Quebradawill

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Stanford ML - Lecture 3 - Logistic regression

1. Classification2. Hypothesis RepresentationLogistic Regression Modelthe above function is called sigmoid function or logistic functionInterpretation of Hypothesis outputestimated probability tha
复制链接

扫一扫