[Note] Machine Learning——Chapter 2

江城暮

于 2021-03-29 22:38:09 发布

阅读量102

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/qq_39087432/article/details/115313585

版权

机器学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

2-1 Model representation

Linear regression with one variable

2-2 Cost function

Also called squared error function.

Hypothesis:

$h_θ(x) = θ_0 + θ_1x$

parameters: $θ_0, θ_1$
Cost function:

$J(θ_0, θ_1) = \frac{1}{2m} \sum_{i = 1}^{m} {\left( h_θ(x^{(i)}) - y^{(i)} \right)^2}$

Goal:

$J(θ_0, \theta_1)$

2-3 Gradient descent

Have some function: $J(\theta_0, \theta_1)$

Want: $J(\theta_0, \theta_1)$

Outlines:

Start with some $\theta_0, \theta_1$
Keep changing $\theta_0, \theta_1$ to reduce $J(\theta_0, \theta_1)$ until we hopefully end up at a minimum or maybe a local minimum

Gradient descent algorithm

repeat until convergence
$\theta_j := \theta_j - \alpha\frac{∂}{∂\theta_j}J(\theta_0, \theta_1) \quad (for\, j = 0 \, and \, j = 1)$

“:=” means assignment, it is different from the truth assertion “=”.
“α” means learning rate. And what alpha does is, it basically controls how big a step we take downhill with gradient descent. So if alpha is very large, then that corresponds to a very aggressive gradient descent procedure.

Correct: Simultaneous update

$\theta_0 - \alpha\frac{∂}{∂\theta_0}J(\theta_0, \theta_1)$
$\theta_1 - \alpha\frac{∂}{∂\theta_1}J(\theta_0, \theta_1)$
$\theta_0 := temp0$
$\theta_1 := temp1$

Incorrect:

$\theta_0 - \alpha\frac{∂}{∂\theta_0}J(\theta_0, \theta_1)$
$\theta_0 := temp0$
$\theta_1 - \alpha\frac{∂}{∂\theta_1}J(\theta_0, \theta_1)$
$\theta_1 := temp1$

if α is too small, gradient descent can be slow.
if α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.

Gradient descent can converge to a local minimum, even with the learning rage α fixed.

As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α over time.

Gradient descent for linear regression

Gradient descent algorithm

repeat until convergence
$\frac{1}{m} \sum_{i = 1}^{m} {\left( h_θ(x^{(i)}) - y^{(i)} \right)} = \frac{∂}{∂\theta_0}J(\theta_0, \theta_1)$

$\frac{1}{m} \sum_{i = 1}^{m} {\left( h_θ(x^{(i)}) - y^{(i)} \right)·x^{(i)}} = \frac{∂}{∂\theta_1}J(\theta_0, \theta_1)$

$\theta_0 := \theta_0 - \alpha\frac{1}{m} \sum_{i = 1}^{m} {\left( h_θ(x^{(i)}) - y^{(i)} \right)}$

$\theta_1 := \theta_1 - \alpha\frac{1}{m} \sum_{i = 1}^{m} {\left( h_θ(x^{(i)}) - y^{(i)} \right)·x^{(i)}}$

Does gradient descent on this type of cost function which you get whenever you’re using linear regression, it will always convert to the global optimum.

“Batch” Gradient Descent

“Batch”: Each step of gradient descent uses all the training examples. So, in gradient descent, when computing derivatives, we’re computing these sums, that sums over our M training examples.