# Linear Regression with One Variable (a.k.a univariate linear regression)

Task: Give a training set of (x,y)=(housing size, housing price), predict y given a new x.

# 1. Model

### 1.1 Notation

m=number of training examples
x=”input” variable/features
y=”output” variable/”target” variable
$\left({x}^{\left(i\right)},{y}^{\left(i\right)}\right)={i}^{th}$$(x^{(i)},y^{(i)})=i^{th}$ training example

### 1.2 Hypothesis

${h}_{\theta }\left(x\right)={\theta }_{0}+{\theta }_{1}x$

### 1.3 Cost function

To estimate the model built above, the idea is to choose parameters ${\theta }_{0}$$\theta_0$, ${\theta }_{1}$$\theta_1$ so that ${h}_{\theta }\left(x\right)$$h_\theta (x)$ is close to y for our training examples(x,y). In practice, we minimize the cost function:

$J\left({\theta }_{0},{\theta }_{1}\right)=\frac{1}{2m}\sum _{i=1}^{m}{\left({h}_{\theta }\left({x}^{\left(i\right)}\right)-{y}^{\left(i\right)}\right)}^{2}$

${h}_{\theta }\left({x}^{\left(i\right)}\right)-{y}^{\left(i\right)}$$h_\theta(x^{(i)}) - y^{(i)}$是残差，$\sum _{i=1}^{m}{\left({h}_{\theta }\left({x}^{\left(i\right)}\right)-{y}^{\left(i\right)}\right)}^{2}$$\sum_{i=1}^m \left (h_\theta(x^{(i)}) - y^{(i)} \right)^2$是残差平方和（SSR）。常数项$\frac{1}{2m}$$\dfrac {1}{2m}$不影响参数估计结果，但习惯上我们在残差平方和的基础上除以样本量m，以计算平均值，再除以2，方便之后的求导计算（对参数求导后1/2会被消去）。完整的目标函数$J\left({\theta }_{0},{\theta }_{1}\right)$$J(\theta_0, \theta_1)$即为代价函数（cost function），也称损失函数（loss function）。代价函数有多种类型，以上这种在回归问题里最常用。

### 1.4 Goal

$\underset{{\theta }_{0},{\theta }_{1}}{min}J\left({\theta }_{0},{\theta }_{1}\right)$

Contour Plot: A graph that contains many contour lines

# 2. Parameter Learning

Have some function $J\left({\theta }_{0},{\theta }_{1}\right)$$J(\theta_0, \theta_1)$
Want $\underset{{\theta }_{0},{\theta }_{1}}{min}J\left({\theta }_{0},{\theta }_{1}\right)$$\min\limits_{\theta_0, \theta_1} J(\theta_0, \theta_1)$

• Start with some ${\theta }_{0},{\theta }_{1}$$\theta_0, \theta_1$
• Keep changing ${\theta }_{0},{\theta }_{1}$$\theta_0, \theta_1$ to reduce $J\left({\theta }_{0},{\theta }_{1}\right)$$J(\theta_0, \theta_1)$ until we hopefully end up
at a minimum

Step1:随机初始化参数
Step2:找到下降最快的方向迈出一小步
Step3:不断重复步骤2
Step4:到达局部最优点

repeat until convergence {
$\phantom{\rule{1em}{0ex}}{\theta }_{j}:={\theta }_{j}-\alpha \frac{\mathrm{\partial }}{\mathrm{\partial }{\theta }_{j}}J\left({\theta }_{0},{\theta }_{1}\right)\phantom{\rule{2em}{0ex}}$$\quad\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1)\qquad$ (for j=0 and j=1)
}
where := denotes assignment operator, α denotes learning rate, which is a positive number and controls how big a step we take when updating parameters.

Correct: Simultaneous update Incorrect:
$temp0:={\theta }_{0}-\alpha \frac{\mathrm{\partial }}{\mathrm{\partial }{\theta }_{0}}J\left({\theta }_{0},{\theta }_{1}\right)\phantom{\rule{0ex}{0ex}}temp1:={\theta }_{1}-\alpha \frac{\mathrm{\partial }}{\mathrm{\partial }{\theta }_{1}}J\left({\theta }_{0},{\theta }_{1}\right)\phantom{\rule{0ex}{0ex}}{\theta }_{0}:=temp0\phantom{\rule{0ex}{0ex}}{\theta }_{1}:=temp1$$temp0 := \theta_0 - \alpha\frac{\partial}{\partial\theta_0}J(\theta_0, \theta_1)\\temp1 := \theta_1 - \alpha\frac{\partial}{\partial\theta_1}J(\theta_0, \theta_1)\\\theta_0 := temp0\\\theta_1 := temp1$ $temp0:={\theta }_{0}-\alpha \frac{\mathrm{\partial }}{\mathrm{\partial }{\theta }_{0}}J\left({\theta }_{0},{\theta }_{1}\right)\phantom{\rule{0ex}{0ex}}{\theta }_{0}:=temp0\phantom{\rule{0ex}{0ex}}temp1:={\theta }_{1}-\alpha \frac{\mathrm{\partial }}{\mathrm{\partial }{\theta }_{1}}J\left({\theta }_{0},{\theta }_{1}\right)\phantom{\rule{0ex}{0ex}}{\theta }_{1}:=temp1$$temp0 := \theta_0 - \alpha\frac{\partial}{\partial\theta_0}J(\theta_0, \theta_1)\\\theta_0 := temp0\\temp1 := \theta_1 - \alpha\frac{\partial}{\partial\theta_1}J(\theta_0, \theta_1)\\\theta_1 := temp1$

### 2.3 Gradient Descent for Linear Regression

${\theta }_{1}:={\theta }_{1}-\alpha \frac{d}{d{\theta }_{1}}J\left({\theta }_{1}\right)\phantom{\rule{2em}{0ex}}\alpha >0$$\theta_1 := \theta_1 - \alpha\frac{d}{d\theta_1}J(\theta_1)\qquadα>0$

1. 如果α太小，梯度下降的速度会非常慢；
2. 如果α太大，梯度下降可能会跳过最优解（如从A点直接到B点），可能会导致算法无法收敛，甚至发散；
3. 即使不改变α的值，梯度下降仍可收敛到局部最优解，因为当接近局部最优时，梯度下降的步长会自动变小

0.001,…0.003…,0.01,…0.03…,0.1,…0.3…,1……

• 广告
• 抄袭
• 版权
• 政治
• 色情
• 无意义
• 其他

120