# 线性回归

### 假设函数

hypothesis是指拟合的函数，表示为${h}_{\theta }\left(x\right)={\theta }^{T}x$$h_{\theta}(x)=\theta^Tx$ , 其中$\theta =\left[{\theta }_{0},{\theta }_{1},\cdots ,{\theta }_{n},\right]$$\theta=[\theta_0,\theta_1,\cdots,\theta_n,]$, $x=\left[{x}_{0},{x}_{1},\cdots ,{x}_{n}\right],{x}_{0}=1$$x=[x_0,x_1,\cdots,x_n],x_0=1$

$\begin{array}{r}{h}_{\theta }\left(x\right)=\left[\begin{array}{c}{\theta }_{0}\phantom{\rule{2em}{0ex}}{\theta }_{1}\phantom{\rule{2em}{0ex}}...\phantom{\rule{2em}{0ex}}{\theta }_{n}\end{array}\right]\left[\begin{array}{c}{x}_{0}\\ {x}_{1}\\ ⋮\\ {x}_{n}\end{array}\right]={\theta }^{T}x\end{array}$\begin{align*}h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace{2em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_n\end{bmatrix}= \theta^T x\end{align*}

### 损失函数

$J\left(\theta \right)=\frac{1}{2m}\sum _{i=1}^{m}\left({h}_{\theta }\phantom{\rule{thinmathspace}{0ex}}\left({x}^{\left(i\right)}\right)-{y}^{\left(i\right)}{\right)}^{2}$

### 优化方法

#### 梯度下降法

repeat until converge:

${\theta }_{j}={\theta }_{j}-\alpha \frac{\mathrm{\partial }}{\mathrm{\partial }{\theta }_{j}}J\left(\theta \right)$$\theta_j=\theta_j-\alpha \dfrac{\partial}{\partial \theta_j}J(\theta)$

${\theta }_{j}={\theta }_{j}-\alpha \left[\frac{1}{m}\sum _{i=1}^{m}\left({h}_{\theta }\left({x}^{\left(i\right)}\right)-{y}^{\left(i\right)}\right){x}_{j}^{\left(i\right)}\right],\left(j=0,1,2,,\cdots ,n\right)$$\theta_j = \theta_j - \alpha[\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}],(j=0,1,2,,\cdots,n)$

#### Normal Equation

${x}^{\left(i\right)}=\left[\begin{array}{c}{x}_{0}^{\left(i\right)}\\ {x}_{1}^{\left(i\right)}\\ ⋮\\ {x}_{n}^{\left(i\right)}\end{array}\right]$$x^{(i)}=\begin{bmatrix}x_0^{(i)} \newline x_1^{(i)} \newline \vdots \newline x_n^{(i)}\end{bmatrix}$ ,$\theta =\left[\begin{array}{c}{\theta }_{0}\\ {\theta }_{1}\\ ⋮\\ {\theta }_{n}\end{array}\right]$$\theta=\begin{bmatrix}\theta_0 \newline \theta_1 \newline \vdots \newline \theta_n\end{bmatrix}$ ,$y=\left[\begin{array}{c}{y}_{0}\\ {y}_{1}\\ ⋮\\ {y}_{m}\end{array}\right]$$y=\begin{bmatrix}y_0 \newline y_1 \newline \vdots \newline y_m\end{bmatrix}$

$X=\left[\begin{array}{c}\left({x}^{\left(1\right)}{\right)}^{T}\\ \left({x}^{\left(2\right)}{\right)}^{T}\\ ⋮\\ \left({x}^{\left(m\right)}{\right)}^{T}\end{array}\right]$$X=\begin{bmatrix} (x^{(1)})^T \newline (x^{(2)})^T \newline \vdots \newline (x^{(m)})^T \end{bmatrix}$ ,$X\in {\mathbb{R}}^{m×\left(n+1\right)}$$X \in \mathbb{R}^{m \times (n+1)}$

$\begin{array}{rl}J\left(\theta \right)& =\frac{1}{2m}\left(X\theta -y{\right)}^{T}\left(X\theta -y\right)\\ & =\frac{1}{2m}\left({\theta }^{T}{X}^{T}X\theta -2{y}^{T}X\theta +{y}^{T}y\right)\end{array}$

$\begin{array}{rl}\frac{\mathrm{\partial }}{\mathrm{\partial }\theta }J\left(\theta \right)& =\frac{1}{2m}\left(2{X}^{T}X\theta -2{X}^{T}y\right)\end{array}$

Need to choose alpha No need to choose alpha
Needs many iterations No need to iterate
$O\left(k{n}^{2}\right)$$O (kn^2)$ $O\left({n}^{3}\right)$$O (n^3)$, need to calculate inverse of ${X}^{T}X$$X^TX$
Works well when n is large Slow if n is very large

##### 不可逆

${X}^{T}X$$X^TX$ 不可逆时，称为singular 或 degenerate, 可能是由于两个原因：

• 有线性相关的feature
• feature 太多，使得$m\le n$$m \le n$

### 过拟合和正则化

• 减少特征维度，有两种方法：

• 手动选择特征
• 通过模型选择

缺点：损失了部分信息

• 正则化（regularization）：保留所有特征，但是减小参数$\theta$$\theta$ 的值，对于具有很多特征，每个特征都对预测值有贡献的问题，有很好的效果。

• 简化预测函数（hypothesis) ${h}_{\theta }\left(x\right)$$h_{\theta}(x)$
• 减小过拟合倾向

$J\left(\theta \right)=\frac{1}{2m}\left[\sum _{i=1}^{m}\left({h}_{\theta }\phantom{\rule{thinmathspace}{0ex}}\left({x}^{\left(i\right)}\right)-{y}^{\left(i\right)}{\right)}^{2}+\lambda \sum _{j=1}^{n}{\theta }_{j}^{2}\right]$

#### 梯度下降法

${\theta }_{0}={\theta }_{0}-\frac{\alpha }{m}\sum _{i=1}^{m}\left({h}_{\theta }\left({x}^{\left(i\right)}\right)-{y}^{\left(i\right)}\right){x}_{0}^{\left(i\right)}$$\theta_0 = \theta_0 - \frac{\alpha}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)}$

${\theta }_{j}={\theta }_{j}-\alpha \left[\frac{1}{m}\sum _{i=1}^{m}\left({h}_{\theta }\left({x}^{\left(i\right)}\right)-{y}^{\left(i\right)}\right){x}_{j}^{\left(i\right)}+\frac{\lambda }{m}{\theta }_{j}\right],\left(j=1,2,,\cdots ,n\right)$$\theta_j = \theta_j - \alpha[\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}\theta_j],(j=1,2,,\cdots,n)$

${\theta }_{j}=\left(1-\alpha \frac{\lambda }{m}\right){\theta }_{j}-\alpha \left[\frac{1}{m}\sum _{i=1}^{m}\left({h}_{\theta }\left({x}^{\left(i\right)}\right)-{y}^{\left(i\right)}\right){x}_{j}^{\left(i\right)}\right],\left(j=1,2,,\cdots ,n\right)$$\theta_j = (1-\alpha\frac{\lambda}{m})\theta_j - \alpha[\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}],(j=1,2,,\cdots,n)$

#### Normal Equation

$\begin{array}{rl}J\left(\theta \right)& =\frac{1}{2m}\left(X\theta -y{\right)}^{T}\left(X\theta -y\right)+\frac{\lambda }{2m}{\theta }^{T}L\theta \\ & =\frac{1}{2m}\left({\theta }^{T}{X}^{T}X\theta -2{y}^{T}X\theta +{y}^{T}y+\lambda L\theta \right)\end{array}$

$\begin{array}{rl}\frac{\mathrm{\partial }}{\mathrm{\partial }\theta }J\left(\theta \right)& =\frac{1}{2m}\left(2{X}^{T}X\theta -2{X}^{T}y+2\lambda L\theta \right)\end{array}$

$L=\left[\begin{array}{c}0\phantom{\rule{1em}{0ex}}0\phantom{\rule{1em}{0ex}}0\phantom{\rule{1em}{0ex}}...\phantom{\rule{1em}{0ex}}0\\ 0\phantom{\rule{1em}{0ex}}1\phantom{\rule{1em}{0ex}}0\phantom{\rule{1em}{0ex}}...\phantom{\rule{1em}{0ex}}0\\ 0\phantom{\rule{1em}{0ex}}0\phantom{\rule{1em}{0ex}}1\phantom{\rule{1em}{0ex}}...\phantom{\rule{1em}{0ex}}0\\ ⋮\phantom{\rule{1em}{0ex}}⋮\phantom{\rule{1em}{0ex}}⋮\phantom{\rule{1em}{0ex}}⋮\phantom{\rule{3em}{0ex}}⋮\\ 0\phantom{\rule{1em}{0ex}}0\phantom{\rule{1em}{0ex}}0\phantom{\rule{1em}{0ex}}...\phantom{\rule{1em}{0ex}}1\end{array}\right]$

### 技巧

#### 技巧1：feature scaling

• 除以最大值

• mean normalization:

${x}_{i}$$x_i$ 替换为${x}_{i}-{\mu }_{i}$$x_i-\mu_i$ 或者$\frac{{x}_{i}-{\mu }_{i}}{{s}_{i}}$$\frac{x_i-\mu_i}{s_i}$，使得属性值的均值大约为0.（不对${x}_{0}$$x_0$ 进行处理），其中，${\mu }_{i}$$\mu_i$ 为均值，${s}_{i}=max-min$$s_i=max-min$ 或标准差。

### 属性选择和多项式回归

• 属性选择（feature choice）

有时会面对很多属性，不一定按照原属性进行拟合，而是应该根据实际情况对属性进行一些处理或计算。

比如可以将两个属性相乘得到新属性

• 多项式回归（polynomial regression）

假设函数可能不是线性的，这时只需要构造新属性，然后在这个属性上进行拟合即可。例如拟合${h}_{\theta }\left(x\right)={\theta }_{0}+{\theta }_{1}{x}_{1}+{\theta }_{2}{x}_{1}^{2}+{\theta }_{3}{x}_{1}^{3}$$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3$, 构造${x}_{1}={x}_{1},{x}_{2}=\left({x}_{1}{\right)}^{2},{x}_{3}=\left({x}_{1}{\right)}^{3}$$x_1=x_1, x_2=(x_1)^2,x_3=(x_1)^3$ 即可。这时，就需要注意feature scaling的问题了