Coursera Machine Learning Week2 学习笔记

最新推荐文章于 2022-02-24 17:07:16 发布

JinbaoSite0144

最新推荐文章于 2022-02-24 17:07:16 发布

阅读量814

点赞数

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/codeforcer/article/details/60763034

版权

机器学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

注：本文已迁移至：http://blog.csdn.net/JinbaoSite/article/details/66530379

四、多变量线性回归(Linear Regression with Multiple Variables)

4.1 多变量线性回归模型

（1） $Hypothesis$ ：

h θ (x) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + \dots + θ n x n

$h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n$
设定

x0=1 $x_0=1$ ，那么

h θ (x) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + \dots + θ n x n = [θ 0 θ 1 \dots θ n] ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ x 0 x 1 ⋮ x n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ = θ T x

$\begin{aligned} h_\theta (x) & = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n \newline & = \left [\begin{array}{cccc} \theta_0 \theta_1 \ldots \theta_n \end{array} \right] \left [\begin{array}{cccc} x_0 \newline x_1 \newline \vdots \newline x_n \end{array} \right] \newline & = \theta^T x \end{aligned}$
（2）

Parameters $Parameters$ ：

θ 0, θ 1, θ 2, θ 3, \dots, θ n

$\theta_0 , \theta_1 , \theta_2 , \theta_3 , \cdots , \theta_n$
（3）

CostFunction $Cost Function$ ：

J (θ 0, θ 1, \dots, θ n) = 1 2 m \sum i = 1 m (h θ (x (i)) - y (i)) 2

$J(\theta_0, \theta_1 , \cdots , \theta_n) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta ( x^{(i)} ) - y^{(i)} \right)^2$

4.2 多变量线性回归（Gradient Descent For Multiple Variables）

跟单变量线性回归类似：

} repeat until convergence: {θ 0 : = θ 0 - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) \cdot x (i) 0 θ 1 : = θ 1 - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) \cdot x (i) 1 \dots θ n : = θ n - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) \cdot x (i) n

$\begin{aligned} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} ( h_\theta(x^{(i)} ) - y^{(i)} ) \cdot x_0^{(i)}\newline \; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} ( h_\theta(x^{(i)} ) - y^{(i)} ) \cdot x_1^{(i)} \newline \; & \cdots \newline & \theta_n := \theta_n - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} ( h_\theta(x^{(i)} ) - y^{(i)} ) \cdot x_n^{(i)} \newline \rbrace \end{aligned}$
简化得：

} repeat until convergence: {θ j : = θ j - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) \cdot x (i) j for j := 0...n

$\begin{aligned}& \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \; & \text{for j := 0...n}\newline \rbrace\end{aligned}$

4.3 特征缩放（Feature Scaling）

1、特征缩放：如果有多个变量的值在一个相近的范围内，那么我们可以将它们缩放在一个更小的范围内，将使得梯度下降算法更快地收敛。
2、特征缩放的方法：

s i : = m a x (x 0, x 1, \dots, x n) x i : = x i s i

$\begin{aligned} & s_i := max (x_0,x_1,\ldots,x_n) \newline & x_i := \dfrac{x_i}{s_i} \end{aligned}$
2、例子

左边比右边需要更多的步数来到达最低点。
3、通常情况下，我们进行特征缩放的时候，尝试将所有特征的尺度都尽量缩放在-1到1之间。如果不在-1到1之间也是可以，并没有明确的要求，但不要太大或者太小了。

4.4 均值归一化（mean normalization）

均值归一化是数值一般化（Feature Normalization）的另一种方式，原理和作用跟特征缩放一致。
$u_i$ ：特征变量的平均值
$s_i$ ：特征变量的最大值-最小值

x i : = x i - μ i s i

$x_i := \dfrac{x_i - \mu_i}{s_i}$

4.5 学习率 $\alpha$

如果我们的学习率 $\alpha$ 选择合适的话，我们会得到下面以梯度下降的迭代次数为横坐标的曲线图

如果学习率 $\alpha$ 太大，那么结果可能是下面两种情况，我们这个时候需要选择更小一点的学习率 $\alpha$ 。

所以，梯度下降算法的每次迭代收到学习率的影响，如果学习率 $\alpha$ 过小，则达到收敛所需的迭代次数会非常高，如果学习率 $\alpha$ 过大，每次迭代可能不会减小代价函数，可能会越过局部最小值导致无法收敛。

4.6 多项式回归（Polynomial Regression）

线性回归并不适用于所有数据，有时候我们需要曲线来适应我们的数据。
假如有一个三次方模型 $h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3$ ，为了便于使用线性回归来解决，我们做如下变换

h θ (x) = θ 0 + θ 1 x 1 + θ 2 x 21 + θ 3 x 31 y 1 = x 1 y 2 = x 21 y 2 = x 21 h θ (y) = θ 0 + θ 1 y 1 + θ 2 y 2 + θ 3 y 3

$\begin{aligned} & h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3 \newline & y_1 = x_1 \newline & y_2 = x_1^2 \newline & y_2 = x_1^2 \newline & h_\theta(y) = \theta_0 + \theta_1 y_1 + \theta_2 y_2 + \theta_3 y_3 \end{aligned}$

这个时候特征缩放就很有必要了。

五、正规方程（Normal Equation）

5.1 正规方程

对于方程 $J(\theta_0, \theta_1 , \cdots , \theta_n) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta ( x^{(i)} ) - y^{(i)} \right)^2$ 为了求 $J(\theta_0, \theta_1 , \cdots , \theta_n)$ 最小值，我们利用数学方法来求解，对所有的变量求偏导数，令偏导数 $\frac{\partial J(\theta)}{\partial \theta_j} = 0$ ，求出 $(\theta_0, \theta_1 , \cdots , \theta_n)$ 使得 $J(\theta_0, \theta_1 , \cdots , \theta_n)$ 最小。
求解结果为

θ = (X T X) - 1 X T y

$\theta = (X^T X)^{-1}X^T y$

5.2 梯度下降与正规方程比较

梯度下降	正规方程
需要选择学习率 $\alpha$	不需要
需要多次迭代	一次运算得出
当特征数量n大时也能很好使用	需要计算出 $(X^T X)^{-1}$ ，如果特征数量n较大则运算代价大，通常来说当n小于10000时还是可以接受的
适用于各种类型的模型	只适用于线性模型，不适合逻辑回归模型等其他模型