Machine Learning 02 - Multivariate Linear Regression

最新推荐文章于 2024-07-24 14:49:54 发布

能智工人

最新推荐文章于 2024-07-24 14:49:54 发布

阅读量248

点赞数

分类专栏：机器学习文章标签：机器学习人工智能 Standford公开课

本文链接：https://blog.csdn.net/ddragon1/article/details/79184437

版权

机器学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

正在学习Stanford吴恩达的机器学习课程，常做笔记，以便复习巩固。
鄙人才疏学浅，如有错漏与想法，还请多包涵，指点迷津。非常欢迎一起学习的伙伴们来讨论！

Week 02

2.1 Multivariate Linear Regression

2.1.1 Multiple Features

The multivariable form of the hypothesis function :
$h θ (x) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + \dots + θ n x n$ $h_{\theta }(x)=\theta _{0}x_{0}+\theta _{1}x_{1}+\theta _{2}x_{2}+\theta _{3}x_{3}+\cdots +\theta _{n}x_{n}$
$= [θ 0 θ 1 \dots θ n] ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ x 0 x 1 ⋮ x n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ = θ T x$ $=\left [ \theta _{0} \quad \theta _{1}\quad\cdots \quad \theta _{n} \right ] \begin{bmatrix} x_{0}\\ x_{1}\\ \vdots\\ x_{n}\\ \end{bmatrix}=\theta ^{T}x$
Remark : For convenice, assume $x_{0}^{(i)}=1 \quad \text{for} \ i\in 1,\cdots ,m$ .
The cost function $J(\theta)$ has the same form
$J (θ) = 1 2 m \sum i = 1 m (h θ (x) - y) 2$ $J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta }(x)-y)^{2}$

2.1.2 Gradient Descent

Gradient descent for mutivariate linear Regression - Algorithm 1’

Repeat {

$θ j : = θ j - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) x (i) j$ $\theta _{j}:=\theta _{j}-\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})x_{j}^{(i)}$
(simultaneously update $\theta _{j}$ for $j=0,\cdots ,n$ )
}

2.1.3 Practical Tricks in GD

Feature Scaling si
- Idea : Make sure features are on a similar scale. This is because $\theta$ will descent very quickly on small ranges, otherwise it will oscillate inefficiently down to the optimum.
- Get every feature into approximately a $-1\leq x_{i}\leq 1$ range (number 1 is no a necessary problem).
- Remark : The quizzes in this course use range - the programming exercises use standard deviation.
Mean Normalization μi
- Replace $x_{i}$ with $x_{i}-\mu _{i}$ to make features have approximately zero mean (do no apply to $x_{0}=1$ ).
- In general, we have :
  $x i : = x i - μ i s i$ $x_{i}:=\frac{x_{i}-\mu _{i}}{s_{i}}$
  where $\mu _{i}$ is the average of all the values for features $(i)$ and $s_{i}$ is the range of values (max-min), or $s_{i}$ is the standard deviation.
Learning Rate Check
- Debug gradient descent, make a plot of iterations on x-axis, judge whether the J(θ) converge to zeor or not :
  - If $\alpha$ is too small, slow convergence
  - If $\alpha$ is too large, $J(\theta)$ may not decrease on every iteration.
- Try to use $1\times 10^{k}$ or $3\times 10^{k}$ or other similar value, when judging from the plot.
- It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.

2.1.4 Improvement of Linear Regression

Feature Combination
- Combine some features in one using a variety of methods.
Polynomial Regression
$h θ (x) = θ 0 x 0 + θ 1 x a 1 1 + θ 2 x a 2 2 + \dots + θ n x a n n$ $h_{\theta}(x)=\theta _{0}x_{0}+\theta _{1}x_{1}^{a_{1}}+\theta _{2}x_{2}^{a_{2}}+\cdots +\theta _{n}x_{n}^{a_{n}}$
Remark : One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.

2.2 Another Method for Normal Equation

2.2.1 Normal Equaltion

x i 1 \leq i \leq m = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ x (i) 0 x (i) 1 x (i) 2 \dots x (i) n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ \in R n + 1, X = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ (x (1)) T (x (2)) T (x (3)) T \dots (x (m)) T ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥, Y = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ y (1) y (2) y (3) \dots y (m) ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

$\underset{1\leq i\leq m}{x^{i}}=\begin{bmatrix} x_{0}^{(i)} \\ x_{1}^{(i)} \\ x_{2}^{(i)} \\ \cdots \\ x_{n}^{(i)} \end{bmatrix} \in R^{n+1} ,\quad X=\begin{bmatrix} (x^{(1)})^{T} \\ (x^{(2)})^{T} \\ (x^{(3)})^{T} \\ \cdots \\ (x^{(m)})^{T} \end{bmatrix} ,\quad Y= \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ y^{(3)} \\ \cdots \\ y^{(m)} \end{bmatrix}$
and

θ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ θ 0 θ 1 θ 2 \dots θ n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

$\theta = \begin{bmatrix} \theta_{0} \\ \theta_{1} \\ \theta_{2} \\ \cdots \\ \theta_{n} \end{bmatrix}$
Then the normal equation formula is given below :

θ = (X T X) - 1 X T y

$\theta = (X^{T}X)^{-1}X^{T}y$

2.2.2 Comparison of GD and NE

Gradient Descent
- Need to choose alpha and iterate
- Need learning rate
- $O(kn^{2})$
- Works well when $n$ is large.
Normal Equation
- No need to choose alpha and iterate
- No Need to set learning rate
- $O(n^{3})$
- Slow if $n$ is large

2.2.3 The non-invertable case of $(X^{T}X)^{-1}$
Reason 1 : Redundant features. That is two or more features are linear dependent., delete one or more.
Reason 2 : Too many features. Delete some features or use “regularization”.

能智工人

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Machine Learning 02 - Multivariate Linear Regression

正在学习Stanford吴恩达的机器学习课程，常做笔记，以便复习巩固。鄙人才疏学浅，如有错漏与想法，还请多包涵，指点迷津。非常欢迎一起学习的伙伴们来讨论！Week 022.1 Multivariate Linear Regression2.1.1 Multiple FeaturesThe multivariable form of the hypothesis f
复制链接

扫一扫