Stanford ML - Lecture 2 - Linear regression with multiple variable

最新推荐文章于 2016-06-17 08:56:00 发布

Quebradawill

最新推荐文章于 2016-06-17 08:56:00 发布

阅读量861

点赞数

分类专栏： ML-Stanford-Andrew Ng Machine Learning

本文链接：https://blog.csdn.net/qiudw/article/details/8652411

版权

Machine Learning 同时被 2 个专栏收录

19 篇文章 0 订阅

订阅专栏

ML-Stanford-Andrew Ng

12 篇文章 0 订阅

订阅专栏

Multiple features

$n \ - \ \textrm{number of features}$
$x^{(i)} \ - \ \textrm{input (features) of }\ i^{th} \ \textrm{training example}$
$x_j^{(i)} \ - \ \textrm{value of feature} \ j \ \textrm{in}\ i^{th} \ \textrm{training example}$

for convenience of notation, define $x_0 = 1$ , the new hypothesis is

$h_{\theta}(x) = \theta^T x = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n$

Gradient descent for multiple variables

new algorithm is

$\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x) - y^i) x_j^{(i)}$

$\textrm{simultaneously update} \ \theta_j \ \textrm{for} \ j = 0, \cdots, n$

Gradient descent in practive I: Feature Scaling

make sure features are on a similar scale

feature scaling

$x_j := \frac{x_j}{\max x_j}$

mean normalization

$x_j := \frac{x_j - \mu_j}{s_j} = \frac{x_j - \mu_j}{\max x_j - \min x_j} \ (\textrm{or} \ =\frac{x_j - \mu_j}{\sigma_j} )$

Gradient descent in practice II: Learning rate

make sure gradient descent is working correctly
gradient descent not working - use smaller \alpha
to choose $\alpha$ , try ..., 0.001, 0.003, 0.01, 0.03, ..., 0.1, ..., 1, ...

Features and polynomial regression

choice of features

$h_{\theta}(x) = \theta_0 + \theta_1 (size) + \theta_2 (size)^2$

$h_{\theta}(x) = \theta_0 + \theta_1 (size) + \theta_2 \sqrt{(size)}$

Normal equations

method for solve $\theta$ analytically

if there are m examples, each example have n input features, so $\theta$ is (n+1)-by-1 matrix, X is m-by-(n+1) matrix, y is m-dimensional vector

$h_{\theta}(X) = \theta X$ $h_{\theta}(X) = X \theta$ ?

$J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_{\theta}(x) - y)^2 = \frac{1}{2m}( X \theta - Y)^2$

$\frac{\partial }{\partial \theta} J = \frac{1}{m} (X \theta - Y) X = X^T(X \theta - Y) = 0$

$X^T X \theta =X^T Y \\$

$\theta = (X^T X)^{-1}X^T Y$

Advantages and disadvantages of gradient descent and normal equation

$m$ training examples, $n$ features

Gradient descent

need to choose $\alpha$
need many iterations
works well even when $n$ is large

Normal equation

no need to choose $\alpha$
don't need to iterate
need to compute $(X^TX)^{-1}$
slow if $n$ is very large
if $n$ is larger than 10000 or 100000, need consider to choose gradient descent
only suitable for linear regression

Normal equation and non-invertibility (optional)

redundant features (linearly dependent)
too many features (e.g. $m \leqslant n$ )
- delete some features, or use regularization

Quebradawill

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Stanford ML - Lecture 2 - Linear regression with multiple variable

Multiple featuresfor convenience of notation, define , the new hypothesis isGradient descent for multiple variablesnew algorithm isGradient descent in practive I: Feature S
复制链接

扫一扫

专栏目录