Multivariate Linear Regression
Notation
x j ( i ) x^{(i)}_{j} xj(i) = value of feature j in the i t h i^{th} ith training example
x ( i ) x^{(i)} x(i) = the input (features) of the i t h i^{th} ith training example
Hypothesis Function
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n h_{\theta}(x) = {\theta}_{0} + {\theta}_{1}x_{1} + {\theta}_{2}x_{2} + ... + {\theta}_{n}x_{n} hθ(x)=θ0+θ1x1+θ2x2+...+θnxn
Rewrite in matrix notation:
h θ ( x ) = [ θ 0 θ 1 θ 2 . . . θ n ] [ x 0 x 1 x 2 . . . x n ] = θ T x h_{\theta}(x) = [ {\theta}_{0}\:{\theta}_{1}\:{\theta}_{2}\:...\:{\theta}_{n}]\begin{bmatrix} x_{0} \\ x_{1} \\ x_{2} \\ ... \\x_{n}\end{bmatrix} = {\theta}^{T}x hθ(x)=[θ0θ1θ2...θn]⎣⎢⎢⎢⎢⎡x0x1x2...xn⎦⎥⎥⎥⎥⎤=θTx, where x 0 = 1. x_{0} = 1. x0=1.
Gradient Descent
For each feature we have:
Feature Scaling
Scale feature values within range [-1, 1]
i.e. If x i ∈ [ 0 , 2000 ] x_{i}\in [0,2000] xi∈[0,2000], scale it by dividing 2000, then we have x i ∈ [ 0 , 1 ] x_{i}\in [0,1] xi∈[0,1]
The denominator is known as the scaling factor
Scaling factor S = M a x ( x ) − M i n ( x ) S = Max(x) - Min(x) S=Max(x)−Min(x),
i.e. S 一般取 feature x x x 的range
Mean Normalization
Replace x i x_{i} xi with x i − μ σ \frac{x_{i}-{\mu}}{\sigma} σxi−μ (divide by SD)
OR with x i − μ S \frac{x_{i}-{\mu}}{S} Sxi−μ (divide by scaling factor, or the range)
Feature Scaling & Mean Normalization 有助于 Gradient Descent 更快 Converge
Choosing the right learning Rate ( α \alpha α)
Good α \alpha α values to try: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …
- Method 1: Plot Cost Function over each iteration
(浅蓝converge)
如果plot不是decreasing function, reduce α , \alpha, α, or it won’t converge.
图A: α \alpha α is good
图B: α \alpha α is too small, slow convergence
图C: α \alpha α is too big, diverge
-
Method 2: Use convergence detection algorithm
如果Cost Function每次下降的值低于某个threshold, 报converge
Threshold 一般取 small value such as 1 0 − 3 10^{−3} 10−3