正在学习Stanford吴恩达的机器学习课程,常做笔记,以便复习巩固。
鄙人才疏学浅,如有错漏与想法,还请多包涵,指点迷津。非常欢迎一起学习的伙伴们来讨论!
Week 02
2.1 Multivariate Linear Regression
2.1.1 Multiple Features
- The multivariable form of the hypothesis function :
hθ(x)=θ0x0+θ1x1+θ2x2+θ3x3+⋯+θnxn h θ ( x ) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + ⋯ + θ n x n
=[θ0θ1⋯θn]⎡⎣⎢⎢⎢⎢x0x1⋮xn⎤⎦⎥⎥⎥⎥=θTx = [ θ 0 θ 1 ⋯ θ n ] [ x 0 x 1 ⋮ x n ] = θ T x - Remark : For convenice, assume x(i)0=1for i∈1,⋯,m x 0 ( i ) = 1 for i ∈ 1 , ⋯ , m .
- The cost function
J(θ)
J
(
θ
)
has the same form
J(θ)=12m∑i=1m(hθ(x)−y)2 J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ) − y ) 2
2.1.2 Gradient Descent
- Gradient descent for mutivariate linear Regression - Algorithm 1’
Repeat {
θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))x(i)j θ j := θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i )
(simultaneously update θj θ j for j=0,⋯,n j = 0 , ⋯ , n )
}
2.1.3 Practical Tricks in GD
- Feature Scaling
si
s
i
- Idea : Make sure features are on a similar scale. This is because θ θ will descent very quickly on small ranges, otherwise it will oscillate inefficiently down to the optimum.
- Get every feature into approximately a −1≤xi≤1 − 1 ≤ x i ≤ 1 range (number 1 is no a necessary problem).
- Remark : The quizzes in this course use range - the programming exercises use standard deviation.
- Mean Normalization
μi
μ
i
- Replace xi x i with xi−μi x i − μ i to make features have approximately zero mean (do no apply to x0=1 x 0 = 1 ).
- In general, we have :
xi:=xi−μisi x i := x i − μ i s i
where μi μ i is the average of all the values for features (i) ( i ) and si s i is the range of values (max-min), or si s i is the standard deviation.
- Learning Rate Check
- Debug gradient descent, make a plot of iterations on x-axis, judge whether the
J(θ)
J
(
θ
)
converge to zeor or not :
- If α α is too small, slow convergence
- If α α is too large, J(θ) J ( θ ) may not decrease on every iteration.
- Try to use 1×10k 1 × 10 k or 3×10k 3 × 10 k or other similar value, when judging from the plot.
- It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.
- Debug gradient descent, make a plot of iterations on x-axis, judge whether the
J(θ)
J
(
θ
)
converge to zeor or not :
2.1.4 Improvement of Linear Regression
- Feature Combination
- Combine some features in one using a variety of methods.
- Polynomial Regression
hθ(x)=θ0x0+θ1xa11+θ2xa22+⋯+θnxann h θ ( x ) = θ 0 x 0 + θ 1 x 1 a 1 + θ 2 x 2 a 2 + ⋯ + θ n x n a n - Remark : One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.
2.2 Another Method for Normal Equation
2.2.1 Normal Equaltion
xi1≤i≤m=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢x(i)0x(i)1x(i)2⋯x(i)n⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥∈Rn+1,X=⎡⎣⎢⎢⎢⎢⎢⎢⎢(x(1))T(x(2))T(x(3))T⋯(x(m))T⎤⎦⎥⎥⎥⎥⎥⎥⎥,Y=⎡⎣⎢⎢⎢⎢⎢⎢⎢y(1)y(2)y(3)⋯y(m)⎤⎦⎥⎥⎥⎥⎥⎥⎥
x
i
1
≤
i
≤
m
=
[
x
0
(
i
)
x
1
(
i
)
x
2
(
i
)
⋯
x
n
(
i
)
]
∈
R
n
+
1
,
X
=
[
(
x
(
1
)
)
T
(
x
(
2
)
)
T
(
x
(
3
)
)
T
⋯
(
x
(
m
)
)
T
]
,
Y
=
[
y
(
1
)
y
(
2
)
y
(
3
)
⋯
y
(
m
)
]
and
θ=⎡⎣⎢⎢⎢⎢⎢⎢θ0θ1θ2⋯θn⎤⎦⎥⎥⎥⎥⎥⎥
θ
=
[
θ
0
θ
1
θ
2
⋯
θ
n
]
Then the normal equation formula is given below :
θ=(XTX)−1XTy
θ
=
(
X
T
X
)
−
1
X
T
y
2.2.2 Comparison of GD and NE
- Gradient Descent
- Need to choose alpha and iterate
- Need learning rate
- O(kn2) O ( k n 2 )
- Works well when n n is large.
- Normal Equation
- No need to choose alpha and iterate
- No Need to set learning rate
- Slow if n n is large