Computing Parameters Analytically
Normal Equation
Find the optimum θ \theta θ without iteration
- Minimize J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero.
Formula:
θ = ( X T X ) − 1 X T y \theta={(X^TX)}^{-1}X^Ty θ=(XTX)−1XTy
Octave: pinv (X’X) X’*y
Design matrix (X)
m examples
(
x
(
1
)
,
y
(
1
)
)
,
.
.
.
(
x
(
m
)
,
y
(
m
)
)
(x^{(1)}, y^{(1)}) ,...(x^{(m)} ,y^{(m)})
(x(1),y(1)),...(x(m),y(m));n features
x
(
i
)
=
[
x
0
(
i
)
x
1
(
i
)
⋅
⋅
⋅
x
n
(
i
)
]
∈
R
n
+
1
x^{(i)}=\begin{bmatrix} x_0^{(i)}\\ x_1^{(i)}\\ \cdot\\ \cdot\\ \cdot\\ x_n^{(i)} \end{bmatrix}\in\R^{n+1}
x(i)=⎣⎢⎢⎢⎢⎢⎢⎢⎡x0(i)x1(i)⋅⋅⋅xn(i)⎦⎥⎥⎥⎥⎥⎥⎥⎤∈Rn+1
X = [ − ( x ( 1 ) ) T − − ( x ( 2 ) ) T − ⋅ ⋅ ⋅ − ( x ( m ) ) T − ] ( m × ( n + 1 ) − d i m e n s i o n a l ) X=\begin{bmatrix} -(x^{(1)})^T-\\ -(x^{(2)})^T-\\ \cdot\\ \cdot\\ \cdot\\ -(x^{(m)})^T- \end{bmatrix}(m\times(n+1)-dimensional) X=⎣⎢⎢⎢⎢⎢⎢⎡−(x(1))T−−(x(2))T−⋅⋅⋅−(x(m))T−⎦⎥⎥⎥⎥⎥⎥⎤(m×(n+1)−dimensional)
There is no need to do feature scaling.
Comparison of gradient descent and normal equation:
G
r
a
d
i
e
n
t
D
e
s
e
n
t
N
o
r
m
a
l
E
q
u
a
t
i
o
n
N
e
e
d
t
o
c
h
o
o
s
e
a
l
p
h
a
N
o
n
e
e
d
t
o
c
h
o
o
s
e
a
l
p
h
a
N
e
e
d
s
m
a
n
y
i
t
e
r
a
t
i
o
n
s
N
o
n
e
e
d
t
o
i
t
e
r
a
t
e
o
(
k
n
2
)
o
(
n
3
)
,
n
e
e
d
t
o
c
a
l
c
u
l
a
t
e
i
n
v
e
r
s
e
o
f
X
T
X
W
o
r
k
s
w
e
l
l
w
h
e
n
n
i
s
l
a
r
g
e
S
l
o
w
i
f
n
i
s
v
e
r
y
l
a
r
g
e
\begin{array}{|c|clr|} \hline Gradient \;Desent&Normal\;Equation\\ \hline Need\;to\;choose\;alpha&No\;need\;to\;choose\;alpha\\ \hline Needs\;many\;iterations&No\;need\;to\;iterate\\ \hline \mathcal{o}(kn^2)&\mathcal{o}(n^3),need\;to\;calculate\;inverse\;of\;X^TX\\ \hline Works\;well\;when\;n\;is\;large&Slow\;if\;n\;is\;very\;large\\ \end{array}
GradientDesentNeedtochoosealphaNeedsmanyiterationso(kn2)WorkswellwhennislargeNormalEquationNoneedtochoosealphaNoneedtoiterateo(n3),needtocalculateinverseofXTXSlowifnisverylarge
With the normal equation, computing the inversion has complexity
O
(
n
3
)
\mathcal{O}(n^3)
O(n3). So if we have a very large number of features, the normal equation will be slow.
Normal Equation Noninvertibility
If X T X X^TX XTX is noninvertible, the common causes might be having :
- Redundant features, where two features are very closely related (i.e. they are linearly dependent)
- Too many features (e.g. m ≤ n). In this case, delete some features or use “regularization” (to be explained in a later lesson).
Solutions
- Deleting a feature that is linearly dependent with another .(Redundant features)
- Deleting one or more features or use regularization when there are too many features( e.g. m ≤ n m\leq n m≤n).