notation
x(i)j denotes value of feature j in the ith training example
n denotes the number of features
The multivariable form of the hypothesis function is
or
hθ(x)=∑i=0nθixi
’
where we asume x(i)0=1
we alse write
θ=⎡⎣⎢⎢⎢⎢θ0θ1⋮θn⎤⎦⎥⎥⎥⎥
and
X=⎡⎣⎢⎢⎢⎢x0x1⋮xn⎤⎦⎥⎥⎥⎥
so we can rewrite h(x) as
hθ(x)=θ⃗ TX⃗
Algorithm
θj:=θj−α1m(hθ(x(i))−y(i)))x(i)j
(for j = 0 to n)
where x0 = 1
Algorithm in practise
feature scaling
We can speed up gradient descent by having each of our input values in roughly the same range.
xi:=xi−μisi
where μi means the average of xi
and si means the range of xi or means standard deviation
learning rate
Debugging gradient descent.
Make a plot with number of iterations on the x-axis. Plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, α is too large.
Automatic convergence test.
Declare convergence if J(θ) decreases by less than E in one iteration, where E is some small value such as 10−3. However in practice it’s difficult to choose this threshold value.
Features and Polynomial Regression
we can combine multiple features into one. Such as x3 := x1 * x2
We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form)for example:
hθ(x)=θ0+θ1x1+θ2x21+θ3x31
or
hθ(x)=θ0+θ1x1+θ2x√1
Normal Equation
θ=(XTX)−1XTy⃗
where
X=⎡⎣⎢⎢⎢⎢⎢⎢x(1)Tx(2)T⋮x(m)T⎤⎦⎥⎥⎥⎥⎥⎥m×(n+1)
and
y=⎡⎣⎢⎢y(0)⋮y(m)⎤⎦⎥⎥
differences
Gradient Descent | Normal Equation |
---|---|
Need to choose alpha | No need to choose alpha |
Needs many iterations | No need to iterate |
O(Kn2) | O(n3) and need to calculate X’X |
work well when n is large | Slow if n is very large |