Vectorization of the Gradient Descent Algorithm
-
Problem Setup
-
Given a dataset with m m m samples ( x ( 1 ) ; x ( 2 ) ; . . . ; x ( m ) ) (x^{(1)};x^{(2)};...;x^{(m)}) (x(1);x(2);...;x(m)) with each sample having n n n features (so x 1 ( 1 ) x^{(1)}_1 x1(1) denotes the value of the first feature of the first sample), and the response variables ( y ( 1 ) ; . . . ; y ( m ) ) (y^{(1)};...;y^{(m)}) (y(1);...;y(m)) with y ( 1 ) y^{(1)} y(1) being the “output” for the first sample, we want to calculate parameters θ 0 , . . . , θ n \theta_0,...,\theta_n θ0,...,θn such that the hypothesis h θ ( x ) = θ 0 + θ 1 x 1 + . . . + θ m x m h_{\theta}(x)=\theta_0+\theta_1x_1+...+\theta_mx_m hθ(x)=θ0+θ1x1+...+θmxm gives the least-square linear equation for our dataset (i.e. for each pair of ( x j ( i ) , y ( i ) ) (x^{(i)}_j,y^{(i)}) (xj(i),y(i)) for all i i i and j j j.
-
For convenience of calculation, let x 0 ( i ) = 1 x_0^{(i)}=1 x0(i)=1 for all i i i so x ⃗ ( i ) = ( x 0 ( i ) , x 1 ( i ) , x 2 ( i ) , . . . , x n ( i ) ) \vec x^{(i)}=(x^{(i)}_0,x^{(i)}_1,x^{(i)}_2,...,x^{(i)}_n) x(i)=(x0(i),x1(i),x2(i),...,xn(i)) is of the same dimension as θ ⃗ = ( θ 0 , . . . , θ n ) \vec \theta=(\theta_0,...,\theta_n) θ=(θ0,...,θn), so we rewrite the hypothesis as h θ ( x ) = θ 0 x 0 + θ 1 x 1 + . . . + θ m x m h_{\theta}(x)=\theta_0x_0+\theta_1x_1+...+\theta_mx_m hθ(x)=θ0x0+θ1x1+...+θmxm.In gradient descent, we have a training set hypothesisthe following algorithm decreases
-
The Gradient Descent algorithm:
-
The cost function is J ( θ 0 , . . . , θ n ) = J(\theta_0,...,\theta_n)= J(θ0,...,θn)= 1 2 m 1\over{2m} 2m1 ∑ i = 1 m ( θ 0 x 0 ( i ) + θ 1 x 1 ( i ) + . . . + θ m x m ( i ) − y ( i ) ) 2 \sum_{i=1}^m(\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+...+\theta_mx_m^{(i)}-y^{(i)})^2 ∑i=1m(θ0x0(i)+θ1x1(i)+...+θmxm(i)−y(i))2, we aim to minimize this function to find the θ \theta θ values.
-
Assume we have only two features, and we have a suitable learning rate α > 0 \alpha>0 α>0. We normally start with some initial values for the θ ′ s \theta's θ′s, as our model is linear, we will always end up at the global optimum using Gradient Descent. The algorithm updates θ \theta θ's simultaneously until convergence, and upon convergence, the θ \theta θ values will be our parameters:
-
repeat until convergence
{ \{ { θ 0 : = θ 0 − α \theta_0:= \theta_0-\alpha θ0:=θ0−α ∂ J ∂ θ 0 \partial J\over \partial \theta_0 ∂θ0∂J;
~ ~ θ 1 : = θ 1 − α \theta_1:= \theta_1-\alpha θ1:=θ1−α ∂ J ∂ θ 1 \partial J\over \partial \theta_1 ∂θ1∂J;
~ ~ θ 2 : = θ 2 − α \theta_2:= \theta_2-\alpha θ2:=θ2−α ∂ J ∂ θ 2 \partial J\over \partial \theta_2 ∂θ2∂J } \} } -
After taking the partials, we found that each line becomes
θ j : = θ j − α \theta_j:= \theta_j-\alpha θj:=θj−α 1 m 1\over m m1 ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) x j ( i ) \sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})x_j^{(i)} ∑i=1m(hθ(x(i)−y(i))xj(i).
-
As we want to update all θ \theta θ's simultanerously, we may have to set several ( 3 3 3, in this case) ancillary variables to store the RHS’s of the assignment expressions, and doing the assignment together by the end of each iteration. This process can be further simplified by vectorization.
-
-
-
-
Vectorization:
-
Algorithm:
{ \{ { θ 0 : = θ 0 − α \theta_0:= \theta_0-\alpha θ0:=θ0−α ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) x 0 ( i ) \sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})x_0^{(i)} ∑i=1m(hθ(x(i)−y(i))x0(i);
~ θ 1 : = θ 1 − α \theta_1:= \theta_1-\alpha θ1:=θ1−α ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) x 1 ( i ) \sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})x_1^{(i)} ∑i=1m(hθ(x(i)−y(i))x1(i);
~ θ 2 : = θ 2 − α \theta_2:= \theta_2-\alpha θ2:=θ2−α ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) x 2 ( i ) \sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})x_2^{(i)} ∑i=1m(hθ(x(i)−y(i))x2(i) } \} } -
Observe that we can seal all the θ \theta θ's in a vector θ ⃗ \vec \theta θ, which is a straightforward [ θ 1 θ 2 θ 3 ] \left[\begin{matrix}\theta_1\\\theta_2\\\theta_3\end{matrix}\right] ⎣⎡θ1θ2θ3⎦⎤
-
We can also seal the sum parts in a vector δ ⃗ \vec \delta δ:
δ ⃗ = \vec \delta= δ= ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) [ x 0 ( i ) x 1 ( i ) x 2 ( i ) ] \sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})\left[\begin{matrix}x_0^{(i)}\\x_1^{(i)}\\x_2^{(i)}\end{matrix}\right] ∑i=1m(hθ(x(i)−y(i))⎣⎢⎡x0(i)x1(i)x2(i)⎦⎥⎤; this expressions is a sum of 3 × 1 3\times1 3×1 vectors that evaluates to a 3 × 1 3\times1 3×1 vector whose entries each contains the change / α /\alpha /α of its corresponding θ \theta θ in the current iteration) -
Therefore, the algorithm becomes:
[ θ 1 θ 2 θ 3 ] : = [ θ 1 θ 2 θ 3 ] − α \left[\begin{matrix}\theta_1\\\theta_2\\\theta_3\end{matrix}\right]:=\left[\begin{matrix}\theta_1\\\theta_2\\\theta_3\end{matrix}\right]-\alpha ⎣⎡θ1θ2θ3⎦⎤:=⎣⎡θ1θ2θ3⎦⎤−α ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) [ x 0 ( i ) x 1 ( i ) x 2 ( i ) ] \sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})\left[\begin{matrix}x_0^{(i)}\\x_1^{(i)}\\x_2^{(i)}\end{matrix}\right] ∑i=1m(hθ(x(i)−y(i))⎣⎢⎡x0(i)x1(i)x2(i)⎦⎥⎤, or θ ⃗ : = θ ⃗ − α δ ⃗ \vec \theta:=\vec \theta-\alpha\vec\delta θ:=θ−αδ, which is a faster implementation
-