Vectorization of the Gradient Descent Algorithm

Vectorization of the Gradient Descent Algorithm

  1. Problem Setup

    1. Given a dataset with m m m samples ( x ( 1 ) ; x ( 2 ) ; . . . ; x ( m ) ) (x^{(1)};x^{(2)};...;x^{(m)}) (x(1);x(2);...;x(m)) with each sample having n n n features (so x 1 ( 1 ) x^{(1)}_1 x1(1) denotes the value of the first feature of the first sample), and the response variables ( y ( 1 ) ; . . . ; y ( m ) ) (y^{(1)};...;y^{(m)}) (y(1);...;y(m)) with y ( 1 ) y^{(1)} y(1) being the “output” for the first sample, we want to calculate parameters θ 0 , . . . , θ n \theta_0,...,\theta_n θ0,...,θn such that the hypothesis h θ ( x ) = θ 0 + θ 1 x 1 + . . . + θ m x m h_{\theta}(x)=\theta_0+\theta_1x_1+...+\theta_mx_m hθ(x)=θ0+θ1x1+...+θmxm gives the least-square linear equation for our dataset (i.e. for each pair of ( x j ( i ) , y ( i ) ) (x^{(i)}_j,y^{(i)}) (xj(i),y(i)) for all i i i and j j j.

    2. For convenience of calculation, let x 0 ( i ) = 1 x_0^{(i)}=1 x0(i)=1 for all i i i so x ⃗ ( i ) = ( x 0 ( i ) , x 1 ( i ) , x 2 ( i ) , . . . , x n ( i ) ) \vec x^{(i)}=(x^{(i)}_0,x^{(i)}_1,x^{(i)}_2,...,x^{(i)}_n) x (i)=(x0(i),x1(i),x2(i),...,xn(i)) is of the same dimension as θ ⃗ = ( θ 0 , . . . , θ n ) \vec \theta=(\theta_0,...,\theta_n) θ =(θ0,...,θn), so we rewrite the hypothesis as h θ ( x ) = θ 0 x 0 + θ 1 x 1 + . . . + θ m x m h_{\theta}(x)=\theta_0x_0+\theta_1x_1+...+\theta_mx_m hθ(x)=θ0x0+θ1x1+...+θmxm.In gradient descent, we have a training set hypothesisthe following algorithm decreases

    3. The Gradient Descent algorithm:

      1. The cost function is J ( θ 0 , . . . , θ n ) = J(\theta_0,...,\theta_n)= J(θ0,...,θn)= 1 2 m 1\over{2m} 2m1 ∑ i = 1 m ( θ 0 x 0 ( i ) + θ 1 x 1 ( i ) + . . . + θ m x m ( i ) − y ( i ) ) 2 \sum_{i=1}^m(\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+...+\theta_mx_m^{(i)}-y^{(i)})^2 i=1m(θ0x0(i)+θ1x1(i)+...+θmxm(i)y(i))2, we aim to minimize this function to find the θ \theta θ values.

      2. Assume we have only two features, and we have a suitable learning rate α > 0 \alpha>0 α>0. We normally start with some initial values for the θ ′ s \theta's θs, as our model is linear, we will always end up at the global optimum using Gradient Descent. The algorithm updates θ \theta θ's simultaneously until convergence, and upon convergence, the θ \theta θ values will be our parameters:

        1. repeat until convergence
          { \{ { θ 0 : = θ 0 − α \theta_0:= \theta_0-\alpha θ0:=θ0α ∂ J ∂ θ 0 \partial J\over \partial \theta_0 θ0J;
            ~     ~   θ 1 : = θ 1 − α \theta_1:= \theta_1-\alpha θ1:=θ1α ∂ J ∂ θ 1 \partial J\over \partial \theta_1 θ1J;
            ~     ~   θ 2 : = θ 2 − α \theta_2:= \theta_2-\alpha θ2:=θ2α ∂ J ∂ θ 2 \partial J\over \partial \theta_2 θ2J } \} }

        2. After taking the partials, we found that each line becomes

          θ j : = θ j − α \theta_j:= \theta_j-\alpha θj:=θjα 1 m 1\over m m1 ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) x j ( i ) \sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})x_j^{(i)} i=1m(hθ(x(i)y(i))xj(i).

        3. As we want to update all θ \theta θ's simultanerously, we may have to set several ( 3 3 3, in this case) ancillary variables to store the RHS’s of the assignment expressions, and doing the assignment together by the end of each iteration. This process can be further simplified by vectorization.

  2. Vectorization:

    1. Algorithm:

      { \{ { θ 0 : = θ 0 − α \theta_0:= \theta_0-\alpha θ0:=θ0α ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) x 0 ( i ) \sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})x_0^{(i)} i=1m(hθ(x(i)y(i))x0(i);
        ~   θ 1 : = θ 1 − α \theta_1:= \theta_1-\alpha θ1:=θ1α ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) x 1 ( i ) \sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})x_1^{(i)} i=1m(hθ(x(i)y(i))x1(i);
        ~   θ 2 : = θ 2 − α \theta_2:= \theta_2-\alpha θ2:=θ2α ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) x 2 ( i ) \sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})x_2^{(i)} i=1m(hθ(x(i)y(i))x2(i) } \} }

    2. Observe that we can seal all the θ \theta θ's in a vector θ ⃗ \vec \theta θ , which is a straightforward [ θ 1 θ 2 θ 3 ] \left[\begin{matrix}\theta_1\\\theta_2\\\theta_3\end{matrix}\right] θ1θ2θ3

    3. We can also seal the sum parts in a vector δ ⃗ \vec \delta δ :
      δ ⃗ = \vec \delta= δ = ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) [ x 0 ( i ) x 1 ( i ) x 2 ( i ) ] \sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})\left[\begin{matrix}x_0^{(i)}\\x_1^{(i)}\\x_2^{(i)}\end{matrix}\right] i=1m(hθ(x(i)y(i))x0(i)x1(i)x2(i); this expressions is a sum of 3 × 1 3\times1 3×1 vectors that evaluates to a 3 × 1 3\times1 3×1 vector whose entries each contains the change / α /\alpha /α of its corresponding θ \theta θ in the current iteration)

    4. Therefore, the algorithm becomes:

      [ θ 1 θ 2 θ 3 ] : = [ θ 1 θ 2 θ 3 ] − α \left[\begin{matrix}\theta_1\\\theta_2\\\theta_3\end{matrix}\right]:=\left[\begin{matrix}\theta_1\\\theta_2\\\theta_3\end{matrix}\right]-\alpha θ1θ2θ3:=θ1θ2θ3α ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) [ x 0 ( i ) x 1 ( i ) x 2 ( i ) ] \sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})\left[\begin{matrix}x_0^{(i)}\\x_1^{(i)}\\x_2^{(i)}\end{matrix}\right] i=1m(hθ(x(i)y(i))x0(i)x1(i)x2(i), or θ ⃗ : = θ ⃗ − α δ ⃗ \vec \theta:=\vec \theta-\alpha\vec\delta θ :=θ αδ , which is a faster implementation

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值