Vectorization of the Gradient Descent Algorithm

最新推荐文章于 2024-02-23 12:39:03 发布

The Well-Built City

最新推荐文章于 2024-02-23 12:39:03 发布

阅读量116

点赞数

本文链接：https://blog.csdn.net/Bill_Wang_01/article/details/108025056

版权

Machine Learning 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Vectorization of the Gradient Descent Algorithm

Problem Setup
1. Given a dataset with $m$ samples $x^{(1)};x^{(2)};...;x^{(m)})$ with each sample having $n$ features (so $x^{(1)}_1$ denotes the value of the first feature of the first sample), and the response variables $y^{(1)};...;y^{(m)})$ with $y^{(1)}$ being the “output” for the first sample, we want to calculate parameters $\theta_0,...,\theta_n$ such that the hypothesis $h_{\theta}(x)=\theta_0+\theta_1x_1+...+\theta_mx_m$ gives the least-square linear equation for our dataset (i.e. for each pair of $(x^{(i)}_j,y^{(i)})$ for all $i$ and $j$ .
2. For convenience of calculation, let $x_0^{(i)}=1$ for all $i$ so $\vec x^{(i)}=(x^{(i)}_0,x^{(i)}_1,x^{(i)}_2,...,x^{(i)}_n)$ is of the same dimension as $\vec \theta=(\theta_0,...,\theta_n)$ , so we rewrite the hypothesis as $h_{\theta}(x)=\theta_0x_0+\theta_1x_1+...+\theta_mx_m$ .In gradient descent, we have a training set hypothesisthe following algorithm decreases
3. The Gradient Descent algorithm:
  1. The cost function is $J(\theta_0,...,\theta_n)=$ $1\over{2m}$ $\sum_{i=1}^m(\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+...+\theta_mx_m^{(i)}-y^{(i)})^2$ , we aim to minimize this function to find the $\theta$ values.
  2. Assume we have only two features, and we have a suitable learning rate $\alpha>0$ . We normally start with some initial values for the $\theta's$ , as our model is linear, we will always end up at the global optimum using Gradient Descent. The algorithm updates $\theta$ 's simultaneously until convergence, and upon convergence, the $\theta$ values will be our parameters:
    1. repeat until convergence
      $\{$ $\theta_0:= \theta_0-\alpha$ $\partial J\over \partial \theta_0$ ;
      $\theta_1:= \theta_1-\alpha$ $\partial J\over \partial \theta_1$ ;
      $\theta_2:= \theta_2-\alpha$ $\partial J\over \partial \theta_2$ $\}$
    2. After taking the partials, we found that each line becomes
      
      $\theta_j:= \theta_j-\alpha$ $1\over m$ $\sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})x_j^{(i)}$ .
    3. As we want to update all $\theta$ 's simultanerously, we may have to set several ( $3$ , in this case) ancillary variables to store the RHS’s of the assignment expressions, and doing the assignment together by the end of each iteration. This process can be further simplified by vectorization.
Vectorization:
1. Algorithm:
  
  $\{$ $\theta_0:= \theta_0-\alpha$ $\sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})x_0^{(i)}$ ;
  $\theta_1:= \theta_1-\alpha$ $\sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})x_1^{(i)}$ ;
  $\theta_2:= \theta_2-\alpha$ $\sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})x_2^{(i)}$ $\}$
2. Observe that we can seal all the $\theta$ 's in a vector $\vec \theta$ , which is a straightforward $\left[\begin{matrix}\theta_1\\\theta_2\\\theta_3\end{matrix}\right]$
3. We can also seal the sum parts in a vector $\vec \delta$ :
  $\vec \delta=$ $\sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})\left[\begin{matrix}x_0^{(i)}\\x_1^{(i)}\\x_2^{(i)}\end{matrix}\right]$ ; this expressions is a sum of $3\times1$ vectors that evaluates to a $3\times1$ vector whose entries each contains the change $/\alpha$ of its corresponding $\theta$ in the current iteration)
4. Therefore, the algorithm becomes:
  
  $\left[\begin{matrix}\theta_1\\\theta_2\\\theta_3\end{matrix}\right]:=\left[\begin{matrix}\theta_1\\\theta_2\\\theta_3\end{matrix}\right]-\alpha$ $\sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)})\left[\begin{matrix}x_0^{(i)}\\x_1^{(i)}\\x_2^{(i)}\end{matrix}\right]$ , or $\vec \theta:=\vec \theta-\alpha\vec\delta$ , which is a faster implementation

The Well-Built City

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Vectorization of the Gradient Descent Algorithm

Vectorization of the Gradient Descent AlgorithmProblem SetupGiven a dataset with mmm samples (x(1);x(2);...;x(m))(x^{(1)};x^{(2)};...;x^{(m)})(x(1);x(2);...;x(m)) with each sample having nnn features (so x1(1)x^{(1)}_1x1(1) denotes the value of the
复制链接

扫一扫

专栏目录