Normal Equation

The Well-Built City

于 2020-08-14 18:22:32 发布

阅读量113

点赞数

本文链接：https://blog.csdn.net/Bill_Wang_01/article/details/108011475

版权

Machine Learning 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Deriving the Normal Equation

An alternate of Gradient Descent in Multiple Linear Regression

Problem Setup:
1. Given a dataset with $m$ samples $x^{(1)};x^{(2)};...;x^{(m)})$ with each sample having $n$ features (so $x^{(1)}_1$ denotes the value of the first feature of the first sample), and the response variables $y^{(1)};...;y^{(m)})$ with $y^{(1)}$ being the “output” for the first sample, we want to calculate parameters $\theta_0,...,\theta_n$ such that the hypothesis $h_{\theta}(x)=\theta_0+\theta_1x_1+...+\theta_mx_m$ gives the least-square linear equation for our dataset (i.e. for each pair of $(x^{(i)}_j,y^{(i)})$ for all $i$ and $j$ .
2. For convenience of calculation, let $x_0^{(i)}=1$ for all $i$ so $\vec x^{(i)}=(x^{(i)}_0,x^{(i)}_1,x^{(i)}_2,...,x^{(i)}_n)$ is of the same dimension as $\vec \theta=(\theta_0,...,\theta_n)$ , so we rewrite the hypothesis as $h_{\theta}(x)=\theta_0x_0+\theta_1x_1+...+\theta_mx_m$ .
Derivation
1. Cost Function: As “least square” means “minimizing the sum of squared loss”, the cost function is $J(\theta_0,...,\theta_n)=$ $1\over{2m}$ $\sum_{i=1}^m(\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+...+\theta_mx_m^{(i)}-y^{(i)})^2$ where $1\over {2m}$ does not have any impact on the results. Our task is to minimize this function.
2. Viewing $J$ as an inner product: In minimizing $J$ , we are looking for $\theta$ values that can minimize the sum of a bunch of squared Euclidean distance.
  
  Notice that $\sum_{i=1}^m(\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+...+\theta_mx_m^{(i)}-y^{(i)})^2$ is the same as the following inner product, which returns a scalar output:
  
  $\left[\begin{matrix}\theta_0x_0^{(1)}+...+\theta_nx_n^{(1)}-y^{(1)}\\\theta_0x_0^{(2)}+...+\theta_nx_n^{(2)}-y^{(2)}\\........\\\theta_0x_0^{(m)}+...+\theta_nx_n^{(m)}-y^{(m)}\end{matrix}\right]^{\mathbf T}$ $\left[\begin{matrix}\theta_0x_0^{(1)}+...+\theta_nx_n^{(1)}-y^{(1)}\\\theta_0x_0^{(2)}+...+\theta_nx_n^{(2)}-y^{(2)}\\........\\\theta_0x_0^{(m)}+...+\theta_nx_n^{(m)}-y^{(m)}\end{matrix}\right]$
  
  (You may expand this to check how it equals the algebraic expression above)
3. Viewing the vector used above as a matrix-vertor product subtracting another vectior :
  
  $\bullet$ The matrix $x\left[\begin{matrix}x_0^{(1)}&.&.&x_n^{(1)}\\x_0^{(2)}&.&.&x_n^{(2)}\\.&.&.&.\\x_0^{(m)}&.&.&x_n^{(m)}\end{matrix}\right]$ timing the vector $\left[\begin{matrix}\theta_0\\\theta_1\\.\\.\\\theta_n\end{matrix}\right]$ gives the vector $\left[\begin{matrix}\theta_0x_0^{(1)}+...+\theta_nx_n^{(1)}\\\theta_0x_0^{(2)}+...+\theta_nx_n^{(2)}\\........\\\theta_0x_0^{(m)}+...+\theta_nx_n^{(m)}\end{matrix}\right]$ .
  
  $\bullet$ We denote $\left[\begin{matrix}x_0^{(1)}&.&.&.x_n^{(1)}\\x_0^{(2)}&.&.&.x_n^{(2)}\\.&.&.&.\\x_0^{(m)}&.&.&.x_n^{(m)}\end{matrix}\right]$ , the matrix, as $X$ , and the vector, $\left[\begin{matrix}\theta_0\\\theta_1\\.\\.\\\theta_n\end{matrix}\right]$ , as $\vec \theta$ .
  
  $\bullet$ Therefore, $\left[\begin{matrix}\theta_0x_0^{(1)}+...+\theta_nx_n^{(1)}-y^{(1)}\\\theta_0x_0^{(2)}+...+\theta_nx_n^{(2)}-y^{(2)}\\........\\\theta_0x_0^{(m)}+...+\theta_nx_n^{(m)}-y^{(m)}\end{matrix}\right]$ is $X\vec \theta-\vec y$ . The inner product is $(X\vec \theta-\vec y)^{\mathbf T}(X\vec \theta-\vec y)$ .
4. We minimize $(X\vec \theta-\vec y)^{\mathbf T}(X\vec \theta-\vec y)$ . Let $(X\vec \theta-\vec y)^T(X\vec \theta-\vec y)$ be a function $E$ of the unknown vector $\vec \theta$ . If the system $X\vec \theta-\vec y=\vec 0$ has a unique solution, then that’s enought as $\min E≥0$ . However, when this system fails to be solvable, we have to minimize $E$ using some techniques. The following contents show two ways of minimization
5. Two ways to solve for $\vec \theta$ :
  1. Taking the partial derivative of $E$ with respect to the unknown vector $\vec \theta$ :
    
    $\bullet$ $\partial E\over \partial \vec \theta$ $=2X^{\mathbf T}(X\vec \theta -\vec y)$ by the product rule.
    
    $\bullet$ To get the local extrema, we let $X^{\mathbf T}(X\vec \theta -\vec y)=\vec 0$ . $X^{\mathbf T}X\vec \theta-X^{\mathbf T}\vec y=\vec 0$ , so $\vec \theta=(X^{\mathbf T}X)^{-1}X^{\mathbf T}\vec y$ after symplifying. However, this solution only applies to the situation where $X^{\mathbf T}X$ is invertible.
    
    $\bullet$ Therefore, $\min J(\vec \theta)=$ $1\over 2m$ $\sum_{i=1}^m(\vec {\theta^*}^{\mathbf T}\vec x^{(i)}-y^{(i)})^2$ = $1\over 2m$ $\sum_{i=1}^m([X^{\mathbf T}X)^{-1}X^{\mathbf T}\vec y]^{\mathbf T}\vec x^{(i)}-y^{(i)})^2$ (notice that we write $J$ as a function of the solved vector $\vec \theta^*$ ), and our hypothesis is $h_\theta(\vec x)=\vec \theta^{\mathbf T}\vec x$ (or $\vec {x^{\mathbf T}}\vec \theta$ ) where $\vec x$ is some arbitrary features sealed in a vector with its first entry being $1$ .
    
    $\bullet$ When $X$ has more columns than rows (i.e. in the dataset, we have more features than samples; a row in $X$ represents one sample, and a column represents one unknown), the system $X\vec \theta -\vec y$ typically has infinitely many solutions, which means any one of them can be used for $\vec \theta$ depending on the preference of the model.
  2. Orthogonally projecting $\vec y$ onto $X$ 's column space:
    
    $\bullet$ Minimizing $E$ means to minimize the length of the vector $(X\vec \theta -\vec y)$ . Intuitively, this lead to the idea of choosing $\vec \theta$ so that $X\vec \theta$ is the result of orthogonally projecting $\vec y$ onto the image of $X$ (or the column space, or span of the columns of $X$ ).
    
    $\bullet$ The formula for orthorgonal projection onto the image of $X$ is $X(X^{\mathbf T}X)^{-1}X^{\mathbf T}$ , so if $X\vec \theta=X(X^{\mathbf T}X)^{-1}X^{\mathbf T}\vec y$ , $\vec \theta =(X^{\mathbf T}X)^{-1}X^{\mathbf T}\vec y$ .

The Well-Built City

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Normal Equation

Deriving the Normal EquationAn alternate of Gradient Descent in Multiple Linear RegressionProblem Setup:Given a dataset with mmm samples (x(1);x(2);...;x(m))(x^{(1)};x^{(2)};...;x^{(m)})(x(1);x(2);...;x(m)) with each sample having nnn features (so x1
复制链接

扫一扫