Normal Equation

Deriving the Normal Equation

An alternate of Gradient Descent in Multiple Linear Regression

  1. Problem Setup:

    1. Given a dataset with m m m samples ( x ( 1 ) ; x ( 2 ) ; . . . ; x ( m ) ) (x^{(1)};x^{(2)};...;x^{(m)}) (x(1);x(2);...;x(m)) with each sample having n n n features (so x 1 ( 1 ) x^{(1)}_1 x1(1) denotes the value of the first feature of the first sample), and the response variables ( y ( 1 ) ; . . . ; y ( m ) ) (y^{(1)};...;y^{(m)}) (y(1);...;y(m)) with y ( 1 ) y^{(1)} y(1) being the “output” for the first sample, we want to calculate parameters θ 0 , . . . , θ n \theta_0,...,\theta_n θ0,...,θn such that the hypothesis h θ ( x ) = θ 0 + θ 1 x 1 + . . . + θ m x m h_{\theta}(x)=\theta_0+\theta_1x_1+...+\theta_mx_m hθ(x)=θ0+θ1x1+...+θmxm gives the least-square linear equation for our dataset (i.e. for each pair of ( x j ( i ) , y ( i ) ) (x^{(i)}_j,y^{(i)}) (xj(i),y(i)) for all i i i and j j j.
    2. For convenience of calculation, let x 0 ( i ) = 1 x_0^{(i)}=1 x0(i)=1 for all i i i so x ⃗ ( i ) = ( x 0 ( i ) , x 1 ( i ) , x 2 ( i ) , . . . , x n ( i ) ) \vec x^{(i)}=(x^{(i)}_0,x^{(i)}_1,x^{(i)}_2,...,x^{(i)}_n) x (i)=(x0(i),x1(i),x2(i),...,xn(i)) is of the same dimension as θ ⃗ = ( θ 0 , . . . , θ n ) \vec \theta=(\theta_0,...,\theta_n) θ =(θ0,...,θn), so we rewrite the hypothesis as h θ ( x ) = θ 0 x 0 + θ 1 x 1 + . . . + θ m x m h_{\theta}(x)=\theta_0x_0+\theta_1x_1+...+\theta_mx_m hθ(x)=θ0x0+θ1x1+...+θmxm.
  2. Derivation

    1. Cost Function: As “least square” means “minimizing the sum of squared loss”, the cost function is J ( θ 0 , . . . , θ n ) = J(\theta_0,...,\theta_n)= J(θ0,...,θn)= 1 2 m 1\over{2m} 2m1 ∑ i = 1 m ( θ 0 x 0 ( i ) + θ 1 x 1 ( i ) + . . . + θ m x m ( i ) − y ( i ) ) 2 \sum_{i=1}^m(\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+...+\theta_mx_m^{(i)}-y^{(i)})^2 i=1m(θ0x0(i)+θ1x1(i)+...+θmxm(i)y(i))2 where 1 2 m 1\over {2m} 2m1 does not have any impact on the results. Our task is to minimize this function.

    2. Viewing J J J as an inner product: In minimizing J J J, we are looking for θ \theta θ values that can minimize the sum of a bunch of squared Euclidean distance.

      Notice that ∑ i = 1 m ( θ 0 x 0 ( i ) + θ 1 x 1 ( i ) + . . . + θ m x m ( i ) − y ( i ) ) 2 \sum_{i=1}^m(\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+...+\theta_mx_m^{(i)}-y^{(i)})^2 i=1m(θ0x0(i)+θ1x1(i)+...+θmxm(i)y(i))2 is the same as the following inner product, which returns a scalar output:

      [ θ 0 x 0 ( 1 ) + . . . + θ n x n ( 1 ) − y ( 1 ) θ 0 x 0 ( 2 ) + . . . + θ n x n ( 2 ) − y ( 2 ) . . . . . . . . θ 0 x 0 ( m ) + . . . + θ n x n ( m ) − y ( m ) ] T \left[\begin{matrix}\theta_0x_0^{(1)}+...+\theta_nx_n^{(1)}-y^{(1)}\\\theta_0x_0^{(2)}+...+\theta_nx_n^{(2)}-y^{(2)}\\........\\\theta_0x_0^{(m)}+...+\theta_nx_n^{(m)}-y^{(m)}\end{matrix}\right]^{\mathbf T} θ0x0(1)+...+θnxn(1)y(1)θ0x0(2)+...+θnxn(2)y(2)........θ0x0(m)+...+θnxn(m)y(m)T [ θ 0 x 0 ( 1 ) + . . . + θ n x n ( 1 ) − y ( 1 ) θ 0 x 0 ( 2 ) + . . . + θ n x n ( 2 ) − y ( 2 ) . . . . . . . . θ 0 x 0 ( m ) + . . . + θ n x n ( m ) − y ( m ) ] \left[\begin{matrix}\theta_0x_0^{(1)}+...+\theta_nx_n^{(1)}-y^{(1)}\\\theta_0x_0^{(2)}+...+\theta_nx_n^{(2)}-y^{(2)}\\........\\\theta_0x_0^{(m)}+...+\theta_nx_n^{(m)}-y^{(m)}\end{matrix}\right] θ0x0(1)+...+θnxn(1)y(1)θ0x0(2)+...+θnxn(2)y(2)........θ0x0(m)+...+θnxn(m)y(m)

      ​ (You may expand this to check how it equals the algebraic expression above)

    3. Viewing the vector used above as a matrix-vertor product subtracting another vectior :

      ∙ \bullet The matrix x [ x 0 ( 1 ) . . x n ( 1 ) x 0 ( 2 ) . . x n ( 2 ) . . . . x 0 ( m ) . . x n ( m ) ] x\left[\begin{matrix}x_0^{(1)}&.&.&x_n^{(1)}\\x_0^{(2)}&.&.&x_n^{(2)}\\.&.&.&.\\x_0^{(m)}&.&.&x_n^{(m)}\end{matrix}\right] xx0(1)x0(2).x0(m)........xn(1)xn(2).xn(m)timing the vector [ θ 0 θ 1 . . θ n ] \left[\begin{matrix}\theta_0\\\theta_1\\.\\.\\\theta_n\end{matrix}\right] θ0θ1..θn gives the vector [ θ 0 x 0 ( 1 ) + . . . + θ n x n ( 1 ) θ 0 x 0 ( 2 ) + . . . + θ n x n ( 2 ) . . . . . . . . θ 0 x 0 ( m ) + . . . + θ n x n ( m ) ] \left[\begin{matrix}\theta_0x_0^{(1)}+...+\theta_nx_n^{(1)}\\\theta_0x_0^{(2)}+...+\theta_nx_n^{(2)}\\........\\\theta_0x_0^{(m)}+...+\theta_nx_n^{(m)}\end{matrix}\right] θ0x0(1)+...+θnxn(1)θ0x0(2)+...+θnxn(2)........θ0x0(m)+...+θnxn(m) .

      ∙ \bullet We denote [ x 0 ( 1 ) . . . x n ( 1 ) x 0 ( 2 ) . . . x n ( 2 ) . . . . x 0 ( m ) . . . x n ( m ) ] \left[\begin{matrix}x_0^{(1)}&.&.&.x_n^{(1)}\\x_0^{(2)}&.&.&.x_n^{(2)}\\.&.&.&.\\x_0^{(m)}&.&.&.x_n^{(m)}\end{matrix}\right] x0(1)x0(2).x0(m).........xn(1).xn(2)..xn(m), the matrix, as X X X, and the vector, [ θ 0 θ 1 . . θ n ] \left[\begin{matrix}\theta_0\\\theta_1\\.\\.\\\theta_n\end{matrix}\right] θ0θ1..θn, as θ ⃗ \vec \theta θ .

      ∙ \bullet Therefore, [ θ 0 x 0 ( 1 ) + . . . + θ n x n ( 1 ) − y ( 1 ) θ 0 x 0 ( 2 ) + . . . + θ n x n ( 2 ) − y ( 2 ) . . . . . . . . θ 0 x 0 ( m ) + . . . + θ n x n ( m ) − y ( m ) ] \left[\begin{matrix}\theta_0x_0^{(1)}+...+\theta_nx_n^{(1)}-y^{(1)}\\\theta_0x_0^{(2)}+...+\theta_nx_n^{(2)}-y^{(2)}\\........\\\theta_0x_0^{(m)}+...+\theta_nx_n^{(m)}-y^{(m)}\end{matrix}\right] θ0x0(1)+...+θnxn(1)y(1)θ0x0(2)+...+θnxn(2)y(2)........θ0x0(m)+...+θnxn(m)y(m) is X θ ⃗ − y ⃗ X\vec \theta-\vec y Xθ y . The inner product is ( X θ ⃗ − y ⃗ ) T ( X θ ⃗ − y ⃗ ) (X\vec \theta-\vec y)^{\mathbf T}(X\vec \theta-\vec y) (Xθ y )T(Xθ y ).

    4. We minimize ( X θ ⃗ − y ⃗ ) T ( X θ ⃗ − y ⃗ ) (X\vec \theta-\vec y)^{\mathbf T}(X\vec \theta-\vec y) (Xθ y )T(Xθ y ). Let ( X θ ⃗ − y ⃗ ) T ( X θ ⃗ − y ⃗ ) (X\vec \theta-\vec y)^T(X\vec \theta-\vec y) (Xθ y )T(Xθ y ) be a function E E E of the unknown vector θ ⃗ \vec \theta θ . If the system X θ ⃗ − y ⃗ = 0 ⃗ X\vec \theta-\vec y=\vec 0 Xθ y =0 has a unique solution, then that’s enought as min ⁡ E ≥ 0 \min E≥0 minE0. However, when this system fails to be solvable, we have to minimize E E E using some techniques. The following contents show two ways of minimization

    5. Two ways to solve for θ ⃗ \vec \theta θ :

      1. Taking the partial derivative of E E E with respect to the unknown vector θ ⃗ \vec \theta θ :

        ∙ \bullet ∂ E ∂ θ ⃗ \partial E\over \partial \vec \theta θ E = 2 X T ( X θ ⃗ − y ⃗ ) =2X^{\mathbf T}(X\vec \theta -\vec y) =2XT(Xθ y ) by the product rule.

        ∙ \bullet To get the local extrema, we let X T ( X θ ⃗ − y ⃗ ) = 0 ⃗ X^{\mathbf T}(X\vec \theta -\vec y)=\vec 0 XT(Xθ y )=0 . X T X θ ⃗ − X T y ⃗ = 0 ⃗ X^{\mathbf T}X\vec \theta-X^{\mathbf T}\vec y=\vec 0 XTXθ XTy =0 , so θ ⃗ = ( X T X ) − 1 X T y ⃗ \vec \theta=(X^{\mathbf T}X)^{-1}X^{\mathbf T}\vec y θ =(XTX)1XTy after symplifying. However, this solution only applies to the situation where X T X X^{\mathbf T}X XTX is invertible.

        ∙ \bullet Therefore, min ⁡ J ( θ ⃗ ) = \min J(\vec \theta)= minJ(θ )= 1 2 m 1\over 2m 2m1 ∑ i = 1 m ( θ ∗ ⃗ T x ⃗ ( i ) − y ( i ) ) 2 \sum_{i=1}^m(\vec {\theta^*}^{\mathbf T}\vec x^{(i)}-y^{(i)})^2 i=1m(θ Tx (i)y(i))2 = 1 2 m 1\over 2m 2m1 ∑ i = 1 m ( [ X T X ) − 1 X T y ⃗ ] T x ⃗ ( i ) − y ( i ) ) 2 \sum_{i=1}^m([X^{\mathbf T}X)^{-1}X^{\mathbf T}\vec y]^{\mathbf T}\vec x^{(i)}-y^{(i)})^2 i=1m([XTX)1XTy ]Tx (i)y(i))2 (notice that we write J J J as a function of the solved vector θ ⃗ ∗ \vec \theta^* θ ), and our hypothesis is h θ ( x ⃗ ) = θ ⃗ T x ⃗ h_\theta(\vec x)=\vec \theta^{\mathbf T}\vec x hθ(x )=θ Tx (or x T ⃗ θ ⃗ \vec {x^{\mathbf T}}\vec \theta xT θ ) where x ⃗ \vec x x is some arbitrary features sealed in a vector with its first entry being 1 1 1.

        ∙ \bullet When X X X has more columns than rows (i.e. in the dataset, we have more features than samples; a row in X X X represents one sample, and a column represents one unknown), the system X θ ⃗ − y ⃗ X\vec \theta -\vec y Xθ y typically has infinitely many solutions, which means any one of them can be used for θ ⃗ \vec \theta θ depending on the preference of the model.

      2. Orthogonally projecting y ⃗ \vec y y onto X X X's column space:

        ∙ \bullet Minimizing E E E means to minimize the length of the vector ( X θ ⃗ − y ⃗ ) (X\vec \theta -\vec y) (Xθ y ). Intuitively, this lead to the idea of choosing θ ⃗ \vec \theta θ so that X θ ⃗ X\vec \theta Xθ is the result of orthogonally projecting y ⃗ \vec y y onto the image of X X X (or the column space, or span of the columns of X X X).

        ∙ \bullet The formula for orthorgonal projection onto the image of X X X is X ( X T X ) − 1 X T X(X^{\mathbf T}X)^{-1}X^{\mathbf T} X(XTX)1XT, so if X θ ⃗ = X ( X T X ) − 1 X T y ⃗ X\vec \theta=X(X^{\mathbf T}X)^{-1}X^{\mathbf T}\vec y Xθ =X(XTX)1XTy , θ ⃗ = ( X T X ) − 1 X T y ⃗ \vec \theta =(X^{\mathbf T}X)^{-1}X^{\mathbf T}\vec y θ =(XTX)1XTy .

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值