Deriving the Normal Equation
An alternate of Gradient Descent in Multiple Linear Regression
-
Problem Setup:
- Given a dataset with m m m samples ( x ( 1 ) ; x ( 2 ) ; . . . ; x ( m ) ) (x^{(1)};x^{(2)};...;x^{(m)}) (x(1);x(2);...;x(m)) with each sample having n n n features (so x 1 ( 1 ) x^{(1)}_1 x1(1) denotes the value of the first feature of the first sample), and the response variables ( y ( 1 ) ; . . . ; y ( m ) ) (y^{(1)};...;y^{(m)}) (y(1);...;y(m)) with y ( 1 ) y^{(1)} y(1) being the “output” for the first sample, we want to calculate parameters θ 0 , . . . , θ n \theta_0,...,\theta_n θ0,...,θn such that the hypothesis h θ ( x ) = θ 0 + θ 1 x 1 + . . . + θ m x m h_{\theta}(x)=\theta_0+\theta_1x_1+...+\theta_mx_m hθ(x)=θ0+θ1x1+...+θmxm gives the least-square linear equation for our dataset (i.e. for each pair of ( x j ( i ) , y ( i ) ) (x^{(i)}_j,y^{(i)}) (xj(i),y(i)) for all i i i and j j j.
- For convenience of calculation, let x 0 ( i ) = 1 x_0^{(i)}=1 x0(i)=1 for all i i i so x ⃗ ( i ) = ( x 0 ( i ) , x 1 ( i ) , x 2 ( i ) , . . . , x n ( i ) ) \vec x^{(i)}=(x^{(i)}_0,x^{(i)}_1,x^{(i)}_2,...,x^{(i)}_n) x(i)=(x0(i),x1(i),x2(i),...,xn(i)) is of the same dimension as θ ⃗ = ( θ 0 , . . . , θ n ) \vec \theta=(\theta_0,...,\theta_n) θ=(θ0,...,θn), so we rewrite the hypothesis as h θ ( x ) = θ 0 x 0 + θ 1 x 1 + . . . + θ m x m h_{\theta}(x)=\theta_0x_0+\theta_1x_1+...+\theta_mx_m hθ(x)=θ0x0+θ1x1+...+θmxm.
-
Derivation
-
Cost Function: As “least square” means “minimizing the sum of squared loss”, the cost function is J ( θ 0 , . . . , θ n ) = J(\theta_0,...,\theta_n)= J(θ0,...,θn)= 1 2 m 1\over{2m} 2m1 ∑ i = 1 m ( θ 0 x 0 ( i ) + θ 1 x 1 ( i ) + . . . + θ m x m ( i ) − y ( i ) ) 2 \sum_{i=1}^m(\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+...+\theta_mx_m^{(i)}-y^{(i)})^2 ∑i=1m(θ0x0(i)+θ1x1(i)+...+θmxm(i)−y(i))2 where 1 2 m 1\over {2m} 2m1 does not have any impact on the results. Our task is to minimize this function.
-
Viewing J J J as an inner product: In minimizing J J J, we are looking for θ \theta θ values that can minimize the sum of a bunch of squared Euclidean distance.
Notice that ∑ i = 1 m ( θ 0 x 0 ( i ) + θ 1 x 1 ( i ) + . . . + θ m x m ( i ) − y ( i ) ) 2 \sum_{i=1}^m(\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+...+\theta_mx_m^{(i)}-y^{(i)})^2 ∑i=1m(θ0x0(i)+θ1x1(i)+...+θmxm(i)−y(i))2 is the same as the following inner product, which returns a scalar output:
[ θ 0 x 0 ( 1 ) + . . . + θ n x n ( 1 ) − y ( 1 ) θ 0 x 0 ( 2 ) + . . . + θ n x n ( 2 ) − y ( 2 ) . . . . . . . . θ 0 x 0 ( m ) + . . . + θ n x n ( m ) − y ( m ) ] T \left[\begin{matrix}\theta_0x_0^{(1)}+...+\theta_nx_n^{(1)}-y^{(1)}\\\theta_0x_0^{(2)}+...+\theta_nx_n^{(2)}-y^{(2)}\\........\\\theta_0x_0^{(m)}+...+\theta_nx_n^{(m)}-y^{(m)}\end{matrix}\right]^{\mathbf T} ⎣⎢⎢⎢⎡θ0x0(1)+...+θnxn(1)−y(1)θ0x0(2)+...+θnxn(2)−y(2)........θ0x0(m)+...+θnxn(m)−y(m)⎦⎥⎥⎥⎤T [ θ 0 x 0 ( 1 ) + . . . + θ n x n ( 1 ) − y ( 1 ) θ 0 x 0 ( 2 ) + . . . + θ n x n ( 2 ) − y ( 2 ) . . . . . . . . θ 0 x 0 ( m ) + . . . + θ n x n ( m ) − y ( m ) ] \left[\begin{matrix}\theta_0x_0^{(1)}+...+\theta_nx_n^{(1)}-y^{(1)}\\\theta_0x_0^{(2)}+...+\theta_nx_n^{(2)}-y^{(2)}\\........\\\theta_0x_0^{(m)}+...+\theta_nx_n^{(m)}-y^{(m)}\end{matrix}\right] ⎣⎢⎢⎢⎡θ0x0(1)+...+θnxn(1)−y(1)θ0x0(2)+...+θnxn(2)−y(2)........θ0x0(m)+...+θnxn(m)−y(m)⎦⎥⎥⎥⎤
(You may expand this to check how it equals the algebraic expression above)
-
Viewing the vector used above as a matrix-vertor product subtracting another vectior :
∙ \bullet ∙ The matrix x [ x 0 ( 1 ) . . x n ( 1 ) x 0 ( 2 ) . . x n ( 2 ) . . . . x 0 ( m ) . . x n ( m ) ] x\left[\begin{matrix}x_0^{(1)}&.&.&x_n^{(1)}\\x_0^{(2)}&.&.&x_n^{(2)}\\.&.&.&.\\x_0^{(m)}&.&.&x_n^{(m)}\end{matrix}\right] x⎣⎢⎢⎢⎡x0(1)x0(2).x0(m)........xn(1)xn(2).xn(m)⎦⎥⎥⎥⎤timing the vector [ θ 0 θ 1 . . θ n ] \left[\begin{matrix}\theta_0\\\theta_1\\.\\.\\\theta_n\end{matrix}\right] ⎣⎢⎢⎢⎢⎡θ0θ1..θn⎦⎥⎥⎥⎥⎤ gives the vector [ θ 0 x 0 ( 1 ) + . . . + θ n x n ( 1 ) θ 0 x 0 ( 2 ) + . . . + θ n x n ( 2 ) . . . . . . . . θ 0 x 0 ( m ) + . . . + θ n x n ( m ) ] \left[\begin{matrix}\theta_0x_0^{(1)}+...+\theta_nx_n^{(1)}\\\theta_0x_0^{(2)}+...+\theta_nx_n^{(2)}\\........\\\theta_0x_0^{(m)}+...+\theta_nx_n^{(m)}\end{matrix}\right] ⎣⎢⎢⎢⎡θ0x0(1)+...+θnxn(1)θ0x0(2)+...+θnxn(2)........θ0x0(m)+...+θnxn(m)⎦⎥⎥⎥⎤ .
∙ \bullet ∙ We denote [ x 0 ( 1 ) . . . x n ( 1 ) x 0 ( 2 ) . . . x n ( 2 ) . . . . x 0 ( m ) . . . x n ( m ) ] \left[\begin{matrix}x_0^{(1)}&.&.&.x_n^{(1)}\\x_0^{(2)}&.&.&.x_n^{(2)}\\.&.&.&.\\x_0^{(m)}&.&.&.x_n^{(m)}\end{matrix}\right] ⎣⎢⎢⎢⎡x0(1)x0(2).x0(m).........xn(1).xn(2)..xn(m)⎦⎥⎥⎥⎤, the matrix, as X X X, and the vector, [ θ 0 θ 1 . . θ n ] \left[\begin{matrix}\theta_0\\\theta_1\\.\\.\\\theta_n\end{matrix}\right] ⎣⎢⎢⎢⎢⎡θ0θ1..θn⎦⎥⎥⎥⎥⎤, as θ ⃗ \vec \theta θ.
∙ \bullet ∙ Therefore, [ θ 0 x 0 ( 1 ) + . . . + θ n x n ( 1 ) − y ( 1 ) θ 0 x 0 ( 2 ) + . . . + θ n x n ( 2 ) − y ( 2 ) . . . . . . . . θ 0 x 0 ( m ) + . . . + θ n x n ( m ) − y ( m ) ] \left[\begin{matrix}\theta_0x_0^{(1)}+...+\theta_nx_n^{(1)}-y^{(1)}\\\theta_0x_0^{(2)}+...+\theta_nx_n^{(2)}-y^{(2)}\\........\\\theta_0x_0^{(m)}+...+\theta_nx_n^{(m)}-y^{(m)}\end{matrix}\right] ⎣⎢⎢⎢⎡θ0x0(1)+...+θnxn(1)−y(1)θ0x0(2)+...+θnxn(2)−y(2)........θ0x0(m)+...+θnxn(m)−y(m)⎦⎥⎥⎥⎤ is X θ ⃗ − y ⃗ X\vec \theta-\vec y Xθ−y. The inner product is ( X θ ⃗ − y ⃗ ) T ( X θ ⃗ − y ⃗ ) (X\vec \theta-\vec y)^{\mathbf T}(X\vec \theta-\vec y) (Xθ−y)T(Xθ−y).
-
We minimize ( X θ ⃗ − y ⃗ ) T ( X θ ⃗ − y ⃗ ) (X\vec \theta-\vec y)^{\mathbf T}(X\vec \theta-\vec y) (Xθ−y)T(Xθ−y). Let ( X θ ⃗ − y ⃗ ) T ( X θ ⃗ − y ⃗ ) (X\vec \theta-\vec y)^T(X\vec \theta-\vec y) (Xθ−y)T(Xθ−y) be a function E E E of the unknown vector θ ⃗ \vec \theta θ. If the system X θ ⃗ − y ⃗ = 0 ⃗ X\vec \theta-\vec y=\vec 0 Xθ−y=0 has a unique solution, then that’s enought as min E ≥ 0 \min E≥0 minE≥0. However, when this system fails to be solvable, we have to minimize E E E using some techniques. The following contents show two ways of minimization
-
Two ways to solve for θ ⃗ \vec \theta θ:
-
Taking the partial derivative of E E E with respect to the unknown vector θ ⃗ \vec \theta θ:
∙ \bullet ∙ ∂ E ∂ θ ⃗ \partial E\over \partial \vec \theta ∂θ∂E = 2 X T ( X θ ⃗ − y ⃗ ) =2X^{\mathbf T}(X\vec \theta -\vec y) =2XT(Xθ−y) by the product rule.
∙ \bullet ∙ To get the local extrema, we let X T ( X θ ⃗ − y ⃗ ) = 0 ⃗ X^{\mathbf T}(X\vec \theta -\vec y)=\vec 0 XT(Xθ−y)=0. X T X θ ⃗ − X T y ⃗ = 0 ⃗ X^{\mathbf T}X\vec \theta-X^{\mathbf T}\vec y=\vec 0 XTXθ−XTy=0, so θ ⃗ = ( X T X ) − 1 X T y ⃗ \vec \theta=(X^{\mathbf T}X)^{-1}X^{\mathbf T}\vec y θ=(XTX)−1XTy after symplifying. However, this solution only applies to the situation where X T X X^{\mathbf T}X XTX is invertible.
∙ \bullet ∙ Therefore, min J ( θ ⃗ ) = \min J(\vec \theta)= minJ(θ)= 1 2 m 1\over 2m 2m1 ∑ i = 1 m ( θ ∗ ⃗ T x ⃗ ( i ) − y ( i ) ) 2 \sum_{i=1}^m(\vec {\theta^*}^{\mathbf T}\vec x^{(i)}-y^{(i)})^2 ∑i=1m(θ∗Tx(i)−y(i))2 = 1 2 m 1\over 2m 2m1 ∑ i = 1 m ( [ X T X ) − 1 X T y ⃗ ] T x ⃗ ( i ) − y ( i ) ) 2 \sum_{i=1}^m([X^{\mathbf T}X)^{-1}X^{\mathbf T}\vec y]^{\mathbf T}\vec x^{(i)}-y^{(i)})^2 ∑i=1m([XTX)−1XTy]Tx(i)−y(i))2 (notice that we write J J J as a function of the solved vector θ ⃗ ∗ \vec \theta^* θ∗), and our hypothesis is h θ ( x ⃗ ) = θ ⃗ T x ⃗ h_\theta(\vec x)=\vec \theta^{\mathbf T}\vec x hθ(x)=θTx (or x T ⃗ θ ⃗ \vec {x^{\mathbf T}}\vec \theta xTθ) where x ⃗ \vec x x is some arbitrary features sealed in a vector with its first entry being 1 1 1.
∙ \bullet ∙ When X X X has more columns than rows (i.e. in the dataset, we have more features than samples; a row in X X X represents one sample, and a column represents one unknown), the system X θ ⃗ − y ⃗ X\vec \theta -\vec y Xθ−y typically has infinitely many solutions, which means any one of them can be used for θ ⃗ \vec \theta θ depending on the preference of the model.
-
Orthogonally projecting y ⃗ \vec y y onto X X X's column space:
∙ \bullet ∙ Minimizing E E E means to minimize the length of the vector ( X θ ⃗ − y ⃗ ) (X\vec \theta -\vec y) (Xθ−y). Intuitively, this lead to the idea of choosing θ ⃗ \vec \theta θ so that X θ ⃗ X\vec \theta Xθ is the result of orthogonally projecting y ⃗ \vec y y onto the image of X X X (or the column space, or span of the columns of X X X).
∙ \bullet ∙ The formula for orthorgonal projection onto the image of X X X is X ( X T X ) − 1 X T X(X^{\mathbf T}X)^{-1}X^{\mathbf T} X(XTX)−1XT, so if X θ ⃗ = X ( X T X ) − 1 X T y ⃗ X\vec \theta=X(X^{\mathbf T}X)^{-1}X^{\mathbf T}\vec y Xθ=X(XTX)−1XTy, θ ⃗ = ( X T X ) − 1 X T y ⃗ \vec \theta =(X^{\mathbf T}X)^{-1}X^{\mathbf T}\vec y θ=(XTX)−1XTy.
-
-