Deriving the Normal Equation
An alternate of Gradient Descent in Multiple Linear Regression
-
Problem Setup:
- Given a dataset with m m m samples ( x ( 1 ) ; x ( 2 ) ; . . . ; x ( m ) ) (x^{(1)};x^{(2)};...;x^{(m)}) (x(1);x(2);...;x(m)) with each sample having n n n features (so x 1 ( 1 ) x^{(1)}_1 x1(1) denotes the value of the first feature of the first sample), and the response variables ( y ( 1 ) ; . . . ; y ( m ) ) (y^{(1)};...;y^{(m)}) (y(1);...;y(m)) with y ( 1 ) y^{(1)} y(1) being the “output” for the first sample, we want to calculate parameters θ 0 , . . . , θ n \theta_0,...,\theta_n θ0,...,θn such that the hypothesis h θ ( x ) = θ 0 + θ 1 x 1 + . . . + θ m x m h_{\theta}(x)=\theta_0+\theta_1x_1+...+\theta_mx_m hθ(x)=θ0+θ1x1+...+θmxm gives the least-square linear equation for our dataset (i.e. for each pair of ( x j ( i ) , y ( i ) ) (x^{(i)}_j,y^{(i)}) (xj(i),y(i)) for all i i i and j j j.
- For convenience of calculation, let x 0 ( i ) = 1 x_0^{(i)}=1 x0(i)=1 for all i i i so x ⃗ ( i ) = ( x 0 ( i ) , x 1 ( i ) , x 2 ( i ) , . . . , x n ( i ) ) \vec x^{(i)}=(x^{(i)}_0,x^{(i)}_1,x^{(i)}_2,...,x^{(i)}_n) x(i)=(x0(i),x1(i),x2(i),...,xn(i)) is of the same dimension as θ ⃗ = ( θ 0 , . . . , θ n ) \vec \theta=(\theta_0,...,\theta_n) θ=(θ0,...,θn), so we rewrite the hypothesis as h θ ( x ) = θ 0 x 0 + θ 1 x 1 + . . . + θ m x m h_{\theta}(x)=\theta_0x_0+\theta_1x_1+...+\theta_mx_m hθ(x)=θ0x0+θ1x1+...+θmxm.
-
Derivation
-
Cost Function: As “least square” means “minimizing the sum of squared loss”, the cost function is J ( θ 0 , . . . , θ n ) = J(\theta_0,...,\theta_n)= J(θ0,...,θn)= 1 2 m 1\over{2m} 2m1 ∑ i = 1 m ( θ 0 x 0 ( i ) + θ 1 x 1 ( i ) + . . . + θ m x m ( i ) − y ( i ) ) 2 \sum_{i=1}^m(\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+...+\theta_mx_m^{(i)}-y^{(i)})^2 ∑i=1m(θ0x0(i)+θ1x1(i)+...+θmxm(i)−y(i))2 where 1 2 m 1\over {2m} 2m1 does not have any impact on the results. Our task is to minimize this function.
-
Viewing J J J as an inner product: In minimizing J J J, we are looking for θ \theta θ values that can minimize the sum of a bunch of squared Euclidean distance.
Notice that ∑ i = 1 m ( θ 0 x 0 ( i ) + θ 1 x 1 ( i ) + . . . + θ m x m ( i ) − y ( i ) ) 2 \sum_{i=1}^m(\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+...+\theta_mx_m^{(i)}-y^{(i)})^2 ∑i=1m(θ0x0(i)+θ1x1(i)+...+θmxm(i)
-