Linear Regression

This article records my note about Linear regression.

The Linear Regression is a single layer neural network with a output.Trough this model,we see theory of gradient descent with intuition.

general model

f ( x ) = w T x + b = w 1 x 1 + w 2 x 2 + . . . + w n x n + b f(x)=w^Tx+b=w_1x_1+w_2x_2+...+w_nx_n+b f(x)=wTx+b=w1x1+w2x2+...+wnxn+b

Numerical solution

loss function

The 1 2 \frac{1}{2} 21 is for simplicity. The m m m is number of samples in a batch.
l = 1 m ∑ i m 1 2 ( f ( x i ) − y i ) 2 = 1 2 m ∑ i m ( w T x i + b − y i ) 2 l=\frac{1}{m}\sum_i^m\frac{1}{2}(f(x^i)-y^i)^2\\ =\frac{1}{2m}\sum_i^m(w^Tx^i+b-y^i)^2 l=m1im21(f(xi)yi)2=2m1im(wTxi+byi)2
When batch is given, l l l is quadratic function about w w w and b b b.When we fix b b b,we can get gradient of w w w.You can imagine that a quadratic function may jitter in lowest point with a big learning rate.Thus,we need a good lr by testing.


Firstly,computing derivation.
∂ l ∂ w = 1 2 m ∑ i m ( 2 ( w T x i + b − y i ) x i ) , w h e r e   w   a n d   x   b o t h   a r e   v e c t o r s . f o r   w j ∂ l ∂ w j = 1 2 m ∑ i m ( 2 ( w T x i + b − y i ) x j i ) = 1 m ∑ i m ( w T x i + b ) x j i − 1 m ∑ i m ( x j i y i ) \frac{\partial l}{\partial w}=\frac{1}{2m}\sum_i^m(2(w^Tx^i+b-y^i)x^i),\\ where\ w\ and\ x\ both\ are\ vectors.\\ for\ w_j\\ \frac{\partial l}{\partial w_j}=\frac{1}{2m}\sum_i^m(2(w^Tx^i+b-y^i)x^i_j)\\ =\frac{1}{m}\sum_i^m(w^Tx^i+b)x^i_j-\frac{1}{m}\sum_i^m(x^i_jy^i) wl=2m1im(2(wTxi+byi)xi),where w and x both are vectors.for wjwjl=2m1im(2(wTxi+byi)xji)=m1im(wTxi+b)xjim1im(xjiyi)
When batch is given,we use constant p p p and q q q, z z z to replace 1 m ∑ x j i x i \frac{1}{m}\sum x^i_jx^i m1xjixi and 1 m ∑ ( x i y i ) \frac{1}{m}\sum (x^iy^i) m1(xiyi), 1 m ∑ x j i \frac{1}{m}\sum x^i_j m1xji respectively:
∂ l ∂ w i = w T p + z b − q w h e r e   z b − q   i s   c o n s t a n t \frac{\partial l}{\partial w_i}=w^Tp+zb-q\\ where\ zb-q\ is\ constant wil=wTp+zbqwhere zbq is constant
Optimizing parameters:
w i − = η ∗ ∂ l ∂ w i w_i-=\eta*\frac{\partial l}{\partial w_i} wi=ηwil

Analytical solution

We can directly use the least square method to solve this question.
Refer to least square method.
Specific derivation process can refer to Zhou ZH Watermelon Book.
(PS: for n*d matrix w w w where d>n,r( w T w w^Tw wTw)<=r( w T w^T wT)<=n).





