This article records my note about Linear regression.
The Linear Regression is a single layer neural network with a output.Trough this model,we see theory of gradient descent with intuition.
general model
f ( x ) = w T x + b = w 1 x 1 + w 2 x 2 + . . . + w n x n + b f(x)=w^Tx+b=w_1x_1+w_2x_2+...+w_nx_n+b f(x)=wTx+b=w1x1+w2x2+...+wnxn+b
Numerical solution
loss function
The
1
2
\frac{1}{2}
21 is for simplicity. The
m
m
m is number of samples in a batch.
l
=
1
m
∑
i
m
1
2
(
f
(
x
i
)
−
y
i
)
2
=
1
2
m
∑
i
m
(
w
T
x
i
+
b
−
y
i
)
2
l=\frac{1}{m}\sum_i^m\frac{1}{2}(f(x^i)-y^i)^2\\ =\frac{1}{2m}\sum_i^m(w^Tx^i+b-y^i)^2
l=m1i∑m21(f(xi)−yi)2=2m1i∑m(wTxi+b−yi)2
When batch is given,
l
l
l is quadratic function about
w
w
w and
b
b
b.When we fix
b
b
b,we can get gradient of
w
w
w.You can imagine that a quadratic function may jitter in lowest point with a big learning rate.Thus,we need a good lr by testing.
SGD
Firstly,computing derivation.
∂
l
∂
w
=
1
2
m
∑
i
m
(
2
(
w
T
x
i
+
b
−
y
i
)
x
i
)
,
w
h
e
r
e
w
a
n
d
x
b
o
t
h
a
r
e
v
e
c
t
o
r
s
.
f
o
r
w
j
∂
l
∂
w
j
=
1
2
m
∑
i
m
(
2
(
w
T
x
i
+
b
−
y
i
)
x
j
i
)
=
1
m
∑
i
m
(
w
T
x
i
+
b
)
x
j
i
−
1
m
∑
i
m
(
x
j
i
y
i
)
\frac{\partial l}{\partial w}=\frac{1}{2m}\sum_i^m(2(w^Tx^i+b-y^i)x^i),\\ where\ w\ and\ x\ both\ are\ vectors.\\ for\ w_j\\ \frac{\partial l}{\partial w_j}=\frac{1}{2m}\sum_i^m(2(w^Tx^i+b-y^i)x^i_j)\\ =\frac{1}{m}\sum_i^m(w^Tx^i+b)x^i_j-\frac{1}{m}\sum_i^m(x^i_jy^i)
∂w∂l=2m1i∑m(2(wTxi+b−yi)xi),where w and x both are vectors.for wj∂wj∂l=2m1i∑m(2(wTxi+b−yi)xji)=m1i∑m(wTxi+b)xji−m1i∑m(xjiyi)
When batch is given,we use constant
p
p
p and
q
q
q,
z
z
z to replace
1
m
∑
x
j
i
x
i
\frac{1}{m}\sum x^i_jx^i
m1∑xjixi and
1
m
∑
(
x
i
y
i
)
\frac{1}{m}\sum (x^iy^i)
m1∑(xiyi),
1
m
∑
x
j
i
\frac{1}{m}\sum x^i_j
m1∑xji respectively:
∂
l
∂
w
i
=
w
T
p
+
z
b
−
q
w
h
e
r
e
z
b
−
q
i
s
c
o
n
s
t
a
n
t
\frac{\partial l}{\partial w_i}=w^Tp+zb-q\\ where\ zb-q\ is\ constant
∂wi∂l=wTp+zb−qwhere zb−q is constant
Optimizing parameters:
w
i
−
=
η
∗
∂
l
∂
w
i
w_i-=\eta*\frac{\partial l}{\partial w_i}
wi−=η∗∂wi∂l
Analytical solution
We can directly use the least square method to solve this question.
Refer to least square method.
Specific derivation process can refer to Zhou ZH Watermelon Book.
(PS: for n*d matrix
w
w
w where d>n,r(
w
T
w
w^Tw
wTw)<=r(
w
T
w^T
wT)<=n).