Regression
Examples:
- stock market forecast
output = Dow Jones Industrial Average at tomorrow - self-driving car
output = angle - recommendation
output = purchase possibility
Step 1: Model
define a set of function
e.g. linear model
y
=
b
+
∑
w
i
x
i
y=b+\sum w_i x_i
y=b+∑wixi
- b=bias (偏置)
- w=weight (权重)
- xi: an attribute of input x (feature)
Step 2: Goodness of Function
Loss function
L
L
L:
input: a function
output: how bad it is
L
(
f
)
=
∑
i
=
1
n
(
y
^
i
−
f
(
x
i
)
)
2
L(f)=\sum_{i=1}^{n} ({\hat y}^i-f({x}^i))^2
L(f)=i=1∑n(y^i−f(xi))2
f
(
x
i
)
f({x}^i)
f(xi): estimated
y
y
y based on input function
(
y
^
i
−
f
(
x
i
)
)
2
({\hat y}^i-f({x}^i))^2
(y^i−f(xi))2: estimation error
∑
i
=
1
n
(
y
^
i
−
f
(
x
i
)
)
2
\sum\limits_{i=1}^{n} ({\hat y}^i-f({x}^i))^2
i=1∑n(y^i−f(xi))2: sum over examples
∵
f
=
f
(
w
,
b
)
\because f=f(w,b)
∵f=f(w,b)
∴
L
(
f
)
=
L
(
f
(
w
,
b
)
)
=
L
(
w
,
b
)
\therefore L(f)=L(f(w,b))=L(w,b)
∴L(f)=L(f(w,b))=L(w,b)
∴
\therefore
∴
L
(
f
)
=
∑
m
=
1
n
(
y
^
m
−
(
b
+
∑
w
⋅
x
m
)
)
2
L(f)=\sum_{m=1}^{n} ({\hat y}^m-(b+\sum w \cdot x^m))^2
L(f)=m=1∑n(y^m−(b+∑w⋅xm))2
Step 3: Best Function
pick the “best” function
f
∗
=
arg
min
f
L
(
f
)
f^*=\arg \min\limits_{f}L(f)
f∗=argfminL(f)
w
∗
,
b
∗
=
arg
min
w
,
b
L
(
w
,
b
)
=
arg
min
w
,
b
∑
i
=
1
n
(
y
^
i
−
(
b
+
w
⋅
x
i
)
)
2
w^*,b^*=\arg \min \limits_{w,b}L(w,b)\\=\arg \min \limits_{w,b}\sum\limits_{i=1}^{n} ({\hat y}^i-(b+w \cdot x^i))^2
w∗,b∗=argw,bminL(w,b)=argw,bmini=1∑n(y^i−(b+w⋅xi))2
method:Gradient Descent
e.g.1 consider loss function
L
L
L with one parameter
w
w
w:
- (radomly) pick an initial value w 0 w^0 w0.
- comput d L d w ∣ w = w 0 \frac{\text{d}L}{\text{d}w}\rvert_{w=w^0} dwdL∣w=w0.
- if d L d w ∣ w = w 0 < 0 \frac{\text{d}L}{\text{d}w}\rvert_{w=w^0}<0 dwdL∣w=w0<0 then increase w w w, if not then decrease.
- w 1 ← w 0 − η d L d w ∣ w = w 0 w^1\gets w^0-\eta\frac{\text{d}L}{\text{d}w}\rvert_{w=w^0} w1←w0−ηdwdL∣w=w0, η \eta η is called "learning rate"
- comput d L d w ∣ w = w 1 \frac{\text{d}L}{\text{d}w}\rvert_{w=w^1} dwdL∣w=w1.
- w 2 ← w 1 − η d L d w ∣ w = w 1 w^2\gets w^1-\eta\frac{\text{d}L}{\text{d}w}\rvert_{w=w^1} w2←w1−ηdwdL∣w=w1
… \dots … many iteration(迭代)
- this method may not get the global minima,it may get the local minima
e.g.2 consider loss function L L L with two parameter w , b w,b w,b:
- (radomly) pick an initial value w 0 , b 0 w^0,b^0 w0,b0.
- comput ∂ L ∂ w ∣ w = w 0 , b = b 0 , ∂ L ∂ b ∣ w = w 0 , b = b 0 \frac{\partial L}{\partial w}\rvert_{w=w^0,b=b^0},\frac{\partial L}{\partial b}\rvert_{w=w^0,b=b^0} ∂w∂L∣w=w0,b=b0,∂b∂L∣w=w0,b=b0.
- if d L d w ∣ w = w 0 < 0 \frac{\text{d}L}{\text{d}w}\rvert_{w=w^0}<0 dwdL∣w=w0<0 then increase w w w, if not then decrease.
-
w
1
←
w
0
−
η
∂
L
∂
w
∣
w
=
w
0
,
b
=
b
0
b
1
←
b
0
−
η
∂
L
∂
b
∣
w
=
w
0
,
b
=
b
0
w^1\gets w^0-\eta\frac{\partial L}{\partial w}\rvert_{w=w^0,b=b^0}\\b^1\gets b^0-\eta\frac{\partial L}{\partial b}\rvert_{w=w^0,b=b^0}
w1←w0−η∂w∂L∣w=w0,b=b0b1←b0−η∂b∂L∣w=w0,b=b0
⋯ \cdots ⋯ many iteration - matrix form:
∇ L = [ ∂ L ∂ w ∂ L ∂ b ] g r a d i e n t \nabla L=\begin{bmatrix}\frac{\partial L}{\partial w} \\ \\\frac{\partial L}{\partial b}\end{bmatrix}_{gradient} ∇L=⎣⎡∂w∂L∂b∂L⎦⎤gradient
P.S.
when solving:
θ ∗ = arg max θ L ( θ ) \theta^*=\arg \max\limits_{\theta}L(\theta) θ∗=argθmaxL(θ)
by gradient descent,each time we update the parameter,we obtain θ \theta θ that makes L ( θ ) L(\theta) L(θ) smaller.
L ( θ 0 ) > L ( θ 1 ) > L ( θ 2 ) > ⋯ L(\theta^0)>L(\theta^1)>L(\theta^2)>\cdots L(θ0)>L(θ1)>L(θ2)>⋯
is this statement correct ?
NOT exactly.
Improve
Suitable model
y
=
b
+
w
1
x
+
w
2
x
2
y=b+w_1 x+w_2 x^2
y=b+w1x+w2x2
y
=
b
+
w
1
x
+
w
2
x
2
+
w
3
x
3
y=b+w_1 x+w_2 x^2+w_3x^3
y=b+w1x+w2x2+w3x3
y
=
b
+
w
1
x
+
w
2
x
2
+
w
3
x
3
+
w
4
x
4
y=b+w_1 x+w_2 x^2+w_3x^3+w_4x^4
y=b+w1x+w2x2+w3x3+w4x4
y
=
b
+
w
1
x
+
w
2
x
2
+
w
3
x
3
+
w
4
x
4
+
w
5
x
5
y=b+w_1 x+w_2 x^2+w_3x^3+w_4x^4+w_5x^5
y=b+w1x+w2x2+w3x3+w4x4+w5x5
⋯
\cdots
⋯
- they are also linear model,because the parameter w 1 , w 2 , b w_1,w_2,b w1,w2,b are linear term.
- when the model is more complex, the training error is lower.
- a more complex model does NOT always lead to better performance on testing data.(overfitting过拟合)
- conclusion: select suitable model
to solve the overfitting.
- get more data
- find the hidden factor
- redesign the model
Regularization(正则化)
y = b + ∑ w i x i L ( f ) = ∑ i = 1 n ( y ^ i − ( b + w ⋅ x i ) ) 2 + λ ∑ ( w i ) 2 y=b+\sum w_ix_i \\ L(f)=\sum_{i=1}^{n} ({\hat y}^i-(b+w \cdot x^i))^2+\lambda\sum (w_i)^2 y=b+∑wixiL(f)=i=1∑n(y^i−(b+w⋅xi))2+λ∑(wi)2
- λ ∑ ( w i ) 2 \lambda\sum (w_i)^2 λ∑(wi)2 is called regularization term
- the functions with smaller w i w_i wi are better
- smaller w i w_i wi means smoother
- we believe smoother function is more likely to be correct
- λ ↑ → \lambda \uparrow \to λ↑→smother
- regularize bias b b b is unnecessary
- we prefer smooth function, but do not too smooth
Error resource
- estimate the mean of a variable
x
x
x
assume the mean(均值) of x x x is μ \mu μ
assume the variance(方差) of x x x is σ 2 \sigma^2 σ2 - estimator(估计量) of the mean
μ
\mu
μ
sample N points: { x 1 , x 2 , … , x N } \{x^1,x^2,\dots,x^N\} {x1,x2,…,xN}
m = 1 N ∑ n x n ≠ μ m=\frac{1}{N}\sum\limits_{n}x^{n}\ne\mu m=N1n∑xn=μ
E ( m ) = E ( 1 N ∑ n x n ) = 1 N ∑ n E ( x n ) = μ E(m)=E(\frac{1}{N}\sum\limits_{n}x^{n})=\frac{1}{N}\sum\limits_{n}E(x^{n})=\mu E(m)=E(N1n∑xn)=N1n∑E(xn)=μ(unbiased estimator 无偏估计)
V a r ( m ) = σ 2 N Var(m)=\frac{\sigma^2}{N} Var(m)=Nσ2 - estimator of variance
σ
2
\sigma^2
σ2
sample N points: { x 1 , x 2 , … , x N } \{x^1,x^2,\dots,x^N\} {x1,x2,…,xN}
m = 1 N ∑ n x n m=\frac{1}{N}\sum\limits_{n}x^{n} m=N1n∑xn
s 2 = 1 N ∑ n ( x n − m ) 2 s^2=\frac{1}{N}\sum\limits_{n}(x^{n}-m)^2 s2=N1n∑(xn−m)2
E ( s 2 ) = N − 1 N σ 2 ≠ σ 2 E(s^2)=\frac{N-1}{N}\sigma^2\ne\sigma^2 E(s2)=NN−1σ2=σ2 (biased estimator 有偏估计)
- simpler model is less influenced by the same data
What to do with large bias and variance?
- Diagnosis:
If your model cannot even fit the training examples, then you have large bias (underfitting)
If you cannot fit the training data, but large error on testing data, then you probably have large variance (overfitting) - For bias, redesign your model:
Add more feature as input
A more complex model - For variance
More data, very effective,but not always practical
Regularization, get it smooth
Model Selection
- There is usually a trade-off between bias and variance
- Select a model that balances two kinds of error to minimize total error
- What you should NOT do:
Error on real testing set may larger than on your own testing set.(your testing set maybe biased)
Cross Validation(交叉验证)
divide training set to two set: training set & validation set