李宏毅深度学习课程
预测宝可梦的战斗力
Regression
- Market Forecast——预测明天股价如何?
- self-driving car——预测方向盘角度
- Recommendation——购买可能性(推荐系统)
f ( x ( 宝 可 梦 ) ) = y ′ C P a f t e r e v o l u t i o n ′ f(x(宝可梦))=y\;'\;CP\;after\;evolution\;' f(x(宝可梦))=y′CPafterevolution′
x c p : 进 化 前 战 斗 力 、 x s : 物 种 、 x h p : 生 命 值 、 x w : 重 量 、 x h : 高 度 x_{cp}:进化前战斗力、x_s:物种、x_{hp}:生命值、x_w:重量、x_h:高度 xcp:进化前战斗力、xs:物种、xhp:生命值、xw:重量、xh:高度
Step 1. Model
A set of function … ————→ Model ( f1, f2, f3 … )
linear Model :
y
=
b
+
w
⋅
x
c
p
y = b+w\cdot{x_{cp}}
y=b+w⋅xcp
w
w
w and
b
b
b are parameters (can be any value)
y
=
b
+
∑
w
i
x
i
y = b+\sum{w_ix_i}
y=b+∑wixi
x
i
x_i
xi : an attribute of input
X
X
X (feature). ——
X
X
X 的各种属性
w i w_i wi : weight
b b b : bias
Step 2. Goodness of function
function input : function output (scalar) :
x 1 , x 2 , x 3 . . . x^1, x^2, x^3 ... x1,x2,x3... y ^ 1 , y ^ 2 , y ^ 3 . . . \widehat{y}^1, \widehat{y}^2, \widehat{y}^3 ... y 1,y 2,y 3...
Loss function L :
- input : a function
- output : how bad it is
L ( f ) = L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 L(f)=L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 L(f)=L(w,b)=n=1∑10(y n−(b+w⋅xcpn))2
estimation error : estimated y y y based on input function .
在衡量一组 w w w, b b b 的好坏。
Step 3. Gradient Descent
Best function :
( pick the ‘‘best’’ function)
f
∗
=
a
r
g
m
i
n
f
L
(
f
)
w
∗
,
b
∗
=
a
r
g
m
i
n
w
,
b
L
(
w
,
b
)
=
a
r
g
m
i
n
w
,
b
∑
n
=
1
10
(
y
^
n
−
(
b
+
w
⋅
x
c
p
n
)
)
2
f^*=argmin_{f}\;L(f)\\w^*,b^*=argmin_{w,b}\;L(w,b)=argmin_{w,b}\;\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2
f∗=argminfL(f)w∗,b∗=argminw,bL(w,b)=argminw,bn=1∑10(y
n−(b+w⋅xcpn))2
Consider loss function
L
(
w
)
L(w)
L(w) with one parameter
w
w
w .
w
∗
=
a
r
g
m
i
n
w
L
(
w
)
w^*=argmin_w\;L(w)
w∗=argminwL(w)
可微分的数可以回传,进行梯度下降
- ( Randomly ) Pick an initial value w 0 w^0 w0
- compute
d L d w ∣ w = w 0 − η d L d w ∣ w = w 0 w 1 = w 0 − η d L d w ∣ w = w 0 \frac{dL}{dw}|_{w=w^0} \\-\eta\frac{dL}{dw}|_{w=w^0} \\w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0} dwdL∣w=w0−ηdwdL∣w=w0w1=w0−ηdwdL∣w=w0
η \eta η is called ‘learning rate’ .
- Many iteration
Local optimal, not global optimal .
How obout two parameters ?
w
∗
,
b
∗
=
a
r
g
m
i
n
w
,
b
L
(
w
,
b
)
w^*,b^*=argmin_{w,b}\;L(w,b)
w∗,b∗=argminw,bL(w,b)
- ( Randomly ) Pick an initial value w 0 , b 0 w^0,\;b^0 w0,b0
- compute
d L d w ∣ w = w 0 , b = b 0 d L d b ∣ w = w 0 , b = b 0 w 1 = w 0 − η d L d w ∣ w = w 0 , b = b 0 b 1 = b 0 − η d L d b ∣ w = w 0 , b = b 0 \frac{dL}{dw}|_{w=w^0,b=b^0} \\\frac{dL}{db}|_{w=w^0,b=b^0} \\w^1=w^0-\eta\frac{dL}{dw}|_{w=w^0,b=b^0} \\b^1=b^0-\eta\frac{dL}{db}|_{w=w^0,b=b^0} dwdL∣w=w0,b=b0dbdL∣w=w0,b=b0w1=w0−ηdwdL∣w=w0,b=b0b1=b0−ηdbdL∣w=w0,b=b0
▽ = [ α L α w α L α b ] g r a d i e n t \bigtriangledown=\left[\begin{array}{rcl} \frac{\alpha{L}}{\alpha{w}} \\\frac{\alpha{L}}{\alpha{b}} \end{array}\right]_{gradient} ▽=[αwαLαbαL]gradient
The linear regression, the loss function L L L is convex. ( No local optimal )
Fomulation of
α
L
α
w
\frac{\alpha{L}}{\alpha{w}}
αwαL and
α
L
α
b
\frac{\alpha{L}}{\alpha{b}}
αbαL
L
(
w
,
b
)
=
∑
n
=
1
10
(
y
^
n
−
(
b
+
w
⋅
x
c
p
n
)
)
2
α
L
α
w
=
∑
n
=
1
10
2
(
y
^
n
−
(
b
+
w
⋅
x
c
p
n
)
)
(
−
x
c
p
n
)
α
L
α
b
=
∑
n
=
1
10
2
(
y
^
n
−
(
b
+
w
⋅
x
c
p
n
)
)
(
−
1
)
L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))^2 \\\frac{\alpha{L}}{\alpha{w}}=\sum_{n=1}^{10}2(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))(-x_{cp}^n) \\\frac{\alpha{L}}{\alpha{b}}=\sum_{n=1}^{10}2(\widehat{y}^n-(b+w\cdot{x_{cp}^n}))(-1)
L(w,b)=n=1∑10(y
n−(b+w⋅xcpn))2αwαL=n=1∑102(y
n−(b+w⋅xcpn))(−xcpn)αbαL=n=1∑102(y
n−(b+w⋅xcpn))(−1)
Model Selection
M o d e l 1 : y = w 1 x + b M o d e l 2 : y = w 1 x + w 2 x + b M o d e l 3 : y = w 1 x + w 2 x + w 3 x + b . . . Model\;1:y=w_1x+b \\Model\;2:y=w_1x+w_2x+b \\Model\;3:y=w_1x+w_2x+w_3x+b \\... Model1:y=w1x+bModel2:y=w1x+w2x+bModel3:y=w1x+w2x+w3x+b...
A more complex model does not always lead to better performance on testing data. This is overfitting.
Let’s collect more data. There is more hidden factors influence the previous model. : the type of pokeman
Back to Step 1: Redesign the Model
x s x_s xs = species of x x x
X ——→
i
f
x
s
=
P
i
d
g
e
y
:
y
=
b
1
+
w
1
⋅
x
c
p
i
f
x
s
=
W
e
e
d
l
e
:
y
=
b
2
+
w
2
⋅
x
c
p
i
f
x
s
=
C
a
t
e
r
p
i
e
:
y
=
b
3
+
w
3
⋅
x
c
p
i
f
x
s
=
E
e
v
e
e
:
y
=
b
4
+
w
4
⋅
x
c
p
if\;x_s=Pidgey:\;y=b_1+w_1\cdot{x_{cp}} \\ if \;x_s=Weedle:\;y=b_2+w_2\cdot{x_{cp}} \\ if \;x_s=Caterpie:\;y=b_3+w_3\cdot{x_{cp}} \\ if \;x_s=Eevee:\;y=b_4+w_4\cdot{x_{cp}}
ifxs=Pidgey:y=b1+w1⋅xcpifxs=Weedle:y=b2+w2⋅xcpifxs=Caterpie:y=b3+w3⋅xcpifxs=Eevee:y=b4+w4⋅xcp
——→
y
y
y
y
=
b
1
δ
(
x
s
=
P
i
d
e
y
)
+
w
1
⋅
δ
(
x
s
=
P
i
d
e
y
)
x
c
p
+
.
.
.
+
b
4
δ
(
x
s
=
E
e
v
e
e
)
+
w
4
⋅
δ
(
x
s
=
E
e
v
e
e
)
x
c
p
y=b_1\delta(x_s=Pidey)+w_1\cdot\delta(x_s=Pidey)x_{cp} \\+... \\+b_4\delta(x_s=Eevee)+w_4\cdot\delta(x_s=Eevee)x_{cp}
y=b1δ(xs=Pidey)+w1⋅δ(xs=Pidey)xcp+...+b4δ(xs=Eevee)+w4⋅δ(xs=Eevee)xcp
δ ( x s = P i d e y ) = { 1 , i f x s = P i d e y 0 , o t h e r w i s e \delta(x_s=Pidey)=\left\{\begin{array}{rcl}1, & if\;x_s=Pidey \\0,&otherwise \end{array}\right. δ(xs=Pidey)={1,0,ifxs=Pideyotherwise
Are there any other hidden factors?
Back to Step 2: Regulazation
y = b + ∑ w i x i y=b+\sum{w_ix_i} y=b+∑wixi
L = ∑ n ( y ^ n − ( b + ∑ w i x i ) ) 2 + λ ∑ ( w i ) 2 L=\sum_n(\widehat{y}^n-(b+\sum{w_ix_i}))^2+\lambda\sum(w_i)^2 L=n∑(y n−(b+∑wixi))2+λ∑(wi)2
training error + 正则化
b b b 对 function 的平滑程度无关,所以正则化时不考虑 b b b
- The functions with smaller w i w_i wi are better. w i w_i wi 越小越平滑。
- Training error: larger λ \lambda λ , considering the training error less.
λ \lambda λ 越大越平滑,但是不可以太平滑
why smooth function are preferred?
平滑 function 对输入杂物影响小。if some noises corrupt input x i x_i xi when testing, a smooth function has less influence.
where are the errors from?
- bias
- variance
simpler model is less influenced by the sample data.
- simple model → small variance, large bias ( underfitting )
- complex model → large variance, small bias ( overfitting )
复杂模型包含简单模型
For bias, redesign your model:
- add more features as input
- a more complex model
what to do with large variance?
- more data ( 采集真实数据,生成假数据 ) —— very effective, but not always practical
- regularization