1 Regression
Regression :output scalar
什么是回归?output是一个数值的就是回归。
Step 1: Model (function set )
A set of function
f 1 : y = 10.0 + 9.0 ⋅ x c p f_1:y = 10.0+9.0\cdot x_{cp} f1:y=10.0+9.0⋅xcp
f 2 : y = 9.8 + 9.2 ⋅ x c p f_2:y = 9.8+9.2\cdot x_{cp} f2:y=9.8+9.2⋅xcp
f 3 : y = − 0.8 − 1.2 ⋅ x c p f_3:y = -0.8-1.2\cdot x_{cp} f3:y=−0.8−1.2⋅xcp
Linear Model
xi :输入值x的一个属性(feature 特征值)
wi :weight ,b :bias
y = b + ∑ w i x i y = b + \sum w_ix_i y=b+∑wixi
Step 2 : Goodness of function
y = b + w ⋅ x c p y = b + w\cdot x_{cp} y=b+w⋅xcp
y hat 表示这是一个正确的数字
上标表示一个整体的资料,
下标表示这个资料里的某一个属性。
衡量function 需要 loss Function
Loss function
L ( f ) = ∑ n = 1 10 ( y ^ n − f ( x c p n ) ) 2 L(f) = \sum_{n=1}^{10} (\hat{y}^n-f(x^n_ {cp}) )^2 L(f)=∑n=110(y^n−f(xcpn))2
Lost function 是 function 的 function
L(f) ——> L(w,b)
L ( f ) = L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 L(f) = L(w,b)= \sum_{n=1}^{10} (\hat{y}^n- (b+w\cdot x^n_ {cp}) )^2 L(f)=L(w,b)=∑n=110(y^n−(b+w⋅xcpn))2
Step 3: Best function
pick the Best function
f ∗ = arg min f L ( f ) f^*= {\arg \min_{f}}L(f) f∗=argminfL(f)
w ∗ , b ∗ = arg min w , b L ( w , b ) w^*,b^*= {\arg \min_{w,b}}L(w,b) w∗,b∗=argminw,bL(w,b)
= arg min w , b ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 = {\arg \min_{w,b}}\sum_{n=1}^{10} (\hat{y}^n- (b+w\cdot x^n_ {cp}) )^2 =argminw,b∑n=110(y^n−(b+w⋅xcpn))2
Step 3: Gradient Descent
单个参数
w ∗ = arg min w L ( w ) w^*= {\arg \min_w}L(w) w∗=argminwL(w)
- pick an inital value w0
- Compute
$ \frac{dL}{dW}|_{w=w_0} $
$ w_1 \leftarrow w_0-\eta\frac{dL}{dW}|_{w=w_0}$
- Compute
$ \frac{dL}{dW}|_{w=w_1} $
w 2 ← w 1 − η d L d W ∣ w = w 1 w_2 \leftarrow w_1-\eta\frac{dL}{dW}|_{w=w_1} w2←w1−ηdWdL∣w=w1
两个参数
w ∗ , b ∗ = arg min w , b L ( w , b ) w^*,b^*= {\arg \min_{w,b}}L(w,b) w∗,b∗=argminw,bL(w,b)
- pick an inital value w0
- Compute (复习高等数学中如何求偏导)\
∂ L ∂ W ∣ w = w 0 , b = b 0 , ∂ L ∂ b ∣ b = b 0 , w = w 0 , \frac{\partial L}{\partial W}|_{w=w_0,b=b_0} , \frac{\partial L}{\partial b}|_{b=b_0,w=w_0} , ∂W∂L∣w=w0,b=b0,∂b∂L∣b=b0,w=w0,
w 1 ← w 0 − η ∂ L ∂ W ∣ w = w 0 , b = b 0 w_1 \leftarrow w_0-\eta\frac{\partial L}{\partial W}|_{w=w_0,b=b_0} w1←w0−η∂W∂L∣w=w0,b=b0
b 1 ← b 0 − η ∂ L ∂ b ∣ w = w 0 , b = b 0 b_1 \leftarrow b_0-\eta\frac{\partial L}{\partial b}|_{w=w_0,b=b_0} b1←b0−η∂b∂L∣w=w0,b=b0
- Compute
∂ L ∂ W ∣ w = w 1 , b = b 1 , ∂ L ∂ b ∣ b = b 1 , w = w 1 , \frac{\partial L}{\partial W}|_{w=w_1,b=b_1} , \frac{\partial L}{\partial b}|_{b=b_1,w=w_1} , ∂W∂L∣w=w1,b=b1,∂b∂L∣b=b1,w=w1,
w 2 ← w 1 − η ∂ L ∂ W ∣ b = b 1 , w = w 1 w_2 \leftarrow w_1-\eta\frac{\partial L}{\partial W}|_{b=b_1,w=w_1} w2←w1−η∂W∂L∣b=b1,w=w1
b 2 ← b 1 − η ∂ L ∂ b ∣ b = b 1 , w = w 1 b_2 \leftarrow b_1-\eta\frac{\partial L}{\partial b}|_{b=b_1,w=w_1} b2←b1−η∂b∂L∣b=b1,w=w1
Problem
globel minima
stuck at local minima
stuck at saddle point
very slow at the plateau
Linear Regression 的 lost function 是一个凸函数,不必担心局部最小值的问题
Learning Rate
η
\eta
η
Learning Rate 控制步子大小、学习速度。
another linear model
y = b + W 1 ⋅ X c p + W 2 ⋅ ( X c p ) 2 y = b + W_1 \cdot X_{cp}+W_2\cdot( X_{cp})^2 y=b+W1⋅Xcp+W2⋅(Xcp)2
y = b + W 1 ⋅ X c p + W 2 ⋅ ( X c p ) 2 + W 3 ⋅ ( X c p ) 3 y = b + W_1 \cdot X_{cp}+W_2\cdot( X_{cp})^2+W_3\cdot( X_{cp})^3 y=b+W1⋅Xcp+W2⋅(Xcp)2+W3⋅(Xcp)3
y = b + W 1 ⋅ X c p + W 2 ⋅ ( X c p ) 2 + W 3 ⋅ ( X c p ) 3 + W 4 ⋅ ( X c p ) 4 y = b + W_1 \cdot X_{cp}+W_2\cdot( X_{cp})^2+W_3\cdot( X_{cp})^3+W_4\cdot( X_{cp})^4 y=b+W1⋅Xcp+W2⋅(Xcp)2+W3⋅(Xcp)3+W4⋅(Xcp)4
y = b + W 1 ⋅ X c p + W 2 ⋅ ( X c p ) 2 + W 3 ⋅ ( X c p ) 3 + W 4 ⋅ ( X c p ) 4 + W 5 ⋅ ( X c p ) 5 y = b + W_1 \cdot X_{cp}+W_2\cdot( X_{cp})^2+W_3\cdot( X_{cp})^3+W_4\cdot( X_{cp})^4+W_5\cdot( X_{cp})^5 y=b+W1⋅Xcp+W2⋅(Xcp)2+W3⋅(Xcp)3+W4⋅(Xcp)4+W5⋅(Xcp)5
所谓一个model是不是linear 是指他的参数对他的output 是不是linear。
- A more complex model yields lower error on training data.
If we can truly find the best function
Model Selection
model | Training | Testing |
---|---|---|
1 | 31.9 | 35.0 |
2 | 15.4 | 18.4 |
3 | 15.3 | 18.1 |
4 | 14.9 | 28.2 |
5 | 12.8 | 232.1 |
- A more complex model does not always lead to better performance on testing data.
- This is Overfitting
复杂模型的model space涵盖了简单模型的model space,因此在training data上的错误率更小,但并不意味着在testing data 上错误率更小。模型太复杂会出现overfitting。
What are the hidden factors?
考虑pakemon种类对cp值的影响。
Back to step 1: Redesign the Model
i f x s = P i d g e y : y = b 1 + w 1 ⋅ x c p if x_s=Pidgey: y = b_1+w_1\cdot x_{cp} ifxs=Pidgey:y=b1+w1⋅xcp
i f x s = W e e d l e : y = b 2 + w 2 ⋅ x c p if x_s=Weedle: y = b_2+w_2\cdot x_{cp} ifxs=Weedle:y=b2+w2⋅xcp
i f x s = C a t e r p i e : y = b 3 + w 3 ⋅ x c p if x_s=Caterpie: y = b_3+w_3\cdot x_{cp} ifxs=Caterpie:y=b3+w3⋅xcp
i f x s = E e v e e : y = b 4 + w 4 ⋅ x c p if x_s=Eevee: y = b_4+w_4\cdot x_{cp} ifxs=Eevee:y=b4+w4⋅xcp
↓ \downarrow ↓
y = b 1 ⋅ δ ( x s = P i d g e y ) + w 1 ⋅ δ ( x s = P i d g e y ) ⋅ x c p y = b_1 \cdot \delta(x_s=Pidgey) +w_1\cdot\delta(x_s=Pidgey)\cdot x_{cp} y=b1⋅δ(xs=Pidgey)+w1⋅δ(xs=Pidgey)⋅xcp
+ b 2 ⋅ δ ( x s = W e e d l e ) + w 2 ⋅ δ ( x s = W e e d l e ) ⋅ x c p +b_2 \cdot \delta(x_s=Weedle) +w_2\cdot\delta(x_s=Weedle)\cdot x_{cp} +b2⋅δ(xs=Weedle)+w2⋅δ(xs=Weedle)⋅xcp
+ b 3 ⋅ δ ( x s = C a t e r p i e ) + w 3 ⋅ δ ( x s = P i d g e y ) ⋅ x c p +b_3 \cdot \delta(x_s=Caterpie) +w_3\cdot\delta(x_s=Pidgey)\cdot x_{cp} +b3⋅δ(xs=Caterpie)+w3⋅δ(xs=Pidgey)⋅xcp
+ b 4 ⋅ δ ( x s = C a t e r p i e ) + w 4 ⋅ δ ( x s = P i d g e y ) ⋅ x c p +b_4 \cdot \delta(x_s=Caterpie) +w_4\cdot\delta(x_s=Pidgey)\cdot x_{cp} +b4⋅δ(xs=Caterpie)+w4⋅δ(xs=Pidgey)⋅xcp
Training error = 3.8 ,Testing Error= 14.3
这个模型在测试集上有更好的表现。
Are there any other hidden factors?
hp值,体重,高度对cp值的影响。
Back to step 1: Redesign the Model Again
i f x s = P i d g e y : y , = b 1 + w 1 ⋅ x c p + w 5 ⋅ ( x c p ) 2 if x_s=Pidgey: y^, = b_1+w_1\cdot x_{cp} + w_5\cdot (x_{cp})^2 ifxs=Pidgey:y,=b1+w1⋅xcp+w5⋅(xcp)2
i f x s = W e e d l e : y , = b 2 + w 2 ⋅ x c p + w 6 ⋅ ( x c p ) 2 if x_s=Weedle: y^, = b_2+w_2\cdot x_{cp}+ w_6\cdot (x_{cp})^2 ifxs=Weedle:y,=b2+w2⋅xcp+w6⋅(xcp)2
i f x s = C a t e r p i e : y , = b 3 + w 3 ⋅ x c p + w 7 ⋅ ( x c p ) 2 if x_s=Caterpie: y^,= b_3+w_3\cdot x_{cp}+ w_7\cdot (x_{cp})^2 ifxs=Caterpie:y,=b3+w3⋅xcp+w7⋅(xcp)2
i f x s = E e v e e : y , = b 4 + w 4 ⋅ x c p + w 8 ⋅ ( x c p ) 2 if x_s=Eevee: y^, = b_4+w_4\cdot x_{cp}+ w_8\cdot (x_{cp})^2 ifxs=Eevee:y,=b4+w4⋅xcp+w8⋅(xcp)2
↓ \downarrow ↓
y = y , + w 9 ⋅ x n p + w 1 0 ⋅ ( x n p ) 2 + w 1 ⋅ y = y^,+w_9\cdot x_{np}+w_10\cdot(x_{np})^2+w_1\cdot y=y,+w9⋅xnp+w10⋅(xnp)2+w1⋅ x h + w 1 2 ⋅ ( x h ) 2 + w 1 3 ⋅ x w + w 1 4 ⋅ ( x w ) 2 x_{h}+w_12\cdot(x_{h})^2+w_13\cdot x_{w}+w_14\cdot(x_{w})^2 xh+w12⋅(xh)2+w13⋅xw+w14⋅(xw)2
Training Error = 1.9, Testing Error = 102.3,Overfitting
如果同时考虑宝可梦的其它属性,选一个很复杂的模型,结果会overfitting。
Back to step 2:regularization
对很多不同的test 都general有用的方法:正则化。
L
(
f
)
=
L
(
w
,
b
)
L(f) = L(w,b)
L(f)=L(w,b)
=
∑
n
(
y
^
n
−
(
b
+
∑
w
i
⋅
x
i
)
)
2
+
λ
∑
(
w
i
)
2
= \sum_{n}(\hat y^n -(b+\sum{w_i\cdot x_i}))^2 + \lambda \sum(w_i)^2
=∑n(y^n−(b+∑wi⋅xi))2+λ∑(wi)2
同时会让w很小。意味着,这是一个比较平滑的function。
y = b + ∑ w i x i y = b + \sum w_ix_i y=b+∑wixi
y + ∑ w i Δ x i = b + ∑ w i ( x i + Δ x i ) y + \sum w_i\Delta x_i = b + \sum w_i(x_i+\Delta x_i) y+∑wiΔxi=b+∑wi(xi+Δxi)
如果w1比较小的话,代表这个function是比较平滑的。
lambda | Training | Testing |
---|---|---|
0 | 1.9 | 102.3 |
1 | 2.3 | 68.7 |
10 | 3.5 | 25.7 |
100 | 4.1 | 11.1 |
1000 | 5.6 | 12.8 |
10000 | 6.3 | 18.7 |
100000 | 8.5 | 26.8 |
lambda 增加的时候,我们是会找到一个比较smooth的function。越大的λ,对training error考虑得越少。 调整λ,选择使testing error最小的λ。