李宏毅机器学习02:回归Regression
文章目录
ML Lecture 1 Regression - Case Study
一、回归(Regression)的定义
1.Regression: Output a scalar
- 回归的输出是数值
Regression 就是找到一个函数 (Model),通过输入特征 x ^ 1 , x ^ 2 , . . . , x ^ n \hat{x}^{1},\hat{x}^{2},...,\hat{x}^{n} x^1,x^2,...,x^n,输出一个Scalar(数值) 。
2.Example of Regression
- 1.股市预测(Stock market forecast)
- 输入:过去10年股票的变动、新闻咨询、公司并购咨询等
- 输出:预测股市明天的平均值
- 2.自动驾驶(Self-driving Car)
- 输入:无人车上的各个sensor的数据,例如路况、测出的车距等
- 输出:方向盘的角度
- 3.商品推荐(Recommendation)
- 输入:商品A的特性,商品B的特性
- 输出:购买商品B的可能性
二、回归的实现(机器学习的步骤)
以预测宝可梦的cp值为例:
- 输入:目前宝可梦的数据
- 输出:宝可梦进化后的cp值
Step 1: define a set of function - Linear Model
确定一个模型,首先采用线性模型,考虑宝可梦的cp值
Model:
- y = b + w ⋅ x c p y=b+w\cdot x_{cp} y=b+w⋅xcp
Linear Model 线性模型:
y = b + ∑ w i x i y=b+\sum{w_ix_i} y=b+∑wixi
w i w_i wi : weight(权重)
b b b :bias(偏移)
Step 2: goodness of function - Loss Function
确定评价函数,我们使用实际进化后的CP值与模型预测的CP值差值,来判定模型的好坏
-
L ( f ) = ∑ i = 1 n ( y i ^ − f ( x c p i ) ) 2 L(f)=\displaystyle\sum_{i=1}^n\big(\hat{y^i}-f(x_{cp}^i)\big)^2 L(f)=i=1∑n(yi^−f(xcpi))2
- f ( x c p i ) f(x_{cp}^i) f(xcpi) : Estimated y based on input function 基于输入量y的估计值
- y i ^ − f ( x c p i ) \hat{y^i}-f(x_{cp}^i) yi^−f(xcpi) : Estimation error 估测误差
- ∑ i = 1 n \displaystyle\sum_{i=1}^n i=1∑n : Sum over examples 所有样例误差之和
将参数 w , b w,b w,b 代入评价函数:
- L ( w , b ) = ∑ i = 1 n ( y i ^ − ( b + w ⋅ x c p i ) ) 2 L(w,b)=\displaystyle\sum_{i=1}^n\big(\hat{y^i}-(b+w\cdot x_{cp}^i)\big)^2 L(w,b)=i=1∑n(yi^−(b+w⋅xcpi))2
Loss Function 损失函数:
是用来估量模型的预测值f(x)与真实值Y的不一致程度,它是一个非负实值函数
Step 3: pick the best function - Gradient Descent
求解最优模型,即寻找参数值 使得Loss函数最小。
1.Consider loss function 𝐿(𝑤) with one parameter w w w:
-
考虑损失函数只有一个参数 w w w的情况:
(方法:Gradient Descent 梯度下降)梯度:
在单变量的函数中,梯度其实就是函数的微分,代表着函数在某个给定点的切线的斜率-
(Randomly) Pick an initial value w 0 w_0 w0
(随机)选取初始值 w 0 w_0 w0 -
Compute d L d w ∣ w = w 0 {\frac{dL}{dw}|}_{w=w_0} dwdL∣w=w0
计算损失函数在 w 0 w_0 w0处的微分 -
Negative ->Increase w w w ;Positive -> Decrease w w w
当微分值为负值,增加 w w w;当微分值为正值,减少 w w w- w = { D e c r e a s e if ( d L d w ∣ w = w 0 ) > 0 I n c r e a s e if ( d L d w ∣ w = w 0 ) < 0 w = \begin{cases} Decrease &\text{if } ({\frac{dL}{dw}|}_{w=w_0})>0\\ Increase &\text{if } ({\frac{dL}{dw}|}_{w=w_0})<0 \end{cases} w={DecreaseIncreaseif (dwdL∣w=w0)>0if (dwdL∣w=w0)<0
- w 1 ← w 0 − η ⋅ ( d L d w ∣ w = w 0 ) w^1\gets w^0-\eta \cdot({\frac{dL}{dw}|}_{w=w_0}) w1←w0−η⋅(dwdL∣w=w0)
Learning Rate 学习率/步长:
− η ⋅ ( d L d w ∣ w = w 0 ) -\eta \cdot({\frac{dL}{dw}|}_{w=w_0}) −η⋅(dwdL∣w=w0) : η \eta η is called “Learning Rate”
− η ⋅ ( d L d w ∣ w = w 0 ) -\eta \cdot({\frac{dL}{dw}|}_{w=w_0}) −η⋅(dwdL∣w=w0)中 η \eta η是学习率/步长-
Many iteration 多次迭代
- Compute d L d w ∣ w = w 1 {\frac{dL}{dw}|}_{w=w_1} dwdL∣w=w1
- w 2 ← w 1 − η ⋅ ( d L d w ∣ w = w 1 ) w^2\gets w^1-\eta \cdot({\frac{dL}{dw}|}_{w=w_1}) w2←w1−η⋅(dwdL∣w=w1)
- Compute d L d w ∣ w = w 2 {\frac{dL}{dw}|}_{w=w_2} dwdL∣w=w2
- w 3 ← w 2 − η ⋅ ( d L d w ∣ w = w 2 ) w^3\gets w^2-\eta \cdot({\frac{dL}{dw}|}_{w=w_2}) w3←w2−η⋅(dwdL∣w=w2)
- … …
-
w n = arg min w L ( w ) w^n=\argmin_wL(w) wn=argminwL(w)
Local minima 局部最小值
global minima 全局最小值
注:在linear regression 上没有 local minima -
2.Consider loss function 𝐿(𝑤) with two parameter ( w , b ) (w,b) (w,b):
-
考虑损失函数有两个参数 ( w , b ) (w,b) (w,b)的情况:
方法:Gradient Descent 梯度下降梯度:
在多变量函数中,梯度是一个向量,向量有方向,梯度的方向就指出了函数在给定点的上升最快的方向
[ ∂ L ∂ w ∂ L ∂ b ] g r a d i e n t \Large\begin{bmatrix} \frac {\partial L} {\partial w} \\ \frac {\partial L} {\partial b} \end{bmatrix}_{gradient} ⎣ ⎡∂w∂L∂b∂L⎦ ⎤gradient- (Randomly) Pick an initial value
w
0
,
b
0
w_0,b_0
w0,b0
(随机)选取初始值 w 0 , b 0 w_0,b_0 w0,b0 - Compute
∂
L
∂
w
∣
w
=
w
0
,
b
=
n
0
{\frac{\partial L}{\partial w}|}_{w=w_0,b=n_0}
∂w∂L∣w=w0,b=n0 ,
∂
L
∂
b
∣
w
=
w
0
,
b
=
n
0
{\frac{\partial L}{\partial b}|}_{w=w_0,b=n_0}
∂b∂L∣w=w0,b=n0
计算损失函数在 w 0 w_0 w0 和 b 0 b_0 b0 处的偏导
w 1 ← w 0 − η ⋅ ( ∂ L ∂ w ∣ w = w 0 , b = n 0 ) w^1\gets w^0-\eta \cdot({\frac{\partial L}{\partial w}|}_{w=w_0,b=n_0}) w1←w0−η⋅(∂w∂L∣w=w0,b=n0)
b 1 ← b 0 − η ⋅ ( ∂ L ∂ b ∣ w = w 0 , b = n 0 ) b^1\gets b^0-\eta \cdot({\frac{\partial L}{\partial b}|}_{w=w_0,b=n_0}) b1←b0−η⋅(∂b∂L∣w=w0,b=n0) - Many iteration 多次迭代
- Compute
∂
L
∂
w
∣
w
=
w
1
,
b
=
n
1
{\frac{\partial L}{\partial w}|}_{w=w_1,b=n_1}
∂w∂L∣w=w1,b=n1 ,
∂
L
∂
b
∣
w
=
w
1
,
b
=
n
1
{\frac{\partial L}{\partial b}|}_{w=w_1,b=n_1}
∂b∂L∣w=w1,b=n1
w 2 ← w 1 − η ⋅ ( ∂ L ∂ w ∣ w = w 1 , b = n 1 ) w^2\gets w^1-\eta \cdot({\frac{\partial L}{\partial w}|}_{w=w_1,b=n_1}) w2←w1−η⋅(∂w∂L∣w=w1,b=n1)
b 2 ← b 1 − η ⋅ ( ∂ L ∂ b ∣ w = w 1 , b = n 1 ) b^2\gets b^1-\eta \cdot({\frac{\partial L}{\partial b}|}_{w=w_1,b=n_1}) b2←b1−η⋅(∂b∂L∣w=w1,b=n1) - Compute
∂
L
∂
w
∣
w
=
w
2
,
b
=
n
2
{\frac{\partial L}{\partial w}|}_{w=w_2,b=n_2}
∂w∂L∣w=w2,b=n2 ,
∂
L
∂
b
∣
w
=
w
2
,
b
=
n
2
{\frac{\partial L}{\partial b}|}_{w=w_2,b=n_2}
∂b∂L∣w=w2,b=n2
w 3 ← w 2 − η ⋅ ( ∂ L ∂ w ∣ w = w 2 , b = n 2 ) w^3\gets w^2-\eta \cdot({\frac{\partial L}{\partial w}|}_{w=w_2,b=n_2}) w3←w2−η⋅(∂w∂L∣w=w2,b=n2)
b 3 ← b 2 − η ⋅ ( ∂ L ∂ b ∣ w = w 2 , b = n 2 ) b^3\gets b^2-\eta \cdot({\frac{\partial L}{\partial b}|}_{w=w_2,b=n_2}) b3←b2−η⋅(∂b∂L∣w=w2,b=n2) - … …
- Compute
∂
L
∂
w
∣
w
=
w
1
,
b
=
n
1
{\frac{\partial L}{\partial w}|}_{w=w_1,b=n_1}
∂w∂L∣w=w1,b=n1 ,
∂
L
∂
b
∣
w
=
w
1
,
b
=
n
1
{\frac{\partial L}{\partial b}|}_{w=w_1,b=n_1}
∂b∂L∣w=w1,b=n1
-
w
n
,
b
n
=
arg min
w
,
b
L
(
w
,
b
)
w^n,b^n=\argmin_{w,b}L(w,b)
wn,bn=argminw,bL(w,b)
- (Randomly) Pick an initial value
w
0
,
b
0
w_0,b_0
w0,b0
3.Formulation of ∂ L ∂ w {\frac{\partial L}{\partial w}} ∂w∂L and ∂ L ∂ b {\frac{\partial L}{\partial b}} ∂b∂L
宝可梦cp值偏微分的公式:
Model : y = b + w ⋅ x c p y=b+w\cdot x_{cp} y=b+w⋅xcp
Loss function : L ( w , b ) = ∑ i = 1 n ( y i ^ − ( b + w ⋅ x c p i ) ) 2 L(w,b)=\displaystyle\sum_{i=1}^n\big(\hat{y^i}-(b+w\cdot x_{cp}^i)\big)^2 L(w,b)=i=1∑n(yi^−(b+w⋅xcpi))2
∂
L
∂
w
=
∑
i
=
1
n
2
(
y
i
^
−
(
b
+
w
⋅
x
c
p
i
)
)
⋅
(
x
c
p
i
)
{\frac{\partial L}{\partial w}}=\displaystyle\sum_{i=1}^n2\big(\hat{y^i}-(b+w\cdot x_{cp}^i)\big)\cdot( x_{cp}^i)
∂w∂L=i=1∑n2(yi^−(b+w⋅xcpi))⋅(xcpi)
∂
L
∂
b
=
∑
i
=
1
n
2
(
y
i
^
−
(
b
+
w
⋅
x
c
p
i
)
)
{\frac{\partial L}{\partial b}}=\displaystyle\sum_{i=1}^n2\big(\hat{y^i}-(b+w\cdot x_{cp}^i)\big)
∂b∂L=i=1∑n2(yi^−(b+w⋅xcpi))
三、回归的优化
1.Select another model 选择另一个模型
(1)linear model 线性模型:
y
=
b
+
w
⋅
x
c
p
y=b+w\cdot x_{cp}
y=b+w⋅xcp
(2)non-linear model 非线性模型:
-
y = b + w 1 ⋅ x c p + w 2 ⋅ x c p 2 y=b+w_1\cdot x_{cp}+w_2\cdot x_{cp}^2 y=b+w1⋅xcp+w2⋅xcp2
-
y = b + w 1 ⋅ x c p + w 2 ⋅ x c p 2 + w 3 ⋅ x c p 3 y=b+w_1\cdot x_{cp}+w_2\cdot x_{cp}^2+w_3\cdot x_{cp}^3 y=b+w1⋅xcp+w2⋅xcp2+w3⋅xcp3
-
y = b + w 1 ⋅ x c p + w 2 ⋅ x c p 2 + w 3 ⋅ x c p 3 + w 4 ⋅ x c p 4 y=b+w_1\cdot x_{cp}+w_2\cdot x_{cp}^2+w_3\cdot x_{cp}^3+w_4\cdot x_{cp}^4 y=b+w1⋅xcp+w2⋅xcp2+w3⋅xcp3+w4⋅xcp4
-
y = b + w 1 ⋅ x c p + w 2 ⋅ x c p 2 + w 3 ⋅ x c p 3 + w 4 ⋅ x c p 4 + w 5 ⋅ x c p 5 y=b+w_1\cdot x_{cp}+w_2\cdot x_{cp}^2+w_3\cdot x_{cp}^3+w_4\cdot x_{cp}^4+w_5\cdot x_{cp}^5 y=b+w1⋅xcp+w2⋅xcp2+w3⋅xcp3+w4⋅xcp4+w5⋅xcp5
(3)Overfitting 过拟合
越复杂的Model包含的Function越多,那么其包含理想Model的可能性就越大,如果过分的拟合理想的模型,就会出现过拟合问题。
过拟合指的是模型在训练集上表现的很好,但是在交叉验证集合测试集上表现一般,也就是说模型对未知样本的预测表现一般,泛化(generalization)能力较差。
2. Consider the hidden factors 考虑其他隐藏因素
宝可梦进化后的cp值和宝可梦的种类有关,即不同种类的宝可梦对应不同模型
可以使用函数将不同种类的模型整合:
δ
(
x
s
=
m
o
d
e
l
i
)
=
{
1
if
(
x
s
=
m
o
d
e
l
i
)
0
if
(
x
s
=
/
m
o
d
e
l
i
)
\delta(x_s={model}_i) = \begin{cases} 1 &\text{if } (x_s={model}_i) \\ 0 &\text{if } (x_s {=}\mathllap{/\,}{model}_i) \end{cases}
δ(xs=modeli)={10if (xs=modeli)if (xs=/modeli)
如模型线性整合:
y
=
∑
i
=
1
n
(
(
b
i
+
w
i
⋅
x
i
)
⋅
δ
(
x
s
=
m
o
d
e
l
i
)
)
y=\displaystyle\sum_{i=1}^n\big((b_i+w_i\cdot x_i)\cdot\delta(x_s={model}_i)\big)
y=i=1∑n((bi+wi⋅xi)⋅δ(xs=modeli))
3.Regularization 正则化
正则化就是在损失函数上加上一个与w(权值)相关的值,那么要是loss function越小的话,w也会越小,w越小就使function更加平滑
L ( w , b ) = ∑ i = 1 n ( y i ^ − ( b + w ⋅ x c p i ) ) 2 L(w,b)=\displaystyle\sum_{i=1}^n\big(\hat{y^i}-(b+w\cdot x_{cp}^i)\big)^2 L(w,b)=i=1∑n(yi^−(b+w⋅xcpi))2
y = L ( w , b ) + λ ∑ ( w i ) 2 y=L(w,b)+\color{blue}\lambda\sum(w_i)^2 y=L(w,b)+λ∑(wi)2
The functions with smaller 𝑤𝑖 are better
y = ∑ i = 1 n ( y i ^ − ( b + w ⋅ x c p i ) ) 2 + λ ∑ ( w i ) 2 y=\displaystyle\sum_{i=1}^n\big(\hat{y^i}-(b+w\cdot x_{cp}^i)\big)^2+\color{blue}\lambda\sum(w_i)^2 y=i=1∑n(yi^−(b+w⋅xcpi))2+λ∑(wi)2
注意
λ
\lambda
λ 值的选择:
ML Lecture 1 Regression - Demo