李宏毅Machine Learning 学习笔记1 Regression

最新推荐文章于 2021-11-20 19:42:56 发布

songyuequan

最新推荐文章于 2021-11-20 19:42:56 发布

阅读量252

点赞数

本文链接：https://blog.csdn.net/songyuequan/article/details/80341092

版权

1 Regression

Regression :output scalar

什么是回归？output是一个数值的就是回归。

Step 1： Model (function set )

A set of function

$f_1:y = 10.0+9.0\cdot x_{cp}$

$f_2:y = 9.8+9.2\cdot x_{cp}$

$f_3:y = -0.8-1.2\cdot x_{cp}$

Linear Model

xi ：输入值x的一个属性（feature 特征值）
wi ：weight ,b :bias

$\sum w_ix_i$

Step 2 : Goodness of function

$w\cdot x_{cp}$

y hat 表示这是一个正确的数字
上标表示一个整体的资料，
下标表示这个资料里的某一个属性。

衡量function 需要 loss Function

Loss function

$\sum_{n=1}^{10} (\hat{y}^n-f(x^n_ {cp}) )^2$

Lost function 是 function 的 function

L(f) ——> L(w,b)

$\sum_{n=1}^{10} (\hat{y}^n- (b+w\cdot x^n_ {cp}) )^2$

Step 3: Best function

pick the Best function

$f^*= {\arg \min_{f}}L(f)$

$w^*,b^*= {\arg \min_{w,b}}L(w,b)$

${\arg \min_{w,b}}\sum_{n=1}^{10} (\hat{y}^n- (b+w\cdot x^n_ {cp}) )^2$

Step 3: Gradient Descent

单个参数

$w^*= {\arg \min_w}L(w)$

pick an inital value w0
Compute

$ \frac{dL}{dW}|_{w=w_0} $

$ w_1 \leftarrow w_0-\eta\frac{dL}{dW}|_{w=w_0}$

Compute

$ \frac{dL}{dW}|_{w=w_1} $

$w_2 \leftarrow w_1-\eta\frac{dL}{dW}|_{w=w_1}$

两个参数

$w^*,b^*= {\arg \min_{w,b}}L(w,b)$

pick an inital value w0
Compute (复习高等数学中如何求偏导)\

$\frac{\partial L}{\partial W}|_{w=w_0,b=b_0} , \frac{\partial L}{\partial b}|_{b=b_0,w=w_0} ,$

$w_1 \leftarrow w_0-\eta\frac{\partial L}{\partial W}|_{w=w_0,b=b_0}$

$b_1 \leftarrow b_0-\eta\frac{\partial L}{\partial b}|_{w=w_0,b=b_0}$

Compute

$\frac{\partial L}{\partial W}|_{w=w_1,b=b_1} , \frac{\partial L}{\partial b}|_{b=b_1,w=w_1} ,$

$w_2 \leftarrow w_1-\eta\frac{\partial L}{\partial W}|_{b=b_1,w=w_1}$

$b_2 \leftarrow b_1-\eta\frac{\partial L}{\partial b}|_{b=b_1,w=w_1}$

Problem

globel minima
stuck at local minima
stuck at saddle point
very slow at the plateau

Linear Regression 的 lost function 是一个凸函数，不必担心局部最小值的问题

Learning Rate

$\eta$
Learning Rate 控制步子大小、学习速度。

another linear model

$W_1 \cdot X_{cp}+W_2\cdot( X_{cp})^2$

$W_1 \cdot X_{cp}+W_2\cdot( X_{cp})^2+W_3\cdot( X_{cp})^3$

$W_1 \cdot X_{cp}+W_2\cdot( X_{cp})^2+W_3\cdot( X_{cp})^3+W_4\cdot( X_{cp})^4$

$W_1 \cdot X_{cp}+W_2\cdot( X_{cp})^2+W_3\cdot( X_{cp})^3+W_4\cdot( X_{cp})^4+W_5\cdot( X_{cp})^5$

所谓一个model是不是linear 是指他的参数对他的output 是不是linear。

A more complex model yields lower error on training data.
If we can truly find the best function

Model Selection

model	Training	Testing
1	31.9	35.0
2	15.4	18.4
3	15.3	18.1
4	14.9	28.2
5	12.8	232.1

A more complex model does not always lead to better performance on testing data.
This is Overfitting

复杂模型的model space涵盖了简单模型的model space，因此在training data上的错误率更小，但并不意味着在testing data 上错误率更小。模型太复杂会出现overfitting。

What are the hidden factors?

考虑pakemon种类对cp值的影响。

Back to step 1: Redesign the Model

$x_s=Pidgey: y = b_1+w_1\cdot x_{cp}$

$x_s=Weedle: y = b_2+w_2\cdot x_{cp}$

$x_s=Caterpie: y = b_3+w_3\cdot x_{cp}$

$x_s=Eevee: y = b_4+w_4\cdot x_{cp}$

$\downarrow$

$b_1 \cdot \delta(x_s=Pidgey) +w_1\cdot\delta(x_s=Pidgey)\cdot x_{cp}$

$+b_2 \cdot \delta(x_s=Weedle) +w_2\cdot\delta(x_s=Weedle)\cdot x_{cp}$

$+b_3 \cdot \delta(x_s=Caterpie) +w_3\cdot\delta(x_s=Pidgey)\cdot x_{cp}$

$+b_4 \cdot \delta(x_s=Caterpie) +w_4\cdot\delta(x_s=Pidgey)\cdot x_{cp}$

Training error = 3.8 ,Testing Error= 14.3

这个模型在测试集上有更好的表现。

Are there any other hidden factors?

hp值，体重，高度对cp值的影响。

Back to step 1: Redesign the Model Again

$x_s=Pidgey: y^, = b_1+w_1\cdot x_{cp} + w_5\cdot (x_{cp})^2$

$x_s=Weedle: y^, = b_2+w_2\cdot x_{cp}+ w_6\cdot (x_{cp})^2$

$x_s=Caterpie: y^,= b_3+w_3\cdot x_{cp}+ w_7\cdot (x_{cp})^2$

$x_s=Eevee: y^, = b_4+w_4\cdot x_{cp}+ w_8\cdot (x_{cp})^2$

$\downarrow$

$y^,+w_9\cdot x_{np}+w_10\cdot(x_{np})^2+w_1\cdot$ $x_{h}+w_12\cdot(x_{h})^2+w_13\cdot x_{w}+w_14\cdot(x_{w})^2$

Training Error = 1.9, Testing Error = 102.3,Overfitting

如果同时考虑宝可梦的其它属性，选一个很复杂的模型，结果会overfitting。

Back to step 2：regularization

对很多不同的test 都general有用的方法：正则化。

$L (f) = L (w, b)$
$\sum_{n}(\hat y^n -(b+\sum{w_i\cdot x_i}))^2 + \lambda \sum(w_i)^2$

同时会让w很小。意味着，这是一个比较平滑的function。

$\sum w_ix_i$

$\sum w_i\Delta x_i = b + \sum w_i(x_i+\Delta x_i)$

如果w1比较小的话，代表这个function是比较平滑的。

lambda	Training	Testing
0	1.9	102.3
1	2.3	68.7
10	3.5	25.7
100	4.1	11.1
1000	5.6	12.8
10000	6.3	18.7
100000	8.5	26.8

lambda 增加的时候，我们是会找到一个比较smooth的function。越大的λ，对training error考虑得越少。调整λ，选择使testing error最小的λ。

songyuequan

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
李宏毅Machine Learning 学习笔记1 Regression

1 RegressionRegression :outpur scalar什么是回归？output是一个数值的就是回归。Step 1： Model (function set )A set of functionf1:y=10.0+9.0⋅xcpf1:y=10.0+9.0⋅xcp f_1:y = 10.0+9.0\cdot x_{cp} f2:y=9.8+9....
复制链接

扫一扫