ML _10.5_P4-7_李宏毅笔记

最新推荐文章于 2024-08-10 11:44:37 发布

weixin_43550040

最新推荐文章于 2024-08-10 11:44:37 发布

阅读量76

点赞数

分类专栏： ML 文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_43550040/article/details/108933658

版权

ML 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

ML _P4-7_李宏毅笔记

2. Erro 误差
3 Gradient Descent梯度下降

2. Erro 误差

Erro= Bias(偏差、偏移项)+Variance(方差)
estimator-估测值，通过平均获得。real and perfect function，which can not be obtained by computing.
The mean（平均值) of x isThe variance of x is
Bias= If we average all the , it is close to . 是每次训练的最佳函数(model)解(注:每次训练包含多个数据样本-sample data)，而是真实的函数(model)。
Variance=且由概率论中的大数定理可以知道：即，样本x计算得到的平均值为m，但样本x的期望平均值为
注意：随着样本x的数目增加，计算获得的m会接近期望值同理，可知是的期望值。
注意：，同时样本数量增加接近
图解Bias 和 Variance 的关系
Simple models have larger bias & smaller variance, while complicate models have smaller bias & larger variance.
这是因为越复杂（含高次项）的函数model，函数空间大，包含的function多，所以函数散布的范围大，但因为函数空间大包含的函数多，所以函数空间可以提供更多的和真实function相近的函数，所以平均后的值和期望值近，bias因此会较小。
如果error来自于variance很大，说明现在的模型是overfitting（过拟合）;如果error来自bias很大，说明现在的模型是underfitting（欠拟合）。
For large bias: add more feature, or make a more complicate model.
For large variance: get more data, or regularization-正则化 (所有曲线都会变得比较平滑).
Model Selection：注意：适用于Public testing set 的model可能不适用于Private testing model。因此需要引入Cross Validation或N-fold Cross Validation。

3 Gradient Descent梯度下降

例如，linear function 中的 w和b都属于paramerts中的variables（变量）。
假设有两个variables则，gradient descent 的步骤为其中，
调Learning Rate方法：1，绘图2，Adaptive Learning Rate：eg.第t次learning rate，3，Adagrad：
g代表微分值。结合方法2中的learning rate，
上图可以化简为Gradient Descent的loss function是对全部example而言，加总的所有loss (update after seeing all examples)。而SGD是随机选一个example，然后计算这一个example的loss，然后更新参数(update for each example)注：
冲突产生反差效果，为了防止Gradient explotion or disapear。
由下图可知best step is
所以，Adagrad实际上是在模拟这样一个最佳best step的运算。但是又比直接算二次微分节省时间–牛顿法思想。
Gradient Descent的loss function是对全部example而言，加总的所有loss (update after seeing all examples)。而SGD是随机选一个example，然后计算这一个example的loss，然后更新参数(update for each example).
Feature Scaling特征缩放/特征归一化：
：存在的原因：提速
方法：
Gradient Descent数理基础：因为泰勒级数在点x=x0具有任意阶导数且展开式的二阶及更高阶一般忽略，所以可以从泰勒级数的角度理解，只有gradient descent - learning rate够小，泰勒级数才能约等于只有一次项，才能保证每次都能往loss最小的方向移动。
Gradient Descen Limitation-Gradient Descen不work的常见情况：Very slow at the plateau、Stuck at saddle point、Stuck at local minima。

weixin_43550040

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ML _10.5_P4-7_李宏毅笔记

ML _P4-7_李宏毅笔记2. Erro 误差3 Gradient Descent梯度下降2. Erro 误差Erro= Bias(偏差、偏移项)+Variance(方差)estimator-估测值，通过平均获得。real and perfect function，which can not be obtained by computing.The mean（平均值) of x isThe variance of x isBias= If we average all the
复制链接

扫一扫