2021-10-28

最新推荐文章于 2024-07-19 16:33:18 发布

FallFlower*

最新推荐文章于 2024-07-19 16:33:18 发布

阅读量127

点赞数

分类专栏：机器学习入门文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_43366431/article/details/121022152

版权

机器学习入门专栏收录该内容

4 篇文章 1 订阅

订阅专栏

李宏毅机器学习笔记

1.2 深度学习基本概念简介

Step1: Function with unknown parameters

$c^T \sigma(b+Wx)$

把这一群参数拼接在一起，构成一个大向量 $\theta$ 暴力搜索gradient descent ，sigmoid越多可以逼近越复杂的Function

Step2: Define Loss from Training Data

Loss is a function of parameters $L(\theta)$

Step3: Optimization of New Model

$ \theta^* = arg\ \underset{\theta}{min}\ L$ L取最小值时 $\theta$ 的取值

Pick initial values $\theta^0$ $\nabla L(\theta^0)$

Compute gradient $ \theta^n \larr \theta^{n-1} - \eta g$

分组Batch ,只拿一个batch里的进行训练，1 epoch = see all the batches once,每更新一次参数叫做一次update

例：N=10000， B=10 ==》 1000个batch How many update in 1 epoch? 1000 updates

(B、sigmoid、learning-rate is hyper-parameter) sigmoid -> ReLU (Activation Function)

neural network VGG-19 GoogleNet 训练上变好，但在没训练过的事情上变差 Overfitting

为什么不变胖。而让网络变深呢？

gradient descent - back propagation

network parameters $\theta = w_1, w_2, …, b_1, b_2, … $

starting parameters $\theta^0 \rarr \theta ^1$

millions of parameters 上百万维的vector，To compute the gradients efficiently, we use back propagation

Chain Rule

Pytorch

Tensor :High-dimensional matrix

Data Type - torch.float torch.long - torch.floatTensor torch.longTensor

Shape of Tensors - dim 0\dim 1\dim 2

Constructor - from_list x - torch.tensor([[1, -1], [-1, 1]]) from_numpy x = torch.from_numpy(np.array([[1, -1], [-1,1]]))

Zero Tensor - x = torch.zeros([2,2]) Unit Tensorx = torch.ones([1,2,5])

Operators - Squeeze remove the specified dimension with length = 1

x = torch.zeros([1,2,3]) x.shape = torch.Size([1.2.3]) x = x.squeeze(0)

- Unsqueeze expand a new dimension

x = x.unsqueeze(1) (dim=1)

Transpose 转置
Cat torch.cat([x,y,z], dim=1)

Device x = x.to('cuda') torch.cuda.is_available() cuda:0 cuda:1

Developer: Facebook AI interface: Python & C++ Pytorch: research

Load Data

Define Neural Network

Loss Function Training Testing

Optimizer Validation

torch.nn torch.optim

Overfitting -

data augmentation (more training data)
- constrained model - less parameters, sharing parameters

类神经网络训练不起来可能的原因

Optimization Fails because - critical point

saddle point 鞍点 - escape // local minimal - no way to go

Tayler Series Approximation Hessian

For all v $v^T H v > 0$ - Local minima == H is positive definite = All eigen values are positive

For all v $v^T H v < 0$ - Local maxima == H is negative definite = All eigen values are negative

Sometimes $v^T H v > 0$ ，sometimes $v^T H v < 0$ - Saddle point

用计算特征值来判断是鞍点还是 局部极小值

在三维空间无路可走的东西，在高维其实并不是封闭的。 ——《三体》

Batch

batch 把一个batch里的数据看一遍，更新gradient，叫做一个epoch

shuffle - 每一个epoch中，在同一个batch的数据不同（打乱顺序）training的时候shuffle=True, testing shuffle=False

small batch v.s. large batch - see all examples v.s. see only one example, then update paras

技能冷却时间比较长 long-time for cooldown, but powerful || short-time for cooldown, but noisy

Large batch size does not require longer time to compute gradient (Tesla V100 GPU - Parallel computing )

noisy gradient 反而会帮助training larger batch size - worse training/testing performance - maybe optimization . 当一个optimization 过程被卡住了，可能其他batch并没有被卡住

hyperparameters

batch size大，速度快，但效果不好，鱼与熊掌可兼得的讨论

Momentum

计算gradient，想gradient反方向修改参数

现在不止往gradient反方向移动，movement $m^0 = 0$ gradient的方向加上前一步的方向

starting at $\theta^0$ , movement $m^0=0$ ，

compute gradient $g^0$ ，movement $m^1=\lambda m^0 - \eta g^0$

move to $\theta^1 = \theta^0 + m^1$

compute gradient $g^1$ ， movement $m^2 = \lambda m^1 - \eta g^1$

move to $\theta^2 = \theta^1 + m^2$

Error surface is rugged… Adaptive learning rate

training stuck $\neq$ small gradient loss卡住可能不是local minima / saddle point 多数training在还没有到critical point的时候就已经stuck了

在不同维度方向上gradient大小不一样，在不同训练时间gradient大小也不同

different parameters need different learning rate

Formulation for one parameter: $\theta_i^{t+1} \larr \theta_{i}^{t}-\eta g_{i}^{t}$ 中的 $\eta$ 改为与i这个参数有关的 $\theta_i^{t+1} \larr \theta_{i}^{t}-\frac{\eta}{\sigma_{i}^{t}} g_{i}^{t}$

$\sigma_{i}^{0} = |g_{i}^{0}|$

就算是同一个参数，也会随着时间而改变 learning rate adapts dynamically

RMSProp（没有论文）

Adam: RMSProp + Momentum

Learning Rate Scheduling - Learning Rate Decay - As the training goes, we are closer to the destination, so we reduce the learning rate.

Warm Up lr先增加再减小，黑科技（没有解释为什么要这么做）RAdam

Summary of Optimization

momentum（gradient的大小和方向） $\sigma$ (root mean square只考虑结果) lr（Warm Up)

“山不转路转”

Classification - Softmax, cross-entropy

class as one-hot vector

$\frac{exp(y_i)}{\sum_{j} exp(y_i)}$ y can have any value - make all values between 0 and 1 (normalize and 不同值之间的差距更大)

cross-entropy - minimizing cross-entropy is equivalent to maximizing likelihood.

pytorch调用cross-entropy 自动调用了softmax

mean Square Error 会被卡住

Batch Normalization

Training

changing landscape

每个dimension的值差别很大，error surface受到值较大的dimension影响不同的dimension同量级的值

feature normalization $z^i = \frac{z^i -\mu}{\sigma}$ 均值方差大数定理

但会出现彼此关联，只会考虑一个batch里面的data做normalization 全部数据实在太多了，batch size较大，batch size里的data就可以approximate整个数据分布

Testing

testing的时候可能没有batch，进来一个就testing一个， $\mu、 \sigma$ 没法算，因此用training里的 $\mu、 \sigma$ 操作一波，pytorch已经自带操作了

New Optimizers for Deep Learning

SGD

SGD with momentum

Adagrad

RMSProp

Adam

FallFlower*

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2021-10-28

李宏毅机器学习笔记1.2 深度学习基本概念简介Step1: Function with unknown parametersy=b+cTσ(b+Wx)y = b + c^T \sigma(b+Wx)y=b+cTσ(b+Wx)把这一群参数拼接在一起，构成一个大向量θ\thetaθ暴力搜索gradient descent ，sigmoid越多可以逼近越复杂的FunctionStep2: Define Loss from Training DataLoss is a function of par
复制链接

扫一扫