2021-10-28

李宏毅机器学习笔记

1.2 深度学习基本概念简介

Step1: Function with unknown parameters

y = b + c T σ ( b + W x ) y = b + c^T \sigma(b+Wx) y=b+cTσ(b+Wx)

把这一群参数拼接在一起,构成一个大向量 θ \theta θ暴力搜索gradient descent ,sigmoid越多可以逼近越复杂的Function

Step2: Define Loss from Training Data

Loss is a function of parameters L ( θ ) L(\theta) L(θ)

Step3: Optimization of New Model

$ \theta^* = arg\ \underset{\theta}{min}\ L$ L取最小值时 θ \theta θ 的取值

Pick initial values θ 0 \theta^0 θ0 g = ∇ L ( θ 0 ) g = \nabla L(\theta^0) g=L(θ0)

Compute gradient $ \theta^n \larr \theta^{n-1} - \eta g$

分组Batch ,只拿一个batch里的进行训练,1 epoch = see all the batches once,每更新一次参数叫做一次update

例:N=10000, B=10 ==》 1000个batch How many update in 1 epoch? 1000 updates

(B、sigmoid、learning-rate is hyper-parameter) sigmoid -> ReLU (Activation Function)

neural network VGG-19 GoogleNet 训练上变好,但在没训练过的事情上变差 Overfitting

为什么不变胖。而让网络变深呢?

gradient descent - back propagation

network parameters $\theta = w_1, w_2, …, b_1, b_2, … $

starting parameters θ 0 → θ 1 \theta^0 \rarr \theta ^1 θ0θ1

millions of parameters 上百万维的vector,To compute the gradients efficiently, we use back propagation

Chain Rule

Pytorch

Tensor :High-dimensional matrix

​ Data Type - torch.float torch.long - torch.floatTensor torch.longTensor

​ Shape of Tensors - dim 0\dim 1\dim 2

​ Constructor - from_list x - torch.tensor([[1, -1], [-1, 1]]) from_numpy x = torch.from_numpy(np.array([[1, -1], [-1,1]]))

Zero Tensor - x = torch.zeros([2,2]) Unit Tensorx = torch.ones([1,2,5])

​ Operators - Squeeze remove the specified dimension with length = 1

x = torch.zeros([1,2,3]) x.shape = torch.Size([1.2.3]) x = x.squeeze(0)

​ - Unsqueeze expand a new dimension

x = x.unsqueeze(1) (dim=1)

  • Transpose 转置

  • Cat torch.cat([x,y,z], dim=1)

    ​ Device x = x.to('cuda') torch.cuda.is_available() cuda:0 cuda:1

Developer: Facebook AI interface: Python & C++ Pytorch: research

Load Data

Define Neural Network

Loss Function Training Testing

Optimizer Validation

torch.nn torch.optim

Overfitting -

  • data augmentation (more training data)

    • constrained model - less parameters, sharing parameters

类神经网络训练不起来可能的原因

Optimization Fails because - critical point

saddle point 鞍点 - escape // local minimal - no way to go

Tayler Series Approximation Hessian

image-20211026160312885

For all v v T H v > 0 v^T H v > 0 vTHv>0 - Local minima == H is positive definite = All eigen values are positive

For all v v T H v < 0 v^T H v < 0 vTHv<0 - Local maxima == H is negative definite = All eigen values are negative

Sometimes v T H v > 0 v^T H v > 0 vTHv>0,sometimes v T H v < 0 v^T H v < 0 vTHv<0 - Saddle point

用计算特征值来判断是 鞍点 还是 局部极小值

在三维空间无路可走的东西,在高维其实并不是封闭的。 ——《三体》

Batch

batch 把一个batch里的数据看一遍,更新gradient,叫做一个epoch

shuffle - 每一个epoch中,在同一个batch的数据不同(打乱顺序)training的时候shuffle=True, testing shuffle=False

small batch v.s. large batch - see all examples v.s. see only one example, then update paras

​ 技能冷却时间比较长 long-time for cooldown, but powerful || short-time for cooldown, but noisy

Large batch size does not require longer time to compute gradient (Tesla V100 GPU - Parallel computing )

noisy gradient 反而会帮助training larger batch size - worse training/testing performance - maybe optimization . 当一个optimization 过程被卡住了,可能其他batch并没有被卡住

image-20211027185706678

hyperparameters

batch size大,速度快,但效果不好,鱼与熊掌可兼得的讨论

Momentum

计算gradient,想gradient反方向修改参数

现在不止往gradient反方向移动,movement m 0 = 0 m^0 = 0 m0=0 gradient的方向加上前一步的方向

starting at θ 0 \theta^0 θ0, movement m 0 = 0 m^0=0 m0=0

compute gradient g 0 g^0 g0 ,movement m 1 = λ m 0 − η g 0 m^1=\lambda m^0 - \eta g^0 m1=λm0ηg0

move to θ 1 = θ 0 + m 1 \theta^1 = \theta^0 + m^1 θ1=θ0+m1

compute gradient g 1 g^1 g1, movement m 2 = λ m 1 − η g 1 m^2 = \lambda m^1 - \eta g^1 m2=λm1ηg1

move to θ 2 = θ 1 + m 2 \theta^2 = \theta^1 + m^2 θ2=θ1+m2

Error surface is rugged… Adaptive learning rate

training stuck ≠ \neq = small gradient loss卡住可能不是local minima / saddle point 多数training在还没有到critical point的时候就已经stuck了

在不同维度方向上gradient大小不一样,在不同训练时间gradient大小也不同

different parameters need different learning rate

Formulation for one parameter: θ i t + 1 ← θ i t − η g i t \theta_i^{t+1} \larr \theta_{i}^{t}-\eta g_{i}^{t} θit+1θitηgit 中的 η \eta η 改为与i这个参数有关的 θ i t + 1 ← θ i t − η σ i t g i t \theta_i^{t+1} \larr \theta_{i}^{t}-\frac{\eta}{\sigma_{i}^{t}} g_{i}^{t} θit+1θitσitηgit

σ i 0 = ∣ g i 0 ∣ \sigma_{i}^{0} = |g_{i}^{0}| σi0=gi0

image-20211027195315613

就算是同一个参数,也会随着时间而改变 learning rate adapts dynamically

RMSProp(没有论文)

image-20211027195803261

Adam: RMSProp + Momentum

Learning Rate Scheduling - Learning Rate Decay - As the training goes, we are closer to the destination, so we reduce the learning rate.

Warm Up lr先增加再减小,黑科技(没有解释为什么要这么做)RAdam

Summary of Optimization

momentum(gradient的大小和方向) σ \sigma σ (root mean square只考虑结果) lr(Warm Up)

“山不转路转”

Classification - Softmax, cross-entropy

class as one-hot vector

y ′ = s o f t m a x ( y ) = e x p ( y i ) ∑ j e x p ( y i ) y' = softmax(y) = \frac{exp(y_i)}{\sum_{j} exp(y_i)} y=softmax(y)=jexp(yi)exp(yi) y can have any value - make all values between 0 and 1 (normalize and 不同值之间的差距更大)

cross-entropy - minimizing cross-entropy is equivalent to maximizing likelihood.

pytorch调用cross-entropy 自动调用了softmax

mean Square Error 会被卡住

Batch Normalization

Training

changing landscape

每个dimension的值差别很大,error surface受到值较大的dimension影响 不同的dimension同量级的值

feature normalization z i = z i − μ σ z^i = \frac{z^i -\mu}{\sigma} zi=σziμ 均值方差大数定理

​ 但会出现彼此关联,只会考虑一个batch里面的data做normalization 全部数据实在太多了,batch size较大,batch size里的data就可以approximate整个数据分布

image-20211027221216836
Testing

testing的时候可能没有batch,进来一个就testing一个, μ 、 σ \mu、 \sigma μσ没法算,因此用training里的 μ 、 σ \mu、 \sigma μσ操作一波,pytorch已经自带操作了

New Optimizers for Deep Learning

SGD

SGD with momentum

Adagrad

RMSProp

Adam

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值