李宏毅机器学习笔记
1.2 深度学习基本概念简介
Step1: Function with unknown parameters
y = b + c T σ ( b + W x ) y = b + c^T \sigma(b+Wx) y=b+cTσ(b+Wx)
把这一群参数拼接在一起,构成一个大向量 θ \theta θ暴力搜索gradient descent ,sigmoid越多可以逼近越复杂的Function
Step2: Define Loss from Training Data
Loss is a function of parameters L ( θ ) L(\theta) L(θ)
Step3: Optimization of New Model
$ \theta^* = arg\ \underset{\theta}{min}\ L$ L取最小值时 θ \theta θ 的取值
Pick initial values θ 0 \theta^0 θ0 g = ∇ L ( θ 0 ) g = \nabla L(\theta^0) g=∇L(θ0)
Compute gradient $ \theta^n \larr \theta^{n-1} - \eta g$
分组Batch ,只拿一个batch里的进行训练,1 epoch = see all the batches once,每更新一次参数叫做一次update
例:N=10000, B=10 ==》 1000个batch How many update in 1 epoch? 1000 updates
(B、sigmoid、learning-rate is hyper-parameter) sigmoid -> ReLU (Activation Function)
neural network VGG-19 GoogleNet 训练上变好,但在没训练过的事情上变差 Overfitting
为什么不变胖。而让网络变深呢?
gradient descent - back propagation
network parameters $\theta = w_1, w_2, …, b_1, b_2, … $
starting parameters θ 0 → θ 1 \theta^0 \rarr \theta ^1 θ0→θ1
millions of parameters 上百万维的vector,To compute the gradients efficiently, we use back propagation
Chain Rule
Pytorch
Tensor :High-dimensional matrix
Data Type - torch.float torch.long - torch.floatTensor torch.longTensor
Shape of Tensors - dim 0\dim 1\dim 2
Constructor - from_list x - torch.tensor([[1, -1], [-1, 1]])
from_numpy x = torch.from_numpy(np.array([[1, -1], [-1,1]]))
Zero Tensor - x = torch.zeros([2,2])
Unit Tensorx = torch.ones([1,2,5])
Operators - Squeeze remove the specified dimension with length = 1
x = torch.zeros([1,2,3]) x.shape = torch.Size([1.2.3]) x = x.squeeze(0)
- Unsqueeze expand a new dimension
x = x.unsqueeze(1) (dim=1)
-
Transpose 转置
-
Cat
torch.cat([x,y,z], dim=1)
Device
x = x.to('cuda')
torch.cuda.is_available()
cuda:0
cuda:1
Developer: Facebook AI interface: Python & C++ Pytorch: research
Load Data
Define Neural Network
Loss Function Training Testing
Optimizer Validation
torch.nn torch.optim
Overfitting -
-
data augmentation (more training data)
- constrained model - less parameters, sharing parameters
类神经网络训练不起来可能的原因
Optimization Fails because - critical point
saddle point 鞍点 - escape // local minimal - no way to go
Tayler Series Approximation Hessian
For all v v T H v > 0 v^T H v > 0 vTHv>0 - Local minima == H is positive definite = All eigen values are positive
For all v v T H v < 0 v^T H v < 0 vTHv<0 - Local maxima == H is negative definite = All eigen values are negative
Sometimes v T H v > 0 v^T H v > 0 vTHv>0,sometimes v T H v < 0 v^T H v < 0 vTHv<0 - Saddle point
用计算特征值来判断是 鞍点 还是 局部极小值
在三维空间无路可走的东西,在高维其实并不是封闭的。 ——《三体》
Batch
batch 把一个batch里的数据看一遍,更新gradient,叫做一个epoch
shuffle - 每一个epoch中,在同一个batch的数据不同(打乱顺序)training的时候shuffle=True, testing shuffle=False
small batch v.s. large batch - see all examples v.s. see only one example, then update paras
技能冷却时间比较长 long-time for cooldown, but powerful || short-time for cooldown, but noisy
Large batch size does not require longer time to compute gradient (Tesla V100 GPU - Parallel computing )
noisy gradient 反而会帮助training larger batch size - worse training/testing performance - maybe optimization . 当一个optimization 过程被卡住了,可能其他batch并没有被卡住
hyperparameters
batch size大,速度快,但效果不好,鱼与熊掌可兼得的讨论
Momentum
计算gradient,想gradient反方向修改参数
现在不止往gradient反方向移动,movement m 0 = 0 m^0 = 0 m0=0 gradient的方向加上前一步的方向
starting at θ 0 \theta^0 θ0, movement m 0 = 0 m^0=0 m0=0,
compute gradient g 0 g^0 g0 ,movement m 1 = λ m 0 − η g 0 m^1=\lambda m^0 - \eta g^0 m1=λm0−ηg0
move to θ 1 = θ 0 + m 1 \theta^1 = \theta^0 + m^1 θ1=θ0+m1
compute gradient g 1 g^1 g1, movement m 2 = λ m 1 − η g 1 m^2 = \lambda m^1 - \eta g^1 m2=λm1−ηg1
move to θ 2 = θ 1 + m 2 \theta^2 = \theta^1 + m^2 θ2=θ1+m2
Error surface is rugged… Adaptive learning rate
training stuck ≠ \neq = small gradient loss卡住可能不是local minima / saddle point 多数training在还没有到critical point的时候就已经stuck了
在不同维度方向上gradient大小不一样,在不同训练时间gradient大小也不同
different parameters need different learning rate
Formulation for one parameter: θ i t + 1 ← θ i t − η g i t \theta_i^{t+1} \larr \theta_{i}^{t}-\eta g_{i}^{t} θit+1←θit−ηgit 中的 η \eta η 改为与i这个参数有关的 θ i t + 1 ← θ i t − η σ i t g i t \theta_i^{t+1} \larr \theta_{i}^{t}-\frac{\eta}{\sigma_{i}^{t}} g_{i}^{t} θit+1←θit−σitηgit
σ i 0 = ∣ g i 0 ∣ \sigma_{i}^{0} = |g_{i}^{0}| σi0=∣gi0∣
就算是同一个参数,也会随着时间而改变 learning rate adapts dynamically
RMSProp(没有论文)
Adam: RMSProp + Momentum
Learning Rate Scheduling - Learning Rate Decay - As the training goes, we are closer to the destination, so we reduce the learning rate.
Warm Up lr先增加再减小,黑科技(没有解释为什么要这么做)RAdam
Summary of Optimization
momentum(gradient的大小和方向) σ \sigma σ (root mean square只考虑结果) lr(Warm Up)
“山不转路转”
Classification - Softmax, cross-entropy
class as one-hot vector
y ′ = s o f t m a x ( y ) = e x p ( y i ) ∑ j e x p ( y i ) y' = softmax(y) = \frac{exp(y_i)}{\sum_{j} exp(y_i)} y′=softmax(y)=∑jexp(yi)exp(yi) y can have any value - make all values between 0 and 1 (normalize and 不同值之间的差距更大)
cross-entropy - minimizing cross-entropy is equivalent to maximizing likelihood.
pytorch调用cross-entropy 自动调用了softmax
mean Square Error 会被卡住
Batch Normalization
Training
changing landscape
每个dimension的值差别很大,error surface受到值较大的dimension影响 不同的dimension同量级的值
feature normalization z i = z i − μ σ z^i = \frac{z^i -\mu}{\sigma} zi=σzi−μ 均值方差大数定理
但会出现彼此关联,只会考虑一个batch里面的data做normalization 全部数据实在太多了,batch size较大,batch size里的data就可以approximate整个数据分布
Testing
testing的时候可能没有batch,进来一个就testing一个, μ 、 σ \mu、 \sigma μ、σ没法算,因此用training里的 μ 、 σ \mu、 \sigma μ、σ操作一波,pytorch已经自带操作了
New Optimizers for Deep Learning
SGD
SGD with momentum
Adagrad
RMSProp
Adam