深度学习复习提纲

Introduction

  • 分类/回归    ⟺    \iff 函数 f i : R d → R f_i : \reals^d \to \reals fi:RdR
  • 类别, 概率, one hot embedding, softmax + cross entropy.
  • 损失
    • mean square error
    • mean absolute error
    • cross entropy (for probability ∈ [0,1])
  • 梯度下降
    • 梯度为零, 局部最优vs鞍点; 梯度极其接近于零.
    • ∇ L ( θ t ) = ( ∂ L ∂ θ ∣ θ = θ t ) \nabla L(\theta_t) = (\frac{\partial L}{\partial \theta}|_{\theta=\theta_t}) L(θt)=(θLθ=θt)
    • vanilla
      θ t + 1 = θ t − η ∇ L ( θ t ) \theta_{t+1} = \theta_t - \eta \nabla L(\theta_t) θt+1=θtηL(θt)
      θ t + 1 = θ t + G t \theta_{t+1} = \theta_t + G_t θt+1=θt+Gt, G t = − η ∇ L ( θ t ) G_t = - \eta \nabla L(\theta_t) Gt=ηL(θt).
    • momentum
      θ t + 1 = θ t − η ∑ s = 1 t λ t − s ∇ L ( θ t ) \theta_{t+1} = \theta_t - \eta \sum\limits_{s=1}^{t} \lambda^{t-s} \nabla L(\theta_t) θt+1=θtηs=1tλtsL(θt)
      θ t + 1 = θ t + G t \theta_{t+1} = \theta_t + G_t θt+1=θt+Gt, G t = λ G ( t − 1 ) − η ∇ L ( θ t ) G_t = \lambda G^{(t-1)} - \eta \nabla L(\theta_t) Gt=λG(t1)ηL(θt).
    • adam = rmsprop + momentum
  • adaptive learning rate
    • θ t + 1 = θ t + 1 H t G t \theta_{t+1} = \theta_t + \frac{1}{H_t} G_t θt+1=θt+Ht1Gt
    • adagrad
      H t = 1 t ∑ s = 1 t ∣ ∇ L ( θ t ) ∣ 2 H_t = \sqrt{\frac{1}{t} \sum\limits_{s=1}^{t} |\nabla L(\theta_t)|^2} Ht=t1s=1tL(θt)2
    • rmsprop
      H t = α t ∣ ∇ L ( θ 1 ) ∣ 2 + ( 1 − α ) ∑ s = 1 t α t − s ∣ ∇ L ( θ t ) ∣ 2 H_t = \sqrt{\alpha^t |\nabla L(\theta_1)|^2 + (1-\alpha) \sum\limits_{s=1}^{t} \alpha^{t-s} |\nabla L(\theta_t)|^2} Ht=αtL(θ1)2+(1α)s=1tαtsL(θt)2
      H t 2 = α H t − 1 2 + ( 1 − α ) ∣ ∇ L ( θ t ) ∣ 2 {H_t}^2 = \alpha {H_{t-1}}^2 + (1-\alpha) |\nabla L(\theta_t)|^2 Ht2=αHt12+(1α)L(θt)2
  • learning rate scheduling
    • decay ↘
    • warm up ↗ ↘
  • batch
    • epoch = sample / batch
    • large batch, fast training, sharp minima; small batch, noisy gradient, better optimization and generalization.
  • 激活函数
    • hard sigmoid
    • sigmoid
    • ReLU
  • 神经网络
    • z = σ ( b + W x ) z = \sigma(b + Wx) z=σ(b+Wx), ∂ L ∂ W = ∂ L ∂ z ∂ σ ∂ ( b + W x ) ∂ ( b + W x ) ∂ W = ∂ L ∂ z σ ( 1 − σ ) x T \frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \frac{\partial \sigma}{\partial (b + Wx)} \frac{\partial (b + Wx)}{\partial W} = \frac{\partial L}{\partial z} \sigma(1-\sigma) x^T WL=zL(b+Wx)σW(b+Wx)=zLσ(1σ)xT, ∂ L ∂ b = ∂ L ∂ z ∂ σ ∂ ( b + W x ) ∂ ( b + W x ) ∂ b = ∂ L ∂ z σ ( 1 − σ ) \frac{\partial L}{\partial b} = \frac{\partial L}{\partial z} \frac{\partial \sigma}{\partial (b + Wx)} \frac{\partial (b + Wx)}{\partial b} = \frac{\partial L}{\partial z} \sigma(1-\sigma) bL=zL(b+Wx)σb(b+Wx)=zLσ(1σ).
  • 深度学习 vs 表示学习
  • 深度网络 vs 宽度网络
  • pytorch
    • 计算图
    • x.grad
    • torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=‘valid’, dilation=1, groups=1, bias=True, padding_mode=‘zeros’, device=None, dtype=None)
    • torch.nn.LSTM(input_size, hidden_size, num_layers=1, bias=True, batch_first=False, dropout=0, bidirectional=False, proj_size=0)
  • 验证与测试
    • oof(out of fold)与cv(cross validation)
    • train loss与test loss(优化与泛化)
      train loss大(无法优化), bias(模型太简单), 优化技术.
      train loss小test loss大(无法泛化), variance(模型太复杂), 分布变化.

CNN

  • receptive field
  • parameter sharing
  • subsampling / pooling
  • feature map
  • activation maximization vs saliency map (gradient)
  • 图像任务
    • LeNet, AlexNet, VGG, GoogleLeNet, ResNet.
    • LeNet, 卷积+池化, 先卷积再全连接. (LeCun)
    • AlexNet, ReLU解决深度梯度消失, DropOut解决深度过拟合. (Hinton)
    • VGG, 3x3核. (牛津)
    • GoogleLeNet, 多种卷积核. (谷歌)
    • ResNet, 残差连接同时解决深度梯度消失和深度过拟合. (何恺明 孙剑)


孙剑今日辞世.

RNN

  • elman y t = f ( c t − 1 , x t ) y_t = f(c_{t-1}, x_t) yt=f(ct1,xt)
    jordan y t = f ( y t − 1 , x t ) y_t = f(y_{t-1}, x_t) yt=f(yt1,xt)
  • bidirectional y ˉ = y ← + y → \bar{y} = y_{\leftarrow}+y_{\rightarrow} yˉ=y+y
  • LSTM
    g t = tanh ⁡ ( W x g x t + W h g h t − 1 + b g ) i t = σ ( W x i x t + W h i h t − 1 + b i ) f t = σ ( W x f x t + W h f h t − 1 + b f ) c t = f t ⊙ c t − 1 + i t ⊙ g t o t = σ ( W x o x t + W h o h t − 1 + b o ) h t = o t ⊙ tanh ⁡ ( c t ) \begin{aligned} g_t =& \tanh(W^{g}_x x_t + W^{g}_h h_{t-1} + b^{g}) \\ i_t =& \sigma(W^{i}_x x_t + W^{i}_h h_{t-1} + b^{i}) \\ f_t =& \sigma(W^{f}_x x_t + W^{f}_h h_{t-1} + b^{f}) \\ c_t =& f_t \odot c_{t-1} + i_t \odot g_t \\ o_t =& \sigma(W^{o}_x x_t + W^{o}_h h_{t-1} + b^{o}) \\ h_t =& o_t \odot \tanh(c_t) \\ \end{aligned} gt=it=ft=ct
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值