Dropout network, DropConnect network

本文介绍了Dropout和DropConnect两种深度学习正则化技术。Dropout通过在训练过程中随机关闭神经元来防止过拟合,而DropConnect则是随机置零权重参数。尽管两者在测试阶段都能进行近似平均,但它们主要适用于全连接层。这些技术受到中心极限定理的支持,但存在局限性,仅适用于特定类型的网络层。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Notations

  • input v v v
  • output r r r
  • weight parameter W ∈ R d × m W \in \mathbb{R}^{d \times m} WRd×m
  • activation function a a a
  • mask m m m for vector and M M M for matrix

Dropout

  • Randomly set activations of each layer to zero with probability 1 − p 1-p 1p.
    r = m ∘ a ( W v ) , r = m \circ a(Wv), r=ma(Wv),
    m j ∼ Bernoulli ( p ) m_j \sim \text{\small Bernoulli}(p) mjBernoulli(p).
  • As many activation functions have the property that a ( 0 ) = 0 ) a(0)=0) a(0)=0), we have
    r = a ( m ∘ W v ) . r = a(m \circ Wv). r=a(mWv).

DropConnect

  • Randomly set the weight of each layer to zero with probability 1 − p 1-p 1p.
    r = a ( M ∘ W v ) , r = a(M \circ Wv), r=a(MWv),
    M i j ∼ Bernoulli ( p ) M_{ij} \sim \text{\small Bernoulli}(p) MijBernoulli(p).
  • Each M i j M_{ij} Mij is drawn independently for each example during training.
    The memory requirement for M M M's grows with the size of each mini-batch, and therefore, the implementation needs to be carefully designed.
  • overall model f ( x ; θ , M ) f(x;\theta,M) f(x;θ,M), where θ = { W g , W , W s } \theta = \{W_g,W,W_s\} θ={Wg,W,Ws}
    o = E M [ f ( x ; θ , M ) ] = ∑ M p ( M ) f ( x ; θ , M ) = 1 ∣ M ∣ ∑ M s ( a ( M ∘ W ) v ) ; W s ) if  p = 0.5 \begin{aligned} o=\mathbb{E}_M[f(x;\theta,M)]&=\sum_M p(M) f(x;\theta,M)\\ &=\frac{1}{|M|}\sum_M s(a(M \circ W) v); W_s) \quad \text{if } p = 0.5 \end{aligned} o=EM[f(x;θ,M)]=Mp(M)f(x;θ,M)=M1Ms(a(MW)v);Ws)if p=0.5
  • inference (test stage)
    r = 1 ∣ M ∣ ∑ M a ( ( M ∘ W ) v ) ) r ≈ 1 Z ∑ z = 1 Z r z ≈ 1 Z ∑ z = 1 Z a ( u z ) , \begin{aligned} r&=\frac{1}{|M|} \sum_M a((M \circ W)v))\\ r&\approx \frac{1}{Z} \sum_{z=1}^Z r_z \\ &\approx \frac{1}{Z} \sum_{z=1}^Z a(u_z), \end{aligned} rr=M1Ma((MW)v))Z1z=1ZrzZ1z=1Za(uz),
    where u z ∼ N ( p W v , p ( 1 − p ) ( W ∘ W ) ( v ∘ v ) u_z \sim \mathcal{N}(pWv,p(1-p)(W \circ W)(v \circ v) uzN(pWv,p(1p)(WW)(vv); Z Z Z denotes the number of randoml samples drawn from the Gaussian distribution.
    Idea: approximate a sum of weighted Bernoulli random variables by a Gaussian random variable. Partially supported by the central limit theorem.

局限性 \textcolor{red}{\text{\small 局限性}} 局限性:
Both techniques are suitable for fully connected layers only.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值